# TMDB movies

Load all the necessary packages

In [None]:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


The TMDB movies dataset contains detailed information about movies including ratings, financials, crew, cast, language, genres, and production details. 

### Loading the data

Load the dataset into a DataFrame named `df_movies` and inspect the first rows of the DataFrame.

### Top movies in 2000

Create a new DataFrame `df_2000` with all movies released in the year 2000. Inspect the amount of rows in the new DataFrame.

Create a new DataFrame `df_topmovies` which only incorporates the rows with the movies of the year 2000 who get an average score of at least 9. Compare the total amount of rows of the original dataframe with these of df_topmovies.

In `df_topmovies`, only include the movies with a valid release date and runtime.

Add a column "final_run" displaying the last day the film was screened in theaters. 

Show how many films were released each month using a barplot.
* Use appropriate labels and titles
* Change the ticks and tick labels so every month has a tick and as label the abbreviation of the month (Jan, Feb, March etc.)
* Change the color of the graph
* Show a grid
* Show the legend

### All movies

For the following questions, work with the dataframe `df_movies`.

Extract following information from the release date and store this in new columns:
    day of the week, day of the month, month, year

Create a new column `profit`, derived from subtracting the budget from the revenue. Remove all rows which do not contain valid revenue or budget data. Also create a column with booleans indicating wheter a film `is_profitable` (profit >0) or not.

**What are the top 10 most frequent genres?**

We are looking for the individual values, not the unique entries of genre combinations. It should be clear that you will have to do some data manipulations in order to get the answer.

To get you started: When analysing the video games dataset, we have seen a method to create separate rows (each with the same index) to represent multiple values of a column. This method takes a list-like object as argument. 

Tip: tackle this assignment step by step.

**Number of actors and budget**


Plot the data so we can visually evaluate whether there seems a relation between the number of actors playing in a movie and the budget of the movie. Create a new_dataframe containing only the relevant columns. 

Create the same plot but now only containing the data of movies who's budget did not exceed 20 000 000.

**Votes**


Show the total number of votes for the year with the most releases.

More exercises? Feel free to explore the dataset further.

In [None]:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


df = pd.read_csv("TMDB-movies-small.csv")
df.head()



df.info()



df.isnull().sum()



df = df.dropna(subset=['vote_average', 'vote_count'])



top_movies = df[df['vote_count'] > 1000].sort_values(by='vote_average', ascending=False).head(10)
top_movies[['title', 'vote_average', 'vote_count']]



plt.figure(figsize=(10, 6))
plt.barh(top_movies['title'], top_movies['vote_average'], color='skyblue')
plt.xlabel('Average Vote')
plt.title('Top 10 Highest Rated Movies (min 1000 votes)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()



plt.figure(figsize=(10, 6))
plt.hist(df['runtime'].dropna(), bins=50, color='lightgreen', edgecolor='black')
plt.xlabel('Runtime (minutes)')
plt.ylabel('Number of Movies')
plt.title('Distribution of Movie Runtimes')
plt.show()



df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
df['release_year'] = df['release_date'].dt.year
release_counts = df['release_year'].value_counts().sort_index()

plt.figure(figsize=(12, 6))
plt.plot(release_counts.index, release_counts.values, color='steelblue')
plt.xlabel('Year')
plt.ylabel('Number of Movies')
plt.title('Movies Released Per Year')
plt.grid(True)
plt.show()



plt.figure(figsize=(10, 6))
plt.scatter(df['budget'], df['revenue'], alpha=0.5)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Budget')
plt.ylabel('Revenue')
plt.title('Budget vs Revenue')
plt.grid(True)
plt.show()



from collections import Counter
from itertools import chain

genre_series = df['genres'].dropna().apply(lambda x: x.split(', '))
all_genres = list(chain.from_iterable(genre_series))
genre_counts = Counter(all_genres)
top_genres = genre_counts.most_common(10)

genres, counts = zip(*top_genres)

plt.figure(figsize=(12, 6))
plt.barh(genres, counts, color='orange')
plt.title('Top 10 Most Common Genres')
plt.gca().invert_yaxis()
plt.show()



genre_avg_vote = []

for genre in genre_counts:
    genre_movies = df[df['genres'].str.contains(genre, na=False)]
    genre_avg_vote.append((genre, genre_movies['vote_average'].mean()))

genre_avg_vote.sort(key=lambda x: x[1], reverse=True)
genres, avg_votes = zip(*genre_avg_vote[:10])

plt.figure(figsize=(12, 6))
plt.barh(genres, avg_votes, color='purple')
plt.xlabel('Average Vote')
plt.title('Top 10 Genres by Average Vote')
plt.gca().invert_yaxis()
plt.show()



language_counts = df['original_language'].value_counts().head(10)

plt.figure(figsize=(10, 6))
plt.barh(language_counts.index, language_counts.values, color='teal')
plt.xlabel('Number of Movies')
plt.title('Top 10 Languages')
plt.gca().invert_yaxis()
plt.show()



top_directors = df['director'].value_counts().drop('', errors='ignore').head(10)

plt.figure(figsize=(10, 6))
plt.barh(top_directors.index, top_directors.values, color='slateblue')
plt.xlabel('Number of Movies')
plt.title('Top 10 Directors')
plt.gca().invert_yaxis()
plt.show()



correlation = df[['vote_average', 'vote_count', 'budget', 'revenue', 'popularity']].corr()

plt.figure(figsize=(8, 6))
plt.imshow(correlation, cmap='coolwarm', interpolation='none')
plt.colorbar()
plt.xticks(range(len(correlation)), correlation.columns, rotation=45)
plt.yticks(range(len(correlation)), correlation.columns)
plt.title('Correlation Between Features')
plt.show()



revenue_by_year = df.groupby('release_year')['revenue'].mean()

plt.figure(figsize=(12, 6))
plt.plot(revenue_by_year.index, revenue_by_year.values, color='darkred')
plt.xlabel('Year')
plt.ylabel('Average Revenue')
plt.title('Average Revenue Per Year')
plt.grid(True)
plt.show()



plt.figure(figsize=(10, 6))
plt.scatter(df['vote_count'], df['popularity'], alpha=0.5, color='darkgreen')
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Vote Count')
plt.ylabel('Popularity')
plt.title('Vote Count vs Popularity')
plt.grid(True)
plt.show()



avg_rev = []
for genre in genre_counts:
    genre_movies = df[df['genres'].str.contains(genre, na=False)]
    avg_rev.append((genre, genre_movies['revenue'].mean()))

avg_rev.sort(key=lambda x: x[1] if not pd.isna(x[1]) else 0, reverse=True)
genres, revenues = zip(*avg_rev[:10])

plt.figure(figsize=(12, 6))
plt.barh(genres, revenues, color='gold')
plt.xlabel('Average Revenue')
plt.title('Top 10 Genres by Avg Revenue')
plt.gca().invert_yaxis()
plt.show()



avg_runtime = []
for genre in genre_counts:
    genre_movies = df[df['genres'].str.contains(genre, na=False)]
    avg_runtime.append((genre, genre_movies['runtime'].mean()))

avg_runtime.sort(key=lambda x: x[1] if not pd.isna(x[1]) else 0, reverse=True)
genres, runtimes = zip(*avg_runtime[:10])

plt.figure(figsize=(12, 6))
plt.barh(genres, runtimes, color='coral')
plt.xlabel('Average Runtime')
plt.title('Top 10 Genres by Avg Runtime')
plt.gca().invert_yaxis()
plt.show()



df['decade'] = (df['release_year'] // 10) * 10
popular_decades = df.groupby('decade')['popularity'].mean().sort_index()

plt.figure(figsize=(10, 6))
plt.bar(popular_decades.index.astype(int), popular_decades.values, width=8, color='dodgerblue')
plt.xlabel('Decade')
plt.ylabel('Average Popularity')
plt.title('Popularity by Decade')
plt.show()



df.describe(include='all')
