DATA ANALYSIS
------------------------------------------------------

How many unique movies are in the dataset?

What is the average IMDb rating of all movies?


How many movies were released each year?

What is the most common genre in the dataset?

What is the most common language for movies?

How many movies are available on each streaming platform?

Which platform has the highest average IMDb rating?

Are there any movies available on all four platforms (Netflix, Hulu, Prime Video, Disney+)?

Which platform has the most movies with an IMDb rating above 8.0?

Which platform has the most movies released after 2015?

What are the top 5 genres with the most movies?

Which genre has the highest average IMDb rating?

What is the most popular genre on Netflix?

How many movies in the Comedy genre have an IMDb rating above 7.5?

What is the average IMDb rating for Horror movies?

Who are the top 5 directors with the most movies in the dataset?

Which director has the highest average IMDb rating?

Which director has made the most Action movies?

What is the average IMDb rating of movies directed by Christopher Nolan?

How many movies by Steven Spielberg are available on Prime Video?

What are the top 5 most common languages in the dataset?

What is the average IMDb rating of French movies?

How many movies are from the United States?

Which country has produced the most Drama movies?

What is the most common language for movies on Hulu?

IMDb and Rotten Tomatoes Score Analysis

What is the highest IMDb rating in the dataset, and which movie has it?

What is the lowest Rotten Tomatoes score, and which movie has it?

What is the correlation between IMDb ratings and Rotten Tomatoes scores?

How many movies have both IMDb rating above 8.0 and Rotten Tomatoes score above 90%?

Which platform has the highest average Rotten Tomatoes score?

What is the average runtime of all movies?

Which movie has the longest runtime?

Which genre tends to have the longest runtime on average?

How many movies have a runtime of less than 60 minutes?

What is the average IMDb rating for movies with a runtime above 150 minutes?



In [97]:
import seaborn as sns
import pandas as pd
import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt

In [98]:
df=pd.read_csv('Movies.csv')

In [99]:
df.columns

Index(['Unnamed: 0', 'ID', 'Title', 'Year', 'Age', 'IMDb', 'Rotten Tomatoes',
       'Netflix', 'Hulu', 'Prime Video', 'Disney+', 'Type', 'Directors',
       'Genres', 'Country', 'Language', 'Runtime'],
      dtype='object')

In [100]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16744 entries, 0 to 16743
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       16744 non-null  int64  
 1   ID               16744 non-null  int64  
 2   Title            16744 non-null  object 
 3   Year             16744 non-null  int64  
 4   Age              7354 non-null   object 
 5   IMDb             16173 non-null  float64
 6   Rotten Tomatoes  5158 non-null   object 
 7   Netflix          16744 non-null  int64  
 8   Hulu             16744 non-null  int64  
 9   Prime Video      16744 non-null  int64  
 10  Disney+          16744 non-null  int64  
 11  Type             16744 non-null  int64  
 12  Directors        16018 non-null  object 
 13  Genres           16469 non-null  object 
 14  Country          16309 non-null  object 
 15  Language         16130 non-null  object 
 16  Runtime          16152 non-null  float64
dtypes: float64(2

In [101]:
df.duplicated().any()

np.False_

In [102]:
a=df['Age'].mode()
im=df['IMDb'].mean()
rt=df['Rotten Tomatoes'].mode()
ct=df['Country'].mode()
dir=df['Directors'].mode()
lt=df['Language'].mode()
rt=df['Runtime'].mean()


In [103]:
df['Age']=df['Age'].fillna(a)
df['IMDb']=df['IMDb'].fillna(im)
df['Rotten Tomatoes']=df['Rotten Tomatoes'].fillna(rt)
df['Country']=df['Country'].fillna(ct)
df['Directors']=df['Directors'].fillna(dir)
df['Language']=df['Language'].fillna(lt)
df['Runtime']=df['Runtime'].fillna(rt)

In [104]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16744 entries, 0 to 16743
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       16744 non-null  int64  
 1   ID               16744 non-null  int64  
 2   Title            16744 non-null  object 
 3   Year             16744 non-null  int64  
 4   Age              7354 non-null   object 
 5   IMDb             16744 non-null  float64
 6   Rotten Tomatoes  16744 non-null  object 
 7   Netflix          16744 non-null  int64  
 8   Hulu             16744 non-null  int64  
 9   Prime Video      16744 non-null  int64  
 10  Disney+          16744 non-null  int64  
 11  Type             16744 non-null  int64  
 12  Directors        16018 non-null  object 
 13  Genres           16469 non-null  object 
 14  Country          16309 non-null  object 
 15  Language         16130 non-null  object 
 16  Runtime          16744 non-null  float64
dtypes: float64(2

In [105]:
df=df.drop(columns=['Unnamed: 0'])

In [106]:
df['Rotten Tomatoes'] = df['Rotten Tomatoes'].str.rstrip('%').astype(float) / 10

In [107]:
platform_df = pd.melt(df,id_vars=['Title','IMDb', 'Rotten Tomatoes'],value_vars=['Netflix','Hulu','Prime Video','Disney+']
        ,var_name='Platform',value_name='Availability')

In [108]:
df['Rotten Tomatoes'] = pd.to_numeric(df['Rotten Tomatoes'], errors='coerce')
df['IMDb'] = pd.to_numeric(df['IMDb'], errors='coerce')

In [109]:
df['Runtime'] = df['Runtime'].astype(int)

How many unique movies are in the dataset?

In [110]:
df['Title'].unique().shape[0]

16744

What is the average IMDb rating of all movies?

In [111]:
df['IMDb'].mean()

np.float64(5.902751499412601)

How many movies were released each year?

In [112]:
movies_per_year = df.groupby('Year')['Title'].count().reset_index()
movies_per_year.columns = ['Year', 'Number_of_Movies']
print(movies_per_year)


     Year  Number_of_Movies
0    1902                 1
1    1912                 1
2    1913                 2
3    1915                 4
4    1916                 1
..    ...               ...
104  2016              1206
105  2017              1401
106  2018              1285
107  2019               689
108  2020               147

[109 rows x 2 columns]


What is the most common genre in the dataset?

In [113]:
# Split genres into separate values
genre_counts = df['Genres'].str.split(',', expand=True).stack().value_counts()

# Most common genre
most_common_genre = genre_counts.idxmax()
most_common_count = genre_counts.max()

print("Most Common Genre:", most_common_genre)
print("Count:", most_common_count)


Most Common Genre: Drama
Count: 7227


What is the most common language for movies?

In [114]:
# Split languages and count frequencies
language_counts = df['Language'].str.split(',', expand=True).stack().value_counts()

# Most common language
most_common_language = language_counts.idxmax()
most_common_count = language_counts.max()

print("Most Common Language:", most_common_language)
print("Count:", most_common_count)


Most Common Language: English
Count: 13233


How many movies are available on each streaming platform?

In [115]:
platform_counts = df[['Netflix', 'Hulu', 'Prime Video', 'Disney+']].sum()

print(platform_counts)


Netflix         3560
Hulu             903
Prime Video    12354
Disney+          564
dtype: int64


Which platform has the highest average IMDb rating?

In [116]:
platform_avg_imdb = {
    "Netflix": df.loc[df["Netflix"] == 1, "IMDb"].mean(),
    "Hulu": df.loc[df["Hulu"] == 1, "IMDb"].mean(),
    "Prime Video": df.loc[df["Prime Video"] == 1, "IMDb"].mean(),
    "Disney+": df.loc[df["Disney+"] == 1, "IMDb"].mean()
}

print(platform_avg_imdb)
print("Highest Avg IMDb Platform:", max(platform_avg_imdb, key=platform_avg_imdb.get))



{'Netflix': np.float64(6.241355246328845), 'Hulu': np.float64(6.1352494645554145), 'Prime Video': np.float64(5.775669999088394), 'Disney+': np.float64(6.440430410459951)}
Highest Avg IMDb Platform: Disney+


Are there any movies available on all four platforms (Netflix, Hulu, Prime Video, Disney+)?

In [117]:
# Filter movies available on all 4 platforms
all_platforms = df[(df['Netflix']==1) & (df['Hulu']==1) & (df['Prime Video']==1) & (df['Disney+']==1)]

print("Number of movies on all 4 platforms:", len(all_platforms))
print(all_platforms[['Title', 'Netflix', 'Hulu', 'Prime Video', 'Disney+']])


Number of movies on all 4 platforms: 0
Empty DataFrame
Columns: [Title, Netflix, Hulu, Prime Video, Disney+]
Index: []


Which platform has the most movies with an IMDb rating above 8.0?

In [118]:
# Filter movies with IMDb > 8.0
high_imdb = df[df['IMDb'] > 8.0]

# Count per platform
platform_high_imdb_counts = high_imdb[['Netflix','Hulu','Prime Video','Disney+']].sum()

print(platform_high_imdb_counts)
print("Platform with most movies IMDb>8.0:", platform_high_imdb_counts.idxmax())


Netflix        129
Hulu            23
Prime Video    324
Disney+         21
dtype: int64
Platform with most movies IMDb>8.0: Prime Video


Which platform has the most movies released after 2015?

In [119]:
# Filter movies released after 2015
movies_post_2015 = df[df['Year'] > 2015]

# Count per platform
platform_post2015_counts = movies_post_2015[['Netflix','Hulu','Prime Video','Disney+']].sum()

print(platform_post2015_counts)
print("Platform with most movies released after 2015:", platform_post2015_counts.idxmax())


Netflix        2105
Hulu            454
Prime Video    2321
Disney+          87
dtype: int64
Platform with most movies released after 2015: Prime Video


What are the top 5 genres with the most movies?

In [120]:
# Split genres into separate rows and count
genre_counts = df['Genres'].str.split(',', expand=True).stack().value_counts()

# Top 5 genres
top5_genres = genre_counts.head(5)
print(top5_genres)


Drama          7227
Comedy         4637
Thriller       3354
Action         2683
Documentary    2546
Name: count, dtype: int64


Which genre has the highest average IMDb rating?

In [121]:
# Split genres into separate rows
df_genres = df.copy()
df_genres = df_genres.assign(Genre=df_genres['Genres'].str.split(',')).explode('Genre')

# Calculate average IMDb rating per genre
genre_avg_imdb = df_genres.groupby('Genre')['IMDb'].mean().sort_values(ascending=False)

# Genre with highest average IMDb
highest_avg_genre = genre_avg_imdb.idxmax()
highest_avg_rating = genre_avg_imdb.max()

print("Genre with highest average IMDb rating:", highest_avg_genre)
print("Average IMDb rating:", highest_avg_rating)


Genre with highest average IMDb rating: News
Average IMDb rating: 7.063914342834766


What is the most popular genre on Netflix?

In [122]:
# Filter Netflix movies
netflix_movies = df[df['Netflix'] == 1]

# Split genres into separate rows
netflix_genres = netflix_movies.assign(Genre=netflix_movies['Genres'].str.split(',')).explode('Genre')

# Count genres and find the most popular
most_popular_genre_netflix = netflix_genres['Genre'].value_counts().idxmax()
count = netflix_genres['Genre'].value_counts().max()

print("Most popular genre on Netflix:", most_popular_genre_netflix)
print("Count:", count)


Most popular genre on Netflix: Drama
Count: 1501


How many movies in the Comedy genre have an IMDb rating above 7.5?

In [123]:
# Filter Comedy movies with IMDb > 7.5
comedy_high_imdb = df[df['Genres'].str.contains('Comedy') & (df['IMDb'] > 7.5)]

# Count
num_comedy_high_imdb = len(comedy_high_imdb)

print("Number of Comedy movies with IMDb > 7.5:", num_comedy_high_imdb)


Number of Comedy movies with IMDb > 7.5: 316


What is the average IMDb rating for Horror movies?

In [124]:
# Filter Horror movies safely
horror_movies = df[df['Genres'].str.contains('Horror', na=False)]

# Average IMDb rating
avg_imdb_horror = horror_movies['IMDb'].mean()

print("Average IMDb rating for Horror movies:", round(avg_imdb_horror, 2))


Average IMDb rating for Horror movies: 4.73


Who are the top 5 directors with the most movies in the dataset?

In [125]:
# Split directors into separate rows
directors_df = df.assign(Director=df['Directors'].str.split(',')).explode('Director')

# Count movies per director
top5_directors = directors_df['Director'].value_counts().head(5)

print(top5_directors)


Director
Jay Chapman         36
Joseph Kane         31
Cheh Chang          29
Jim Wynorski        23
William Beaudine    23
Name: count, dtype: int64


Which director has the highest average IMDb rating?

In [126]:
# Split directors into separate rows
directors_df = df.assign(Director=df['Directors'].str.split(',')).explode('Director')

# Calculate average IMDb rating per director
director_avg_imdb = directors_df.groupby('Director')['IMDb'].mean().sort_values(ascending=False)

# Director with highest average IMDb
top_director = director_avg_imdb.idxmax()
top_rating = director_avg_imdb.max()

print("Director with highest average IMDb rating:", top_director)
print("Average IMDb rating:", round(top_rating, 2))


Director with highest average IMDb rating: Danny Wu
Average IMDb rating: 9.3


Which director has made the most Action movies?

In [127]:
# Filter Action movies
action_movies = df[df['Genres'].str.contains('Action', na=False)]

# Split directors into separate rows
action_directors = action_movies.assign(Director=action_movies['Directors'].str.split(',')).explode('Director')

# Count movies per director
top_action_director = action_directors['Director'].value_counts().idxmax()
count = action_directors['Director'].value_counts().max()

print("Director with most Action movies:", top_action_director)
print("Number of Action movies:", count)


Director with most Action movies: Cheh Chang
Number of Action movies: 29


What is the average IMDb rating of movies directed by Christopher Nolan?

In [128]:
# Filter Christopher Nolan movies
nolan_movies = df[df['Directors'].str.contains('Christopher Nolan', na=False)]

# Average IMDb rating
avg_imdb_nolan = nolan_movies['IMDb'].mean()

print("Average IMDb rating of Christopher Nolan movies:", round(avg_imdb_nolan, 2))


Average IMDb rating of Christopher Nolan movies: 8.3


How many movies by Steven Spielberg are available on Prime Video?

In [129]:
# Filter Steven Spielberg movies
spielberg_movies = df[df['Directors'].str.contains('Steven Spielberg', na=False)]

# Filter movies available on Prime Video
prime_spielberg = spielberg_movies[spielberg_movies['Prime Video'] == 1]

# Count
num_prime_spielberg = len(prime_spielberg)

print("Number of Steven Spielberg movies on Prime Video:", num_prime_spielberg)


Number of Steven Spielberg movies on Prime Video: 1


What are the top 5 most common languages in the dataset?

In [130]:
# Split languages into separate rows
languages_df = df.assign(Language=df['Language'].str.split(',')).explode('Language')

# Count movies per language
top5_languages = languages_df['Language'].value_counts().head(5)

print(top5_languages)


Language
English    13233
Spanish      872
French       799
Hindi        731
German       483
Name: count, dtype: int64


What is the average IMDb rating of French movies?

In [131]:
# Filter French movies
french_movies = df[df['Language'].str.contains('French', na=False)]

# Average IMDb rating
avg_imdb_french = french_movies['IMDb'].mean()

print("Average IMDb rating of French movies:", round(avg_imdb_french, 2))


Average IMDb rating of French movies: 6.36


How many movies are from the United States?

In [132]:
# Filter movies from the United States
us_movies = df[df['Country'].str.contains('United States', na=False)]

# Count
num_us_movies = len(us_movies)

print("Number of movies from the United States:", num_us_movies)


Number of movies from the United States: 10486


Which country has produced the most Drama movies?

In [133]:
top_drama_country = df.loc[df['Genres'].str.contains('Drama', na=False), 'Country'].value_counts().idxmax()
print("Country with most Drama movies:", top_drama_country)

Country with most Drama movies: United States


What is the most common language for movies on Hulu?

In [134]:
top_hulu_language = df.loc[df['Hulu'] == 1, 'Language'].str.split(',').explode().value_counts().idxmax()
print("Most common language for movies on Hulu:", top_hulu_language)


Most common language for movies on Hulu: English


# IMDb and Rotten Tomatoes Score Analysis

What is the highest IMDb rating in the dataset, and which movie has it?

In [135]:
# Highest IMDb rating
max_imdb = df['IMDb'].max()

# Movie(s) with highest IMDb
movies_max_imdb = df.loc[df['IMDb'] == max_imdb, 'Title'].tolist()

print("Highest IMDb rating:", max_imdb)
print("Movie(s) with highest IMDb rating:", movies_max_imdb)


Highest IMDb rating: 9.3
Movie(s) with highest IMDb rating: ['My Next Guest with David Letterman and Shah Rukh Khan', 'Love on a Leash', 'Square One', 'Steven Banks: Home Entertainment Center', 'Down, But Not Out!', 'Bounty']


What is the lowest Rotten Tomatoes score, and which movie has it?

In [136]:
# Lowest Rotten Tomatoes score
min_rt = df['Rotten Tomatoes'].min()

# Movie(s) with lowest Rotten Tomatoes score
movies_min_rt = df.loc[df['Rotten Tomatoes'] == min_rt, 'Title'].tolist()

print("Lowest Rotten Tomatoes score:", min_rt)
print("Movie(s) with lowest Rotten Tomatoes score:", movies_min_rt)


Lowest Rotten Tomatoes score: 0.2
Movie(s) with lowest Rotten Tomatoes score: ['Getaway', 'Nina', 'Strange Wilderness', "Kickin' It Old Skool"]


What is the correlation between IMDb ratings and Rotten Tomatoes scores?

In [137]:
# Correlation between IMDb and Rotten Tomatoes
correlation = df['IMDb'].corr(df['Rotten Tomatoes'])
print("Correlation between IMDb and Rotten Tomatoes:", round(correlation, 2))

Correlation between IMDb and Rotten Tomatoes: 0.62


How many movies have both IMDb rating above 8.0 and Rotten Tomatoes score above 90%?

In [138]:
high_rated_movies = df.loc[(df['IMDb'] > 8.0) & (df['Rotten Tomatoes'] > 90)]
num_high_rated = len(high_rated_movies)

print("Number of movies with IMDb > 8.0 and Rotten Tomatoes > 90%:", num_high_rated)


Number of movies with IMDb > 8.0 and Rotten Tomatoes > 90%: 0


Which platform has the highest average Rotten Tomatoes score?

In [139]:
# Calculate average Rotten Tomatoes score per platform
avg_rt_scores = {
    'Netflix': df.loc[df['Netflix'] == 1, 'Rotten Tomatoes'].mean(),
    'Hulu': df.loc[df['Hulu'] == 1, 'Rotten Tomatoes'].mean(),
    'Prime Video': df.loc[df['Prime Video'] == 1, 'Rotten Tomatoes'].mean(),
    'Disney+': df.loc[df['Disney+'] == 1, 'Rotten Tomatoes'].mean()
}

# Find platform with highest average
top_platform = max(avg_rt_scores, key=avg_rt_scores.get)
print("Platform with highest average Rotten Tomatoes score:", top_platform)
print("Average Rotten Tomatoes score:", round(avg_rt_scores[top_platform], 2))


Platform with highest average Rotten Tomatoes score: Hulu
Average Rotten Tomatoes score: 6.69


What is the average runtime of all movies?

In [140]:
# Average runtime
avg_runtime = df['Runtime'].mean()
print("Average runtime of all movies:", round(avg_runtime, 2), "minutes")


Average runtime of all movies: 93.4 minutes


Which movie has the longest runtime?

In [141]:
# Longest runtime
max_runtime = df['Runtime'].max()

# Movie(s) with longest runtime
longest_movies = df.loc[df['Runtime'] == max_runtime, 'Title'].tolist()

print("Longest runtime:", max_runtime, "minutes")
print("Movie(s) with the longest runtime:", longest_movies)


Longest runtime: 1256 minutes
Movie(s) with the longest runtime: ['Colorado']


Which genre tends to have the longest runtime on average?

In [142]:
# Split and explode genres
genre_runtime = df.assign(Genres=df['Genres'].str.split(',')).explode('Genres')

# Average runtime per genre
avg_runtime_per_genre = genre_runtime.groupby('Genres')['Runtime'].mean()

# Genre with longest average runtime
longest_genre = avg_runtime_per_genre.idxmax()
longest_avg_runtime = avg_runtime_per_genre.max()

print("Genre with longest average runtime:", longest_genre)
print("Average runtime:", round(longest_avg_runtime, 2), "minutes")


Genre with longest average runtime: Musical
Average runtime: 105.18 minutes


How many movies have a runtime of less than 60 minutes?

In [143]:
short_movies_count = len(df.loc[df['Runtime'] < 60])
print("Number of movies with runtime less than 60 minutes:", short_movies_count)


Number of movies with runtime less than 60 minutes: 1219


What is the average IMDb rating for movies with a runtime above 150 minutes?

In [144]:
# long_movies = df.loc[df['Runtime'] > 150, 'IMDb']
# avg_imdb_long_movies = long_movies.mean()

# print("Average IMDb rating for movies with runtime > 150 minutes:", round(avg_imdb_long_movies, 2))

df.loc[df['Runtime'] > 150, 'IMDb'].mean()

np.float64(6.622602863917855)