<a href="https://colab.research.google.com/github/InsightByHarshit/IMDB-Data-Analysis/blob/main/IMDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset

data = pd.read_csv('Copy of imdb_data.csv')


The **info()** method in a pandas DataFrame provides a concise summary of the DataFrame's structure. It is particularly useful for quickly understanding the data's size, data types, and missing values.

In [3]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     3000 non-null   int64  
 1   belongs_to_collection  604 non-null    object 
 2   budget                 3000 non-null   int64  
 3   genres                 2993 non-null   object 
 4   homepage               946 non-null    object 
 5   imdb_id                3000 non-null   object 
 6   original_language      3000 non-null   object 
 7   original_title         3000 non-null   object 
 8   overview               2992 non-null   object 
 9   popularity             3000 non-null   float64
 10  poster_path            2999 non-null   object 
 11  production_companies   2844 non-null   object 
 12  production_countries   2945 non-null   object 
 13  release_date           3000 non-null   object 
 14  runtime                2998 non-null   float64
 15  spok

In pandas, the **dtypes** attribute of a DataFrame shows the data types of each column. These data types are critical because they determine how operations are performed on the data in that column.

In [4]:
print("Data types of each column:\n", data.dtypes)


Data types of each column:
 id                         int64
belongs_to_collection     object
budget                     int64
genres                    object
homepage                  object
imdb_id                   object
original_language         object
original_title            object
overview                  object
popularity               float64
poster_path               object
production_companies      object
production_countries      object
release_date              object
runtime                  float64
spoken_languages          object
status                    object
tagline                   object
title                     object
Keywords                  object
cast                      object
crew                      object
revenue                    int64
dtype: object


**isnull().sum()** is a concise way to check for missing (null) values in a pandas DataFrame or Series.

In [5]:
missing_values = data.isnull().sum()
print("Missing values in each column:\n", missing_values)

Missing values in each column:
 id                          0
belongs_to_collection    2396
budget                      0
genres                      7
homepage                 2054
imdb_id                     0
original_language           0
original_title              0
overview                    8
popularity                  0
poster_path                 1
production_companies      156
production_countries       55
release_date                0
runtime                     2
spoken_languages           20
status                      0
tagline                   597
title                       0
Keywords                  276
cast                       13
crew                       16
revenue                     0
dtype: int64


When working with a pandas DataFrame, **columns** refer to the vertical structures that hold data for a specific feature or attribute. Each column has a name (or label) and contains data of a specific type.

In [6]:
print(data.columns)

Index(['id', 'belongs_to_collection', 'budget', 'genres', 'homepage',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'runtime', 'spoken_languages',
       'status', 'tagline', 'title', 'Keywords', 'cast', 'crew', 'revenue'],
      dtype='object')


The **.describe()** method in pandas is used to compute summary statistics for numerical columns in your dataset

In [7]:
print(data.describe())

                id        budget   popularity      runtime       revenue
count  3000.000000  3.000000e+03  3000.000000  2998.000000  3.000000e+03
mean   1500.500000  2.253133e+07     8.463274   107.856571  6.672585e+07
std     866.169729  3.702609e+07    12.104000    22.086434  1.375323e+08
min       1.000000  0.000000e+00     0.000001     0.000000  1.000000e+00
25%     750.750000  0.000000e+00     4.018053    94.000000  2.379808e+06
50%    1500.500000  8.000000e+06     7.374861   104.000000  1.680707e+07
75%    2250.250000  2.900000e+07    10.890983   118.000000  6.891920e+07
max    3000.000000  3.800000e+08   294.337037   338.000000  1.519558e+09


Converting columns to the **datetime** type in pandas is important because it allows for more efficient handling and manipulation of date and time-related data.

In [8]:
data['release_date'] = pd.to_datetime(data['release_date'], errors='coerce')
print("Data types after conversion:\n", data.dtypes)

Data types after conversion:
 id                                int64
belongs_to_collection            object
budget                            int64
genres                           object
homepage                         object
imdb_id                          object
original_language                object
original_title                   object
overview                         object
popularity                      float64
poster_path                      object
production_companies             object
production_countries             object
release_date             datetime64[ns]
runtime                         float64
spoken_languages                 object
status                           object
tagline                          object
title                            object
Keywords                         object
cast                             object
crew                             object
revenue                           int64
dtype: object


  data['release_date'] = pd.to_datetime(data['release_date'], errors='coerce')


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset

data = pd.read_csv('Copy of imdb_data.csv')

# 1. Basic Dataset Summary
# Display the first 5 and last 5 rows
print(data.head())
print(data.tail())

# Summary statistics for numerical columns
print(data.describe())

# 2. Check for Missing Data
missing_values = data.isnull().sum()
print("Missing values in each column:\n", missing_values)

# 3. Data Type Check
print("Data types of each column:\n", data.dtypes)

# Convert 'release_date' to datetime if not already
data['release_date'] = pd.to_datetime(data['release_date'], errors='coerce')
print("Data types after conversion:\n", data.dtypes)

# 4. Check for Duplicates
duplicates = data.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

# Remove duplicates if necessary
data = data.drop_duplicates()

# 5. Count Unique Values
unique_genres = data['genres'].nunique()
unique_languages = data['original_language'].nunique()
unique_status = data['status'].nunique()
print(f"Unique genres: {unique_genres}, Unique languages: {unique_languages}, Unique status: {unique_status}")

# 6. Find Top 10 Most Expensive Movies
top_expensive_movies = data[['title', 'budget']].sort_values(by='budget', ascending=False).head(10)
print("Top 10 most expensive movies:\n", top_expensive_movies)

# 7. Movies Released in Specific Year
movies_2020 = data[data['release_date'].dt.year == 2020]
print("Movies released in 2020:\n", movies_2020[['title', 'release_date']])

# 8. Count Movies with Specific Status
status_counts = data['status'].value_counts()
print("Movies by status:\n", status_counts)

# 9. Movies with Multiple Genres
data['genres_parsed'] = data['genres'].apply(lambda x: [g['name'] for g in eval(x)] if isinstance(x, str) else [])
multiple_genres = data[data['genres_parsed'].apply(lambda x: len(x) > 1)]
print(f"Movies with multiple genres: {len(multiple_genres)}")

# 10. Movies with the Longest and Shortest Runtime
longest_runtime = data.loc[data['runtime'].idxmax()]
shortest_runtime = data.loc[data['runtime'].idxmin()]
print(f"Movie with longest runtime: {longest_runtime['title']} ({longest_runtime['runtime']} minutes)")
print(f"Movie with shortest runtime: {shortest_runtime['title']} ({shortest_runtime['runtime']} minutes)")

# 11. Top 10 Most Popular Movies
top_popular_movies = data[['title', 'popularity']].sort_values(by='popularity', ascending=False).head(10)
print("Top 10 most popular movies:\n", top_popular_movies)

# 12. Movies by Language
movies_by_language = data['original_language'].value_counts()
print("Movies by language:\n", movies_by_language)

# 13. Budget Distribution
plt.figure(figsize=(10, 6))
sns.histplot(data['budget'], kde=True, bins=30, color='skyblue')
plt.title("Budget Distribution")
plt.xlabel("Budget")
plt.ylabel("Frequency")
plt.show()

# 14. Average Revenue per Genre
# First, we need to parse genres
data['genres_parsed'] = data['genres'].apply(lambda x: [g['name'] for g in eval(x)] if isinstance(x, str) else [])
genre_revenue = data.explode('genres_parsed').groupby('genres_parsed')['revenue'].mean().sort_values(ascending=False)
print("Average revenue per genre:\n", genre_revenue.head(10))

# 15. Top 5 Movies by Revenue
top_revenue_movies = data[['title', 'revenue']].sort_values(by='revenue', ascending=False).head(5)
print("Top 5 movies by revenue:\n", top_revenue_movies)

# 16. Identify Movies with Missing Overviews
missing_overviews = data[data['overview'].isnull()]
print("Movies with missing overviews:\n", missing_overviews[['title', 'overview']])

# 17. Revenue Distribution by Genre
plt.figure(figsize=(12, 6))
# Reset index before plotting to avoid duplicate index issue
sns.boxplot(x='genres_parsed', y='revenue', data=data.explode('genres_parsed').reset_index())
plt.title("Revenue Distribution by Genre")
plt.xlabel("Genre")
plt.ylabel("Revenue")
plt.xticks(rotation=90)
plt.show()

# 18. Movies with the Most Keywords
# First, count the number of keywords per movie (if applicable)
# 18. Movies with the Most Keywords
# First, count the number of keywords per movie (if applicable)
# Check if 'keywords' column exists before accessing it
'''if 'keywords' in data.columns:
  data['keyword_count'] = data['keywords'].apply(lambda x: len(eval(x)) if isinstance(x, str) else 0)
  movies_most_keywords = data[['title', 'keyword_count']].sort_values(by='keyword_count', ascending=False).head(5)
  print("Movies with most keywords:\n", movies_most_keywords)
else:
  print("The 'keywords' column is not found in the DataFrame.")

data['keyword_count'] = data['keywords'].apply(lambda x: len(eval(x)) if isinstance(x, str) else 0)
movies_most_keywords = data[['title', 'keyword_count']].sort_values(by='keyword_count', ascending=False).head(5)
print("Movies with most keywords:\n", movies_most_keywords)'''

# 19. Movies Released in Specific Month
data['release_month'] = data['release_date'].dt.month
movies_in_january = data[data['release_month'] == 1]
print(f"Movies released in January: {len(movies_in_january)}")

# 20. Movies with the Most Cast Members
data['cast_count'] = data['cast'].apply(lambda x: len(eval(x)) if isinstance(x, str) else 0)
movies_most_cast = data[['title', 'cast_count']].sort_values(by='cast_count', ascending=False).head(5)
print("Movies with most cast members:\n", movies_most_cast)


   id                              belongs_to_collection    budget  \
0   1  [{'id': 313576, 'name': 'Hot Tub Time Machine ...  14000000   
1   2  [{'id': 107674, 'name': 'The Princess Diaries ...  40000000   
2   3                                                NaN   3300000   
3   4                                                NaN   1200000   
4   5                                                NaN         0   

                                              genres  \
0                     [{'id': 35, 'name': 'Comedy'}]   
1  [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...   
2                      [{'id': 18, 'name': 'Drama'}]   
3  [{'id': 53, 'name': 'Thriller'}, {'id': 18, 'n...   
4  [{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...   

                            homepage    imdb_id original_language  \
0                                NaN  tt2637294                en   
1                                NaN  tt0368933                en   
2  http://sonyclassics.com/whiplash

  data['release_date'] = pd.to_datetime(data['release_date'], errors='coerce')


Data types after conversion:
 id                                int64
belongs_to_collection            object
budget                            int64
genres                           object
homepage                         object
imdb_id                          object
original_language                object
original_title                   object
overview                         object
popularity                      float64
poster_path                      object
production_companies             object
production_countries             object
release_date             datetime64[ns]
runtime                         float64
spoken_languages                 object
status                           object
tagline                          object
title                            object
Keywords                         object
cast                             object
crew                             object
revenue                           int64
dtype: object
Number of duplicate rows: 0
Unique g

KeyboardInterrupt: 