# Netflix Content-Viewership Dataset Analysis
This notebook contains analysis of netflix_titles dataset downloaded from Kaggle.  
We will:
- Load the dataset
- Explore its structure
- Perform some analysis using NumPy and Pandas

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv("netflix_titles.csv")
pd.set_option('display.max_columns', None) # To display all columns

# 1st Step - Exploring Dataset

In [2]:
df.head() #give the first five rows of dataset

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [3]:
df.tail() #give the bottom five rows of dataset

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."
8806,s8807,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...


In [4]:
df.shape #tells total no. of rows and columns

(8807, 12)

In [5]:
df.info() # give total non null count values and data type of each table

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [6]:
df.describe() # give some stats of int column

Unnamed: 0,release_year
count,8807.0
mean,2014.180198
std,8.819312
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


In [7]:
df.describe(include="object") #give some info of object columns

Unnamed: 0,show_id,type,title,director,cast,country,date_added,rating,duration,listed_in,description
count,8807,8807,8807,6173,7982,7976,8797,8803,8804,8807,8807
unique,8807,2,8807,4528,7692,748,1767,17,220,514,8775
top,s1,Movie,Dick Johnson Is Dead,Rajiv Chilaka,David Attenborough,United States,"January 1, 2020",TV-MA,1 Season,"Dramas, International Movies","Paranormal activity at a lush, abandoned prope..."
freq,1,6131,1,19,19,2818,109,3207,1793,362,4


In [8]:
df[df.duplicated()] # not a single repeated entry

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description


# 2nd Step - Cleaning dataset

In [9]:
# After exploring datset it's found that our datset has 3 major problems which should be consider in cleanig process

# 1.changing date_added dtype to datetime dtype
# 2.Splitting duration column into duration_num and duration_value with changing dtype 
# 3.replacing missing values with some values like mean,mode or some interpolation technique if needed or unavaialable  

df_clean = df.copy()  
# we are copying our datframe to new dataframe so we compare dataset before cleaning and after cleaning  and 
# also check our upper codes and their analysis if we change this dataframe then it make change it upper code outputs


In [10]:
# Converting date_added col  to datetime dtype 
df_clean["date_added"] = pd.to_datetime(df_clean["date_added"], errors="coerce")

# Spliting duration into 'duration_num', 'duration_type'
df_clean[['duration_num', 'duration_type']] = df_clean['duration'].str.extract(r'(\d+)\s*(\w+)')

# Converting duration_num  to numeric (int64)
df_clean['duration_num'] = pd.to_numeric(df_clean['duration_num'], errors='coerce')

# Moving duration_num,duration_type just after duration
pos = df_clean.columns.get_loc("duration")
col1 = df_clean.pop("duration_num")  
col2 = df_clean.pop("duration_type")

df_clean.insert(pos + 1, "duration_num", col1)  
df_clean.insert(pos + 2, "duration_type", col2)


In [11]:
# Droping the original 'duration' column since it's no longer needed
df_clean.drop("duration", axis=1, inplace=True)

In [12]:
# since director,cast and country has many null values so we replace it with "unavaiable" text
cols_to_fill = ['cast', 'country', 'director']

for col in cols_to_fill:
    df_clean[col] = df_clean[col].fillna("Unknown") 

# since rating,duration_num,duration_type,date_added has very less null values so we replace it with their mean(average) and mode(top occuring values)

# Fill with mode (top occurring value)
df_clean['rating'] = df_clean['rating'].fillna(df_clean['rating'].mode()[0])
df_clean['duration_type'] = df_clean['duration_type'].fillna(df_clean['duration_type'].mode()[0])
df_clean['date_added'] = df_clean['date_added'].fillna(df_clean['date_added'].mode()[0])

# Fill with mean
df_clean['duration_num'] = df_clean['duration_num'].fillna(df_clean['duration_num'].mean())

# Convert all object (string) columns to lowercase and strip spaces
for col in df_clean.select_dtypes(include='object').columns:
    df_clean[col] = df_clean[col].str.lower().str.strip()


# Verifying changes

In [13]:
df_clean.head(1)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration_num,duration_type,listed_in,description
0,s1,movie,dick johnson is dead,kirsten johnson,unknown,united states,2021-09-25,2020,pg-13,90.0,min,documentaries,"as her father nears the end of his life, filmm..."


In [14]:
df_clean.dtypes

show_id                  object
type                     object
title                    object
director                 object
cast                     object
country                  object
date_added       datetime64[ns]
release_year              int64
rating                   object
duration_num            float64
duration_type            object
listed_in                object
description              object
dtype: object

In [15]:
df_clean.isnull().sum()

show_id          0
type             0
title            0
director         0
cast             0
country          0
date_added       0
release_year     0
rating           0
duration_num     0
duration_type    0
listed_in        0
description      0
dtype: int64

Now, We will download clean dataset

In [16]:
df_clean.to_csv("netflix_cleaned_dataset.csv",index=False)

# 3rd Step - Analysis

In [16]:
# Finding total number of shows and movies
movies_vs_shows = df_clean["type"].value_counts().reset_index()
movies_vs_shows.columns = ["Category", "Count"]
movies_vs_shows["Analysis"] = "Movies vs TV Shows"
print(movies_vs_shows)

  Category  Count            Analysis
0    movie   6131  Movies vs TV Shows
1  tv show   2676  Movies vs TV Shows


In [17]:
# Splitting and exploding countries to count as seperate value even together in a single row
countries_exploded = df_clean['country'].str.split(',').explode().str.strip()

# Top 2 Countries with most content production
Top_countries = countries_exploded[countries_exploded != "unavailable"].value_counts().head(2).reset_index()

Top_countries.columns = ['Category', 'Count']
Top_countries['Analysis'] = 'Top 2 Countries'
print(Top_countries)


# Bottom 2 Countries with most content production
bottom_countries = countries_exploded[countries_exploded != "unavailable"].value_counts().tail(2).reset_index()

bottom_countries.columns = ['Category', 'Count']
bottom_countries['Analysis'] = 'Bottom 2 Countries'
print(bottom_countries)

        Category  Count         Analysis
0  united states   3690  Top 2 Countries
1          india   1046  Top 2 Countries
       Category  Count            Analysis
0  east germany      1  Bottom 2 Countries
1    montenegro      1  Bottom 2 Countries


In [18]:

# Top 3 most content produced year
Top_content_trend = df_clean['release_year'].value_counts().head(3).reset_index()
Top_content_trend.columns = ['Category', 'Count']
Top_content_trend['Analysis'] = 'Top 3 most content produced year'
print(Top_content_trend)

# bottom 3 most content produced year
Bottom_content_trend = df_clean['release_year'].value_counts().tail(3).reset_index()
Bottom_content_trend.columns = ['Category', 'Count']
Bottom_content_trend['Analysis'] = 'Bottom 3 most content produced year'
print(Bottom_content_trend)

   Category  Count                          Analysis
0      2018   1147  Top 3 most content produced year
1      2017   1032  Top 3 most content produced year
2      2019   1030  Top 3 most content produced year
   Category  Count                             Analysis
0      1959      1  Bottom 3 most content produced year
1      1966      1  Bottom 3 most content produced year
2      1947      1  Bottom 3 most content produced year


In [19]:
# Splitting and exploding directors to count as seperate  even work  together in a single content
directors_exploded = df_clean['director'].str.split(', ').explode()

# Top 2 Directors
top_directors = directors_exploded[directors_exploded != "unknown"].value_counts().head(2).reset_index()

top_directors.columns = ['Category', 'Count']
top_directors['Analysis'] = 'Top 2 Directors'

print(top_directors)

        Category  Count         Analysis
0  rajiv chilaka     22  Top 2 Directors
1      jan suter     21  Top 2 Directors


In [20]:
# Splitting and exploding genres to count as seperate single value even if together in a single row
genres_exploded = df_clean['listed_in'].str.split(', ').explode()

# Top 2 Genres
top_genres = genres_exploded[genres_exploded != "unknown"].value_counts().head(2).reset_index()

top_genres.columns = ['Category', 'Count']
top_genres['Analysis'] = 'Top 2 genres'
print(top_genres)

# Bottom 2 Genres
Bottom_genres = genres_exploded[genres_exploded != "unknown"].value_counts().tail(2).reset_index()

Bottom_genres.columns = ['Category', 'Count']
Bottom_genres['Analysis'] = 'Bottom 2 genres'
print(Bottom_genres)


               Category  Count      Analysis
0  international movies   2752  Top 2 genres
1                dramas   2427  Top 2 genres
            Category  Count         Analysis
0  classic & cult tv     28  Bottom 2 genres
1           tv shows     16  Bottom 2 genres


In [22]:
describe_stats = df_clean['release_year'].describe().reset_index()
describe_stats.columns = ['Category', 'Count'] 
describe_stats['Analysis'] = 'stats collected by release year'
print(describe_stats)

  Category        Count                         Analysis
0    count  8807.000000  stats collected by release year
1     mean  2014.180198  stats collected by release year
2      std     8.819312  stats collected by release year
3      min  1925.000000  stats collected by release year
4      25%  2013.000000  stats collected by release year
5      50%  2017.000000  stats collected by release year
6      75%  2019.000000  stats collected by release year
7      max  2021.000000  stats collected by release year


Concating all dataframe of analysis and saving it in seperate file

In [26]:
final_analysis = pd.concat([movies_vs_shows,Top_countries,bottom_countries,Top_content_trend,Bottom_content_trend,describe_stats],ignore_index=True)

final_analysis.to_csv("netflix_analysis_csv_file.csv", index=False)
print(final_analysis)

         Category        Count                             Analysis
0           movie  6131.000000                   Movies vs TV Shows
1         tv show  2676.000000                   Movies vs TV Shows
2   united states  3690.000000                      Top 2 Countries
3           india  1046.000000                      Top 2 Countries
4    east germany     1.000000                   Bottom 2 Countries
5      montenegro     1.000000                   Bottom 2 Countries
6            2018  1147.000000     Top 3 most content produced year
7            2017  1032.000000     Top 3 most content produced year
8            2019  1030.000000     Top 3 most content produced year
9            1959     1.000000  Bottom 3 most content produced year
10           1966     1.000000  Bottom 3 most content produced year
11           1947     1.000000  Bottom 3 most content produced year
12          count  8807.000000      stats collected by release year
13           mean  2014.180198      stats collec

# 📊 Final Insights from Netflix Dataset Analysis

After cleaning, transforming, and exploring the dataset, here are the key takeaways:

1. **Movies vs TV Shows**
   - Movies dominate the dataset with **6,131 entries**, while TV Shows account for **2,676**.
   - This indicates Netflix has historically focused more on movies than series.

2. **Content Production by Country**
   - **United States (3,690)** and **India (1,046)** are the top two content producers.
   - Some countries like **East Germany** and **Montenegro** have contributed only **1 show each**, highlighting limited global representation.

3. **Release Year Trends**
   - Peak production years were **2018 (1,147 titles)**, **2017 (1,032 titles)**, and **2019 (1,030 titles)**.
   - Earliest content dates back to **1925**, while the latest is from **2021**.
   - Very low production was observed in years like **1947, 1959, and 1966**.

4. **Statistical Summary of Release Year**
   - Average release year ≈ **2014**.
   - Standard deviation (std) ≈ **8.2**, meaning ~68% of titles are from **2006–2022**.
   - 25% of content released before **2013**, 50% before **2017**, and 75% before **2019**.
   - Confirms Netflix’s aggressive content growth after **2013**.

---

# 🚀 Key Insights

- Netflix’s library is **movie-heavy** compared to TV shows.  
- The **US and India** dominate the content landscape.  
- The **2017–2019 period** marked the golden era of Netflix content production.  
- The dataset shows a **rapid acceleration post-2013**, matching Netflix’s global expansion.  
- Some countries have very **minimal contributions**, suggesting opportunities for more localized content.

---
