# Loading and understanding the dataset

Loading dataset

In [1]:
import pandas as pd
df=pd.read_csv("netflix.csv")

Information about the dataset

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [3]:
print(df.dtypes)

show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object


In [8]:
df.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,Unknown,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",Unknown,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,Unknown,,Unknown,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,Unknown,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


# Handle Null (Missing) Values

Ckecking if there is any null values

In [5]:
df.isnull().sum(axis=0)

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

Replacing all the null value of 'director', and 'country' columns with 'unknown' since these columns are "Object" type

In [6]:
df['director']=df['director'].fillna('Unknown')

In [7]:
df['country']=df['country'].fillna('Unknown')

Replacing all the null value of 'cast' column with 'Not Mentioned' since this column is also "Object" type

In [9]:
df['cast']=df['cast'].fillna('Not Mentioned')

# Handling Duplicate Values

Checking if there is any duplicate values

In [12]:
df.duplicated().sum(axis=0)

np.int64(0)

*Since there is no duplicate value in the dataset so no operation needed. Otherwise i would apply "df.drop_duplicates()" this operation*

# What is the number of Movies vs TV Shows?

Method: 1

In [11]:
no_movies=df.groupby('type')['show_id'].count()
print(no_movies)

type
Movie      6131
TV Show    2676
Name: show_id, dtype: int64


Method: 2

In [12]:
print(df['type'].value_counts())

type
Movie      6131
TV Show    2676
Name: count, dtype: int64


(Need to find the most years in the dataset)

# What are the top 10 countries with the most Netflix content?

(*Need to Remove extra spaces, and commas etc*)

In [15]:
df['country']=df['country'].str.strip()

In [18]:
print(df['country'].value_counts().head(10))

country
United States     2818
India              972
Unknown            831
United Kingdom     419
Japan              245
South Korea        199
Canada             181
Spain              145
France             124
Mexico             110
Name: count, dtype: int64


# Q3: How many titles were added each year?

In [20]:
# Removing extra spaces
df['date_added']=df['date_added'].str.strip()

In [21]:
# Convert 'date_added' to datetime
df['date_added'] = pd.to_datetime(df['date_added'])

In [24]:
# Extract year
df['year_added'] = df['date_added'].dt.year

In [26]:
# Count by year
print(df['year_added'].value_counts().sort_index())

year_added
2008.0       2
2009.0       2
2010.0       1
2011.0      13
2012.0       3
2013.0      11
2014.0      24
2015.0      82
2016.0     429
2017.0    1188
2018.0    1649
2019.0    2016
2020.0    1879
2021.0    1498
Name: count, dtype: int64


Need to convert the data type of "year_added" column into integer.

In [27]:
# Checking, if there is any null value in "year_added" column
df['year_added'].isnull().sum(axis=0)

np.int64(10)

In [36]:
# Got 10 null values, replacing all the null values with any '0')
df['year_added']=df['year_added'].fillna(0)

In [38]:
# Changing the data types into (Integer)
df['year_added']=df['year_added'].astype(int)

In [40]:
# Temporarily droping the rows with the value '0'
temp=df[df['year_added']!=0]

In [48]:
print(temp['year_added'].value_counts().sort_index())

year_added
2008       2
2009       2
2010       1
2011      13
2012       3
2013      11
2014      24
2015      82
2016     429
2017    1188
2018    1649
2019    2016
2020    1879
2021    1498
Name: count, dtype: int64


# What are the most common genres (listed in 'listed_in')?

In [52]:
df['listed_in'].isnull().sum(axis=0)

np.int64(0)

In [59]:
from collections import Counter
all_genres=list()

# Split genres and count
genres = df['listed_in'].apply(lambda x: x.split(', '))
for i in genres:
    for j in i:
        all_genres.append(j)
store=Counter(all_genres).most_common(10)
for i,j in store:
    print(f"{i}: {j}")

International Movies: 2752
Dramas: 2427
Comedies: 1674
International TV Shows: 1351
Documentaries: 869
Action & Adventure: 859
TV Dramas: 763
Independent Movies: 756
Children & Family Movies: 641
Romantic Movies: 616


# Who are the top 10 directors on Netflix?

Option: 1

In [None]:
# Temporarily droping the rows from 'director' column with the value of "Unknown"
temp = df[df['director'] != 'Unknown']

In [None]:
from collections import Counter
direc=list()
for i in temp['director']:
    direc.append(i.strip())
res=Counter(direc).most_common(10)
for i,j in res:
    print(f"{i}: {j}")


Rajiv Chilaka: 19
Raúl Campos, Jan Suter: 18
Suhas Kadav: 16
Marcus Raboy: 16
Jay Karas: 14
Cathy Garcia-Molina: 13
Youssef Chahine: 12
Martin Scorsese: 12
Jay Chapman: 12
Steven Spielberg: 11


Option: 2

In [64]:
temp['director'].value_counts().head(10)

director
Rajiv Chilaka             19
Raúl Campos, Jan Suter    18
Suhas Kadav               16
Marcus Raboy              16
Jay Karas                 14
Cathy Garcia-Molina       13
Martin Scorsese           12
Youssef Chahine           12
Jay Chapman               12
Steven Spielberg          11
Name: count, dtype: int64

# What are the top 10 countries with the most Netflix content?

In [None]:
df.head()

In [66]:
df['country'].isnull().sum(axis=0)

np.int64(0)

In [68]:
# Temporarily droping the rows from 'country' column with the value of "Unknown"
temp=df[df['country']!='Unknown']

In [70]:
temp['country'].value_counts().head(10)

country
United States     2818
India              972
United Kingdom     419
Japan              245
South Korea        199
Canada             181
Spain              145
France             124
Mexico             110
Egypt              106
Name: count, dtype: int64

# What is the percentage of Movies vs TV Shows?

In [71]:
df['type'].isnull().sum(axis=0)

np.int64(0)

In [72]:
parcent=df['type'].value_counts(normalize=True)*100
print(round(parcent,2))

type
Movie      69.62
TV Show    30.38
Name: proportion, dtype: float64


# What is the trend of content being added over the years? (Hint: use year_added)

In [76]:
df['year_added'].isnull().sum(axis=0)

np.int64(0)

In [None]:
# Temporarily droping the rows from 'year_added' column where value is '0'
temp=df[df['year_added']!=0]

In [80]:
temp.groupby(['year_added','type'])['type'].count().sort_index()

year_added  type   
2008        Movie         1
            TV Show       1
2009        Movie         2
2010        Movie         1
2011        Movie        13
2012        Movie         3
2013        Movie         6
            TV Show       5
2014        Movie        19
            TV Show       5
2015        Movie        56
            TV Show      26
2016        Movie       253
            TV Show     176
2017        Movie       839
            TV Show     349
2018        Movie      1237
            TV Show     412
2019        Movie      1424
            TV Show     592
2020        Movie      1284
            TV Show     595
2021        Movie       993
            TV Show     505
Name: type, dtype: int64

# Which year had the most content added?

In [82]:
no_content=df['year_added'].value_counts().idxmax()
print(no_content)

2019


# Show all TV Shows released in 2020.

In [None]:
# Checking if there is any null values?
df['release_year'].isnull().sum(axis=0)

np.int64(0)

In [91]:
stack = df[(df['release_year'] == 2020) & (df['type'] == 'TV Show')]['title'].tolist()
print(stack)

['Falsa identidad', 'Sex Education', 'Tayo and Little Wizards', 'The Smart Money Woman', 'HQ Barbers', 'Bread Barbershop', 'The Defeated', 'The Creative Indians', '44 Cats', 'Darwin’s Game', 'Wynonna Earp', 'Okupas', 'I AM A KILLER', 'The New Legends of Monkey', 'The Worst Witch', 'Undercover', 'YooHoo to the Rescue', "Grey's Anatomy", 'Big Timber', 'Bureau of Magical Things', 'Quarantine Tales', 'Legend\xa0of\xa0Exorcism', 'The Daily Life of the Immortal King', 'Cleo & Cuquin', 'Deadwind', 'Gameboys Level-Up Edition', 'Locked Up', "Schitt's Creek", 'The Last Dance', 'Millennials', 'Avengers Climate Conundrum', 'Dirty John', 'The Mystic River', 'Selena: The Series', 'Miniforce: Super Dino Power', 'Next in Fashion', 'Power Players', 'Indian Matchmaking', 'Jeffrey Epstein: Filthy Rich', 'The Underclass', 'The Baker and the Beauty', "Heaven Official's Blessing", 'Glimpses of a Future', 'Rainbow High', 'Zindagi in Short', 'Good Girls', 'The Sinner', 'The House Arrest of Us', 'The Magicians

# List all movies longer than 1 hour 30 minutes. (Hint: clean the duration column first)

In [None]:
# Replacing null values with '0'
df['duration']=df['duration'].fillna(0)

In [None]:
# Filter only movies
temp = df[df['type'] == 'Movie'].copy()

In [101]:
# Removing the ' min' from 'duration' column
temp['duration_clean']=temp['duration'].str.replace(' min','')

In [None]:
# Replacing null values with '0'
temp['duration_clean']=temp['duration_clean'].fillna(0)

In [109]:
# Converting 'duration_clean' into 'integer'
temp['duration_clean']=temp['duration_clean'].astype(int)

Option: 1

In [114]:
temp=temp[temp['duration_clean']>90]
print(temp[['title','duration_clean']])

                                 title  duration_clean
6     My Little Pony: A New Generation              91
7                              Sankofa             125
9                         The Starling             104
12                        Je Suis Karl             127
13    Confessions of an Invisible Girl              91
...                                ...             ...
8798                          Zed Plus             131
8799                             Zenda             120
8801                           Zinzana              96
8802                            Zodiac             158
8806                            Zubaan             111

[4138 rows x 2 columns]


Option: 2

In [122]:
temp_dic=dict(zip(temp['title'],temp['duration_clean']))
for i,j in temp_dic.items():
    print(f"Movie Name: {i}. Duration: {j} min")

Movie Name: My Little Pony: A New Generation. Duration: 91 min
Movie Name: Sankofa. Duration: 125 min
Movie Name: The Starling. Duration: 104 min
Movie Name: Je Suis Karl. Duration: 127 min
Movie Name: Confessions of an Invisible Girl. Duration: 91 min
Movie Name: Intrusion. Duration: 94 min
Movie Name: Avvai Shanmughi. Duration: 161 min
Movie Name: Jeans. Duration: 166 min
Movie Name: Minsara Kanavu. Duration: 147 min
Movie Name: Grown Ups. Duration: 103 min
Movie Name: Dark Skies. Duration: 97 min
Movie Name: Paranoia. Duration: 106 min
Movie Name: Ankahi Kahaniya. Duration: 111 min
Movie Name: The Father Who Moves Mountains. Duration: 110 min
Movie Name: The Stronghold. Duration: 105 min
Movie Name: Birth of the Dragon. Duration: 96 min
Movie Name: Jaws. Duration: 124 min
Movie Name: Jaws 2. Duration: 116 min
Movie Name: Jaws 3. Duration: 98 min
Movie Name: Jaws: The Revenge. Duration: 91 min
Movie Name: Safe House. Duration: 115 min
Movie Name: Training Day. Duration: 122 min
Movie

# Find all content involving the actor “David Attenborough”.

In [124]:
temp=df[df['cast']!='Not Mentioned']

In [135]:
temp_cast=temp[temp['cast'].str.contains("David Attenborough", case=False)]
cast_dic=dict(zip(temp_cast['title'],temp_cast['cast']))
for i,j in cast_dic.items():
    print(f"Movie Title: {i}. Movie Cast: {j}")

Movie Title: Breaking Boundaries: The Science Of Our Planet. Movie Cast: David Attenborough, Johan Rockström
Movie Title: Life in Color with David Attenborough. Movie Cast: David Attenborough
Movie Title: David Attenborough: A Life on Our Planet. Movie Cast: David Attenborough
Movie Title: Our Planet - Behind The Scenes. Movie Cast: David Attenborough
Movie Title: Our Planet. Movie Cast: David Attenborough
Movie Title: Africa. Movie Cast: David Attenborough
Movie Title: Blue Planet II. Movie Cast: David Attenborough
Movie Title: Frozen Planet. Movie Cast: David Attenborough
Movie Title: Frozen Planet: On Thin Ice. Movie Cast: David Attenborough
Movie Title: Frozen Planet: The Epic Journey. Movie Cast: David Attenborough
Movie Title: Life on Location. Movie Cast: David Attenborough
Movie Title: Life Story. Movie Cast: David Attenborough
Movie Title: Nature: Raising the Dinosaur Giant. Movie Cast: David Attenborough
Movie Title: Nature's Great Events (2009). Movie Cast: David Attenboroug

# Group the number of Movies and TV Shows by rating.

# Which director has the most titles on Netflix?

# Count the number of titles per country per year. (group by both)

# Which month sees the most new additions? (extract month from date_added)

# What’s the most common release year for content? (use release_year)

# Is there any seasonal trend in adding content (e.g., more in December)?

# Create a bar chart of number of content items added each year.

# Build a pie chart showing share of top 5 countries in total content.