# Netflix data set exploration

Netflix is one of the most popular media and video streaming platforms. They have over 10000 movies or tv shows available on their platform, as of mid-2021, they have over 222M Subscribers globally. This tabular dataset consists of listings of all the movies and tv shows available on Netflix, along with details such as - cast, directors, ratings, release year, duration, etc.

Business problem

Netflix wants to know which type of shows/movies to produce and how they can grow the business in different countries

In [21]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [22]:
df = pd.read_csv('netflix.csv')

In [23]:
df.head(3)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...


Finding Shape and datatypes of the DataFrame

In [24]:
df.shape

(8807, 12)

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


1. As you can see the Netflix dataset has initial rows amounting to 8807 in total and columns amounting to 12.

2. In total six columns have some amount of Null values ('director', 'cast', 'country', date_added', 'rating', 'duration',)

3. And as you can observe 'release_year' was the only 'int-type' column and rest of others are string/object datatypes.

In [26]:
title = df["title"]
country = df['country']
cast = df['cast']
director = df['director']
rating = df['rating']
listed_in = df['listed_in']

In [27]:
def split_columns(inp):
  return str(inp).split(', ')

In [28]:
columns_list = ['country', 'cast', 'director', 'rating', 'listed_in']

for col in columns_list:
  df[col] = df[col].apply(split_columns)

When we observe the entire dataset, we can come to the conclusion that various columns in the dataset have mutliple string values which needed to be splitted into various individual rows to ascertain proper results and it is utmost for reliable data analysis

---> Below is the DataFrame showing the result of split and that of how strings in selected columns formed as a list

In [29]:
df.head(3)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,[Kirsten Johnson],[nan],[United States],"September 25, 2021",2020,[PG-13],90 min,[Documentaries],"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,[nan],"[Ama Qamata, Khosi Ngema, Gail Mabalane, Thaba...",[South Africa],"September 24, 2021",2021,[TV-MA],2 Seasons,"[International TV Shows, TV Dramas, TV Mysteries]","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,[Julien Leclercq],"[Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nab...",[nan],"September 24, 2021",2021,[TV-MA],1 Season,"[Crime TV Shows, International TV Shows, TV Ac...",To protect his family from a powerful drug lor...


---> After splitting, now we used explode function to make sure that each individual string of each column occupies each row; thus number of rows gets drastically increased to a staggering number of 201991.

In [30]:
for col in columns_list:
  df = df.explode(col)

In [31]:
df

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,Ama Qamata,South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,International TV Shows,"After crossing paths at a party, a Cape Town t..."
1,s2,TV Show,Blood & Water,,Ama Qamata,South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,TV Dramas,"After crossing paths at a party, a Cape Town t..."
1,s2,TV Show,Blood & Water,,Ama Qamata,South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,TV Mysteries,"After crossing paths at a party, a Cape Town t..."
1,s2,TV Show,Blood & Water,,Khosi Ngema,South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,International TV Shows,"After crossing paths at a party, a Cape Town t..."
...,...,...,...,...,...,...,...,...,...,...,...,...
8806,s8807,Movie,Zubaan,Mozez Singh,Anita Shabdish,India,"March 2, 2019",2015,TV-14,111 min,International Movies,A scrappy but poor boy worms his way into a ty...
8806,s8807,Movie,Zubaan,Mozez Singh,Anita Shabdish,India,"March 2, 2019",2015,TV-14,111 min,Music & Musicals,A scrappy but poor boy worms his way into a ty...
8806,s8807,Movie,Zubaan,Mozez Singh,Chittaranjan Tripathy,India,"March 2, 2019",2015,TV-14,111 min,Dramas,A scrappy but poor boy worms his way into a ty...
8806,s8807,Movie,Zubaan,Mozez Singh,Chittaranjan Tripathy,India,"March 2, 2019",2015,TV-14,111 min,International Movies,A scrappy but poor boy worms his way into a ty...


Below is the DataFrame Replacing nan values of each column to 'NaN'.

In [12]:
df = df.replace('nan', np.nan)

In [13]:
df.head(3)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,Ama Qamata,South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,International TV Shows,"After crossing paths at a party, a Cape Town t..."
1,s2,TV Show,Blood & Water,,Ama Qamata,South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,TV Dramas,"After crossing paths at a party, a Cape Town t..."


Finding number of Nulls

In [14]:
df.isna().sum()

Unnamed: 0,0
show_id,0
type,0
title,0
director,50643
cast,2146
country,11897
date_added,158
release_year,0
rating,67
duration,3


Using fillna() to fill NaN values with corresponding values according to each type of column.

In [15]:
df['director'] = df['director'].fillna('unknown director')
df['cast'] = df['cast'].fillna('unknown cast')
df['country'] = df['country'].fillna('unknown country')
df['rating'] = df['rating'].fillna('unknown rating')
df['date_added'] = df['date_added'].fillna(0)
df['duration']= df['duration'].fillna(0)

Renaming some column names to improve clarity

In [16]:
df = df.rename(columns = {'type': 'content_type', 'listed_in': 'Genre', 'duration': 'content_duration'})

Correcting datetime datatype from 'object' to 'date'

In [17]:
df['date_added'] = pd.to_datetime(df['date_added'], errors = 'coerce')

Displaying info() method again after initial preprocessing

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 201991 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   show_id           201991 non-null  object        
 1   content_type      201991 non-null  object        
 2   title             201991 non-null  object        
 3   director          201991 non-null  object        
 4   cast              201991 non-null  object        
 5   country           201991 non-null  object        
 6   date_added        200245 non-null  datetime64[ns]
 7   release_year      201991 non-null  int64         
 8   rating            201991 non-null  object        
 9   content_duration  201991 non-null  object        
 10  Genre             201991 non-null  object        
 11  description       201991 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(10)
memory usage: 20.0+ MB


Preprocessing step done:

1. Observed the initial shape and understood the datatypes of the features in the dataset.

2. Splitted some string columns and used explode function to assign each string with a row inorder to identlfy deeper realtionships and trends in the data.

3. Found nulls in the dataset and replaced with appropriate values using np.fillna()

4. Formatted few column names to improve the clarity & understanding of the features.

5. Corrected the datatype of 'date_added' column.