# Netflix Movies and TV Shows




#### About this Dataset: 
Netflix is one of the most popular media and video streaming platforms. They have over 8000 movies or tv shows available on their platform, as of mid-2021, they have over 200M Subscribers globally. This tabular dataset consists of listings of all the movies and tv shows available on Netflix, along with details such as - cast, directors, ratings, release year, duration, etc.


#### Interesting Task Ideas

1. Understanding what content is available in different countries
2. Identifying similar content by matching text-based features
3. Network analysis of Actors / Directors and find interesting insights
4. Does Netflix has more focus on TV Shows than movies in recent years.


#### Source of Dataset : https://www.kaggle.com/datasets/shivamb/netflix-shows

###### 

In [1]:
# Importing required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Importing all worksheets

movies_shows = pd.read_excel("C:/Users/navee/OneDrive/Desktop/GitHub/End_to_end_case_studies/Netflix/Data/netflix_titles.xlsx", sheet_name='netflix_titles')
directors = pd.read_excel("C:/Users/navee/OneDrive/Desktop/GitHub/End_to_end_case_studies/Netflix/Data/netflix_titles.xlsx", sheet_name='netflix_titles_directors')
countries = pd.read_excel("C:/Users/navee/OneDrive/Desktop/GitHub/End_to_end_case_studies/Netflix/Data/netflix_titles.xlsx", sheet_name='netflix_titles_countries')
cast = pd.read_excel("C:/Users/navee/OneDrive/Desktop/GitHub/End_to_end_case_studies/Netflix/Data/netflix_titles.xlsx", sheet_name='netflix_titles_cast')
category = pd.read_excel("C:/Users/navee/OneDrive/Desktop/GitHub/End_to_end_case_studies/Netflix/Data/netflix_titles.xlsx", sheet_name='netflix_titles_category')

##### Alalysis will be done show_id wise

###### EDA on movies

In [3]:
movies_shows

Unnamed: 0,duration_minutes,duration_seasons,type,title,date_added,release_year,rating,description,show_id
0,90,,Movie,Norm of the North: King Sized Adventure,2019-09-09 00:00:00,2019.0,TV-PG,Before planning an awesome wedding for his gra...,81145628.0
1,94,,Movie,Jandino: Whatever it Takes,2016-09-09 00:00:00,2016.0,TV-MA,Jandino Asporaat riffs on the challenges of ra...,80117401.0
2,,1,TV Show,Transformers Prime,2018-09-08 00:00:00,2013.0,TV-Y7-FV,"With the help of three human allies, the Autob...",70234439.0
3,,1,TV Show,Transformers: Robots in Disguise,2018-09-08 00:00:00,2016.0,TV-Y7,When a prison ship crash unleashes hundreds of...,80058654.0
4,99,,Movie,#realityhigh,2017-09-08 00:00:00,2017.0,TV-14,When nerdy high schooler Dani finally attracts...,80125979.0
...,...,...,...,...,...,...,...,...,...
6231,,13,TV Show,Red vs. Blue,,2015.0,NR,"This parody of first-person shooter games, mil...",80000063.0
6232,,4,TV Show,Maron,,2016.0,TV-MA,"Marc Maron stars as Marc Maron, who interviews...",70286564.0
6233,60,,Movie,Little Baby Bum: Nursery Rhyme Friends,,2016.0,,Nursery rhymes and original music for children...,80116008.0
6234,,2,TV Show,A Young Doctor's Notebook and Other Stories,,2013.0,TV-MA,"Set during the Russian Revolution, this comic ...",70281022.0


In [4]:
# Count of duplicate records

movies_shows.duplicated().sum()

# No duplicate records found

0

In [5]:
# Info about the data

movies_shows.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6236 entries, 0 to 6235
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   duration_minutes  4267 non-null   object 
 1   duration_seasons  1971 non-null   object 
 2   type              6235 non-null   object 
 3   title             6235 non-null   object 
 4   date_added        6223 non-null   object 
 5   release_year      6234 non-null   float64
 6   rating            6223 non-null   object 
 7   description       6233 non-null   object 
 8   show_id           6232 non-null   float64
dtypes: float64(2), object(7)
memory usage: 438.6+ KB


In [6]:
# From the above we can understand that if 
# It is a movie then "duration_minutes" will be mentioned and "duration_seasons" will be null.
# And if it is a show then "duration_minutes" will be null and "duration_seasons" will be mentioned.


# As per this logic sum of count of not nulls in "duration_minutes" and "duration_seasons" should be 6236 or less.

In [7]:
# logic 1

movies_shows[movies_shows.duration_minutes.notnull() & movies_shows.duration_seasons.notnull()]

Unnamed: 0,duration_minutes,duration_seasons,type,title,date_added,release_year,rating,description,show_id
2018,"Flying Fortress""",2017-03-31 00:00:00,1944.0,TV-PG,This documentary centers on the crew of the B-...,80119194.0,,,
4525,"and probably will.""",80188902,,,,,,,


In [8]:
# There are 2 records in which "duration_minutes" and "duration_seasons" contains not nulls.
# But these contains unrealted informations in some fileds so these should be deleted.


movies_shows = movies_shows.drop(index=[2018, 4525])

In [9]:
# logic 2

movies_shows[movies_shows.duration_minutes.isnull() & movies_shows.duration_seasons.isnull()]


# There is no record in which both "duration_minutes" and "duration_seasons" are null

Unnamed: 0,duration_minutes,duration_seasons,type,title,date_added,release_year,rating,description,show_id


In [10]:
movies_shows['duration'] = np.where(movies_shows.duration_seasons.isna(), 
                                    movies_shows.duration_minutes, movies_shows.duration_seasons)

In [11]:
movies_shows = movies_shows.drop(columns=['duration_seasons', 'duration_minutes'])
movies_shows.head()

Unnamed: 0,type,title,date_added,release_year,rating,description,show_id,duration
0,Movie,Norm of the North: King Sized Adventure,2019-09-09 00:00:00,2019.0,TV-PG,Before planning an awesome wedding for his gra...,81145628.0,90
1,Movie,Jandino: Whatever it Takes,2016-09-09 00:00:00,2016.0,TV-MA,Jandino Asporaat riffs on the challenges of ra...,80117401.0,94
2,TV Show,Transformers Prime,2018-09-08 00:00:00,2013.0,TV-Y7-FV,"With the help of three human allies, the Autob...",70234439.0,1
3,TV Show,Transformers: Robots in Disguise,2018-09-08 00:00:00,2016.0,TV-Y7,When a prison ship crash unleashes hundreds of...,80058654.0,1
4,Movie,#realityhigh,2017-09-08 00:00:00,2017.0,TV-14,When nerdy high schooler Dani finally attracts...,80125979.0,99


In [12]:
# Getting null counts for movies_shows

movies_shows.isna().sum()


type             0
title            0
date_added      12
release_year     1
rating          11
description      1
show_id          2
duration         0
dtype: int64

In [13]:
# date_added, release_year, rating, description, show_id contains very lesser number of nulls
# Checking the records where these contains nulls 

rec_having_nulls = movies_shows[movies_shows.date_added.isna() | movies_shows.release_year.isna() 
                         | movies_shows.rating.isna() | movies_shows.description.isna() | movies_shows.show_id.isna()]


rec_having_nulls


# They are very lesser number of records so we can drop them for better analysis

Unnamed: 0,type,title,date_added,release_year,rating,description,show_id,duration
211,Movie,Louis C.K.: Hilarious,2016-09-16 00:00:00,2010.0,,Emmy-winning comedy writer Louis C.K. brings h...,70129452.0,84
2017,Movie,The Memphis Belle: A Story of a,,,,,,40
2412,Movie,My Honor Was Loyalty,2017-03-01 00:00:00,2015.0,,"Amid the chaos and horror of World War II, a c...",80144119.0,115
3289,Movie,13TH: A Conversation with Oprah Winfrey & Ava ...,2017-01-26 00:00:00,2017.0,,Oprah Winfrey sits down with director Ava DuVe...,80169801.0,37
4057,TV Show,Little Lunch,2018-02-01 00:00:00,2015.0,,"Adopting a child's perspective, this show take...",80078037.0,1
4403,Movie,Fireplace 4K: Classic Crackling Fireplace from...,2015-12-21 00:00:00,2015.0,,"The first of its kind in UHD 4K, with the clea...",80092839.0,60
4404,Movie,Fireplace 4K: Crackling Birchwood from Firepla...,2015-12-21 00:00:00,2015.0,,"For the first time in 4K Ultra-HD, everyone's ...",80092835.0,60
4524,Movie,The Bad Education Movie,2018-12-15 00:00:00,2015.0,TV-MA,Britain's most ineffective but caring teacher ...,,87
4708,TV Show,Gargantia on the Verdurous Planet,2016-12-01 00:00:00,2013.0,,"After falling through a wormhole, a space-dwel...",80039789.0,1
5017,Movie,Louis C.K.: Live at the Comedy Store,2016-08-15 00:00:00,2015.0,,The comic puts his trademark hilarious/thought...,80114111.0,66


In [14]:
# Deleting records having nulls


movies_shows = movies_shows.drop(index=list(rec_having_nulls.index))

movies_shows

Unnamed: 0,type,title,date_added,release_year,rating,description,show_id,duration
0,Movie,Norm of the North: King Sized Adventure,2019-09-09 00:00:00,2019.0,TV-PG,Before planning an awesome wedding for his gra...,81145628.0,90
1,Movie,Jandino: Whatever it Takes,2016-09-09 00:00:00,2016.0,TV-MA,Jandino Asporaat riffs on the challenges of ra...,80117401.0,94
2,TV Show,Transformers Prime,2018-09-08 00:00:00,2013.0,TV-Y7-FV,"With the help of three human allies, the Autob...",70234439.0,1
3,TV Show,Transformers: Robots in Disguise,2018-09-08 00:00:00,2016.0,TV-Y7,When a prison ship crash unleashes hundreds of...,80058654.0,1
4,Movie,#realityhigh,2017-09-08 00:00:00,2017.0,TV-14,When nerdy high schooler Dani finally attracts...,80125979.0,99
...,...,...,...,...,...,...,...,...
6220,TV Show,Talking Tom and Friends,2019-04-10 00:00:00,2017.0,TV-G,Full of funny one-liners and always ready for ...,80162994.0,2
6221,TV Show,Pokémon the Series,2019-04-01 00:00:00,2019.0,TV-Y7-FV,Ash and his Pikachu travel to the Alola region...,80186475.0,2
6222,TV Show,Justin Time,2016-04-01 00:00:00,2012.0,TV-Y,"In Justin's dreams, he and his imaginary frien...",70272742.0,2
6223,TV Show,Terrace House: Boys & Girls in the City,2016-04-01 00:00:00,2016.0,TV-14,A new set of six men and women start their liv...,80067942.0,2


In [15]:
# Resetting the indexes

movies_shows = movies_shows.reset_index().drop(columns='index')
movies_shows.head()

Unnamed: 0,type,title,date_added,release_year,rating,description,show_id,duration
0,Movie,Norm of the North: King Sized Adventure,2019-09-09 00:00:00,2019.0,TV-PG,Before planning an awesome wedding for his gra...,81145628.0,90
1,Movie,Jandino: Whatever it Takes,2016-09-09 00:00:00,2016.0,TV-MA,Jandino Asporaat riffs on the challenges of ra...,80117401.0,94
2,TV Show,Transformers Prime,2018-09-08 00:00:00,2013.0,TV-Y7-FV,"With the help of three human allies, the Autob...",70234439.0,1
3,TV Show,Transformers: Robots in Disguise,2018-09-08 00:00:00,2016.0,TV-Y7,When a prison ship crash unleashes hundreds of...,80058654.0,1
4,Movie,#realityhigh,2017-09-08 00:00:00,2017.0,TV-14,When nerdy high schooler Dani finally attracts...,80125979.0,99


In [16]:
movies_shows.isna().sum()

# Now no nulls availables 

type            0
title           0
date_added      0
release_year    0
rating          0
description     0
show_id         0
duration        0
dtype: int64

In [17]:
# EDA for column "type"

movies_shows.type.value_counts()

# Cardinality is only 2 

type
Movie      4255
TV Show    1957
Name: count, dtype: int64

In [18]:
# Correcting datatype of column "date_added"


movies_shows.date_added = pd.to_datetime(movies_shows.date_added, format= '%Y-%m-%d')

movies_shows.date_added

0      2019-09-09
1      2016-09-09
2      2018-09-08
3      2018-09-08
4      2017-09-08
          ...    
6207   2019-04-10
6208   2019-04-01
6209   2016-04-01
6210   2016-04-01
6211   2014-04-01
Name: date_added, Length: 6212, dtype: datetime64[ns]

In [19]:
# We can extract month, month_name and year from "date_added" column


movies_shows['added_year'] = movies_shows.date_added.dt.year
movies_shows['added_month'] = movies_shows.date_added.dt.month
movies_shows['added_month_name'] = movies_shows.date_added.dt.month_name()


movies_shows.head()

Unnamed: 0,type,title,date_added,release_year,rating,description,show_id,duration,added_year,added_month,added_month_name
0,Movie,Norm of the North: King Sized Adventure,2019-09-09,2019.0,TV-PG,Before planning an awesome wedding for his gra...,81145628.0,90,2019,9,September
1,Movie,Jandino: Whatever it Takes,2016-09-09,2016.0,TV-MA,Jandino Asporaat riffs on the challenges of ra...,80117401.0,94,2016,9,September
2,TV Show,Transformers Prime,2018-09-08,2013.0,TV-Y7-FV,"With the help of three human allies, the Autob...",70234439.0,1,2018,9,September
3,TV Show,Transformers: Robots in Disguise,2018-09-08,2016.0,TV-Y7,When a prison ship crash unleashes hundreds of...,80058654.0,1,2018,9,September
4,Movie,#realityhigh,2017-09-08,2017.0,TV-14,When nerdy high schooler Dani finally attracts...,80125979.0,99,2017,9,September


In [20]:
# EDA for column "rating"


movies_shows.rating.value_counts()

rating
TV-MA       2024
TV-14       1695
TV-PG        698
R            508
PG-13        286
NR           217
PG           184
TV-Y7        168
TV-G         149
TV-Y         142
TV-Y7-FV      95
G             37
UR             7
NC-17          2
Name: count, dtype: int64

In [21]:
# Correction of datatype in column "release_year"

movies_shows.release_year = movies_shows.release_year.astype(int)

movies_shows.release_year.info()

<class 'pandas.core.series.Series'>
RangeIndex: 6212 entries, 0 to 6211
Series name: release_year
Non-Null Count  Dtype
--------------  -----
6212 non-null   int32
dtypes: int32(1)
memory usage: 24.4 KB


In [22]:
# Correction of datatype of column "show_id"

movies_shows.show_id = movies_shows.show_id.astype(int)

In [23]:
# Checking if duplicates available in column "show_id"

movies_shows.show_id.duplicated().sum()

# No duplicates available.

0

In [24]:
# EDA for "title" and "description"


movies_shows[['title', 'description']].duplicated().sum()


1

In [25]:
# There is one movie having same name and same desciption
# As "show_id" contains only unique so why this contains duplicate, lets check


movies_shows[movies_shows[['title', 'description']].duplicated()]

Unnamed: 0,type,title,date_added,release_year,rating,description,show_id,duration,added_year,added_month,added_month_name
2122,Movie,Sarkar,2019-03-02,2018,TV-MA,A ruthless businessman’s mission to expose ele...,81072516,162,2019,3,March


In [26]:
movies_shows[movies_shows.title == 'Sarkar']

Unnamed: 0,type,title,date_added,release_year,rating,description,show_id,duration,added_year,added_month,added_month_name
2121,Movie,Sarkar,2019-03-02,2018,TV-MA,A ruthless businessman’s mission to expose ele...,81075235,162,2019,3,March
2122,Movie,Sarkar,2019-03-02,2018,TV-MA,A ruthless businessman’s mission to expose ele...,81072516,162,2019,3,March


In [27]:
# So above movies are same and having same  charaterstics but having different "show_id".
# This doesn't makes sense, so we can drop one on these record
# Dropping record having index as 2122

movies_shows = movies_shows.drop(index=2122)


# Resetting the indexes

movies_shows = movies_shows.reset_index().drop(columns='index')


movies_shows.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6211 entries, 0 to 6210
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   type              6211 non-null   object        
 1   title             6211 non-null   object        
 2   date_added        6211 non-null   datetime64[ns]
 3   release_year      6211 non-null   int32         
 4   rating            6211 non-null   object        
 5   description       6211 non-null   object        
 6   show_id           6211 non-null   int32         
 7   duration          6211 non-null   object        
 8   added_year        6211 non-null   int32         
 9   added_month       6211 non-null   int32         
 10  added_month_name  6211 non-null   object        
dtypes: datetime64[ns](1), int32(4), object(6)
memory usage: 436.8+ KB


In [28]:
# EDA on column "title"

movies_shows.title.duplicated().sum()

60

In [29]:
# There are 60 duplicate movie or show names, lets wheck whuch are they


movies_shows[movies_shows.title.isin(list(movies_shows.loc[movies_shows.title.duplicated(), 
                                                           'title']))].sort_values(by='title')


# There are some movies and tv shows having same names.

# there are some movies and tv shows which are having remake 
# as those have same names but different description and other charterstics


Unnamed: 0,type,title,date_added,release_year,rating,description,show_id,duration,added_year,added_month,added_month_name
5893,TV Show,Aquarius,2017-06-16,2016,TV-MA,"Amid the turmoil of 1960s LA, two cops and a p...",80026224,2,2017,6,June
3436,Movie,Aquarius,2017-01-13,2016,UR,The final holdout in her historic beachside bu...,80113667,146,2017,1,January
1968,Movie,Benji,2018-03-06,1974,G,After lovable abandoned mutt Benji is adopted ...,296682,86,2018,3,March
2156,Movie,Benji,2018-03-16,2018,TV-PG,A determined dog comes to the rescue and helps...,80204923,87,2018,3,March
5712,TV Show,Bleach,2018-11-03,2006,TV-14,After teenager Ichigo Kurosaki acquires superp...,70204957,3,2018,11,November
...,...,...,...,...,...,...,...,...,...,...,...
2317,Movie,Wet Hot American Summer,2019-03-01,2001,R,Everyone wants a final shot at action on the l...,60021299,98,2019,3,March
3127,Movie,Zoo,2018-07-01,2018,TV-MA,A drug dealer starts having doubts about his t...,80993648,94,2018,7,July
5644,TV Show,Zoo,2017-10-03,2017,TV-14,When animal species all over the world begin a...,80011206,3,2017,10,October
3451,Movie,Zoom,2020-01-11,2006,PG,"Dragged from civilian life, a former superhero...",81221873,88,2020,1,January


In [30]:
# EDA on column "description"

movies_shows.description.duplicated().sum()

7

In [31]:
# There are 60 duplicate movie or show descriptions, lets wheck whuch are they


movies_shows[movies_shows.description.isin(list(movies_shows.loc[movies_shows.description.duplicated(), 
                                                           'description']))].sort_values(by='description')


# There are some movies shows having same description.

# This is because they might got released in different laguage or might got added in different years on Netflix.
# As the number of duplicate descripion is very low, so it will not add any value to the data,
# we can drop duplicates as apart from "date_added" and "title" all the charatersitcs are same.

Unnamed: 0,type,title,date_added,release_year,rating,description,show_id,duration,added_year,added_month,added_month_name
119,Movie,Oh! Baby (Malayalam),2019-09-25,2019,TV-14,A surly septuagenarian gets another chance at ...,81186758,146,2019,9,September
120,Movie,Oh! Baby (Tamil),2019-09-25,2019,TV-14,A surly septuagenarian gets another chance at ...,81186757,146,2019,9,September
251,Movie,Oh! Baby,2019-09-14,2019,TV-14,A surly septuagenarian gets another chance at ...,81093951,157,2019,9,September
3180,Movie,Solo: A Star Wars Story,2019-01-09,2018,PG-13,A young Han Solo tries to settle an old score ...,80220814,135,2019,1,January
3181,Movie,Solo: A Star Wars Story (Spanish Version),2019-01-09,2018,PG-13,A young Han Solo tries to settle an old score ...,81046962,135,2019,1,January
5177,Movie,Petta (Telugu Version),2019-04-07,2019,TV-14,"An affable, newly appointed college warden pro...",81091424,170,2019,4,April
5201,Movie,Petta,2019-04-05,2019,TV-14,"An affable, newly appointed college warden pro...",81091423,170,2019,4,April
2309,Movie,Sarvam Thaala Mayam (Tamil Version),2019-03-01,2018,TV-14,An aspiring musician battles age-old caste div...,81083971,131,2019,3,March
5171,Movie,Sarvam Thaala Mayam (Telugu Version),2019-04-08,2018,TV-14,An aspiring musician battles age-old caste div...,81074135,131,2019,4,April
4869,Movie,Game Over (Hindi Version),2019-08-21,2019,TV-MA,"As a series of murders hit close to home, a vi...",81151880,98,2019,8,August


In [32]:
# Deleting records having duplicate "description", "type", "release_year" and "duration"

movies_shows = movies_shows.drop_duplicates(subset=['description', 'type', 'release_year', 'duration'], keep='first')


# Resetting the indexes

movies_shows = movies_shows.reset_index().drop(columns='index')

In [33]:
movies_shows

Unnamed: 0,type,title,date_added,release_year,rating,description,show_id,duration,added_year,added_month,added_month_name
0,Movie,Norm of the North: King Sized Adventure,2019-09-09,2019,TV-PG,Before planning an awesome wedding for his gra...,81145628,90,2019,9,September
1,Movie,Jandino: Whatever it Takes,2016-09-09,2016,TV-MA,Jandino Asporaat riffs on the challenges of ra...,80117401,94,2016,9,September
2,TV Show,Transformers Prime,2018-09-08,2013,TV-Y7-FV,"With the help of three human allies, the Autob...",70234439,1,2018,9,September
3,TV Show,Transformers: Robots in Disguise,2018-09-08,2016,TV-Y7,When a prison ship crash unleashes hundreds of...,80058654,1,2018,9,September
4,Movie,#realityhigh,2017-09-08,2017,TV-14,When nerdy high schooler Dani finally attracts...,80125979,99,2017,9,September
...,...,...,...,...,...,...,...,...,...,...,...
6200,TV Show,Talking Tom and Friends,2019-04-10,2017,TV-G,Full of funny one-liners and always ready for ...,80162994,2,2019,4,April
6201,TV Show,Pokémon the Series,2019-04-01,2019,TV-Y7-FV,Ash and his Pikachu travel to the Alola region...,80186475,2,2019,4,April
6202,TV Show,Justin Time,2016-04-01,2012,TV-Y,"In Justin's dreams, he and his imaginary frien...",70272742,2,2016,4,April
6203,TV Show,Terrace House: Boys & Girls in the City,2016-04-01,2016,TV-14,A new set of six men and women start their liv...,80067942,2,2016,4,April


###### EDA on directors

In [34]:
directors.head()

Unnamed: 0,director,show_id
0,Richard Finn,81145628
1,Fernando Lebrija,80125979
2,Gabe Ibáñez,70304989
3,Rodrigo Toro,80164077
4,Henrik Ruben Genz,70304990


In [35]:
directors.info()

# No nulls availabe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4852 entries, 0 to 4851
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   director  4852 non-null   object
 1   show_id   4852 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 75.9+ KB


In [36]:
# Checking for duplicates

directors.duplicated().sum()

# Dropping duplicates

directors = directors.drop_duplicates()

In [37]:
# Checking how many uniue values availabe

directors.nunique()

director    3655
show_id     4265
dtype: int64

In [38]:
# Checking how many movies directed by each director


cnt_of_movies_per_director = pd.DataFrame(directors.director.value_counts()).reset_index()
cnt_of_movies_per_director.columns = ['director', 'cnt_of_movies']

cnt_of_movies_per_director

Unnamed: 0,director,cnt_of_movies
0,Jan Suter,21
1,Raúl Campos,19
2,Marcus Raboy,14
3,Jay Karas,14
4,Jay Chapman,12
...,...,...
3650,Haissam Hussain,1
3651,Mitch Gould,1
3652,Kaizad Gustad,1
3653,Clay Porter,1


###### EDA on Cast

In [39]:
cast.head()

Unnamed: 0,cast,show_id
0,Alan Marriott,81145628
1,Jandino Asporaat,80117401
2,Peter Cullen,70234439
3,Will Friedle,80058654
4,Nesta Cooper,80125979


In [40]:
cast.info()

# No nulls availabe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44311 entries, 0 to 44310
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   cast     44311 non-null  object
 1   show_id  44311 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 692.5+ KB


In [41]:
# Checking for duplicate records

print("Count of duplicate records : ", cast.duplicated().sum())

# Dropping this duplicate record

cast = cast.drop(index=list(cast[cast.duplicated()].index))


# Resetting the indexes

cast = cast.reset_index().drop(columns='index')


cast

Count of duplicate records :  1


Unnamed: 0,cast,show_id
0,Alan Marriott,81145628
1,Jandino Asporaat,80117401
2,Peter Cullen,70234439
3,Will Friedle,80058654
4,Nesta Cooper,80125979
...,...,...
44305,Kaden Stephen,80108373
44306,Tonye Patano,70136122
44307,Rie Nakagawa,70204989
44308,Yomary Cruz,80000063


In [42]:
# Extracting the uniques value count from cast

cast.nunique()

cast       27405
show_id     5664
dtype: int64

###### EDA on Countries

In [43]:
countries.head()

Unnamed: 0,country,show_id
0,Germany,80016401
1,South Africa,80182274
2,United States,80182274
3,United States,81145628
4,United Kingdom,80117401


In [44]:
countries.info()

# No duplicates available

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7179 entries, 0 to 7178
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   country  7179 non-null   object
 1   show_id  7179 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 112.3+ KB


In [45]:
# Checking for duplicate records

print("Count of duplicate records : ", countries.duplicated().sum())


# No duplicate records found

Count of duplicate records :  0


In [46]:
# unique value count of countries

countries.nunique()

country     113
show_id    5758
dtype: int64

###### EDA on category

In [47]:
category.head()

Unnamed: 0,listed_in,show_id
0,Children & Family Movies,81145628
1,Stand-Up Comedy,80117401
2,Kids' TV,70234439
3,Kids' TV,80058654
4,Comedies,80125979


In [48]:
category.info()

# No nulls available

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13670 entries, 0 to 13669
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   listed_in  13670 non-null  object
 1   show_id    13670 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 213.7+ KB


In [49]:
# Checking for duplicate records

print("Count of duplicate records : ", category.duplicated().sum())

Count of duplicate records :  0


In [50]:
# Unique value counts

category.nunique()

listed_in      42
show_id      6234
dtype: int64

In [51]:
# Exporting the cleaned data into csv files for further analysis

movies_shows.to_csv("C:/Users/navee/OneDrive/Desktop/GitHub/End_to_end_case_studies/Netflix/Data/Cleaned_data/movies_shows.csv", index=False)
directors.to_csv("C:/Users/navee/OneDrive/Desktop/GitHub/End_to_end_case_studies/Netflix/Data/Cleaned_data/directors.csv", index=False)
cast.to_csv("C:/Users/navee/OneDrive/Desktop/GitHub/End_to_end_case_studies/Netflix/Data/Cleaned_data/cast.csv", index=False)
countries.to_csv("C:/Users/navee/OneDrive/Desktop/GitHub/End_to_end_case_studies/Netflix/Data/Cleaned_data/countries.csv", index=False)
category.to_csv("C:/Users/navee/OneDrive/Desktop/GitHub/End_to_end_case_studies/Netflix/Data/Cleaned_data/category.csv", index=False)

# ---------------------------------------------------END---------------------------------------------------