# Netflix Movies and TV Shows

#### About this Dataset: 
Netflix is one of the most popular media and video streaming platforms. They have over 8000 movies or tv shows available on their platform, as of mid-2021, they have over 200M Subscribers globally. This tabular dataset consists of listings of all the movies and tv shows available on Netflix, along with details such as - cast, directors, ratings, release year, duration, etc.


#### Interesting Task Ideas

1. Understanding what content is available in different countries
2. Identifying similar content by matching text-based features
3. Network analysis of Actors / Directors and find interesting insights
4. Does Netflix has more focus on TV Shows than movies in recent years.


#### Source of Dataset : https://www.kaggle.com/datasets/nishanthkv/netflix

In [1]:
# Importing required libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

In [57]:
# importing dataset


title = pd.read_excel("C:/Users/Naveen/Desktop/GitHub/For_practice/End_to_end_case_studies/Netflix/Project_1/Data/netflix_titles.xlsx", sheet_name="netflix_titles")
director = pd.read_excel("C:/Users/Naveen/Desktop/GitHub/For_practice/End_to_end_case_studies/Netflix/Project_1/Data/netflix_titles.xlsx", sheet_name="netflix_titles_directors")
countries = pd.read_excel("C:/Users/Naveen/Desktop/GitHub/For_practice/End_to_end_case_studies/Netflix/Project_1/Data/netflix_titles.xlsx", sheet_name="netflix_titles_countries")
cast = pd.read_excel("C:/Users/Naveen/Desktop/GitHub/For_practice/End_to_end_case_studies/Netflix/Project_1/Data/netflix_titles.xlsx", sheet_name="netflix_titles_cast")
category = pd.read_excel("C:/Users/Naveen/Desktop/GitHub/For_practice/End_to_end_case_studies/Netflix/Project_1/Data/netflix_titles.xlsx", sheet_name="netflix_titles_category")

In [58]:
title.head(2)

Unnamed: 0,duration_minutes,duration_seasons,type,title,date_added,release_year,rating,description,show_id
0,90,,Movie,Norm of the North: King Sized Adventure,2019-09-09 00:00:00,2019.0,TV-PG,Before planning an awesome wedding for his gra...,81145628.0
1,94,,Movie,Jandino: Whatever it Takes,2016-09-09 00:00:00,2016.0,TV-MA,Jandino Asporaat riffs on the challenges of ra...,80117401.0


In [59]:
cast.head(2)

Unnamed: 0,cast,show_id
0,Alan Marriott,81145628
1,Jandino Asporaat,80117401


In [60]:
director.head(2)

Unnamed: 0,director,show_id
0,Richard Finn,81145628
1,Fernando Lebrija,80125979


In [61]:
countries.head(2)

Unnamed: 0,country,show_id
0,Germany,80016401
1,South Africa,80182274


In [62]:
category.head(2)

Unnamed: 0,listed_in,show_id
0,Children & Family Movies,81145628
1,Stand-Up Comedy,80117401


#### For title

In [63]:
title.info()

# Variable duration_seasons got only 1971 non-nulls out of 6236, which is too low, This can be dropped for further analysis

title.drop(columns=["duration_seasons"], inplace=True)


'''
# We need to correct data type of duration_minutes as int, date_added as datetime, release_year as int, show_id as int.
# Also show_id should not be in float but in str or object.

# duration_minutes contains too many nulls so it will not be a good idea to delete, so filling with median and also creating a missing_duration 
    column so tan we can identify later if needed.

    

'''

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6236 entries, 0 to 6235
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   duration_minutes  4267 non-null   object 
 1   duration_seasons  1971 non-null   object 
 2   type              6235 non-null   object 
 3   title             6235 non-null   object 
 4   date_added        6223 non-null   object 
 5   release_year      6234 non-null   float64
 6   rating            6223 non-null   object 
 7   description       6233 non-null   object 
 8   show_id           6232 non-null   float64
dtypes: float64(2), object(7)
memory usage: 438.6+ KB


'\n# We need to correct data type of duration_minutes as int, date_added as datetime, release_year as int, show_id as int.\n# Also show_id should not be in float but in str or object.\n\n# duration_minutes contains too many nulls so it will not be a good idea to delete, so filling with median and also creating a missing_duration \n    column so tan we can identify later if needed.\n\n    \n\n'

In [64]:
print(f"No. of duplicates : {title.duplicated().sum()}")

No. of duplicates : 0


In [71]:
title.dropna()

Unnamed: 0,duration_minutes,type,title,date_added,release_year,rating,description,show_id
0,90,Movie,Norm of the North: King Sized Adventure,2019-09-09 00:00:00,2019.0,TV-PG,Before planning an awesome wedding for his gra...,81145628.0
1,94,Movie,Jandino: Whatever it Takes,2016-09-09 00:00:00,2016.0,TV-MA,Jandino Asporaat riffs on the challenges of ra...,80117401.0
4,99,Movie,#realityhigh,2017-09-08 00:00:00,2017.0,TV-14,When nerdy high schooler Dani finally attracts...,80125979.0
6,110,Movie,Automata,2017-09-08 00:00:00,2014.0,R,"In a dystopian future, an insurance adjuster f...",70304989.0
7,60,Movie,Fabrizio Copano: Solo pienso en mi,2017-09-08 00:00:00,2017.0,TV-MA,Fabrizio Copano takes audience participation t...,80164077.0
...,...,...,...,...,...,...,...,...
5577,106,Movie,Toro,2017-04-01 00:00:00,2016.0,NR,Ex-con Toro's brother and former partner in cr...,80093107.0
5579,70,Movie,Frank and Cindy,2016-04-01 00:00:00,2007.0,TV-MA,Frank was a rising pop star when he married Ci...,80085438.0
5580,102,Movie,Frank and Cindy,2016-04-01 00:00:00,2015.0,R,A student filmmaker vengefully turns his camer...,80085439.0
5581,88,Movie,Iverson,2016-04-01 00:00:00,2014.0,NR,This unfiltered documentary follows the rocky ...,80011846.0


In [66]:
title[title.type == "TV Show"]["duration_minutes"].isna().sum()

np.int64(1969)

In [68]:
title[title.type == 1944]

Unnamed: 0,duration_minutes,type,title,date_added,release_year,rating,description,show_id
2018,"Flying Fortress""",1944,TV-PG,This documentary centers on the crew of the B-...,80119194.0,,,


#### 

#### 