#                                                  EDA of Netflix

Dataset Content
This dataset consists of tv shows and movies available on Netflix as of 2019. 
The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV 
shows on Netflix has nearly tripled since 2010. The streaming service’s number 
of movies has decreased by more than 2,000 titles since 2010, while its number 
of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, 
rotten tomatoes can also provide many interesting findings.</p>

Inspiration

Understanding what content is available in different countries
Identifying similar content by matching text-based features
Network analysis of Actors / Directors and find interesting insights
Is Netflix has increasingly focusing on TV rather than movies in recent years.




In [1]:
import numpy as np   # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns


In [2]:
df= pd.read_csv('netflix_titles_nov_2019.csv')
df.head()

Unnamed: 0,show_id,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,type
0,81193313,Chocolate,,"Ha Ji-won, Yoon Kye-sang, Jang Seung-jo, Kang ...",South Korea,"November 30, 2019",2019,TV-14,1 Season,"International TV Shows, Korean TV Shows, Roman...",Brought together by meaningful meals in the pa...,TV Show
1,81197050,Guatemala: Heart of the Mayan World,"Luis Ara, Ignacio Jaunsolo",Christian Morales,,"November 30, 2019",2019,TV-G,67 min,"Documentaries, International Movies","From Sierra de las Minas to Esquipulas, explor...",Movie
2,81213894,The Zoya Factor,Abhishek Sharma,"Sonam Kapoor, Dulquer Salmaan, Sanjay Kapoor, ...",India,"November 30, 2019",2019,TV-14,135 min,"Comedies, Dramas, International Movies",A goofy copywriter unwittingly convinces the I...,Movie
3,81082007,Atlantics,Mati Diop,"Mama Sane, Amadou Mbow, Ibrahima Traore, Nicol...","France, Senegal, Belgium","November 29, 2019",2019,TV-14,106 min,"Dramas, Independent Movies, International Movies","Arranged to marry a rich man, young Ada is cru...",Movie
4,80213643,Chip and Potato,,"Abigail Oliver, Andrea Libman, Briana Buckmast...","Canada, United Kingdom",,2019,TV-Y,2 Seasons,Kids' TV,"Lovable pug Chip starts kindergarten, makes ne...",TV Show


In [3]:
def data_inv(df):
    print('Netflix Shows and Movies: ',df.shape[0])
    print('Dataset Variables: ',df.shape[1])
    print('-'*10)
    print('Dataset Coulmns: \n')
    print(df.columns)
    print('-'*10)
    print("Data-type of each columns: \n")
    print(df.dtypes)
    print('-'*10)
    print('missing rows in each column: \n')
    c=df.isnull().sum()
    print(c[c>0])
data_inv(df)

Netflix Shows and Movies:  5837
Dataset Variables:  12
----------
Dataset Coulmns: 

Index(['show_id', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description',
       'type'],
      dtype='object')
----------
Data-type of each columns: 

show_id          int64
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
type            object
dtype: object
----------
missing rows in each column: 

director      1901
cast           556
country        427
date_added     642
rating          10
dtype: int64


## Data Cleaning


- Drop id column
- Drop dublicate shows
- Create a new column shows the number of cast in each row
- We have 10 missing rows in rating column, replace them by the mode
- For the missing rows in added_date column, replace them by January 1,{release_year}
- I think we can not replace missing rows in column country by other countries, but we can use genre to identify this country 
  ex: replace missing rows by japan for Anime
- Convert the date_added column from object type to datetime

In [4]:
# df.duplicated Return boolean Series denoting duplicate rows.

dups=df.duplicated(['title','country','type','release_year'])
df[dups]

Unnamed: 0,show_id,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,type
1134,80175351,Kakegurui,,"Saori Hayami, Minami Tanaka, Tatsuya Tokutake,...",Japan,,2019,TV-14,2 Seasons,"Anime Series, International TV Shows, TV Thril...",High roller Yumeko Jabami plans to clean house...,TV Show
1741,81072516,Sarkar,A.R. Murugadoss,"Vijay, Varalakshmi Sarathkumar, Keerthi Suresh...",India,"March 2, 2019",2018,TV-MA,162 min,"Action & Adventure, Dramas, International Movies",A ruthless businessman’s mission to expose ele...,Movie


In [5]:
df=df.drop_duplicates(['title','country','type','release_year'])

In [6]:
df=df.drop('show_id',axis=1)

In [None]:
df['cast']=df['cast'].replace(np.nan,'Unknown')    #replace nan values with Unknown
def cast_counter(cast):
    if cast=='Unknown':
        return 0          #'return 0' means that the function doesn't return any value. It is used when the void return type 
    else:
        lst=cast.split(' , ')
        length=
    