# Problem Statement

## About NETFLIX

Netflix is one of the most popular media and video streaming platforms. They have over 10000 movies or tv shows available on their platform, as of mid-2021, they have over 222M Subscribers globally. This tabular dataset consists of listings of all the movies and tv shows available on Netflix, along with details such as - cast, directors, ratings, release year, duration, etc.

## Business Problem

Analyze the data and generate insights that could help Netflix ijn deciding which type of shows/movies to produce and how they can grow the business in different countries

# Module Imports and Datset

## Modules

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

## Dataset

In [2]:
df = pd.read_csv("netflix.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


| **`Attribute`**  | **`Description`**                                    |
|------------------|------------------------------------------------------|
| **Show_id**      | Unique ID for every Movie / Tv Show                  |
| **Type**         | Identifier - A Movie or TV Show                      |
| **Title**        | Title of the Movie / Tv Show                         |
| **Director**     | Director of the Movie                                |
| **Cast**         | Actors involved in the movie/show                    |
| **Country**      | Country where the movie/show was produced            |
| **Date_added**   | Date it was added on Netflix                         |
| **Release_year** | Actual Release year of the movie/show                |
| **Rating**       | TV Rating of the movie/show                          |
| **Duration**     | Total Duration - in minutes or number of seasons     |
| **Listed_in**    | Genre                                                |
| **Description**  | The summary description                              |


# Data Cleaning and Preprocessing

## Basic Checks

In [3]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [4]:
df.isna().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

In [5]:
df.nunique()

show_id         8807
type               2
title           8807
director        4528
cast            7692
country          748
date_added      1767
release_year      74
rating            17
duration         220
listed_in        514
description     8775
dtype: int64

In [6]:
df.duplicated().sum()

0

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


## Unnesting data

In [8]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


### `Unnesting Director Column`

In [9]:
df['director'].head(20)

0                                   Kirsten Johnson
1                                               NaN
2                                   Julien Leclercq
3                                               NaN
4                                               NaN
5                                     Mike Flanagan
6                     Robert Cullen, José Luis Ucha
7                                      Haile Gerima
8                                   Andy Devonshire
9                                    Theodore Melfi
10                                              NaN
11                                Kongkiat Komesiri
12                              Christian Schwochow
13                                    Bruno Garotti
14                                              NaN
15                                              NaN
16    Pedro de Echave García, Pablo Azorín Williams
17                                              NaN
18                                       Adam Salky
19          

- row no. 6 and 16 have 2 directors, so there is a need to unnest these values.

In [10]:
temp = df['director'].apply(lambda s: str(s).split(', ')).tolist()
df_directors = pd.DataFrame(temp, index = df["title"])
df_directors = df_directors.stack().reset_index()
df_directors.rename(columns={0: "Director"}, inplace=True)
df_directors.drop(['level_1'], axis = 1, inplace=True)

df_directors.head(20)

Unnamed: 0,title,Director
0,Dick Johnson Is Dead,Kirsten Johnson
1,Blood & Water,
2,Ganglands,Julien Leclercq
3,Jailbirds New Orleans,
4,Kota Factory,
5,Midnight Mass,Mike Flanagan
6,My Little Pony: A New Generation,Robert Cullen
7,My Little Pony: A New Generation,José Luis Ucha
8,Sankofa,Haile Gerima
9,The Great British Baking Show,Andy Devonshire


### `Unnesting Cast Column`

In [11]:
df['cast'].head(20)

0                                                   NaN
1     Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...
2     Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...
3                                                   NaN
4     Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...
5     Kate Siegel, Zach Gilford, Hamish Linklater, H...
6     Vanessa Hudgens, Kimiko Glenn, James Marsden, ...
7     Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...
8     Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...
9     Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...
10                                                  NaN
11    Sukollawat Kanarot, Sushar Manaying, Pavarit M...
12    Luna Wedler, Jannis Niewöhner, Milan Peschel, ...
13    Klara Castanho, Lucca Picon, Júlia Gomes, Marc...
14                                                  NaN
15    Logan Browning, Brandon P. Bell, DeRon Horton,...
16                                                  NaN
17    Luis Ernesto Franco, Camila Sodi, Sergio G

- row no 1, 2 and many more have multiple nested values. So we need to unnest these.

In [12]:
temp = df['cast'].apply(lambda s: str(s).split(', ')).tolist()
df_casts = pd.DataFrame(temp, index = df["title"])
df_casts = df_casts.stack().reset_index()
df_casts.rename(columns={0: "Cast"}, inplace=True)
df_casts.drop(['level_1'], axis = 1, inplace=True)

df_casts.head(20)

Unnamed: 0,title,Cast
0,Dick Johnson Is Dead,
1,Blood & Water,Ama Qamata
2,Blood & Water,Khosi Ngema
3,Blood & Water,Gail Mabalane
4,Blood & Water,Thabang Molaba
5,Blood & Water,Dillon Windvogel
6,Blood & Water,Natasha Thahane
7,Blood & Water,Arno Greeff
8,Blood & Water,Xolile Tshabalala
9,Blood & Water,Getmore Sithole


### `Unnested Country Column`

In [13]:
df['country'].head(20)

0                                         United States
1                                          South Africa
2                                                   NaN
3                                                   NaN
4                                                 India
5                                                   NaN
6                                                   NaN
7     United States, Ghana, Burkina Faso, United Kin...
8                                        United Kingdom
9                                         United States
10                                                  NaN
11                                                  NaN
12                              Germany, Czech Republic
13                                                  NaN
14                                                  NaN
15                                        United States
16                                                  NaN
17                                              

- row no 7 and many more have multiple nested values. So we need to unnest these.

In [14]:
temp = df['country'].apply(lambda s: str(s).split(', ')).tolist()
df_countries = pd.DataFrame(temp, index = df["title"])
df_countries = df_countries.stack().reset_index()
df_countries.rename(columns={0: "Country"}, inplace=True)
df_countries.drop(['level_1'], axis = 1, inplace=True)

df_countries.head(20)

Unnamed: 0,title,Country
0,Dick Johnson Is Dead,United States
1,Blood & Water,South Africa
2,Ganglands,
3,Jailbirds New Orleans,
4,Kota Factory,India
5,Midnight Mass,
6,My Little Pony: A New Generation,
7,Sankofa,United States
8,Sankofa,Ghana
9,Sankofa,Burkina Faso


### `Unnesting listed_in Column`

In [15]:
df['listed_in'].head(20)

0                                         Documentaries
1       International TV Shows, TV Dramas, TV Mysteries
2     Crime TV Shows, International TV Shows, TV Act...
3                                Docuseries, Reality TV
4     International TV Shows, Romantic TV Shows, TV ...
5                    TV Dramas, TV Horror, TV Mysteries
6                              Children & Family Movies
7      Dramas, Independent Movies, International Movies
8                          British TV Shows, Reality TV
9                                      Comedies, Dramas
10    Crime TV Shows, Docuseries, International TV S...
11    Crime TV Shows, International TV Shows, TV Act...
12                         Dramas, International Movies
13                   Children & Family Movies, Comedies
14         British TV Shows, Crime TV Shows, Docuseries
15                               TV Comedies, TV Dramas
16                  Documentaries, International Movies
17    Crime TV Shows, Spanish-Language TV Shows,

- row no 1, 2 and many more have multiple nested values. So we need to unnest these.

In [16]:
temp = df['listed_in'].apply(lambda s: str(s).split(', ')).tolist()
df_genre = pd.DataFrame(temp, index = df["title"])
df_genre = df_genre.stack().reset_index()
df_genre.rename(columns={0: "Genre"}, inplace=True)
df_genre.drop(['level_1'], axis = 1, inplace=True)

df_genre.head(20)

Unnamed: 0,title,Genre
0,Dick Johnson Is Dead,Documentaries
1,Blood & Water,International TV Shows
2,Blood & Water,TV Dramas
3,Blood & Water,TV Mysteries
4,Ganglands,Crime TV Shows
5,Ganglands,International TV Shows
6,Ganglands,TV Action & Adventure
7,Jailbirds New Orleans,Docuseries
8,Jailbirds New Orleans,Reality TV
9,Kota Factory,International TV Shows


### Merging above dataframes

In [17]:
df_new = df_directors.merge(df_casts, on="title", how="inner")
df_new = df_new.merge(df_countries, on="title", how="inner")
df_new = df_new.merge(df_genre, on="title", how="inner")

df_new = df_new.merge(df[['show_id', 'type', 'title', 'date_added',
       'release_year', 'rating', 'duration', 'description']], on="title", how="inner")

df_new.head()

Unnamed: 0,title,Director,Cast,Country,Genre,show_id,type,date_added,release_year,rating,duration,description
0,Dick Johnson Is Dead,Kirsten Johnson,,United States,Documentaries,s1,Movie,"September 25, 2021",2020,PG-13,90 min,"As her father nears the end of his life, filmm..."
1,Blood & Water,,Ama Qamata,South Africa,International TV Shows,s2,TV Show,"September 24, 2021",2021,TV-MA,2 Seasons,"After crossing paths at a party, a Cape Town t..."
2,Blood & Water,,Ama Qamata,South Africa,TV Dramas,s2,TV Show,"September 24, 2021",2021,TV-MA,2 Seasons,"After crossing paths at a party, a Cape Town t..."
3,Blood & Water,,Ama Qamata,South Africa,TV Mysteries,s2,TV Show,"September 24, 2021",2021,TV-MA,2 Seasons,"After crossing paths at a party, a Cape Town t..."
4,Blood & Water,,Khosi Ngema,South Africa,International TV Shows,s2,TV Show,"September 24, 2021",2021,TV-MA,2 Seasons,"After crossing paths at a party, a Cape Town t..."


## Handling Missing Values

In [18]:
df_new['Cast'].replace(['nan'],['Unknown Cast'],inplace=True) # replacing nan values of director by Unknown Cast
df_new['Director'].replace(['nan'],['Unknown Director'],inplace=True) # replacing nan values of director by Unknown Director

In [19]:
sorted(df_new['Country'].unique())

['',
 'Afghanistan',
 'Albania',
 'Algeria',
 'Angola',
 'Argentina',
 'Armenia',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bangladesh',
 'Belarus',
 'Belgium',
 'Bermuda',
 'Botswana',
 'Brazil',
 'Bulgaria',
 'Burkina Faso',
 'Cambodia',
 'Cambodia,',
 'Cameroon',
 'Canada',
 'Cayman Islands',
 'Chile',
 'China',
 'Colombia',
 'Croatia',
 'Cuba',
 'Cyprus',
 'Czech Republic',
 'Denmark',
 'Dominican Republic',
 'East Germany',
 'Ecuador',
 'Egypt',
 'Ethiopia',
 'Finland',
 'France',
 'Georgia',
 'Germany',
 'Ghana',
 'Greece',
 'Guatemala',
 'Hong Kong',
 'Hungary',
 'Iceland',
 'India',
 'Indonesia',
 'Iran',
 'Iraq',
 'Ireland',
 'Israel',
 'Italy',
 'Jamaica',
 'Japan',
 'Jordan',
 'Kazakhstan',
 'Kenya',
 'Kuwait',
 'Latvia',
 'Lebanon',
 'Liechtenstein',
 'Lithuania',
 'Luxembourg',
 'Malawi',
 'Malaysia',
 'Malta',
 'Mauritius',
 'Mexico',
 'Mongolia',
 'Montenegro',
 'Morocco',
 'Mozambique',
 'Namibia',
 'Nepal',
 'Netherlands',
 'New Zealand',
 'Nicaragua',
 'N

- `Cambodia,` is a duplicate value of Cambodia.
- Similarly `Poland,`, `United Kingdom,`, `United States,` are the duplicate values of Poland, United Kingdom, United States respectively.
---
- So first handling these redundancy in data.
- Also replacing empty '' with np.nan to consider it as missing value

In [20]:
df_new['Country'] = df_new['Country'].str.replace(',', '') 
df_new['Country'].replace('', np.nan, inplace=True)
df_new['Country'].replace(['nan'],[np.nan],inplace=True) # replacing nan values of Country by np.nan

In [21]:
df_new.isna().sum()

title               0
Director            0
Cast                0
Country         11929
Genre               0
show_id             0
type                0
date_added        158
release_year        0
rating             67
duration            3
description         0
dtype: int64

### Filling Missing Values of rating and duration

In [22]:
df_new.loc[df_new['rating'].isin(['74 min', '66 min', '84 min'])]

Unnamed: 0,title,Director,Cast,Country,Genre,show_id,type,date_added,release_year,rating,duration,description
126537,Louis C.K. 2017,Louis C.K.,Louis C.K.,United States,Movies,s5542,Movie,"April 4, 2017",2017,74 min,,"Louis C.K. muses on religion, eternal love, gi..."
131603,Louis C.K.: Hilarious,Louis C.K.,Louis C.K.,United States,Movies,s5795,Movie,"September 16, 2016",2010,84 min,,Emmy-winning comedy writer Louis C.K. brings h...
131737,Louis C.K.: Live at the Comedy Store,Louis C.K.,Louis C.K.,United States,Movies,s5814,Movie,"August 15, 2016",2015,66 min,,The comic puts his trademark hilarious/thought...


- For the 3 records in last output it is clear that duration value has been overriden in rating column.
- So filling these 3 records duration with corresponding rating value.
- And replacing rating value with `NR` i.e., Not Rated

In [23]:
indices = df_new.loc[df_new['rating'].isin(['74 min', '66 min', '84 min'])].index # indices of these 3 records
df_new.loc[indices, "duration"] = df_new.loc[indices, "rating"] # filling corresponding duration value by rating value
df_new.loc[indices, "rating"] = 'NR'
df_new.loc[indices]

Unnamed: 0,title,Director,Cast,Country,Genre,show_id,type,date_added,release_year,rating,duration,description
126537,Louis C.K. 2017,Louis C.K.,Louis C.K.,United States,Movies,s5542,Movie,"April 4, 2017",2017,NR,74 min,"Louis C.K. muses on religion, eternal love, gi..."
131603,Louis C.K.: Hilarious,Louis C.K.,Louis C.K.,United States,Movies,s5795,Movie,"September 16, 2016",2010,NR,84 min,Emmy-winning comedy writer Louis C.K. brings h...
131737,Louis C.K.: Live at the Comedy Store,Louis C.K.,Louis C.K.,United States,Movies,s5814,Movie,"August 15, 2016",2015,NR,66 min,The comic puts his trademark hilarious/thought...


- replacing missing rating with `NR`

In [24]:
df_new['rating'].fillna('NR', inplace=True)

### Filling missing values of Country

In [25]:
df_new['Country'].isna().sum()

11929

- First imputing missing Country by mode of Country by director

In [26]:
for director in df_new[df_new['Country'].isna()]['Director'].unique():
    if director in df_new[~ df_new['Country'].isna()]['Director'].unique():
        mode_country_value = df_new.loc[df_new['Director'] == director]['Country'].mode().values[0]
        df_new.loc[df_new['Director'] == director, 'Country'] = df_new.loc[df_new['Director'] == director, 'Country'].fillna(mode_country_value)

- Now Imputing missing Country by mode of Country by Cast

In [27]:
for cast in df_new[df_new['Country'].isna()]['Cast'].unique():
    if cast in df_new[~ df_new['Country'].isna()]['Cast'].unique():
        mode_country_value = df_new.loc[df_new['Cast'] == cast]['Country'].mode().values[0]
        df_new.loc[df_new['Cast'] == cast, 'Country'] = df_new.loc[df_new['Cast'] == cast, 'Country'].fillna(mode_country_value)

- Filling rest missing Country by `Unknown Country`

In [28]:
df_new['Country'].fillna('Unknown Country', inplace=True)

In [29]:
df_new['Country'].isna().sum()

0

### Filling missing date_added value

- Filling missing date_added values by release year.
- i.e., For all movies released in year 2015, find mode of their date_added value.

In [30]:
for year in df_new[df_new['date_added'].isna()]['release_year'].unique():
    if year in df_new[~ df_new['date_added'].isna()]['release_year'].unique():
        mode_date_value = df_new.loc[df_new['release_year'] == year]['date_added'].mode().values[0]
        df_new.loc[df_new['release_year'] == year, 'date_added'] = df_new.loc[df_new['release_year'] == year, 'date_added'].fillna(mode_date_value)

In [31]:
df_new.isna().sum()

title           0
Director        0
Cast            0
Country         0
Genre           0
show_id         0
type            0
date_added      0
release_year    0
rating          0
duration        0
description     0
dtype: int64

## Miscellaneous Checks

In [32]:
df_new.head()

Unnamed: 0,title,Director,Cast,Country,Genre,show_id,type,date_added,release_year,rating,duration,description
0,Dick Johnson Is Dead,Kirsten Johnson,Unknown Cast,United States,Documentaries,s1,Movie,"September 25, 2021",2020,PG-13,90 min,"As her father nears the end of his life, filmm..."
1,Blood & Water,Unknown Director,Ama Qamata,South Africa,International TV Shows,s2,TV Show,"September 24, 2021",2021,TV-MA,2 Seasons,"After crossing paths at a party, a Cape Town t..."
2,Blood & Water,Unknown Director,Ama Qamata,South Africa,TV Dramas,s2,TV Show,"September 24, 2021",2021,TV-MA,2 Seasons,"After crossing paths at a party, a Cape Town t..."
3,Blood & Water,Unknown Director,Ama Qamata,South Africa,TV Mysteries,s2,TV Show,"September 24, 2021",2021,TV-MA,2 Seasons,"After crossing paths at a party, a Cape Town t..."
4,Blood & Water,Unknown Director,Khosi Ngema,South Africa,International TV Shows,s2,TV Show,"September 24, 2021",2021,TV-MA,2 Seasons,"After crossing paths at a party, a Cape Town t..."


### Handling duration column

In [33]:
df_new['duration'].unique()

array(['90 min', '2 Seasons', '1 Season', '91 min', '125 min',
       '9 Seasons', '104 min', '127 min', '4 Seasons', '67 min', '94 min',
       '5 Seasons', '161 min', '61 min', '166 min', '147 min', '103 min',
       '97 min', '106 min', '111 min', '3 Seasons', '110 min', '105 min',
       '96 min', '124 min', '116 min', '98 min', '23 min', '115 min',
       '122 min', '99 min', '88 min', '100 min', '6 Seasons', '102 min',
       '93 min', '95 min', '85 min', '83 min', '113 min', '13 min',
       '182 min', '48 min', '145 min', '87 min', '92 min', '80 min',
       '117 min', '128 min', '119 min', '143 min', '114 min', '118 min',
       '108 min', '63 min', '121 min', '142 min', '154 min', '120 min',
       '82 min', '109 min', '101 min', '86 min', '229 min', '76 min',
       '89 min', '156 min', '112 min', '107 min', '129 min', '135 min',
       '136 min', '165 min', '150 min', '133 min', '70 min', '84 min',
       '140 min', '78 min', '7 Seasons', '64 min', '59 min', '139 min',
    

In [34]:
def getDurationCategory(duration: str) -> str:
    if 'min' not in duration:
        return duration
    
    length = int(duration.split()[0])

    if length <= 40:
        return '0 to 40 minutes | Short-Films'
    elif length <= 180:
        return '40 - 180 minutes | Feature-Films'
    else:
        return '180+ minutes | Long-Films' 

In [35]:
df_new['duration'] = df_new['duration'].apply(getDurationCategory)

In [36]:
df_new.head()

Unnamed: 0,title,Director,Cast,Country,Genre,show_id,type,date_added,release_year,rating,duration,description
0,Dick Johnson Is Dead,Kirsten Johnson,Unknown Cast,United States,Documentaries,s1,Movie,"September 25, 2021",2020,PG-13,40 - 180 minutes | Feature-Films,"As her father nears the end of his life, filmm..."
1,Blood & Water,Unknown Director,Ama Qamata,South Africa,International TV Shows,s2,TV Show,"September 24, 2021",2021,TV-MA,2 Seasons,"After crossing paths at a party, a Cape Town t..."
2,Blood & Water,Unknown Director,Ama Qamata,South Africa,TV Dramas,s2,TV Show,"September 24, 2021",2021,TV-MA,2 Seasons,"After crossing paths at a party, a Cape Town t..."
3,Blood & Water,Unknown Director,Ama Qamata,South Africa,TV Mysteries,s2,TV Show,"September 24, 2021",2021,TV-MA,2 Seasons,"After crossing paths at a party, a Cape Town t..."
4,Blood & Water,Unknown Director,Khosi Ngema,South Africa,International TV Shows,s2,TV Show,"September 24, 2021",2021,TV-MA,2 Seasons,"After crossing paths at a party, a Cape Town t..."


In [37]:
df_new.columns

Index(['title', 'Director', 'Cast', 'Country', 'Genre', 'show_id', 'type',
       'date_added', 'release_year', 'rating', 'duration', 'description'],
      dtype='object')

### Saving Clean Data

- `Saving our preprocessed data to a csv file`
- I'll later load that file use it for analysis.

In [38]:
df_new.to_csv('netflix_clean.csv', index=False)

# EDA

In [39]:
df = pd.read_csv('netflix_clean.csv')
df.head()

Unnamed: 0,title,Director,Cast,Country,Genre,show_id,type,date_added,release_year,rating,duration,description
0,Dick Johnson Is Dead,Kirsten Johnson,Unknown Cast,United States,Documentaries,s1,Movie,"September 25, 2021",2020,PG-13,40 - 180 minutes | Feature-Films,"As her father nears the end of his life, filmm..."
1,Blood & Water,Unknown Director,Ama Qamata,South Africa,International TV Shows,s2,TV Show,"September 24, 2021",2021,TV-MA,2 Seasons,"After crossing paths at a party, a Cape Town t..."
2,Blood & Water,Unknown Director,Ama Qamata,South Africa,TV Dramas,s2,TV Show,"September 24, 2021",2021,TV-MA,2 Seasons,"After crossing paths at a party, a Cape Town t..."
3,Blood & Water,Unknown Director,Ama Qamata,South Africa,TV Mysteries,s2,TV Show,"September 24, 2021",2021,TV-MA,2 Seasons,"After crossing paths at a party, a Cape Town t..."
4,Blood & Water,Unknown Director,Khosi Ngema,South Africa,International TV Shows,s2,TV Show,"September 24, 2021",2021,TV-MA,2 Seasons,"After crossing paths at a party, a Cape Town t..."
