![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Noelia's ML_OPS Project!  👻👻

###  Notebook: platform file data cleaning / data transformation



![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
### Import zone

![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
### Utils functions

![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

In [2]:
def load_dataframe_and_report(path, name):
    """
        Returns dataframe and some useful metrics.
        Inputs: 
            - path: relative path to file (.csv)
            - name: name of platform being analized

        Outputs:
            - df: DataFrame
            - n_rows: number of rows
            - n_col: numbers of columns
            - missing_report: Indicate the count of null values per column
            - columns: column name list
    """
    df = pd.read_csv(path)
    n_rows, n_cols = df.shape      
    missing_report = df.isna().sum()
    columns = df.columns.tolist()
    sep = '---------------------------------------------------------------------------'
    print(sep)
    print(f"Dataset being analized: {name}")
    print(sep)
    print(f"N rows ({name})--> {n_rows}")
    print(f"N cols ({name})-->  {n_cols}")
    print(sep)
    print('Columns name')
    print(columns)
    print(sep)
    print("Is there missing data?:\n")
    print(missing_report)
    print(sep)
    display(df.head())


    return df, n_rows, n_cols, missing_report, columns

In [3]:
import calendar

month_dict = {calendar.month_name[num]: num for num in range(1,13)}


def date_parser(my_date):
    """
        parse date from (Month_name day, year ) to (yyyy/mm/dd)
        Input: 
            - my_date: string in the form (Month_name day, year)
        Output:
            - string in the form (yyyy/mm/dd)
    """
    
    my_date = my_date.split(",")    
    my_month_day = my_date[0].strip().split(" ")
    my_month, my_day, my_year =  [*my_month_day,my_date[1].strip()]
    my_parsed_date = f"{my_year}/{month_dict[my_month]:02d}/{my_day}"
    return my_parsed_date



In [4]:
# preview of month dictionary
month_dict

{'January': 1,
 'February': 2,
 'March': 3,
 'April': 4,
 'May': 5,
 'June': 6,
 'July': 7,
 'August': 8,
 'September': 9,
 'October': 10,
 'November': 11,
 'December': 12}


![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

### Data Exploration

![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

In [5]:
# platform dictionary --> keys: name of platform, values: number from 0-3
p_dict = {
    "Amazon":0, 
    "Disney":1,
    "Hulu":2, 
    "Netflix":3
}

# Inverse platform dictionary --> keys: number from 0-3, values: name of platform
inverse_p_dict = {v:k for k,v in p_dict.items()}

# List with platform names, list with file names, empty list to save all the partial dataframes
plataform = ["Amazon","Disney","Hulu", "Netflix"]
file_name = ["amazon_prime_titles.csv", "disney_plus_titles.csv", "hulu_titles.csv", "netflix_titles.csv"]
dataframes = []

---
#### Amazon dataset

After invoking the function `load_dataframe_and_report`, the following information is obtained: 
* number of rows: 9668
* number of columns: 12
* missing data in columns: 
    * director: 2082
    * cast: 1233
    * country: 8996
    * date_added: 9513
    * rating: 337

We have almost the same amount of missing data as the number of rows in columns `country` and `date_added`. Clearly, they will not be useful for training our model.

**To do:**: explore if we must drop `country` and `date_added` columns

---

In [6]:
# path to amazon file
rel_path = '../data/'+file_name[0]

# call my util function 
df_amazon, n_rows_amazon, n_cols_amazon, missing_report_amazon, columns_amazon = load_dataframe_and_report(rel_path, plataform[0])

# append partial dataframe to list container
dataframes.append(df_amazon)


---------------------------------------------------------------------------
Dataset being analized: Amazon
---------------------------------------------------------------------------
N rows (Amazon)--> 9668
N cols (Amazon)-->  12
---------------------------------------------------------------------------
Columns name
['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description']
---------------------------------------------------------------------------
Is there missing data?:

show_id            0
type               0
title              0
director        2082
cast            1233
country         8996
date_added      9513
release_year       0
rating           337
duration           0
listed_in          0
description        0
dtype: int64
---------------------------------------------------------------------------


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,The Grand Seduction,Don McKellar,"Brendan Gleeson, Taylor Kitsch, Gordon Pinsent",Canada,"March 30, 2021",2014,,113 min,"Comedy, Drama",A small fishing village must procure a local d...
1,s2,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,"March 30, 2021",2018,13+,110 min,"Drama, International",A Metro Family decides to fight a Cyber Crimin...
2,s3,Movie,Secrets of Deception,Josh Webber,"Tom Sizemore, Lorenzo Lamas, Robert LaSardo, R...",United States,"March 30, 2021",2017,,74 min,"Action, Drama, Suspense",After a man discovers his wife is cheating on ...
3,s4,Movie,Pink: Staying True,Sonia Anderson,"Interviews with: Pink, Adele, Beyoncé, Britney...",United States,"March 30, 2021",2014,,69 min,Documentary,"Pink breaks the mold once again, bringing her ..."
4,s5,Movie,Monster Maker,Giles Foster,"Harry Dean Stanton, Kieran O'Brien, George Cos...",United Kingdom,"March 30, 2021",1989,,45 min,"Drama, Fantasy",Teenage Matt Banting wants to work with a famo...


---
### Disney plus dataset

After invoking the function `load_dataframe_and_report`, the following information is obtained: 
* number of rows: 1450
* number of columns: 12
* missing data in columns: 
    * director: 473
    * cast: 190
    * country: 219
    * date_added: 3
    * rating: 3

In this case, the amount of missing data in column c1 is not very high. We should explore the other datasets to make a final decision about columns c1 and c2.

**To do:**: explore what happens to the `country` and `date_added` columns in the hulu and netflix dataset before making a decision

---

In [7]:
rel_path = '../data/'+file_name[1]
df_disney, n_rows_disney, n_cols_disney, missing_report_disney, columns_disney = load_dataframe_and_report(rel_path,plataform[1])
dataframes.append(df_disney)


---------------------------------------------------------------------------
Dataset being analized: Disney
---------------------------------------------------------------------------
N rows (Disney)--> 1450
N cols (Disney)-->  12
---------------------------------------------------------------------------
Columns name
['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description']
---------------------------------------------------------------------------
Is there missing data?:

show_id           0
type              0
title             0
director        473
cast            190
country         219
date_added        3
release_year      0
rating            3
duration          0
listed_in         0
description       0
dtype: int64
---------------------------------------------------------------------------


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Duck the Halls: A Mickey Mouse Christmas Special,"Alonso Ramirez Ramos, Dave Wasson","Chris Diamantopoulos, Tony Anselmo, Tress MacN...",,"November 26, 2021",2016,TV-G,23 min,"Animation, Family",Join Mickey and the gang as they duck the halls!
1,s2,Movie,Ernest Saves Christmas,John Cherry,"Jim Varney, Noelle Parker, Douglas Seale",,"November 26, 2021",1988,PG,91 min,Comedy,Santa Claus passes his magic bag to a new St. ...
2,s3,Movie,Ice Age: A Mammoth Christmas,Karen Disher,"Raymond Albert Romano, John Leguizamo, Denis L...",United States,"November 26, 2021",2011,TV-G,23 min,"Animation, Comedy, Family",Sid the Sloth is on Santa's naughty list.
3,s4,Movie,The Queen Family Singalong,Hamish Hamilton,"Darren Criss, Adam Lambert, Derek Hough, Alexa...",,"November 26, 2021",2021,TV-PG,41 min,Musical,"This is real life, not just fantasy!"
4,s5,TV Show,The Beatles: Get Back,,"John Lennon, Paul McCartney, George Harrison, ...",,"November 25, 2021",2021,,1 Season,"Docuseries, Historical, Music",A three-part documentary from Peter Jackson ca...


---
### Hulu dataset

After invoking the function `load_dataframe_and_report`, the following information is obtained: 
* number of rows: 3073
* number of columns: 12
* missing data in columns: 
    * director: 3070
    * cast: 3073
    * country: 1453
    * date_added:  28
    * rating: 520
    * duration: 479

Be careful with this dataset. Explore it thoroughly, it has too many NANs.

Director and cast amount of missing data is huge. Can we get this info from the other dataset?<br>
Country column has 50% of missing data

**To do:**: explore what happens with this dataset


---

In [8]:
rel_path = '../data/'+file_name[2]
df_hulu, n_rows_hulu, n_cols_hulu, missing_report_hulu, columns_hulu = load_dataframe_and_report(rel_path,plataform[2])
dataframes.append(df_hulu)

---------------------------------------------------------------------------
Dataset being analized: Hulu
---------------------------------------------------------------------------
N rows (Hulu)--> 3073
N cols (Hulu)-->  12
---------------------------------------------------------------------------
Columns name
['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description']
---------------------------------------------------------------------------
Is there missing data?:

show_id            0
type               0
title              0
director        3070
cast            3073
country         1453
date_added        28
release_year       0
rating           520
duration         479
listed_in          0
description        4
dtype: int64
---------------------------------------------------------------------------


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Ricky Velez: Here's Everything,,,,"October 24, 2021",2021,TV-MA,,"Comedy, Stand Up",​Comedian Ricky Velez bares it all with his ho...
1,s2,Movie,Silent Night,,,,"October 23, 2021",2020,,94 min,"Crime, Drama, Thriller","Mark, a low end South London hitman recently r..."
2,s3,Movie,The Marksman,,,,"October 23, 2021",2021,PG-13,108 min,"Action, Thriller",A hardened Arizona rancher tries to protect an...
3,s4,Movie,Gaia,,,,"October 22, 2021",2021,R,97 min,Horror,A forest ranger and two survivalists with a cu...
4,s5,Movie,Settlers,,,,"October 22, 2021",2021,,104 min,"Science Fiction, Thriller",Mankind's earliest settlers on the Martian fro...


---
#### Netflix dataset

After invoking the function `load_dataframe_and_report`, the following information is obtained: 
* number of rows:  8807
* number of columns: 12
* missing data in columns: 
    * director: 2634
    * cast: 825
    * country: 831
    * date_added:  10
    * rating: 4
    * duration: 3

We have missing data in director,cast and country. Can we get this info from the another dataset?

**To do:**: Can we get director, cast and country info from another datasets?

---

In [9]:
rel_path = '../data/'+file_name[3]
df_netflix, n_rows_netflix, n_cols_netflix, missing_report_netflix, columns_netflix = load_dataframe_and_report(rel_path,plataform[3])
dataframes.append(df_netflix)


---------------------------------------------------------------------------
Dataset being analized: Netflix
---------------------------------------------------------------------------
N rows (Netflix)--> 8807
N cols (Netflix)-->  12
---------------------------------------------------------------------------
Columns name
['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description']
---------------------------------------------------------------------------
Is there missing data?:

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64
---------------------------------------------------------------------------


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

### Add id column to each dataframe

```python
id =  firt_letter_of_service + show_id

# e.g --> Platform: Netflix, show_id: s1 --> 
id=ns1
```

![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

In [10]:
for i in range(len(dataframes)):
    first_letter = plataform[i][0].lower()
    dataframes[i]['id'] = dataframes[i]['show_id'].map(lambda x: first_letter+x)


![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
### Null values from rating column must be replaced to "G" --> Mature audience

![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

In [11]:
for df in dataframes:
    df['rating'].fillna('G', inplace=True)

![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
### Dates must be in the format yyyy/mm/dd

![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

In [12]:

for df in dataframes:
    df["date_added"] = df["date_added"].map(lambda x: date_parser(x) if isinstance(x,str) else x)


![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
### Text field must be lowercase



![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

In [13]:
cat = df.select_dtypes(include=['object','category']).columns.tolist()

for df in dataframes:
    for c in cat:
        df[c] = df[c].map(lambda x: x.lower() if isinstance(x, str) else x)


![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
### Explore rating column for all dataframes

#### In the Hulu dataset, values from the duration column have been mistakenly included in the rating column

![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

In [14]:
ratings_unique = {}
for idx, df in enumerate(dataframes):
    ratings_unique[inverse_p_dict[idx]] = dataframes[idx]['rating'].unique().tolist()



---
### Amazon uniques values in rating column

---

In [15]:
i = 0
print(inverse_p_dict[i]+": ")
print(ratings_unique[inverse_p_dict[i]])

Amazon: 
['g', '13+', 'all', '18+', 'r', 'tv-y', 'tv-y7', 'nr', '16+', 'tv-pg', '7+', 'tv-14', 'tv-nr', 'tv-g', 'pg-13', 'tv-ma', 'pg', 'nc-17', 'unrated', '16', 'ages_16_', 'ages_18_', 'all_ages', 'not_rate']


---
### Disney uniques values in rating column

---

In [16]:
i = 1
print(inverse_p_dict[i]+": ")
print(ratings_unique[inverse_p_dict[i]])

Disney: 
['tv-g', 'pg', 'tv-pg', 'g', 'pg-13', 'tv-14', 'tv-y7', 'tv-y', 'tv-y7-fv']


---
### Hulu uniques values in rating column

#### *We have a problem here!  values from the duration column have been mistakenly included in the rating column*

---

In [17]:
i = 2
print(inverse_p_dict[i]+": ")
print(ratings_unique[inverse_p_dict[i]])

### Houston, we have a problem...

Hulu: 
['tv-ma', 'g', 'pg-13', 'r', 'tv-14', 'pg', 'tv-pg', 'not rated', 'tv-g', '2 seasons', 'tv-y', '93 min', '4 seasons', 'tv-y7', '136 min', '91 min', '85 min', '98 min', '89 min', '94 min', '86 min', '3 seasons', '121 min', '88 min', '101 min', '1 season', '83 min', '100 min', '95 min', '92 min', '96 min', '109 min', '99 min', '75 min', '87 min', '67 min', '104 min', '107 min', '84 min', '103 min', '105 min', '119 min', '114 min', '82 min', '90 min', '130 min', '110 min', '80 min', '6 seasons', '97 min', '111 min', '81 min', '49 min', '45 min', '41 min', '73 min', '40 min', '36 min', '39 min', '34 min', '47 min', '65 min', '37 min', '78 min', '102 min', '129 min', '115 min', '112 min', 'nr', '61 min', '106 min', '76 min', '77 min', '79 min', '157 min', '28 min', '64 min', '7 min', '5 min', '6 min', '127 min', '142 min', '108 min', '57 min', '118 min', '116 min', '12 seasons', '71 min']


---
### Netflix uniques values in rating column

---

In [18]:
i = 3
print(inverse_p_dict[i]+": ")
print(ratings_unique[inverse_p_dict[i]])

Netflix: 
['pg-13', 'tv-ma', 'pg', 'tv-14', 'tv-pg', 'tv-y', 'tv-y7', 'r', 'tv-g', 'g', 'nc-17', '74 min', '84 min', '66 min', 'nr', 'tv-y7-fv', 'ur']


---
### We must clean HULU rating column

#### It can be seen that when there is a wrong value in the rating column, there is a NaN in the duration column

---

In [19]:

i =  2 # Hulu index
mask = (dataframes[i]['rating'].str.contains('min')) | (dataframes[i]['rating'].str.contains('seasons'))|(dataframes[i]['rating'].str.contains('season'))
mask
dataframes[i][mask][['rating','duration']]

Unnamed: 0,rating,duration
50,2 seasons,
108,93 min,
263,4 seasons,
397,136 min,
471,2 seasons,
...,...,...
2951,12 seasons,
2955,3 seasons,
2958,93 min,
2959,6 seasons,


---
### Moving the wrong data in the rating column to the duration column

---

In [20]:
dataframes[i]['duration'] = np.where(mask, dataframes[i]['rating'], dataframes[i]['duration'])
dataframes[i]['rating'] = np.where(mask, 'G', dataframes[i]['rating'])

---
### check if it worked

#### Perfect! we no longer have mixed the rating and duration columns

---

In [21]:
dataframes[i]['rating'].unique()

array(['tv-ma', 'g', 'pg-13', 'r', 'tv-14', 'pg', 'tv-pg', 'not rated',
       'tv-g', 'G', 'tv-y', 'tv-y7', 'nr'], dtype=object)

![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

### Duration field must be split into two different columns.

First replace all Nan with 0 min

![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)


In [22]:
# replace nan with 0 min
for df in dataframes:    
    df['duration'] = df['duration'].fillna('0 min')



In [23]:
# Split duration into two columns
for df in dataframes:
    df[['duration_int', 'duration_type']] = df.duration.str.split(expand=True)

In [24]:
# Cast duration_int to int (was object aka string)
for df in dataframes:
    df = df.astype({'duration_int':'int'})
    print(df.info())
    print('----------------------------------------------------------------------')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9668 entries, 0 to 9667
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   show_id        9668 non-null   object
 1   type           9668 non-null   object
 2   title          9668 non-null   object
 3   director       7586 non-null   object
 4   cast           8435 non-null   object
 5   country        672 non-null    object
 6   date_added     155 non-null    object
 7   release_year   9668 non-null   int64 
 8   rating         9668 non-null   object
 9   duration       9668 non-null   object
 10  listed_in      9668 non-null   object
 11  description    9668 non-null   object
 12  id             9668 non-null   object
 13  duration_int   9668 non-null   int32 
 14  duration_type  9668 non-null   object
dtypes: int32(1), int64(1), object(13)
memory usage: 1.1+ MB
None
----------------------------------------------------------------------
<class 'pandas.core.frame

![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
### Dataframes last inspection

![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

In [25]:
sep = "----------------------------------------------------------------------------"
print(sep)
print(sep)
print("                       Looking for nan in the dataset")
print(sep)
cols_with_missing = set()
for idx, df in enumerate(dataframes):   
    print(sep)
    print(f"Platform: {inverse_p_dict[idx]}\n")    
    print(df.isna().sum())
    
print(sep)


----------------------------------------------------------------------------
----------------------------------------------------------------------------
                       Looking for nan in the dataset
----------------------------------------------------------------------------
----------------------------------------------------------------------------
Platform: Amazon

show_id             0
type                0
title               0
director         2082
cast             1233
country          8996
date_added       9513
release_year        0
rating              0
duration            0
listed_in           0
description         0
id                  0
duration_int        0
duration_type       0
dtype: int64
----------------------------------------------------------------------------
Platform: Disney

show_id            0
type               0
title              0
director         473
cast             190
country          219
date_added         3
release_year       0
rating        

In [26]:
sep = "----------------------------------------------------------------------------"
print(sep)
print(sep)
print("                       Looking for nan in the dataset")
print(sep)
cols_with_missing = set()
for idx, df in enumerate(dataframes):   
    print(sep)
    print(f"Platform: {inverse_p_dict[idx]}")
    # print(df.isna().sum())
    col_msng = df.isna().sum()
    mask = col_msng!=0
    col_msng = col_msng[mask].index.tolist()
    print(col_msng)
    cols_with_missing.update(col_msng)
    print("cols with missing data --> ", cols_with_missing)
    print(sep)

cols_with_missing = list(cols_with_missing)

----------------------------------------------------------------------------
----------------------------------------------------------------------------
                       Looking for nan in the dataset
----------------------------------------------------------------------------
----------------------------------------------------------------------------
Platform: Amazon
['director', 'cast', 'country', 'date_added']
cols with missing data -->  {'country', 'cast', 'director', 'date_added'}
----------------------------------------------------------------------------
----------------------------------------------------------------------------
Platform: Disney
['director', 'cast', 'country', 'date_added']
cols with missing data -->  {'country', 'cast', 'director', 'date_added'}
----------------------------------------------------------------------------
----------------------------------------------------------------------------
Platform: Hulu
['director', 'cast', 'country', 'date_add

![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

### Al missing data will be replace with "unknown "+cast/country/...

>Probably many of the missing data from the director, cast and country columns of a dataset (for example Netflix) could be found in another dataset (e.g Amazon).<br>
> **I leave it as a future task.**

![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

In [27]:
for df in dataframes:
    for c in cols_with_missing:
        df[c] = df[c].fillna("unknown "+c)

![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
### Save each individual cleaned dataframe to a separate CSV file, and save the concatenated version of all dataframes as a single huge CSV file.

The path to clean data is
`../data/clean/the_file.csv` 


![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

---
### Individual

---

In [28]:
import os

# Individual
for idx, df in enumerate(dataframes):
    root_path = '../data/clean/'
    isExist = os.path.exists(root_path)
    if not isExist:
        # Create a new directory because it does not exist
        os.makedirs(root_path)
    df.to_csv(root_path+f'{inverse_p_dict[idx]}_clean.csv',index=False)

---
### Huge

---

In [29]:
# create a huge dataframw
# First, concat all df
mega_df = pd.concat([*dataframes])
mega_df = mega_df.reset_index(drop=True)
mega_df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,id,duration_int,duration_type
0,s1,movie,the grand seduction,don mckellar,"brendan gleeson, taylor kitsch, gordon pinsent",canada,2021/03/30,2014,g,113 min,"comedy, drama",a small fishing village must procure a local d...,as1,113,min
1,s2,movie,take care good night,girish joshi,"mahesh manjrekar, abhay mahajan, sachin khedekar",india,2021/03/30,2018,13+,110 min,"drama, international",a metro family decides to fight a cyber crimin...,as2,110,min
2,s3,movie,secrets of deception,josh webber,"tom sizemore, lorenzo lamas, robert lasardo, r...",united states,2021/03/30,2017,g,74 min,"action, drama, suspense",after a man discovers his wife is cheating on ...,as3,74,min
3,s4,movie,pink: staying true,sonia anderson,"interviews with: pink, adele, beyoncé, britney...",united states,2021/03/30,2014,g,69 min,documentary,"pink breaks the mold once again, bringing her ...",as4,69,min
4,s5,movie,monster maker,giles foster,"harry dean stanton, kieran o'brien, george cos...",united kingdom,2021/03/30,1989,g,45 min,"drama, fantasy",teenage matt banting wants to work with a famo...,as5,45,min


In [30]:
# Drop columns date_added, duration
mega_df.drop(columns=['date_added','duration'], inplace=True)


In [31]:
# reorder columns
mega_df = mega_df[['id','show_id','type','title','director','cast','country','release_year','rating','listed_in','description','duration_int','duration_type']]
mega_df.head()


Unnamed: 0,id,show_id,type,title,director,cast,country,release_year,rating,listed_in,description,duration_int,duration_type
0,as1,s1,movie,the grand seduction,don mckellar,"brendan gleeson, taylor kitsch, gordon pinsent",canada,2014,g,"comedy, drama",a small fishing village must procure a local d...,113,min
1,as2,s2,movie,take care good night,girish joshi,"mahesh manjrekar, abhay mahajan, sachin khedekar",india,2018,13+,"drama, international",a metro family decides to fight a cyber crimin...,110,min
2,as3,s3,movie,secrets of deception,josh webber,"tom sizemore, lorenzo lamas, robert lasardo, r...",united states,2017,g,"action, drama, suspense",after a man discovers his wife is cheating on ...,74,min
3,as4,s4,movie,pink: staying true,sonia anderson,"interviews with: pink, adele, beyoncé, britney...",united states,2014,g,documentary,"pink breaks the mold once again, bringing her ...",69,min
4,as5,s5,movie,monster maker,giles foster,"harry dean stanton, kieran o'brien, george cos...",united kingdom,1989,g,"drama, fantasy",teenage matt banting wants to work with a famo...,45,min


In [32]:
# Save huge df
mega_df.to_csv(root_path+"all_together_clean.csv", index=False)

![divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)