## Netflix Hackathon Preprocessing

In [None]:
# import zipfile
# with zipfile.ZipFile('archive (2).zip', 'r') as zip_ref:
#     zip_ref.extractall()

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df=pd.read_csv('netflix_titles.csv', index_col='show_id')

In [None]:
df[df.title.str.startswith('The Lor')]

## CS directed Cleaning and Preprocessing Steps:

1. Convert Data Types:\
· Ensure that the ' date_added' column is in datetime format.\
· Ensure that the ' release_year' column is in the correct numeric format.

2. Create Additional Features:\
· Extract month and year from the ' date_added' column for time-based analysis.

3. Handle Missing Values:\
· For numeric columns, fill in missing values with the mean or median.\
· For categorical columns 'rating', 'duration' and 'country', fill missing values with the mode or a\
placeholder like "Unknown", “Not Rated”

4. Clean Categorical Data:\
· Standardize capitalization in categorical columns 'type' and 'rating' for consistency.

5. Remove Duplicates:\
· Check for and remove any duplicate entries.

6. Handle Outliers (Optional):\
· Investigate and handle any outliers in colum

#### 1&2. dtypes and feature creation

**date_added to datetime**\
**release_year to float64**\
**month_added, year_added extracted from date_added**

In [None]:
#converting dtypes and extracting features

df['date_added'] = pd.to_datetime(df['date_added'].str.strip())
df['release_year'] = df['release_year'].astype('float64') 
df['month_added'] = df['date_added'].dt.month
df['year_added'] = df['date_added'].dt.year

#### 3. Dealing with nulls

**Imputed *'Unknown'* to 'director','cast','country'**\
**Moved duration entries in the rating column (4)**\
**Googled missing ratings and filled**\

**Maybe ignore instead???**
**Removed ten entries with missing dateadded data**

##### still to do: rating and date/month/year added

In [None]:
#Prepping cols in case we wanted a different imputation strategy
catcols = [col for col in df.columns if df[col].dtype == 'object']
nancols = [col for col in df.columns if df[col].isna().sum() >0]
numnancols = [col for col in nancols if col not in catcols]
catnancols = [col for col in nancols if col in catcols]
print(catnancols, numnancols)

***Strategy***

##### country director cast 
    - fill Unknown
##### rating
    - look up actual values
##### Duration 
    - These were just rating see 4.
##### Not sure how to treat date, month year added cols 
    - look up/ remove

In [None]:
#filling cats with 'Unknown'

for col in ['director','cast','country']:
    df[col] = df[col].fillna('Unknown')

# df

In [None]:
df.isna().sum()

##### date, month and year added 
    to fix, remove/google
##### rating 
    look up
##### duration
    see bellow

##### There are a few rating values in duration:

In [None]:
display(df['rating'][df['rating'] == ('74 min')],
df['rating'][df['rating'] == ('84 min')],
df['rating'][df['rating'] == ('66 min')])


In [None]:
weird_is = df.loc[['s5542','s5795','s5814'],:].index

for i in weird_is:
    df.loc[i,'duration'] = df.loc[i,'rating']
    df.loc[i,'rating'] = np.nan

df.loc[weird_is]

In [None]:
#double checking!!
# df['rating'].dropna()[df['rating'].dropna().str.endswith('min')]
# df['rating'].dropna()[df['rating'].dropna().str.endswith('ons')]

In [None]:
df.isna().sum()

##### Looking up rating values::

In [None]:
df.rating.value_counts()

In [None]:
missing_rating_i = df[df.rating.isna()].index

**Louis C.K. 2017** TV-MA\
[imdb](https://www.imdb.com/title/tt6736782/)\
**Louis C.K. Hilarious** Not Rated - NR\
[imdb](https://www.imdb.com/title/tt1421373/?ref_=fn_al_tt_1)\
**Louis C.K.: Live at the Comedy Store** TV-MA\
[netflix](https://www.netflix.com/gb/title/80114111)\
**13TH: A Conversation with Oprah Winfrey & Ava** TV-PG\
[netflix](https://www.netflix.com/gb/title/81460481)\
**Gargantia on the Verdurous Planet** TV-14\
[appletv](https://tv.apple.com/us/show/gargantia-on-the-verdurous-planet/umc.cmc.73qfq78p3omafbdadkgwlh7gm)\
**Little Lunch** U :=G\
[netflix](https://www.netflix.com/gb/title/80078037)\
**My Honor Was Loyalty** PG-13\
[imdb](https://www.imdb.com/title/tt4544696/)





In [None]:
df.loc[missing_rating_i,'rating'] = ['TV-MA','NR','TV-MA','TV-PG','TV-14','G','PG-13']

##### Only missing datetime rows -> keep or delete??
##### Let's delete.

In [None]:
df.isna().sum()

In [None]:
df.to_csv('netflix_clean.csv')

In [None]:
# missing_datetime_i = df[df.date_added.isna()].index
# final = df.drop(missing_datetime_i)

In [None]:
# final.isna().sum()

In [None]:
# final.to_csv('netflix_clean.csv' )

#### 4. Standardising Format

**No Action Taken**

##### Double check

#### 5. Removing Duplicates
**No Action Taken**
##### Double check

In [None]:
print(len(df))
df.index.nunique()

In [None]:
for col in df.columns:
    print(col, df[col].nunique())
# df.description.nunique()

##### Remarks::

No duplicates in titles: interesting there are duplicates in description, turns out these are translated films.(Examples bellow)

In [None]:
# df.description.value_counts()[df.description.value_counts() >1]

In [None]:
# df[df.description == 'A young Han Solo tries to settle an old score with the help of his new buddy Chewbacca, a crew of space smugglers and a cunning old friend.']

In [None]:
# df[df.description == 'Paranormal activity at a lush, abandoned property alarms a group eager to redevelop the site, but the eerie events may not be as unearthly as they think.']

##### Don't seem to be duplicates, just different translations

#### 6. Handling outliers

**No Action Taken**

In [None]:
df =final
df.isna().sum()

In [None]:
for col in [col for col in df.select_dtypes(exclude='object').columns]:
    try:
        plt.figure()
        sns.boxplot(data=df, y=col)
        plt.show()
    except:print('n')

In [None]:
df.year_added.min()