## CLEANING THE DATA FOR DISNEY PLUS DATASET

Import libraries

In [1]:
import pandas as pd
import numpy as np

Load data

Tip: Change encoding from 'latin-1' to 'utf-8' to get rid of special characters that may appear in the dataset. 

In [2]:
df = pd.read_csv("../date raw/DataSchool_DisneyPlus.csv", encoding = 'utf-8')
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Duck the Halls: A Mickey Mouse Christmas Special,Alonso Ramirez Ramos| Dave Wasson,Chris Diamantopoulos| Tony Anselmo| Tress MacN...,,26-Nov-21,2016,TV-G,23 min,Animation| Family,Join Mickey and the gang as they duck the halls!
1,s2,Movie,Ernest Saves Christmas,John Cherry,Jim Varney| Noelle Parker| Douglas Seale,,26-Nov-21,1988,PG,91 min,Comedy,Santa Claus passes his magic bag to a new St. ...
2,s3,Movie,Ice Age: A Mammoth Christmas,Karen Disher,Raymond Albert Romano| John Leguizamo| Denis L...,United States,26-Nov-21,2011,TV-G,23 min,Animation| Comedy| Family,Sid the Sloth is on Santa's naughty list.
3,s4,Movie,The Queen Family Singalong,Hamish Hamilton,Darren Criss| Adam Lambert| Derek Hough| Alexa...,,26-Nov-21,2021,TV-PG,41 min,Musical,This is real life; not just fantasy!
4,s5,TV Show,The Beatles: Get Back,,John Lennon| Paul McCartney| George Harrison| ...,,25-Nov-21,2021,,1 Season,Docuseries| Historical| Music,A three-part documentary from Peter Jackson ca...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1450 entries, 0 to 1449
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       1450 non-null   object
 1   type          1450 non-null   object
 2   title         1450 non-null   object
 3   director      977 non-null    object
 4   cast          1260 non-null   object
 5   country       1231 non-null   object
 6   date_added    1447 non-null   object
 7   release_year  1450 non-null   int64 
 8   rating        1447 non-null   object
 9   duration      1450 non-null   object
 10  listed_in     1450 non-null   object
 11  description   1450 non-null   object
dtypes: int64(1), object(11)
memory usage: 136.1+ KB


Step 1: Find the percentage of missing data in order to make decisions in order to drop or input the values

Percentage of missing data for each feature
- For 'director' -> 32.62%
- For 'cast' -> 13.10%
- For 'country' -> 15.10
- For 'date_added' & 'rating'-> 0.21%, we have only 3 missing values

In [4]:
df["director"].isnull().sum()/len(df["director"])*100

32.62068965517241

In [5]:
df["cast"].isnull().sum()/len(df["cast"])*100

13.10344827586207

In [6]:
df["country"].isnull().sum()/len(df["country"])*100

15.10344827586207

In [7]:
df["date_added"].isnull().sum()/len(df["date_added"])*100
df["rating"].isnull().sum()/len(df["rating"])*100

0.20689655172413793

Step 2: Remove duplicates based on all columns

In [8]:
df.drop_duplicates(inplace=True)
df.drop_duplicates(subset='title', inplace=True)

In [9]:
df.dropna(subset = ['date_added'], inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1447 entries, 0 to 1449
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       1447 non-null   object
 1   type          1447 non-null   object
 2   title         1447 non-null   object
 3   director      977 non-null    object
 4   cast          1257 non-null   object
 5   country       1228 non-null   object
 6   date_added    1447 non-null   object
 7   release_year  1447 non-null   int64 
 8   rating        1444 non-null   object
 9   duration      1447 non-null   object
 10  listed_in     1447 non-null   object
 11  description   1447 non-null   object
dtypes: int64(1), object(11)
memory usage: 147.0+ KB


In [10]:
df.loc[4, 'rating'] = 'PG-13'
df.loc[276, 'rating'] = 'TV-14'
df.loc[280, 'rating'] = 'TV-14'
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Duck the Halls: A Mickey Mouse Christmas Special,Alonso Ramirez Ramos| Dave Wasson,Chris Diamantopoulos| Tony Anselmo| Tress MacN...,,26-Nov-21,2016,TV-G,23 min,Animation| Family,Join Mickey and the gang as they duck the halls!
1,s2,Movie,Ernest Saves Christmas,John Cherry,Jim Varney| Noelle Parker| Douglas Seale,,26-Nov-21,1988,PG,91 min,Comedy,Santa Claus passes his magic bag to a new St. ...
2,s3,Movie,Ice Age: A Mammoth Christmas,Karen Disher,Raymond Albert Romano| John Leguizamo| Denis L...,United States,26-Nov-21,2011,TV-G,23 min,Animation| Comedy| Family,Sid the Sloth is on Santa's naughty list.
3,s4,Movie,The Queen Family Singalong,Hamish Hamilton,Darren Criss| Adam Lambert| Derek Hough| Alexa...,,26-Nov-21,2021,TV-PG,41 min,Musical,This is real life; not just fantasy!
4,s5,TV Show,The Beatles: Get Back,,John Lennon| Paul McCartney| George Harrison| ...,,25-Nov-21,2021,PG-13,1 Season,Docuseries| Historical| Music,A three-part documentary from Peter Jackson ca...


Step 3: Format data as YYYY-MM-DD (e.g. 2021-09-25)

In [11]:
df['date_added'].head(2)

0    26-Nov-21
1    26-Nov-21
Name: date_added, dtype: object

In [12]:
df["date_added"] = pd.to_datetime(df["date_added"]).dt.date
df["date_added"].head()

0    2021-11-26
1    2021-11-26
2    2021-11-26
3    2021-11-26
4    2021-11-25
Name: date_added, dtype: object

In [13]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Duck the Halls: A Mickey Mouse Christmas Special,Alonso Ramirez Ramos| Dave Wasson,Chris Diamantopoulos| Tony Anselmo| Tress MacN...,,2021-11-26,2016,TV-G,23 min,Animation| Family,Join Mickey and the gang as they duck the halls!
1,s2,Movie,Ernest Saves Christmas,John Cherry,Jim Varney| Noelle Parker| Douglas Seale,,2021-11-26,1988,PG,91 min,Comedy,Santa Claus passes his magic bag to a new St. ...
2,s3,Movie,Ice Age: A Mammoth Christmas,Karen Disher,Raymond Albert Romano| John Leguizamo| Denis L...,United States,2021-11-26,2011,TV-G,23 min,Animation| Comedy| Family,Sid the Sloth is on Santa's naughty list.
3,s4,Movie,The Queen Family Singalong,Hamish Hamilton,Darren Criss| Adam Lambert| Derek Hough| Alexa...,,2021-11-26,2021,TV-PG,41 min,Musical,This is real life; not just fantasy!
4,s5,TV Show,The Beatles: Get Back,,John Lennon| Paul McCartney| George Harrison| ...,,2021-11-25,2021,PG-13,1 Season,Docuseries| Historical| Music,A three-part documentary from Peter Jackson ca...


Step 4: Fix the problems with rating column

In [14]:
df['rating'].head(7)

0     TV-G
1       PG
2     TV-G
3    TV-PG
4    PG-13
5    PG-13
6    TV-14
Name: rating, dtype: object

In [15]:
# Replace np.NaN with Not Rated
df['rating'].replace(np.NaN, 'Not Rated', inplace=True)

Step 5: Split duration column into duration as a number and unit_of_measure (min for movies and season for tv shows)

In [16]:
# split column and add new columns to df
df[['duration_show', 'unit_measure']] = df['duration'].str.split(' ', expand=True)
# display the dataframe
df.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,duration_show,unit_measure
0,s1,Movie,Duck the Halls: A Mickey Mouse Christmas Special,Alonso Ramirez Ramos| Dave Wasson,Chris Diamantopoulos| Tony Anselmo| Tress MacN...,,2021-11-26,2016,TV-G,23 min,Animation| Family,Join Mickey and the gang as they duck the halls!,23,min
1,s2,Movie,Ernest Saves Christmas,John Cherry,Jim Varney| Noelle Parker| Douglas Seale,,2021-11-26,1988,PG,91 min,Comedy,Santa Claus passes his magic bag to a new St. ...,91,min


In [17]:
df.drop(['duration'], axis=1, inplace=True)

In [18]:
df.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,listed_in,description,duration_show,unit_measure
0,s1,Movie,Duck the Halls: A Mickey Mouse Christmas Special,Alonso Ramirez Ramos| Dave Wasson,Chris Diamantopoulos| Tony Anselmo| Tress MacN...,,2021-11-26,2016,TV-G,Animation| Family,Join Mickey and the gang as they duck the halls!,23,min
1,s2,Movie,Ernest Saves Christmas,John Cherry,Jim Varney| Noelle Parker| Douglas Seale,,2021-11-26,1988,PG,Comedy,Santa Claus passes his magic bag to a new St. ...,91,min


Step 6: replace Seasons with Season

In [19]:
df['unit_measure'].replace('Seasons', 'Season', inplace=True)
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,listed_in,description,duration_show,unit_measure
0,s1,Movie,Duck the Halls: A Mickey Mouse Christmas Special,Alonso Ramirez Ramos| Dave Wasson,Chris Diamantopoulos| Tony Anselmo| Tress MacN...,,2021-11-26,2016,TV-G,Animation| Family,Join Mickey and the gang as they duck the halls!,23,min
1,s2,Movie,Ernest Saves Christmas,John Cherry,Jim Varney| Noelle Parker| Douglas Seale,,2021-11-26,1988,PG,Comedy,Santa Claus passes his magic bag to a new St. ...,91,min
2,s3,Movie,Ice Age: A Mammoth Christmas,Karen Disher,Raymond Albert Romano| John Leguizamo| Denis L...,United States,2021-11-26,2011,TV-G,Animation| Comedy| Family,Sid the Sloth is on Santa's naughty list.,23,min
3,s4,Movie,The Queen Family Singalong,Hamish Hamilton,Darren Criss| Adam Lambert| Derek Hough| Alexa...,,2021-11-26,2021,TV-PG,Musical,This is real life; not just fantasy!,41,min
4,s5,TV Show,The Beatles: Get Back,,John Lennon| Paul McCartney| George Harrison| ...,,2021-11-25,2021,PG-13,Docuseries| Historical| Music,A three-part documentary from Peter Jackson ca...,1,Season


Step 7: format duration_show as a whole number (int)

In [20]:
df['duration_show'] = df['duration_show'].astype('int')
df['duration_show'].groupby(df['type']).describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Movie,1052.0,71.910646,40.595585,1.0,44.0,85.0,98.0,183.0
TV Show,395.0,2.113924,2.420237,1.0,1.0,1.0,2.0,32.0


Step 8: fix the problem with 'duration' for the following TV Shows:
- Obi-Wan Kenobi;
- Baymax!;
- The Proud Family: Louder and Prouder.
The issue is that they have a duration listed if 1 min, instead of 1 season according to what Google says :)

In [24]:
df.loc[df['title']=='Obi-Wan Kenobi']
df.loc[df['title']=='Baymax!']
df.loc[df['title']=='The Proud Family: Louder and Prouder']

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,listed_in,description,duration_show,unit_measure
37,s38,Movie,The Proud Family: Louder and Prouder,,Kyla Pratt| Tommy Davidson| Paula Jai Parker| ...,,2021-11-12,2021,TV-G,Animation| Comedy| Coming of Age,"""The Proud Family: Louder and Prouder"" follows...",1,min


In [27]:
df.loc[27, 'unit_measure']='Season'
df.loc[27, 'type']='TV Show'
df.loc[15, 'unit_measure']='Season'
df.loc[15, 'type']='TV Show'
df.loc[37, 'unit_measure']='Season'
df.loc[37, 'type']='TV Show'

Step 9: Deal with outliers

In [28]:
df.loc[(df['duration_show'] >6) & (df['type']=='TV Show')]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,listed_in,description,duration_show,unit_measure
13,s14,TV Show,Dr. Oakley| Yukon Vet,,Dr. Michelle Oakley| Zachary Fine,United States,2021-11-17,2013,TV-PG,Action-Adventure| Animals & Nature| Docuseries,Meet Dr. Michelle Oakley; vet to pretty much e...,10,Season
92,s93,TV Show,The Simpsons,,Dan Castellaneta| Julie Kavner| Nancy Cartwrig...,United States,2021-09-29,1989,TV-PG,Animation| Comedy,The world’s favorite nuclear family; in the aw...,32,Season
106,s107,TV Show,Life Below Zero,,Chip Hailstone| Agnes Hailstone| Sue Aikens| A...,United States,2021-09-15,2012,TV-14,Action-Adventure| Animals & Nature| Docuseries,Experience life deep in Alaska where the prima...,16,Season
115,s116,TV Show,The Incredible Dr. Pol,,Rick Robles| Dr. Pol,United States,2021-09-08,2011,TV-14,Animals & Nature| Docuseries| Family,Dr. Pol and his team handle challenging veteri...,19,Season
125,s126,TV Show,Dr. K's Exotic Animal ER,,Dr. Susan Kelleher| Art Edmonds,,2021-08-25,2014,TV-14,Animals & Nature| Docuseries| Family,Dedicated veterinarians treat a colorful array...,9,Season
179,s180,TV Show,When Sharks Attack,,Eric Meyers,United States,2021-07-09,2013,TV-14,Reality,National Geographic investigates shark attacks.,7,Season
216,s217,TV Show,Wicked Tuna,,Mike Rowe,United States,2021-05-28,2016,TV-14,Action-Adventure| Animals & Nature| Docuseries,Massachusetts fishermen make their living one ...,10,Season
285,s286,TV Show,Car SOS,,,United Kingdom,2021-02-26,2012,TV-PG,Buddy| Comedy| Docuseries,Decaying classic cars are revived by two exper...,8,Season
307,s308,TV Show,Wicked Tuna: Outer Banks,,Bill Ratner,,2021-02-05,2013,TV-14,Animals & Nature| Docuseries| Family,Fishermen venture to North Carolina’s Outer Ba...,7,Season
412,s413,TV Show,Once Upon a Time,,Ginnifer Goodwin| Jennifer Morrison| Robert Ca...,United States,2020-09-18,2011,TV-PG,Action-Adventure| Fantasy| Soap Opera / Melodrama,Fairy tale characters inhabit a land of good a...,7,Season


Step 10: Make sure that title is a string (some movies have dates as title) and release_year is a integer

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1447 entries, 0 to 1449
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   show_id        1447 non-null   object
 1   type           1447 non-null   object
 2   title          1447 non-null   object
 3   director       977 non-null    object
 4   cast           1257 non-null   object
 5   country        1228 non-null   object
 6   date_added     1447 non-null   object
 7   release_year   1447 non-null   int64 
 8   rating         1447 non-null   object
 9   listed_in      1447 non-null   object
 10  description    1447 non-null   object
 11  duration_show  1447 non-null   int64 
 12  unit_measure   1447 non-null   object
dtypes: int64(2), object(11)
memory usage: 190.6+ KB


In [30]:
df['title'].describe()

count                                                 1447
unique                                                1447
top       Duck the Halls: A Mickey Mouse Christmas Special
freq                                                     1
Name: title, dtype: object

In [31]:
df['release_year'].astype('int')

0       2016
1       1988
2       2011
3       2021
4       2021
        ... 
1445    2009
1446    2009
1447    2016
1448    2003
1449    2012
Name: release_year, Length: 1447, dtype: int64

Step 11: fill empty cells for director, cast, country and date_added with NaN

In [32]:
df['director'].fillna(np.NaN, inplace=True)
df['cast'].fillna(np.NaN, inplace=True)
df['country'].fillna(np.NaN, inplace=True)

Step 12: Save all changes into a new csv labeled "DataSchool_DisneyPlus_Clean.csv"

In [33]:
df.to_csv('DataSchool_DisneyPlus_Clean.csv')

In [34]:
df.to_excel('DataSchool_DisneyPlus_Clean.xlsx', encoding='utf-8')