# PI 1: ETL

## Things to do:

* Trim string values. ✅
* Has duplicate values? NO
* Drop added_date column. ✅
* Manage null values. ✅
* Create a movies DF

## Import Libs

In [155]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


## Create DF's

In [156]:
df_amazon = pd.read_csv('Datasets/amazon_prime_titles.csv')
df_disney = pd.read_csv('Datasets/disney_plus_titles.csv')
df_hulu = pd.read_csv('Datasets/hulu_titles.csv')
df_netflix = pd.read_json('Datasets/netflix_titles.json')

df_amazon

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,The Grand Seduction,Don McKellar,"Brendan Gleeson, Taylor Kitsch, Gordon Pinsent",Canada,"March 30, 2021",2014,,113 min,"Comedy, Drama",A small fishing village must procure a local d...
1,s2,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,"March 30, 2021",2018,13+,110 min,"Drama, International",A Metro Family decides to fight a Cyber Crimin...
2,s3,Movie,Secrets of Deception,Josh Webber,"Tom Sizemore, Lorenzo Lamas, Robert LaSardo, R...",United States,"March 30, 2021",2017,,74 min,"Action, Drama, Suspense",After a man discovers his wife is cheating on ...
3,s4,Movie,Pink: Staying True,Sonia Anderson,"Interviews with: Pink, Adele, Beyoncé, Britney...",United States,"March 30, 2021",2014,,69 min,Documentary,"Pink breaks the mold once again, bringing her ..."
4,s5,Movie,Monster Maker,Giles Foster,"Harry Dean Stanton, Kieran O'Brien, George Cos...",United Kingdom,"March 30, 2021",1989,,45 min,"Drama, Fantasy",Teenage Matt Banting wants to work with a famo...
...,...,...,...,...,...,...,...,...,...,...,...,...
9663,s9664,Movie,Pride Of The Bowery,Joseph H. Lewis,"Leo Gorcey, Bobby Jordan",,,1940,7+,60 min,Comedy,New York City street principles get an East Si...
9664,s9665,TV Show,Planet Patrol,,"DICK VOSBURGH, RONNIE STEVENS, LIBBY MORRIS, M...",,,2018,13+,4 Seasons,TV Shows,"This is Earth, 2100AD - and these are the adve..."
9665,s9666,Movie,Outpost,Steve Barker,"Ray Stevenson, Julian Wadham, Richard Brake, M...",,,2008,R,90 min,Action,"In war-torn Eastern Europe, a world-weary grou..."
9666,s9667,TV Show,Maradona: Blessed Dream,,"Esteban Recagno, Ezequiel Stremiz, Luciano Vit...",,,2021,TV-MA,1 Season,"Drama, Sports","The series tells the story of Diego Maradona, ..."


## Trim values

In [157]:
# Remove Leading and Trailing Whitespace

df_amazon = df_amazon.apply(lambda x: x.strip() if type(x) == str else x)
df_disney = df_disney.apply(lambda x: x.strip() if type(x) == str else x)
df_hulu = df_hulu.apply(lambda x: x.strip() if type(x) == str else x)
df_netflix = df_netflix.apply(lambda x: x.strip() if type(x) == str else x)

# Normalize string values

df_amazon[['type', 'title', 'director', 'cast', 'country', 'listed_in']] = df_amazon[['type', 'title', 'director', 'cast', 'country', 'listed_in']].apply(lambda x: x.title() if type(x) == str else x)

df_disney[['type', 'title', 'director', 'cast', 'country', 'listed_in']] = df_disney[['type', 'title', 'director', 'cast', 'country', 'listed_in']].apply(lambda x: x.title() if type(x) == str else x)

df_hulu[['type', 'title', 'director', 'cast', 'country', 'listed_in']] = df_hulu[['type', 'title', 'director', 'cast', 'country', 'listed_in']].apply(lambda x: x.title() if type(x) == str else x)

df_netflix[['type', 'title', 'director', 'cast', 'country', 'listed_in']] = df_netflix[['type', 'title', 'director', 'cast', 'country', 'listed_in']].apply(lambda x: x.title() if type(x) == str else x)

#Change listed_in column name to genre

df_netflix.rename(columns= {'listed_in': 'genre'})


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,genre,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


## Drop date_added column

In [158]:
#Apply strip to al string values.
df_amazon.drop(columns='date_added', inplace=True)
df_disney.drop(columns='date_added', inplace=True)
df_hulu.drop(columns='date_added', inplace=True)
df_netflix.drop(columns='date_added', inplace=True)


## Create a movies DF

In [159]:
#Crate platform table.
df_platforms = pd.DataFrame(data= {'platform': ['Amazon', 'Disney', 'Hulu', 'Netflix']})

#Add an platform id column.
df_amazon['platform'] = 0
df_disney['platform'] = 1
df_hulu['platform'] = 2
df_netflix['platform'] = 3

df_platforms

Unnamed: 0,platform
0,Amazon
1,Disney
2,Hulu
3,Netflix


## Manage null values.

In [160]:
#set title column as index

df_amazon.set_index('title', drop=True, inplace=True)
df_disney.set_index('title', drop=True, inplace=True)
df_hulu.set_index('title', drop=True, inplace=True)
df_netflix.set_index('title', drop=True, inplace=True)

#Filling amazon null values with fillna
# fillna match with the index and column

df_amazon.fillna(df_disney, inplace=True)
df_amazon.fillna(df_hulu, inplace=True)
df_amazon.fillna(df_netflix, inplace=True)

df_disney.fillna(df_amazon, inplace=True)
df_disney.fillna(df_hulu, inplace=True)
df_disney.fillna(df_netflix, inplace=True)

df_hulu.fillna(df_amazon, inplace=True)
df_hulu.fillna(df_disney, inplace=True)
df_hulu.fillna(df_netflix, inplace=True)

df_netflix.fillna(df_amazon, inplace=True)
df_netflix.fillna(df_disney, inplace=True)
df_netflix.fillna(df_hulu, inplace=True)

#Revert set_index.

df_amazon.reset_index(inplace=True)
df_disney.reset_index(inplace=True)
df_hulu.reset_index(inplace=True)
df_netflix.reset_index(inplace=True)

## Divide general DF to movies and Tv Shows

In [161]:
df_general = pd.concat([df_amazon, df_netflix, df_disney, df_hulu])

df_movies = df_general[df_general['type'] == "Movie"]
df_tv_shows = df_general[df_general['type'] == 'TV Show']

df_movies.reset_index(drop=True, inplace=True)
df_movies.reset_index(drop=True, inplace=True)

In [162]:
len(df_movies.director.unique().tolist())

9953

## Create Dimension Dataframes

In [163]:
# Directors

#error strings:
err_string_1 = 'CAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI ValidCAPI'


df_movies['director'] = df_movies.director.fillna('Sin Dato')#Fill nan 
df_movies['director'].replace(err_string_1, 'Sin Dato', inplace=True)# Mange error strings
df_tv_shows['director'] = df_tv_shows.director.fillna('Sin Dato')#Fill nan 
df_tv_shows['director'].replace(err_string_1, 'Sin Dato', inplace=True)#Manage error strings



#List with uni values of director names
director_names = df_movies.director.unique().tolist() 
director_names += df_tv_shows.director.unique().tolist() #Concat lists

#print(director_names) #Len for solve 3 error in the final of the list

# Search rare strings
#for ind, cont in enumerate(director_names):
#    if len(cont) > 100:
#        print("Indice: ", ind, end='\n\n')
#        print(cont)

#Solve 3 errors
director_names[9949] = 'Jennifer Kent'
director_names[9950] = 'Gigi Saul Guerrero'
director_names[9951] = 'Alex Winter'


#Split strings wit ', '.
for ind, cont in enumerate(director_names):
    director_names[ind] = cont.split(", ")

#tranform to a 1D list
director_names = list(np.concatenate(director_names).flat)

#Trim values
for ind, cont in enumerate(director_names):
    director_names[ind] = cont.strip()

#remove duplicated values
director_names = list(set(director_names))

#create dataframe
df_directors = pd.DataFrame({'name': director_names})

df_directors['name']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_movies['director'] = df_movies.director.fillna('Sin Dato')#Fill nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_movies['director'].replace(err_string_1, 'Sin Dato', inplace=True)# Mange error strings
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tv_shows['director'] = df_tv_shows.director.fillna('Sin Dato')#Fill nan
A value is trying to be s

0        Paul Andrew Williams
1                Kayoze Irani
2            Nanette Burstein
3               Augusto Matte
4                 Alan Gibson
                 ...         
10889           Chakri Toleti
10890          Rodrigo Triana
10891            Joachim Fest
10892       Franco Steffanino
10893      Milap Milan Zaveri
Name: name, Length: 10894, dtype: object

## Replace values of movies and tv show dataframes

In [164]:
#Replace error of director in movies dataframe:

err_string_2 = "Director Gigi Saul Guerrero talks through Culture Shock’s themes – immigration, filmmaking, and latinidad – in this one-on-one chat."

err_string_3 = "Alex Winter goes inside the biggest global corruption scandal in history that was uncovered by hundreds of journalists, working in secret and at enormous risk."

df_movies[df_movies['director'].str.find(("The Babadook")) != -1]
df_movies.loc[16184, 'director'] = 'Jennifer Kent'

df_movies[df_movies['director'].str.find((err_string_2)) != -1]
df_movies.loc[16213, 'director'] = 'Gigi Saul Guerrero'

df_movies[df_movies['director'].str.find((err_string_3)) != -1]
df_movies.loc[16307, 'director'] = 'Alex Winter'



In [165]:
for ind, name in df_directors.iterrows():
    print(name.values[0])

Paul Andrew Williams
Kayoze Irani
Nanette Burstein
Augusto Matte
Alan Gibson
Lee Chung-hyun
White Trash Tyler
Sabu Varghese
Flor Salcedo
Lenard Fritz Krawinkel
Peter Farrelly
Jason Sussberg
Suresh Krishna
Tugçe Soysop
Marc Fouchard
Ricardo de Montreuil
Mike Jeffers
Timothy Armstrong
Sanjiv Kolate
Tony Vidal
Ryland Brickson Cole Tews
Peggy Holmes
Christian Baumeister
Predrag Antonijevic
Michael Lange
Antonio Chavarrías
Matthew O. Henderson
Fred Dekker
Ron Scalpello
A.R. Murugadoss
Sri Senthil
Nicholas Hytner
Michael Winner
Kyle Hedrick
Pang Ho-cheung
Kevin Willmott
Chris Eneaji
Lance Bangs
Rudranil Chaudhuri
Susan Glatzer
David Blair
Michael Carney
Lucía Puenzo
Paul D. Hannah
Alan Metter
Emil Ben-Shimon
William Eubank
Luis Libran
Heidi Brandenburg
Xiaoxing Yi
Bernard L. Kowalski
Anne Wheeler
Aleksey Sidorov
David van Eyssen
Pavel Lungin
Rajeev Patil
Bruce MacDonald
Tig Notaro
Toby Trackman
Otoja Abit
Amy Waddell
Nageshwara rao Anchula
Jeff Kennedy
Agustin Adba
Dana Doron
David Alvarado


In [166]:
str(df_directors[df_directors['name'].isin(['Rodrigo Triana'])].index[0])

'10890'

In [169]:
#Replace values
for ind, name in df_movies.iterrows():
    print(name['director'])

Don McKellar
Girish Joshi
Josh Webber
Sonia Anderson
Giles Foster
Paul Weiland
Fran Strine
Thomas Kail, Alex Rudzinski
Daniel Gilboy
Robert Allan Ackerman
Justin G. Dyck
Liz Tuccillo
Dominique Rocher
Jep Barcelona
Sonia Anderson
Becca Gleason
Glenn Miller
Drake Doremus
William Nigh
Sam Pillsbury
Dr. Rudolf Lammers
Ida Lupino
Sin Dato
Sin Dato
George C. Wolfe
Daisy Aitkens
Mark Knight
Mark Knight
Mark Knight
Mark Knight
Mark Knight
Mark Knight
Mark Knight
Mark Knight
Mark Knight
Mark Knight
Mark Knight
Mark Knight
Mark Knight
R. John Hugh
Mahi V Raghav
Mahi V Raghav
Allan Moyle
Boaz Davidson
Jeffrey Schwarz
Tim Gray
Baeble Music
Thor Freudenthal
Cannis Holder
Cannis Holder
Buzz Kulik
Robert Ginty
John Elbert Ferrer
Jenny Bowen
Oscar Micheaux
Alfred Santell
Mack V. Wright
Carroll Ballard
Frederic Compain
Jonathan Chase Cook
Brandon Jones
Brandon Jones
Alan Scales
Frank Hall Green
Jason Bourque
Glenn Gordon Caron
Andrew V. McLaglen
Mike Slee
Eugene Jarecki
Kreeti Gogia
Caryl Ebenezer, J. 

In [None]:
df_aux = df_movies['director'].apply(lambda x: x.split(", ") if type(x) == str 
else x)

df_aux

df_aux.value_counts()

KeyboardInterrupt: 