# Merging & Cleaning & Transforming Data (Movies Dataset)

## Introduction / Getting the Datasets

1. __Load__ and __inspect__ the datasets "movies_clean.csv" and "credits.csv". __Identify__ stringified/nested __json columns__ in the __credits__ dataset.

In [None]:
import pandas as pd

In [None]:
df=pd.read_csv('movies_clean.csv',parse_dates=["release_date"])

In [None]:
df.info()

In [None]:
#credits.csv --> contains information on the movies cast and crew
credits=pd.read_csv("credits.csv")

In [None]:
credits

In [None]:
credits.info()

In [None]:
credits.cast[0]

In [None]:
credits.crew[0]

## Preparing the Data for Merge

2. __Drop Duplicates__ in the credits datasets. (similar to Project 3)

In [None]:
credits.id.value_counts()
#we have duplicates

In [None]:
credits[credits.duplicated(subset=["id"],keep=False)].sort_values("id")

In [None]:
#keep only one instance of the duplicate
credits.drop_duplicates(subset="id",inplace=True)

In [None]:
credits.id.value_counts() 
#now we have only one instance of each id

## Merging the Data

3. __Merge/Join__ the datasets movies_clean and credits. -> Add the features __cast__ and __crew__ to the movies_clean dataset.

In [None]:
#merging the data using left join 
#df is the left data frame 
#credits is the right data frame 
#find out any movies in df for which we dont have cast and crew information in the credits dataframe
df[~df.id.isin(credits.id)]
#output says we will get cast and crew for all movies in the movies dataset

In [None]:
#filter out those movies where we have cast and crew data but no information in df
credits[~credits.id.isin(df.id)]
#we have 741 movies that are in credits but not in df

In [None]:
#perform the left join or left merge on left data frame df and pass credits as the right dataframe 
#columns to join are the movie id in the left data frame and the id column in the right data frame
df=df.merge(credits,how="left",left_on="id",right_on="id")

In [None]:
df
#adding two more columns cast and crew to df

In [None]:
df.info()

## Cleaning and Transforming the new "Cast" Column

4.  __Evaluate__ Python Expressions in the stringified column "cast" and __remove quotes__ ("") where possible.

5. __Determine__ the __cast size__ for all movies (number of actors) and add the additional column "cast_size".

6. __Extract__ all __actor names__ from the column "cast" and __overwrite__ "cast". If a movie has more than one actor, __seperate names by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wallace Shawn|John Ratzenberger|Annie Potts|John Morris|Erik von Detten|Laurie Metcalf|R. Lee Ermey|Sarah Freeman|Penn Jillette'.

7. __Inspect__ cast with value_counts(). Do you see anything strange? __Take reasonable measures__!

In [None]:
import ast
import numpy as np

In [None]:
df.cast[0]   #data is stringified json data 

In [None]:
df.cast=df.cast.apply(lambda x: ast.literal_eval(x) if isinstance(x,str) else np.nan)

In [None]:
df.cast[0]
#each and every element in the list is a dictonary with information on character and actor names

In [None]:
pd.DataFrame(df.cast[0])

In [None]:
#determine the length of the cast list 
df["cast_size"]=df.cast.apply(lambda x:len(x))

In [None]:
df.cast_size

In [None]:
df.cast_size.value_counts(dropna=False).head(50)
#most common case size is 10 

In [None]:
df.cast=df.cast.apply(lambda x:'|'.join(i['name'] for i in x) if isinstance(x,list) else np.nan)

In [None]:
df.cast

In [None]:
df.cast[0]

In [None]:
df.cast.value_counts(dropna=False).head(50)

In [None]:
df.cast.replace("",np.nan,inplace=True)
#replace missing values

In [None]:
df.cast.value_counts(dropna=False).head(50)

## Cleaning and Transforming the new "Crew" Column

8.  __Evaluate__ Python Expressions in the stringified column "crew" and __remove quotes__ ("") where possible.

9. __Determine__ the __crew size__ for all movies (size of the crew) and add the additional column "crew_size".

10. __Extract__ the __director name__ from the column "crew" and create the new column "director". <br> For example: The value in the first row (Toy Story) should be 'John Lasseter'.

In [None]:
df.crew[0]

In [None]:
#apply literal eval on each and every string to extract the data
df.crew=df.crew.apply(lambda x: ast.literal_eval(x) if isinstance(x,str) else np.nan)

In [None]:
df.crew[0]
#for each and every movie, we have a list containing dictionary
#each and every dictionary is a crew member

In [None]:
pd.DataFrame(df.crew[0])
#for toy story we have a crew size of 106 rows

In [None]:
df["crew_size"]=df.crew.apply(lambda x:len(x))

In [None]:
df.crew_size.value_counts(dropna=False).head(50)
#most frequent crew size is 2,3,1

In [None]:
#extract the director 
#for each and every movie, we have list and in that list we have dictionary.
#in that dictionary, we have actually one crew member
#we want to extract one crew member, with the name of the crew member as director 
#we should iterate through the list, in case we find the the job director, we should take the name eg John Lasseter
def get_director(x):
    for i in x:
        if i['job']=='Director':
            return i['name']
    return np.nan
#return name of director if job else return missing value

In [None]:
#we can apply this user-defined function to each and every element of crew column
df["director"]=df.crew.apply(get_director)

In [None]:
df.director

In [None]:
df.director.value_counts(dropna=False).head(50)
#over 700 missing values

## Final Steps

11. __Drop__ the column "crew" and __save__ the dataset in a csv-file.

In [None]:
df.head(2)

In [None]:
df.info()

In [None]:
df.drop(columns="crew",inplace=True)

In [None]:
df.to_csv("movies_complete.csv",index=False)

In [None]:
print("The End")