# Data Cleaning - Tidy up messy Datasets (Movies Dataset)

## First Steps 

1. __Load__ and __inspect__ the messy dataset __movies_metadata.csv__. Identify columns with nested / stringified json data.

In [None]:
import pandas as pd
import numpy as np
pd.options.display.max_columns=30

In [None]:
df=pd.read_csv("movies_metadata.csv",low_memory=False)

In [None]:
df

In [None]:
df.info()

In [None]:
df["genres"]

In [None]:
df["genres"][0]  #output is a string not a list

In [None]:
df["belongs_to_collection"]

In [None]:
df["belongs_to_collection"][0]

## Dropping irrelevant Columns

2. __Drop__ the irrelevant columns 'adult', 'imdb_id', 'original_title', 'video' and 'homepage'.

In [None]:
df.adult.value_counts()

In [None]:
df.drop(columns=["adult"],inplace=True)

In [None]:
df.drop(columns=["imdb_id"],inplace=True)

In [None]:
df.drop(columns=["original_title"],inplace=True)

In [None]:
df.drop(columns=["video"],inplace=True)

In [None]:
df.drop(columns=["homepage"],inplace=True)

## How to handle stringified JSON columns

3. __Evaluate__ Python Expressions in the stringified columns ["belongs_to_collection", "genres", "production_countries", "production_companies", "spoken_languages"] and __remove quotes__ ("") where possible.

In [None]:
#cleaning up messy data
import json
import ast
#ast stands for abstract syntax trees

In [None]:
json_col=["belongs_to_collection","genres","production_countries","production_companies","spoken_languages"]


In [None]:
#double quote is for dictionary, single quote is for value and key
df.belongs_to_collection[0]

In [None]:
json2='{"dog":3,"cat":5}'
json.loads(json2)         #converted stringified json data into a dictionary

In [None]:
json1="{'dog':3,'cat':5}"  
#json.loads(json1) #it wont work because it is not a valid json format

In [None]:
json1.replace("'",'"')     #replace single quote with double quote because that is acceptable for json

In [None]:
json.loads(json1.replace("'",'"'))

In [None]:
df.genres.apply(lambda x: json.loads(x.replace("'",'"')))
#replace each and every element x with single quote by double quote
#converted stringfied json data into a list

In [None]:
df["genres"][0]

In [None]:
ast.literal_eval(json1)

In [None]:
ast.literal_eval(json2)

In [None]:
#apply literal_eval on each and every element of the genres column

In [None]:
df["genres"].apply(ast.literal_eval) 
#same as lambda x: json.loads(x.replace) but more efficient
#converted stringfied json data into a list

In [None]:
df.genres=df.genres.apply(ast.literal_eval)

In [None]:
# df.loc[:,json_col].apply(ast.literal_eval,axis=0) 
#-- gives error because it has mixed data type in json_col  -- "belongs to collection"

In [None]:
# ast.literal_eval(0) #it gives value error for '0' value
#literal_eval wont work if you pass objects other than strings
# apply literal_eval to string only 

In [None]:
##part 2

In [None]:
df.belongs_to_collection.apply(lambda x: isinstance(x,str)) #it will check whether element is a string or not

In [None]:
df.belongs_to_collection=df.belongs_to_collection.apply(lambda x: ast.literal_eval(x) if isinstance(x,str) else np.nan)
#it checks whether each element of the column belongs to string 
#if it is string, apply the literal_eval else convert into missing values of np.nan

In [None]:
df.belongs_to_collection

In [None]:
df.spoken_languages

In [None]:
df.spoken_languages=df.spoken_languages.apply(lambda x: ast.literal_eval(x) if isinstance(x,str) else np.nan)

In [None]:
df.production_countries

In [None]:
df.production_countries=df.production_countries.apply(lambda x: ast.literal_eval(x) if isinstance(x,str) else np.nan)

In [None]:
df.production_companies

In [None]:
df.production_companies=df.production_companies.apply(lambda x: ast.literal_eval(x) if isinstance(x,str) else np.nan)

In [None]:
df  # we have nested columns but instead of stringfied json we have list and dictionaries

## How to flatten nested Columns

4. __Extract__ only the __collection name__ from the column "belongs_to_collection" and __overwrite__ "belongs_to_collection". <br> For example: The value in the first row (Toy Story) should be 'Toy Story Collection'.

5. __Extract__ all __genre names__ from the column "genres" and __overwrite__ "genres". If a movie has more than one genre, __seperate genres by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Animation|Comedy|Family'.

6. __Extract__ all __spoken language names__ from the column "spoken_languages" and __overwrite__ "spoken_languages". If a movie has more than one spoken language, __seperate spoken languages by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'English'.

7. __Extract__ all __production countries names__ from the column "production_countries" and __overwrite__ "production_countries". If a movie has more than one production country, __seperate production countries by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'United States of America'.

8. __Extract__ all __production companies names__ from the column "production_companies" and __overwrite__ "production_companies". If a movie has more than one production company, __seperate production companies by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Pixar Animation Studios'

9. __Inspect__ all columns above with value_counts(). Do you see anything strange? __Take reasonable measures__!

In [None]:
df["belongs_to_collection"][0]

In [None]:
df["belongs_to_collection"]=df["belongs_to_collection"].apply(lambda x: x['name'] if isinstance(x,dict) else np.nan)
#if element is not a dictionary, then we should have a missing value

In [None]:
df["belongs_to_collection"]

In [None]:
df.belongs_to_collection.value_counts(dropna=False).head(20)
#40975 movies dont belong to a collection 

In [None]:
df.genres[0]

In [None]:
df.genres=df.genres.apply(lambda x:"|".join(i['name'] for i in x))
#just retrieve the names of genres

In [None]:
df.genres[0]

In [None]:
df.genres.value_counts(dropna=False).head(20)
#empty string with 2442 instances


In [None]:
#replace empty string with a missing value
df.genres.replace("",np.nan,inplace=True)

In [None]:
df.genres.value_counts(dropna=False).head(20)
#now we have 2442 NaN missing values

In [None]:
df.spoken_languages

In [None]:
df.spoken_languages=df.spoken_languages.apply(lambda x:'|'.join(i['name'] for i in x) if isinstance(x,list) else np.nan)

In [None]:
df.spoken_languages.value_counts(dropna=False).head(20)
#second most value is an empty string with 3952 

In [None]:
#replace empty string with NaN missing values
df.spoken_languages.replace("",np.nan,inplace=True)

In [None]:
df.spoken_languages

In [None]:
df.production_countries

In [None]:
df.production_countries=df.production_countries.apply(lambda x:'|'.join(i['name'] for i in x) if isinstance(x,list) else np.nan)

In [None]:
df.production_countries.value_counts(dropna=False).head(20)
#6283 missing values--> convert it into nan string

In [None]:
df.production_countries.replace("",np.nan,inplace=True)

In [None]:
df.production_countries.value_counts(dropna=False).head(20)

In [None]:
df.production_companies

In [None]:
df.production_companies=df.production_companies.apply(lambda x: '|'.join(i['name'] for i in x) if isinstance(x,list) else np.nan)

In [None]:
df.production_companies

In [None]:
df.production_companies.value_counts(dropna=False).head(20)
#11875 missing values

In [None]:
df.production_companies.replace("",np.nan,inplace=True)
#convert missing values to np.nan

In [None]:
df.production_companies.value_counts(dropna=False).head(20)
#11881 values are np.nan

In [None]:
#check the no of missing values in our columns
df.isna().sum()

In [None]:
#compare it with the original data frame that was uncleaned
pd.read_csv("movies_metadata.csv",low_memory=False).isna().sum()
#missing data doesnt mean uncleaned data
#in original, we have 3 production_companies and 3 production_countries missing values

## Cleaning Numerical Columns

10. __Convert__ the datatype in the columns __"budget"__, __"id"__ and __"popularity"__ __to numeric__. Set invalid values as NaN.

11. __Analyze__ the columns __"budget"__ and __"revenue"__ and __"runtime"__. Analyze movies with a budget/revenue/runtime of 0. Do you think the value 0 is the most appropriate value? __Take reasonable measures__! 

12. The columns "budget" and "revenue" shall show values in Million USD. __Convert and Overwrite__!

13. __Analyze__ movies with a __vote_count of 0__. What´s the __vote_average__ for those movies? Do you think this value is the most appropriate value? __Take reasonable measures__!

In [None]:
df.info() #convert datatype budget to object value

In [None]:
#errors="coerce" --> invalid value will be set to nan
df.budget=pd.to_numeric(df.budget,errors="coerce")

In [None]:
df.budget.value_counts(dropna=False)
#budget of 0 is the most frequent value

In [None]:
#consider 0 as the missing value np.nan, because 0 is not a valid value
df.budget=df.budget.replace(0,np.nan)

In [None]:
#divide budget by 1 million and rewrite 
df.budget=df.budget.div(1000000)

In [None]:
df.info()
#budget is not converted to float64 from object

In [None]:
df.revenue.value_counts(dropna=False)
#vast majority of movies has revenue 0.0
# 0 is a placeholder for missing value

In [None]:
df.revenue=df.revenue.replace(0,np.nan)

In [None]:
df.revenue=df.revenue.div(1000000) #revenue is in million dollar sor divide by 1million

In [None]:
df.rename(columns={"revenue":"revenue_musd","budget":"budget_musd"},inplace=True)

In [None]:
df.info()
#revenue column has 7408 missing values

In [None]:
df.runtime.value_counts(dropna=False).head(20)
#most frequent value is 0

In [None]:
#0 may indicate missing value so replace 0 by missing value of np.nan
df.runtime=df.runtime.replace(0,np.nan)

In [None]:
df.info()

In [None]:
#convert id to numeric 
#pd.to_numeric(df.id) #it wont work --> gives error 
df.id=pd.to_numeric(df.id,errors="coerce") 

In [None]:
df.id.value_counts(dropna=False).head(20)
#id column has missing value and duplicated values

In [None]:
df.info()

In [None]:
#convert popularity column to numeric
#pd.to_numeric(df.popularity) #gives error 
df.popularity=pd.to_numeric(df.popularity,errors="coerce")

In [None]:
df.popularity.value_counts(dropna=False).head(20)

In [None]:
df.vote_count.value_counts(dropna=False).head(20)
#0 rating for 2899 movies

In [None]:
df.vote_average.value_counts(dropna=False).head(20)
#0 value for 2998 records

In [None]:
df.loc[df.vote_count==0,"vote_average"]

In [None]:
#convert a missing value for vote_average where vote_count is 0
df.loc[df.vote_count==0,"vote_average"]==np.nan

In [None]:
df.info()

## Cleaning DateTime Columns

14. __Convert__ the datatype in the column __"release_date"__ __to datetime__. Set invalid values as NaN.

In [None]:
df.info()

In [None]:
#convert datatype of "release_date" to datetime from object
df.release_date

In [None]:
#pd.to_datetime(df.release_date) #it gives error 
df.release_date=pd.to_datetime(df.release_date,errors="coerce")
#using coerce --> converts missing value for those strings that cannot be converted into date time 

In [None]:
df.info()

In [None]:
df.release_date.value_counts(dropna=False).head(20)
#frequently we have 1st january consider it as placeholder for cases where we know release year of movie not actual date

## Cleaning Text / String Columns

15. __Analyze__ the text columns "overview" and "tagline". Try to identify __missing data that is not represented by NaN__ (e.g. "No Data"). __Replace as NaN__ (np.nan)!

In [None]:
df.info()

In [None]:
df.original_language.value_counts(dropna=False).head(50)
# 50 most frequent languages
# Also have some missing values NaN

In [None]:
df.title

In [None]:
df.title.value_counts(dropna=False).head(20)

In [None]:
df.overview

In [None]:
df.overview.value_counts(dropna=False).head(20)
#954 NaN --> Missing values

In [None]:
#replace no overview found by missing value NaN
df.overview.replace('No overview found',np.nan,inplace=True)

In [None]:
df.overview.replace('No overview',np.nan,inplace=True)

In [None]:
df.overview.replace('No movie overview available',np.nan,inplace=True)

In [None]:
#replace white space with a missing value
df.overview.replace(" ",np.nan,inplace=True)

In [None]:
df.overview.replace("No overview yet",np.nan,inplace=True)

In [None]:
df.tagline.value_counts(dropna=False).head(50)

In [None]:
df.tagline.replace("-",np.nan,inplace=True)
#now dataset is a lit bit cleaner 

## Removing Duplicates

16. __Identify__ and __remove__ duplicates!

In [None]:
df[df.duplicated(keep=False)]
#returns all duplicate rows
#check whether we have duplicates or not and then we are filtering df

In [None]:
df[df.duplicated(keep=False)].sort_values(by="id") 

In [None]:
#drop one instance of duplicates and keep one instance of duplicates
df.drop_duplicates(inplace=True)

In [None]:
df[df.duplicated(subset="id",keep=False)].sort_values(by="id")
#filter records that has identical movie id

In [None]:
df.drop_duplicates(subset="id",inplace=True)
#some id values have duplicates so again drop one instance of row with duplicated id values

In [None]:
df.id.value_counts(dropna=False)
#now no duplicates

## Handling Missing Values & Removing Observations

17. __Drop__ all rows/movies with unknown __id__ or __title__.

18. __Keep__ only those rows/movies in the df with __10 or more non-NaN__ values.

In [None]:
df.info()

In [None]:
#for missing values
#1] do nothing or
#2] remove missing values or remove entire row/ or entire column that exceed certain no of missing values
#3] replacing missing value with replacement values [done for ML purposes]
df.isna().sum()   #directly gives the no of missing values
#40946 missing values for belongs_to_collection

In [None]:
df[df.title.isna()]

In [None]:
#drop all rows or movies where we have missing values for titles
df.dropna(subset=["id","title"],inplace=True)
#dropna drops all row where we have atlease one missing row in the subset of id and title either id or title column

In [None]:
#now we wont have any missing values in the id and title column
df.isna().sum()

In [None]:
df[df.title.isna()]

In [None]:
#convert the datatype value of the id column from float to integer
df.id=df.id.astype("int")

In [None]:
#check for movies (each records) how many non missing values ie valid values
df.notna().sum(axis=1)
#for toystory we have 18 non missing values out of 18 columns


In [None]:
df.notna().sum(axis=1).value_counts().sort_values(ascending=False)
#6 movies have only 7 non missing values

In [None]:
df[df.notna().sum(axis=1)==7]   #filter those movies where we have 7 non-missing values

In [None]:
#put a threshold of atleast 10 non -missing values
#remove all movies with less than 10 non-missing values
df.dropna(thresh=10,inplace=True)

In [None]:
df.info()

In [None]:
df.isna().sum()

## Final (Cleaning) Steps

19. __Keep__ only those rows/movies in the df with __status "Released"__. Then __drop__ the column "status".

20. The Order of the columns should be as follows: 

In [None]:
["id", "title", "tagline", "release_date", "genres", "belongs_to_collection", 
"original_language", "budget_musd", "revenue_musd", "production_companies",
"production_countries", "vote_count", "vote_average", "popularity", "runtime",
"overview", "spoken_languages", "poster_path"]

21. __Reset__ the Index and create a __RangeIndex__.

22. __Save__ the cleaned dataset in a __csv-file__.

In [None]:
df.status.value_counts()

In [None]:
#select only those rows with the value released 
df=df.loc[df["status"]=="Released"].copy()

In [None]:
df.drop(columns=["status"],inplace=True)

In [None]:
col=["id", "title", "tagline", "release_date", "genres", "belongs_to_collection", 
"original_language", "budget_musd", "revenue_musd", "production_companies",
"production_countries", "vote_count", "vote_average", "popularity", "runtime",
"overview", "spoken_languages", "poster_path"]

In [None]:
df=df.loc[:,col]

In [None]:
df

In [None]:
df.reset_index(drop=True,inplace=True)
#drop index and create a new range index from 0 to 44810

In [None]:
df.info()

In [None]:
df.poster_path[0]

In [None]:
base_poster_url='http://image.tmdb.org/t/p/w185/'
df.poster_path="<img src='"+ base_poster_url + df.poster_path + "' style='height:100px;'>"


In [None]:
df.poster_path[0]

In [None]:
df.to_csv("movies_clean.csv",index=False)

In [None]:
pd.read_csv("movies_clean.csv").info()

In [None]:
print("The end")