# Project 3: Data Cleaning - Tidy up messy Datasets (Movies Dataset)

# Project Brief for Self-Coders

Here you´ll have the opportunity to code major parts of Project 3 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code. <br> <br>
Keep in mind that it´s all about __getting the right results/conclusions__. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code. 

## First Steps 

1. __Load__ and __inspect__ the messy dataset __movies_metadata.csv__. Identify columns with nested / stringified json data.

In [393]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 30

In [394]:
df = pd.read_csv("movies_metadata.csv", low_memory=False)
df.head

<bound method NDFrame.head of        adult                              belongs_to_collection    budget  \
0      False  {'id': 10194, 'name': 'Toy Story Collection', ...  30000000   
1      False                                                NaN  65000000   
2      False  {'id': 119050, 'name': 'Grumpy Old Men Collect...         0   
3      False                                                NaN  16000000   
4      False  {'id': 96871, 'name': 'Father of the Bride Col...         0   
...      ...                                                ...       ...   
45461  False                                                NaN         0   
45462  False                                                NaN         0   
45463  False                                                NaN         0   
45464  False                                                NaN         0   
45465  False                                                NaN         0   

                                             

In [395]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

## Dropping irrelevant Columns

2. __Drop__ the irrelevant columns 'adult', 'imdb_id', 'original_title', 'video' and 'homepage'.

In [396]:
df.drop(columns = ['adult'], inplace = True)
df.drop(columns = ['imdb_id'], inplace = True)
df.drop(columns = ['original_title'], inplace = True)
df.drop(columns = ['video'], inplace= True)
df.drop(columns = ['homepage'], inplace= True)

## How to handle stringified JSON columns

3. __Evaluate__ Python Expressions in the stringified columns ["belongs_to_collection", "genres", "production_countries", "production_companies", "spoken_languages"] and __remove quotes__ ("") where possible.

In [397]:
import json
import ast

In [398]:
json_cols = ["belongs_to_collection", "genres", "production_countries", 
            "production_companies", "spoken_languages"]

In [399]:
df.genres.apply(lambda x: json.loads(x.replace("'", '"')))[0]

[{'id': 16, 'name': 'Animation'},
 {'id': 35, 'name': 'Comedy'},
 {'id': 10751, 'name': 'Family'}]

In [400]:
df.genres.apply(ast.literal_eval)[0]

[{'id': 16, 'name': 'Animation'},
 {'id': 35, 'name': 'Comedy'},
 {'id': 10751, 'name': 'Family'}]

In [401]:
df.genres = df.genres.apply(ast.literal_eval)

## How to handle stringified JSON columns (Part 2)

In [402]:
df.belongs_to_collection

0        {'id': 10194, 'name': 'Toy Story Collection', ...
1                                                      NaN
2        {'id': 119050, 'name': 'Grumpy Old Men Collect...
3                                                      NaN
4        {'id': 96871, 'name': 'Father of the Bride Col...
                               ...                        
45461                                                  NaN
45462                                                  NaN
45463                                                  NaN
45464                                                  NaN
45465                                                  NaN
Name: belongs_to_collection, Length: 45466, dtype: object

In [403]:
df.belongs_to_collection.apply(lambda x: isinstance(x, str))

0         True
1        False
2         True
3        False
4         True
         ...  
45461    False
45462    False
45463    False
45464    False
45465    False
Name: belongs_to_collection, Length: 45466, dtype: bool

In [404]:
df.belongs_to_collection = df.belongs_to_collection.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

In [405]:
df.belongs_to_collection[0]

{'id': 10194,
 'name': 'Toy Story Collection',
 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg',
 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}

In [406]:
df.spoken_languages

0                 [{'iso_639_1': 'en', 'name': 'English'}]
1        [{'iso_639_1': 'en', 'name': 'English'}, {'iso...
2                 [{'iso_639_1': 'en', 'name': 'English'}]
3                 [{'iso_639_1': 'en', 'name': 'English'}]
4                 [{'iso_639_1': 'en', 'name': 'English'}]
                               ...                        
45461               [{'iso_639_1': 'fa', 'name': 'فارسی'}]
45462                    [{'iso_639_1': 'tl', 'name': ''}]
45463             [{'iso_639_1': 'en', 'name': 'English'}]
45464                                                   []
45465             [{'iso_639_1': 'en', 'name': 'English'}]
Name: spoken_languages, Length: 45466, dtype: object

In [407]:
df.spoken_languages = df.spoken_languages.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

In [408]:
df.production_countries = df.production_countries.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

In [409]:
df.production_companies = df.production_companies.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

## How to flatten nested Columns

4. __Extract__ only the __collection name__ from the column "belongs_to_collection" and __overwrite__ "belongs_to_collection". <br> For example: The value in the first row (Toy Story) should be 'Toy Story Collection'.

In [410]:
df.belongs_to_collection[0]

{'id': 10194,
 'name': 'Toy Story Collection',
 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg',
 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}

In [411]:
df.belongs_to_collection = df.belongs_to_collection.apply(lambda x: x["name"] if isinstance(x, dict) else np.nan)

5. __Extract__ all __genre names__ from the column "genres" and __overwrite__ "genres". If a movie has more than one genre, __seperate genres by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Animation|Comedy|Family'.

In [412]:
df.genres[0]

[{'id': 16, 'name': 'Animation'},
 {'id': 35, 'name': 'Comedy'},
 {'id': 10751, 'name': 'Family'}]

In [413]:
df.genres = df.genres.apply(lambda x: "|".join(i["name"] for i in x))

In [414]:
df.genres.value_counts(dropna = False)

genres
Drama                              5000
Comedy                             3621
Documentary                        2723
                                   2442
Drama|Romance                      1301
                                   ... 
Action|Drama|Comedy|Documentary       1
War|Drama|History|Thriller            1
Horror|Drama|History|Thriller         1
Comedy|Crime|Action|Drama             1
Family|Animation|Romance|Comedy       1
Name: count, Length: 4069, dtype: int64

In [415]:
df.genres.replace("", np.nan, inplace = True)

6. __Extract__ all __spoken language names__ from the column "spoken_languages" and __overwrite__ "spoken_languages". If a movie has more than one spoken language, __seperate spoken languages by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'English'.

In [416]:
df.spoken_languages

0                 [{'iso_639_1': 'en', 'name': 'English'}]
1        [{'iso_639_1': 'en', 'name': 'English'}, {'iso...
2                 [{'iso_639_1': 'en', 'name': 'English'}]
3                 [{'iso_639_1': 'en', 'name': 'English'}]
4                 [{'iso_639_1': 'en', 'name': 'English'}]
                               ...                        
45461               [{'iso_639_1': 'fa', 'name': 'فارسی'}]
45462                    [{'iso_639_1': 'tl', 'name': ''}]
45463             [{'iso_639_1': 'en', 'name': 'English'}]
45464                                                   []
45465             [{'iso_639_1': 'en', 'name': 'English'}]
Name: spoken_languages, Length: 45466, dtype: object

In [417]:
df.spoken_languages = df.spoken_languages.apply(lambda x: "|".join(
    i["name"] for i in x) if isinstance(x, list) else np.nan)

In [418]:
df.spoken_languages.value_counts(dropna = False)

spoken_languages
English                           22395
                                   3952
Français                           1853
日本語                                1289
Italiano                           1218
                                  ...  
English|日本語|Latin                     1
Deutsch||ελληνικά|English             1
English|suomi|Deutsch|svenska         1
English|Français|Deutsch|فارسی        1
Fulfulde|English                      1
Name: count, Length: 1843, dtype: int64

In [419]:
df.spoken_languages.replace("", np.nan, inplace = True)

7. __Extract__ all __production countries names__ from the column "production_countries" and __overwrite__ "production_countries". If a movie has more than one production country, __seperate production countries by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'United States of America'.

In [420]:
df.production_countries

0        [{'iso_3166_1': 'US', 'name': 'United States o...
1        [{'iso_3166_1': 'US', 'name': 'United States o...
2        [{'iso_3166_1': 'US', 'name': 'United States o...
3        [{'iso_3166_1': 'US', 'name': 'United States o...
4        [{'iso_3166_1': 'US', 'name': 'United States o...
                               ...                        
45461               [{'iso_3166_1': 'IR', 'name': 'Iran'}]
45462        [{'iso_3166_1': 'PH', 'name': 'Philippines'}]
45463    [{'iso_3166_1': 'US', 'name': 'United States o...
45464             [{'iso_3166_1': 'RU', 'name': 'Russia'}]
45465     [{'iso_3166_1': 'GB', 'name': 'United Kingdom'}]
Name: production_countries, Length: 45466, dtype: object

In [421]:
df.production_countries = df.production_countries.apply(lambda x: "|".join(i["name"] for i in x) if isinstance(x, list) else np.nan)

In [422]:
df.production_countries.value_counts(dropna = False)

production_countries
United States of America                  17851
                                           6282
United Kingdom                             2238
France                                     1654
Japan                                      1356
                                          ...  
Romania|United Kingdom|Canada                 1
Finland|Germany|Netherlands                   1
France|Denmark|Spain|Sweden                   1
France|United States of America|Canada        1
Egypt|Italy|United States of America          1
Name: count, Length: 2391, dtype: int64

In [423]:
df.production_countries.replace("", np.nan, inplace = True)

8. __Extract__ all __production companies names__ from the column "production_companies" and __overwrite__ "production_companies". If a movie has more than one production company, __seperate production companies by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Pixar Animation Studios'

In [424]:
df.production_companies

0           [{'name': 'Pixar Animation Studios', 'id': 3}]
1        [{'name': 'TriStar Pictures', 'id': 559}, {'na...
2        [{'name': 'Warner Bros.', 'id': 6194}, {'name'...
3        [{'name': 'Twentieth Century Fox Film Corporat...
4        [{'name': 'Sandollar Productions', 'id': 5842}...
                               ...                        
45461                                                   []
45462               [{'name': 'Sine Olivia', 'id': 19653}]
45463    [{'name': 'American World Pictures', 'id': 6165}]
45464                 [{'name': 'Yermoliev', 'id': 88753}]
45465                                                   []
Name: production_companies, Length: 45466, dtype: object

In [425]:
df.production_companies = df.production_companies.apply(
    lambda x: "|".join(i["name"] for i in x) if isinstance(x, list) else np.nan)

In [426]:
df.production_companies.value_counts(dropna = False)

production_companies
                                                                                                                                              11875
Metro-Goldwyn-Mayer (MGM)                                                                                                                       742
Warner Bros.                                                                                                                                    540
Paramount Pictures                                                                                                                              505
Twentieth Century Fox Film Corporation                                                                                                          439
                                                                                                                                              ...  
HBO Films|Moving Pictures                                                                  

In [427]:
df.production_companies.replace("", np.nan, inplace = True)

9. __Inspect__ all columns above with value_counts(). Do you see anything strange? __Take reasonable measures__!

In [428]:
df.isna().sum()

belongs_to_collection    40975
budget                       0
genres                    2442
id                           0
original_language           11
overview                   954
popularity                   5
poster_path                386
production_companies     11881
production_countries      6288
release_date                87
revenue                      6
runtime                    263
spoken_languages          3958
status                      87
tagline                  25054
title                        6
vote_average                 6
vote_count                   6
dtype: int64

In [429]:
pd.read_csv("movies_metadata.csv", low_memory=False).isna().sum()

adult                        0
belongs_to_collection    40972
budget                       0
genres                       0
homepage                 37684
id                           0
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   5
poster_path                386
production_companies         3
production_countries         3
release_date                87
revenue                      6
runtime                    263
spoken_languages             6
status                      87
tagline                  25054
title                        6
video                        6
vote_average                 6
vote_count                   6
dtype: int64

## Cleaning Numerical Columns

In [430]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   belongs_to_collection  4491 non-null   object 
 1   budget                 45466 non-null  object 
 2   genres                 43024 non-null  object 
 3   id                     45466 non-null  object 
 4   original_language      45455 non-null  object 
 5   overview               44512 non-null  object 
 6   popularity             45461 non-null  object 
 7   poster_path            45080 non-null  object 
 8   production_companies   33585 non-null  object 
 9   production_countries   39178 non-null  object 
 10  release_date           45379 non-null  object 
 11  revenue                45460 non-null  float64
 12  runtime                45203 non-null  float64
 13  spoken_languages       41508 non-null  object 
 14  status                 45379 non-null  object 
 15  ta

10. __Convert__ the datatype in the columns __"budget"__, __"id"__ and __"popularity"__ __to numeric__. Set invalid values as NaN.

In [431]:
df.budget = pd.to_numeric(df.budget, errors = "coerce")
df.id = pd.to_numeric(df.id, errors = "coerce")
df.popularity = pd.to_numeric(df.popularity, errors = "coerce")

11. __Analyze__ the columns __"budget"__ and __"revenue"__ and __"runtime"__. Analyze movies with a budget/revenue/runtime of 0. Do you think the value 0 is the most appropriate value? __Take reasonable measures__! 

In [432]:
df.budget = df.budget.replace(0, np.nan)
df.revenue = df.revenue.replace(0, np.nan)
df.runtime = df.runtime.replace(0, np.nan)

12. The columns "budget" and "revenue" shall show values in Million USD. __Convert and Overwrite__!

In [433]:
df.budget = df.budget.div(1000000)
df.revenue = df.revenue.div(1000000)

In [434]:
df.rename(columns = {"revenue":"revenue_musd", "budget":"budget_musd"}, inplace = True)

13. __Analyze__ movies with a __vote_count of 0__. What´s the __vote_average__ for those movies? Do you think this value is the most appropriate value? __Take reasonable measures__!

In [435]:
df.vote_count.value_counts(dropna = False)

vote_count
1.0       3264
2.0       3132
0.0       2899
3.0       2787
4.0       2480
          ... 
2755.0       1
1187.0       1
4200.0       1
3322.0       1
2712.0       1
Name: count, Length: 1821, dtype: int64

In [436]:
df.loc[df.vote_count == 0, "vote_average"] = np.nan

## Cleaning DateTime Columns

14. __Convert__ the datatype in the column __"release_date"__ __to datetime__. Set invalid values as NaN.

In [437]:
df.release_date = pd.to_datetime(df.release_date, errors = "coerce")

In [438]:
df.release_date.value_counts(dropna = False)

release_date
2008-01-01    136
2009-01-01    121
2007-01-01    118
2005-01-01    111
2006-01-01    101
             ... 
1957-09-26      1
1938-11-21      1
1936-08-19      1
2010-01-27      1
1917-10-21      1
Name: count, Length: 17334, dtype: int64

## Cleaning Text / String Columns

15. __Analyze__ the text columns "overview" and "tagline". Try to identify __missing data that is not represented by NaN__ (e.g. "No Data"). __Replace as NaN__ (np.nan)!

In [439]:
df.overview.value_counts(dropna = False)

overview
NaN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      954
No overview found.                                                                                                                                                                                                                

In [440]:
df.overview.replace("No overview found.", np.nan, inplace = True)
df.overview.replace("No Overview", np.nan, inplace = True)
df.overview.replace("No movie overview available.", np.nan, inplace = True)
df.overview.replace(" ", np.nan, inplace = True)
df.overview.replace("No overview yet.", np.nan, inplace = True)

In [441]:
df.tagline.value_counts(dropna = False)

tagline
NaN                                                           25054
Based on a true story.                                            7
Trust no one.                                                     4
Be careful what you wish for.                                     4
-                                                                 4
                                                              ...  
A special force in a special kind of hell!                        1
Play it. Sing it. Shout it. Feel it.                              1
If It's On TV, It Must Be The Truth.                              1
"I LOVE YOU BABY, BUT MY WIFE JUST REFUSES TO UNDERSTAND!"        1
A deadly game of wits.                                            1
Name: count, Length: 20284, dtype: int64

In [442]:
df.tagline.replace("-", np.nan, inplace = True)

## Removing Duplicates

16. __Identify__ and __remove__ duplicates!

In [443]:
# df[df.duplicated(keep = False)].sort_values("id")

In [444]:
df.drop_duplicates(inplace = True)

In [445]:
# df[df.duplicated(subset = "id", keep = False)].sort_values(by = "id")

In [446]:
df.drop_duplicates(subset="id", inplace=True)

## Handling Missing Values & Removing Observations

17. __Drop__ all rows/movies with unknown __id__ or __title__.

In [447]:
df.isna().sum()

belongs_to_collection    40946
budget_musd              36554
genres                    2442
id                           1
original_language           11
overview                  1104
popularity                   4
poster_path                386
production_companies     11872
production_countries      6283
release_date                88
revenue_musd             38036
runtime                   1819
spoken_languages          3954
status                      85
tagline                  25037
title                        4
vote_average              2900
vote_count                   4
dtype: int64

In [448]:
df.dropna(subset=["id", "title"], inplace=True)

In [449]:
df.id = df.id.astype("int")

In [450]:
df.notna().sum(axis = 1).value_counts().sort_values(ascending = False)

15    12522
16    11454
14     5424
17     4265
18     3859
13     3040
12     1891
19     1132
11     1020
10      511
9       184
8       104
7        20
6         4
Name: count, dtype: int64

18. __Keep__ only those rows/movies in the df with __10 or more non-NaN__ values.

In [451]:
df.dropna(thresh = 10, inplace = True)

In [452]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 45118 entries, 0 to 45465
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   belongs_to_collection  4487 non-null   object        
 1   budget_musd            8878 non-null   float64       
 2   genres                 42969 non-null  object        
 3   id                     45118 non-null  int32         
 4   original_language      45107 non-null  object        
 5   overview               44142 non-null  object        
 6   popularity             45118 non-null  float64       
 7   poster_path            44886 non-null  object        
 8   production_companies   33561 non-null  object        
 9   production_countries   39147 non-null  object        
 10  release_date           45078 non-null  datetime64[ns]
 11  revenue_musd           7398 non-null   float64       
 12  runtime                43552 non-null  float64       
 13  spoken

In [453]:
df.isna().sum()

belongs_to_collection    40631
budget_musd              36240
genres                    2149
id                           0
original_language           11
overview                   976
popularity                   0
poster_path                232
production_companies     11557
production_countries      5971
release_date                40
revenue_musd             37720
runtime                   1566
spoken_languages          3656
status                      66
tagline                  24722
title                        0
vote_average              2658
vote_count                   0
dtype: int64

## Final (Cleaning) Steps

19. __Keep__ only those rows/movies in the df with __status "Released"__. Then __drop__ the column "status".

In [454]:
df.status.value_counts(dropna = False)

status
Released           44691
Rumored              226
Post Production       98
NaN                   66
In Production         20
Planned               15
Canceled               2
Name: count, dtype: int64

In [455]:
df = df.loc[df.status == "Released"].copy()

In [456]:
df.drop(columns = ["status"], inplace = True)

20. The Order of the columns should be as follows: 

In [457]:
col = ["id", "title", "tagline", "release_date", "genres", "belongs_to_collection", 
"original_language", "budget_musd", "revenue_musd", "production_companies",
"production_countries", "vote_count", "vote_average", "popularity", "runtime",
"overview", "spoken_languages", "poster_path"]

df = df.loc[:, col]

21. __Reset__ the Index and create a __RangeIndex__.

In [458]:
df.reset_index(drop = True, inplace =True)

In [459]:
df.poster_path[0]

'/rhIRbceoE9lR4veEXuwCC2wARtG.jpg'

In [460]:
# base_poster_url = 'https://www.themoviedb.org/t/p/original'
base_poster_url = 'http://image.tmdb.org/t/p/w185/'
df.poster_path = "<img src='" + base_poster_url + df.poster_path + "' style='height:100px;'>"

22. __Save__ the cleaned dataset in a __csv-file__.

In [461]:
df.to_csv("movies_clean.csv", index = False)

In [462]:
pd.read_csv("movies_clean.csv").info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44691 entries, 0 to 44690
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     44691 non-null  int64  
 1   title                  44691 non-null  object 
 2   tagline                20284 non-null  object 
 3   release_date           44657 non-null  object 
 4   genres                 42586 non-null  object 
 5   belongs_to_collection  4463 non-null   object 
 6   original_language      44681 non-null  object 
 7   budget_musd            8854 non-null   float64
 8   revenue_musd           7385 non-null   float64
 9   production_companies   33356 non-null  object 
 10  production_countries   38835 non-null  object 
 11  vote_count             44691 non-null  float64
 12  vote_average           42077 non-null  float64
 13  popularity             44691 non-null  float64
 14  runtime                43179 non-null  float64
 15  ov