# Project 3: Data Cleaning - Tidy up messy Datasets (Movies Dataset)

# Project Brief for Self-Coders

Here you´ll have the opportunity to code major parts of Project 3 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code. <br> <br>
Keep in mind that it´s all about __getting the right results/conclusions__. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code. 

In [1]:
import pandas as pd
pd.options.display.max_columns = 30

## First Steps 

1. __Load__ and __inspect__ the messy dataset __movies_metadata.csv__. Identify columns with nested / stringified json data.

In [2]:
data = pd.read_csv("movies_metadata.csv",low_memory=False)
data.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


The nested columns are : <br>
- `genres`
- `production_companies`
- `production_countries`
- `spoken_language`
- `belongs_to_collection`



In [3]:
data['production_countries'][0]

"[{'iso_3166_1': 'US', 'name': 'United States of America'}]"

In [4]:
data['belongs_to_collection'][0]

"{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}"

Most of the columns are mixed type. For instance, `adult` should be of tyoe `bool` or budget should be of type `numeric`. <br>
*<span style="color:red">Note:</span> The `homepage` column has a lot of missing values.

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

## Dropping irrelevant Columns

2. __Drop__ the irrelevant columns 'adult', 'imdb_id', 'original_title', 'video' and 'homepage'.

In [6]:
data.drop(columns = 'adult', inplace=True)
data.drop(columns = 'imdb_id', inplace=True)
data.drop(columns = 'original_title', inplace=True)
data.drop(columns = 'video', inplace=True)
data.drop(columns = 'homepage', inplace=True)


## How to handle stringified JSON columns

3. __Evaluate__ Python Expressions in the stringified columns ["belongs_to_collection", "genres", "production_countries", "production_companies", "spoken_languages"] and __remove quotes__ ("") where possible.

In [7]:
import json
import ast
import numpy as np

In [8]:
data['belongs_to_collection'] = data['belongs_to_collection'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

In [9]:
data['spoken_languages'] = data['spoken_languages'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)
data['production_countries'] = data['production_countries'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)
data['production_companies'] = data['production_companies'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)
data['genres'] = data['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

In [10]:
data['spoken_languages'][0]

[{'iso_639_1': 'en', 'name': 'English'}]

## How to flatten nested Columns

4. __Extract__ only the __collection name__ from the column "belongs_to_collection" and __overwrite__ "belongs_to_collection". <br> For example: The value in the first row (Toy Story) should be 'Toy Story Collection'.

In [11]:
data['belongs_to_collection'] = data['belongs_to_collection'].apply(lambda x: x['name'] if isinstance(x, dict) else np.nan)

In [12]:
data['belongs_to_collection'].value_counts(dropna=False)

NaN                              40975
The Bowery Boys                     29
Totò Collection                     27
James Bond Collection               26
Zatôichi: The Blind Swordsman       26
                                 ...  
Glass Tiger collection               1
Kathleen Madigan Collection          1
The Big Bottom Box                   1
Joséphine - Saga                     1
Red Lotus Collection                 1
Name: belongs_to_collection, Length: 1696, dtype: int64

5. __Extract__ all __genre names__ from the column "genres" and __overwrite__ "genres". If a movie has more than one genre, __seperate genres by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Animation|Comedy|Family'.

In [13]:
data['genres'] = data['genres'].apply(lambda x: "|".join(genre['name'] for genre in x))

In [14]:
data['genres']

0         Animation|Comedy|Family
1        Adventure|Fantasy|Family
2                  Romance|Comedy
3            Comedy|Drama|Romance
4                          Comedy
                   ...           
45461                Drama|Family
45462                       Drama
45463       Action|Drama|Thriller
45464                            
45465                            
Name: genres, Length: 45466, dtype: object

6. __Extract__ all __spoken language names__ from the column "spoken_languages" and __overwrite__ "spoken_languages". If a movie has more than one spoken language, __seperate spoken languages by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'English'.

In [15]:
data['spoken_languages'][1200]

[{'iso_639_1': 'gd', 'name': ''},
 {'iso_639_1': 'it', 'name': 'Italiano'},
 {'iso_639_1': 'yi', 'name': ''},
 {'iso_639_1': 'en', 'name': 'English'}]

In [16]:
data['spoken_languages'] = data['spoken_languages'].apply(lambda x: "|".join(lang['name'] for lang in x) if isinstance(x,list) else np.nan)

In [17]:
data['spoken_languages'].value_counts(dropna=False)

English                           22395
                                   3952
Français                           1853
日本語                                1289
Italiano                           1218
                                  ...  
English|日本語|Latin                     1
Deutsch||ελληνικά|English             1
English|suomi|Deutsch|svenska         1
English|Français|Deutsch|فارسی        1
Fulfulde|English                      1
Name: spoken_languages, Length: 1843, dtype: int64

7. __Extract__ all __production countries names__ from the column "production_countries" and __overwrite__ "production_countries". If a movie has more than one production country, __seperate production countries by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'United States of America'.

In [18]:
data['production_countries'] = data['production_countries'].apply(lambda x: "|".join(country['name'] for country in x) if isinstance(x,list) else np.nan)

In [19]:
data['production_countries'].value_counts(dropna=False)

United States of America                  17851
                                           6282
United Kingdom                             2238
France                                     1654
Japan                                      1356
                                          ...  
Romania|United Kingdom|Canada                 1
Finland|Germany|Netherlands                   1
France|Denmark|Spain|Sweden                   1
France|United States of America|Canada        1
Egypt|Italy|United States of America          1
Name: production_countries, Length: 2391, dtype: int64

8. __Extract__ all __production companies names__ from the column "production_companies" and __overwrite__ "production_companies". If a movie has more than one production company, __seperate production companies by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Pixar Animation Studios'

In [20]:
data['production_companies'] = data['production_companies'].apply(lambda x: "|".join(company['name'] for company in x) if isinstance(x,list) else np.nan)

9. __Inspect__ all columns above with value_counts(). Do you see anything strange? __Take reasonable measures__!

While we are able to preserve most of the values from the desired columns, we now have more missing values in belongs_to_collection, production_companies, production_countries after cleaning.

In [21]:
data.isna().sum()

belongs_to_collection    40975
budget                       0
genres                       0
id                           0
original_language           11
overview                   954
popularity                   5
poster_path                386
production_companies         6
production_countries         6
release_date                87
revenue                      6
runtime                    263
spoken_languages             6
status                      87
tagline                  25054
title                        6
vote_average                 6
vote_count                   6
dtype: int64

In [22]:
pd.read_csv('movies_metadata.csv',low_memory=False).isna().sum()

adult                        0
belongs_to_collection    40972
budget                       0
genres                       0
homepage                 37684
id                           0
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   5
poster_path                386
production_companies         3
production_countries         3
release_date                87
revenue                      6
runtime                    263
spoken_languages             6
status                      87
tagline                  25054
title                        6
video                        6
vote_average                 6
vote_count                   6
dtype: int64

## Cleaning Numerical Columns

10. __Convert__ the datatype in the columns __"budget"__, __"id"__ and __"popularity"__ __to numeric__. Set invalid values as NaN.

In [23]:
data['budget'].dtype

dtype('O')

In [24]:
data['budget']= pd.to_numeric(data['budget'],errors='coerce')

In [25]:
data['id']= pd.to_numeric(data['id'],errors='coerce')
data['popularity']= pd.to_numeric(data['popularity'],errors='coerce')

In [26]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   belongs_to_collection  4491 non-null   object 
 1   budget                 45463 non-null  float64
 2   genres                 45466 non-null  object 
 3   id                     45463 non-null  float64
 4   original_language      45455 non-null  object 
 5   overview               44512 non-null  object 
 6   popularity             45460 non-null  float64
 7   poster_path            45080 non-null  object 
 8   production_companies   45460 non-null  object 
 9   production_countries   45460 non-null  object 
 10  release_date           45379 non-null  object 
 11  revenue                45460 non-null  float64
 12  runtime                45203 non-null  float64
 13  spoken_languages       45460 non-null  object 
 14  status                 45379 non-null  object 
 15  ta

11. __Analyze__ the columns __"budget"__ and __"revenue"__ and __"runtime"__. Analyze movies with a budget/revenue/runtime of 0. Do you think the value 0 is the most appropriate value? __Take reasonable measures__! 

In [27]:
data['budget'].value_counts(dropna=False)

0.0           36573
5000000.0       286
10000000.0      259
20000000.0      243
2000000.0       242
              ...  
9750000.0         1
7275000.0         1
78146652.0        1
280.0             1
1254040.0         1
Name: budget, Length: 1224, dtype: int64

In [28]:
data['budget'] = data['budget'].replace(0,np.nan)
data['revenue'] = data['revenue'].replace(0,np.nan)
data['runtime'] = data['runtime'].replace(0,np.nan)

In [29]:
data['budget'].head()

0    30000000.0
1    65000000.0
2           NaN
3    16000000.0
4           NaN
Name: budget, dtype: float64

12. The columns "budget" and "revenue" shall show values in Million USD. __Convert and Overwrite__!

In [30]:
data['budget'] = data['budget'].div(1000000)
data['revenue'] = data['revenue'].div(1000000)

In [31]:
data.rename(columns={'budget':'budget_musd', 'revenue':'revenue_musd'},inplace=True)

In [32]:
data.head()

Unnamed: 0,belongs_to_collection,budget_musd,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue_musd,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,Toy Story Collection,30.0,Animation|Comedy|Family,862.0,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,Pixar Animation Studios,United States of America,1995-10-30,373.554033,81.0,English,Released,,Toy Story,7.7,5415.0
1,,65.0,Adventure|Fantasy|Family,8844.0,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,TriStar Pictures|Teitler Film|Interscope Commu...,United States of America,1995-12-15,262.797249,104.0,English|Français,Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,Grumpy Old Men Collection,,Romance|Comedy,15602.0,en,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,Warner Bros.|Lancaster Gate,United States of America,1995-12-22,,101.0,English,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,,16.0,Comedy|Drama|Romance,31357.0,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,Twentieth Century Fox Film Corporation,United States of America,1995-12-22,81.452156,127.0,English,Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,Father of the Bride Collection,,Comedy,11862.0,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,Sandollar Productions|Touchstone Pictures,United States of America,1995-02-10,76.578911,106.0,English,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0


13. __Analyze__ movies with a __vote_count of 0__. What´s the __vote_average__ for those movies? Do you think this value is the most appropriate value? __Take reasonable measures__!

Movies with no vote has average rating of 0.

In [33]:
data[data['vote_count'] == 0] = np.nan

## Cleaning DateTime Columns

14. __Convert__ the datatype in the column __"release_date"__ __to datetime__. Set invalid values as NaN.

In [34]:
data['release_date'].dtype

dtype('O')

In [35]:
data['release_date'] = pd.to_datetime(data['release_date'],errors='coerce')

In [36]:
data['release_date']

0       1995-10-30
1       1995-12-15
2       1995-12-22
3       1995-12-22
4       1995-02-10
           ...    
45461          NaT
45462   2011-11-17
45463   2003-08-01
45464          NaT
45465          NaT
Name: release_date, Length: 45466, dtype: datetime64[ns]

## Cleaning Text / String Columns

15. __Analyze__ the text columns "overview" and "tagline". Try to identify __missing data that is not represented by NaN__ (e.g. "No Data"). __Replace as NaN__ (np.nan)!

In [37]:
data['overview'].value_counts(dropna=False).head(50)

NaN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            3652
No o

In [38]:
data['overview'] = data['overview'].replace(['No overview found.','No Overview','No movie overview available'], np.nan)

In [39]:
data['overview'].value_counts(dropna=False)

NaN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       3775
                                                                                                                                                                                                                                                                                                                                                                                                                         

In [40]:
data['tagline'] = data['tagline'].replace('-', np.nan)

In [41]:
data['tagline'].value_counts(dropna=False)

NaN                                                                         25689
Based on a true story.                                                          7
Be careful what you wish for.                                                   4
Trust no one.                                                                   4
The end is near.                                                                3
                                                                            ...  
For some men, the sky was the limit. For him, it was just the beginning.        1
The deeper you go, the weirder life gets.                                       1
"oh yeah?" "Oh yeah!"                                                           1
Willy Wonka is semi-sweet and nuts.                                             1
A deadly game of wits.                                                          1
Name: tagline, Length: 19657, dtype: int64

## Removing Duplicates

16. __Identify__ and __remove__ duplicates!

In [42]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   belongs_to_collection  4433 non-null   object        
 1   budget_musd            8800 non-null   float64       
 2   genres                 42567 non-null  object        
 3   id                     42564 non-null  float64       
 4   original_language      42561 non-null  object        
 5   overview               41691 non-null  object        
 6   popularity             42561 non-null  float64       
 7   poster_path            42457 non-null  object        
 8   production_companies   42561 non-null  object        
 9   production_countries   42561 non-null  object        
 10  release_date           42527 non-null  datetime64[ns]
 11  revenue_musd           7371 non-null   float64       
 12  runtime                41190 non-null  float64       
 13  s

In [43]:
data[data.duplicated(keep=False)].sort_values('id')

Unnamed: 0,belongs_to_collection,budget_musd,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue_musd,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
7345,,,Crime|Drama|Thriller,5511.0,fr,Hitman Jef Costello is a perfectionist who alw...,9.091288,/cvNW8IXigbaMNo4gKEIps0NGnhA.jpg,Fida cinematografica|Compagnie Industrielle et...,France|Italy,1967-10-25,0.039481,105.0,Français,Released,There is no solitude greater than that of the ...,Le Samouraï,7.9,187.0
9165,,,Crime|Drama|Thriller,5511.0,fr,Hitman Jef Costello is a perfectionist who alw...,9.091288,/cvNW8IXigbaMNo4gKEIps0NGnhA.jpg,Fida cinematografica|Compagnie Industrielle et...,France|Italy,1967-10-25,0.039481,105.0,Français,Released,There is no solitude greater than that of the ...,Le Samouraï,7.9,187.0
24844,,,Comedy|Drama,11115.0,en,As an ex-gambler teaches a hot-shot college ki...,6.880365,/kHaBqrrozaG7rj6GJg3sUCiM29B.jpg,Andertainment Group|Crescent City Pictures|Tag...,United States of America,2008-01-29,,85.0,English,Released,,Deal,5.2,22.0
14012,,,Comedy|Drama,11115.0,en,As an ex-gambler teaches a hot-shot college ki...,6.880365,/kHaBqrrozaG7rj6GJg3sUCiM29B.jpg,Andertainment Group|Crescent City Pictures|Tag...,United States of America,2008-01-29,,85.0,English,Released,,Deal,5.2,22.0
22151,,,Action|Horror|Science Fiction,18440.0,en,When a comet strikes Earth and kicks up a clou...,1.436085,/tWCyKXHuSrQdLAvNeeVJBnhf1Yv.jpg,,United States of America,2007-01-01,,89.0,English,Released,,Days of Darkness,5.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45432,,,,,,,,,,,NaT,,,,,,,,
45434,,,,,,,,,,,NaT,,,,,,,,
45452,,,,,,,,,,,NaT,,,,,,,,
45464,,,,,,,,,,,NaT,,,,,,,,


In [44]:
data.drop_duplicates(subset='id',keep = 'first',inplace=True)

In [45]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42538 entries, 0 to 45463
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   belongs_to_collection  4430 non-null   object        
 1   budget_musd            8790 non-null   float64       
 2   genres                 42537 non-null  object        
 3   id                     42537 non-null  float64       
 4   original_language      42531 non-null  object        
 5   overview               41661 non-null  object        
 6   popularity             42534 non-null  float64       
 7   poster_path            42427 non-null  object        
 8   production_companies   42534 non-null  object        
 9   production_countries   42534 non-null  object        
 10  release_date           42500 non-null  datetime64[ns]
 11  revenue_musd           7361 non-null   float64       
 12  runtime                41163 non-null  float64       
 13  s

In [46]:
data[data.duplicated(keep=False)]

Unnamed: 0,belongs_to_collection,budget_musd,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue_musd,runtime,spoken_languages,status,tagline,title,vote_average,vote_count


## Handling Missing Values & Removing Observations

17. __Drop__ all rows/movies with unknown __id__ or __title__.

In [47]:
data['title'].isna().sum()

4

In [48]:
data.drop_duplicates(subset=['id','title'],keep=False,inplace=True)

In [49]:
data['title'].isna().sum()

4

18. __Keep__ only those rows/movies in the df with __10 or more non-NaN__ values.

In [50]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42538 entries, 0 to 45463
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   belongs_to_collection  4430 non-null   object        
 1   budget_musd            8790 non-null   float64       
 2   genres                 42537 non-null  object        
 3   id                     42537 non-null  float64       
 4   original_language      42531 non-null  object        
 5   overview               41661 non-null  object        
 6   popularity             42534 non-null  float64       
 7   poster_path            42427 non-null  object        
 8   production_companies   42534 non-null  object        
 9   production_countries   42534 non-null  object        
 10  release_date           42500 non-null  datetime64[ns]
 11  revenue_musd           7361 non-null   float64       
 12  runtime                41163 non-null  float64       
 13  s

In [51]:
data = data[data.notna().sum(axis=1) >= 10]

In [52]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42534 entries, 0 to 45463
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   belongs_to_collection  4429 non-null   object        
 1   budget_musd            8790 non-null   float64       
 2   genres                 42534 non-null  object        
 3   id                     42534 non-null  float64       
 4   original_language      42528 non-null  object        
 5   overview               41658 non-null  object        
 6   popularity             42534 non-null  float64       
 7   poster_path            42427 non-null  object        
 8   production_companies   42534 non-null  object        
 9   production_countries   42534 non-null  object        
 10  release_date           42500 non-null  datetime64[ns]
 11  revenue_musd           7361 non-null   float64       
 12  runtime                41163 non-null  float64       
 13  s

## Final (Cleaning) Steps

19. __Keep__ only those rows/movies in the df with __status "Released"__. Then __drop__ the column "status".

In [53]:
data = data[data['status'] == 'Released'].copy()

In [54]:
data.drop(columns='status',inplace=True)

In [55]:
data.head()

Unnamed: 0,belongs_to_collection,budget_musd,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue_musd,runtime,spoken_languages,tagline,title,vote_average,vote_count
0,Toy Story Collection,30.0,Animation|Comedy|Family,862.0,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,Pixar Animation Studios,United States of America,1995-10-30,373.554033,81.0,English,,Toy Story,7.7,5415.0
1,,65.0,Adventure|Fantasy|Family,8844.0,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,TriStar Pictures|Teitler Film|Interscope Commu...,United States of America,1995-12-15,262.797249,104.0,English|Français,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,Grumpy Old Men Collection,,Romance|Comedy,15602.0,en,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,Warner Bros.|Lancaster Gate,United States of America,1995-12-22,,101.0,English,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,,16.0,Comedy|Drama|Romance,31357.0,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,Twentieth Century Fox Film Corporation,United States of America,1995-12-22,81.452156,127.0,English,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,Father of the Bride Collection,,Comedy,11862.0,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,Sandollar Productions|Touchstone Pictures,United States of America,1995-02-10,76.578911,106.0,English,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0


20. The Order of the columns should be as follows: 

In [56]:
col = ["id", "title", "tagline", "release_date", "genres", "belongs_to_collection", 
"original_language", "budget_musd", "revenue_musd", "production_companies",
"production_countries", "vote_count", "vote_average", "popularity", "runtime",
"overview", "spoken_languages", "poster_path"]

In [57]:
data = data.loc[:, col]

21. __Reset__ the Index and create a __RangeIndex__.

In [58]:
data.reset_index(drop=True,inplace=True)

In [59]:
data.head()

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,production_countries,vote_count,vote_average,popularity,runtime,overview,spoken_languages,poster_path
0,862.0,Toy Story,,1995-10-30,Animation|Comedy|Family,Toy Story Collection,en,30.0,373.554033,Pixar Animation Studios,United States of America,5415.0,7.7,21.946943,81.0,"Led by Woody, Andy's toys live happily in his ...",English,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg
1,8844.0,Jumanji,Roll the dice and unleash the excitement!,1995-12-15,Adventure|Fantasy|Family,,en,65.0,262.797249,TriStar Pictures|Teitler Film|Interscope Commu...,United States of America,2413.0,6.9,17.015539,104.0,When siblings Judy and Peter discover an encha...,English|Français,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
2,15602.0,Grumpier Old Men,Still Yelling. Still Fighting. Still Ready for...,1995-12-22,Romance|Comedy,Grumpy Old Men Collection,en,,,Warner Bros.|Lancaster Gate,United States of America,92.0,6.5,11.7129,101.0,A family wedding reignites the ancient feud be...,English,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg
3,31357.0,Waiting to Exhale,Friends are the people who let you be yourself...,1995-12-22,Comedy|Drama|Romance,,en,16.0,81.452156,Twentieth Century Fox Film Corporation,United States of America,34.0,6.1,3.859495,127.0,"Cheated on, mistreated and stepped on, the wom...",English,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg
4,11862.0,Father of the Bride Part II,Just When His World Is Back To Normal... He's ...,1995-02-10,Comedy,Father of the Bride Collection,en,,76.578911,Sandollar Productions|Touchstone Pictures,United States of America,173.0,5.7,8.387519,106.0,Just when George Banks has recovered from his ...,English,/e64sOI48hQXyru7naBFyssKFxVd.jpg


In [60]:
base_poster_url = 'http://image.tmdb.org/t/p/w185/'
data['poster_path'] = "<img src='" + base_poster_url + data['poster_path'] + "' style='height:100px;'>"

22. __Save__ the cleaned dataset in a __csv-file__.

In [61]:
data.to_csv("clean_data.csv",index = False)