# Project 3: Data Cleaning - Tidy up messy Datasets (Movies Dataset)

# Project Brief for Self-Coders

Here you´ll have the opportunity to code major parts of Project 3 on your own. If you need any help or inspiration, have a look at the Videos. <br> <br>
Keep in mind that it´s all about __getting the right results/conclusions__. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code. 

## First Steps 

1. __Load__ and __inspect__ the messy dataset __movies_metadata.csv__. Identify columns with nested / stringified json data.

In [295]:
import pandas as pd
import numpy as np
import json
import ast
pd.options.display.max_columns = 30

In [296]:
df = pd.read_csv('movies_metadata.csv', low_memory=False)

In [297]:
df.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0


## Dropping irrelevant Columns

2. __Drop__ the irrelevant columns 'adult', 'imdb_id', 'original_title', 'video' and 'homepage'.

In [298]:
df.drop(columns = ['adult', 'imdb_id', 'original_title', 'video','homepage'], inplace = True)

## How to handle stringified JSON columns

3. __Evaluate__ Python Expressions in the stringified columns ["belongs_to_collection", "genres", "production_countries", "production_companies", "spoken_languages"] and __remove quotes__ ("") where possible.

In [299]:
df.tail(3)

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
45463,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",67758,en,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,"[{'name': 'American World Pictures', 'id': 6165}]","[{'iso_3166_1': 'US', 'name': 'United States o...",2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,3.8,6.0
45464,,0,[],227506,en,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,"[{'name': 'Yermoliev', 'id': 88753}]","[{'iso_3166_1': 'RU', 'name': 'Russia'}]",1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,0.0,0.0
45465,,0,[],461257,en,50 years after decriminalisation of homosexual...,0.163015,/s5UkZt6NTsrS7ZF0Rh8nzupRlIU.jpg,[],"[{'iso_3166_1': 'GB', 'name': 'United Kingdom'}]",2017-06-09,0.0,75.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Queerama,0.0,0.0


In [300]:
#df.belongs_to_collection[0].replace("'", '"' )

In [301]:
#df.genres.apply(lambda x: json.loads(x.replace("'", '"' )))[0]

#### Changing the genres column to dictionary data type 

In [302]:
df.genres = df.genres.apply(ast.literal_eval)

#### Changing the belongs_to_collection to column to string objects 

In [303]:
df.belongs_to_collection = df.belongs_to_collection.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

In [304]:
df.belongs_to_collection[0]

{'id': 10194,
 'name': 'Toy Story Collection',
 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg',
 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}

#### Changing the production_countries to column to string objects 

In [305]:
df.production_countries = df.production_countries.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

In [306]:
df.production_countries[0]

[{'iso_3166_1': 'US', 'name': 'United States of America'}]

#### Changing the Spoken_languages to column to string objects 

In [307]:
df.spoken_languages = df.spoken_languages.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

In [308]:
df.spoken_languages[0]

[{'iso_639_1': 'en', 'name': 'English'}]

#### Changing the production_companies to column to string objects # 

In [309]:
df.production_companies = df.production_companies.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

In [310]:
df.production_companies[0]

[{'name': 'Pixar Animation Studios', 'id': 3}]

In [311]:
df.head(2)

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,7.7,5415.0
1,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0


## How to flatten nested Columns

4. __Extract__ only the __collection name__ from the column "belongs_to_collection" and __overwrite__ "belongs_to_collection". <br> For example: The value in the first row (Toy Story) should be 'Toy Story Collection'.

In [312]:
df.belongs_to_collection = df.belongs_to_collection.apply(lambda x: x['name'] if isinstance(x, dict) else np.nan)

5. __Extract__ all __genre names__ from the column "genres" and __overwrite__ "genres". If a movie has more than one genre, __seperate genres by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Animation|Comedy|Family'.

In [313]:
df.genres = df.genres.apply(lambda x: "|".join(i['name'] for i in x))

In [314]:
df.genres.head(2)

0     Animation|Comedy|Family
1    Adventure|Fantasy|Family
Name: genres, dtype: object

6. __Extract__ all __spoken language names__ from the column "spoken_languages" and __overwrite__ "spoken_languages". If a movie has more than one spoken language, __seperate spoken languages by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'English'.

In [315]:
df.spoken_languages = df.spoken_languages.apply(lambda x: "|".join(i['name'] for i in x) if isinstance(x, list) else np.nan)

In [316]:
df.spoken_languages.head(2)

0             English
1    English|Français
Name: spoken_languages, dtype: object

7. __Extract__ all __production countries names__ from the column "production_countries" and __overwrite__ "production_countries". If a movie has more than one production country, __seperate production countries by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'United States of America'.

In [317]:
df.production_countries = df.production_countries.apply(lambda x: "|".join(i['name'] for i in x) if isinstance(x, list) else np.nan)

In [318]:
df.production_countries.head(2)

0    United States of America
1    United States of America
Name: production_countries, dtype: object

8. __Extract__ all __production companies names__ from the column "production_companies" and __overwrite__ "production_companies". If a movie has more than one production company, __seperate production companies by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Pixar Animation Studios'

In [319]:
df.production_companies = df.production_companies.apply(lambda x: "|".join(i['name'] for i in x) if isinstance(x, list) else np.nan)

In [320]:
df.production_companies.head(2)

0                              Pixar Animation Studios
1    TriStar Pictures|Teitler Film|Interscope Commu...
Name: production_companies, dtype: object

9. __Inspect__ all columns above with value_counts(). Do you see anything strange? __Take reasonable measures__!

In [321]:
df.belongs_to_collection.value_counts(dropna = False).head(3)

NaN                40975
The Bowery Boys       29
Totò Collection       27
Name: belongs_to_collection, dtype: int64

In [322]:
df.genres.value_counts(dropna = False).head(5)

Drama            5000
Comedy           3621
Documentary      2723
                 2442
Drama|Romance    1301
Name: genres, dtype: int64

In [323]:
df.production_countries.value_counts(dropna = False).head(5)

United States of America    17851
                             6282
United Kingdom               2238
France                       1654
Japan                        1356
Name: production_countries, dtype: int64

In [324]:
df.spoken_languages.value_counts(dropna = False).head(5)

English     22395
             3952
Français     1853
日本語          1289
Italiano     1218
Name: spoken_languages, dtype: int64

In [325]:
df.production_companies.value_counts(dropna = False).head(5)

                                          11875
Metro-Goldwyn-Mayer (MGM)                   742
Warner Bros.                                540
Paramount Pictures                          505
Twentieth Century Fox Film Corporation      439
Name: production_companies, dtype: int64

## Cleaning Numerical Columns

10. __Convert__ the datatype in the columns __"budget"__, __"id"__ and __"popularity"__ __to numeric__. Set invalid values as NaN.

In [326]:
df.budget = pd.to_numeric(df.budget, errors = 'coerce')

In [352]:
df.id = pd.to_numeric(df.id, errors = 'coerce')

In [353]:
df.popularity = pd.to_numeric(df.popularity, errors = 'coerce')

11. __Analyze__ the columns __"budget"__ and __"revenue"__ and __"runtime"__. Analyze movies with a budget/revenue/runtime of 0. Do you think the value 0 is the most appropriate value? __Take reasonable measures__! 

#### Replacing 0 of budget entries with nan 

In [327]:
df.budget = df.budget.replace(0, np.nan)


In [328]:
df.budget.head()

0    30000000.0
1    65000000.0
2           NaN
3    16000000.0
4           NaN
Name: budget, dtype: float64

#### Replacing 0 entries  of revenues with nan 

In [329]:
df.revenue = df.revenue.replace(0, np.nan)


In [330]:
df.revenue.head(4)

0    373554033.0
1    262797249.0
2            NaN
3     81452156.0
Name: revenue, dtype: float64

In [331]:
df.runtime = df.runtime.replace(0, np.nan)

In [332]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 19 columns):
belongs_to_collection    4491 non-null object
budget                   8890 non-null float64
genres                   45466 non-null object
id                       45466 non-null object
original_language        45455 non-null object
overview                 44512 non-null object
popularity               45461 non-null object
poster_path              45080 non-null object
production_companies     45460 non-null object
production_countries     45460 non-null object
release_date             45379 non-null object
revenue                  7408 non-null float64
runtime                  43645 non-null float64
spoken_languages         45460 non-null object
status                   45379 non-null object
tagline                  20412 non-null object
title                    45460 non-null object
vote_average             45460 non-null float64
vote_count               45460 non-null floa

12. The columns "budget" and "revenue" shall show values in Million USD. __Convert and Overwrite__!

#### Dividing budget values with a million  

In [333]:
df.budget = df.budget.div(1000_000)

#### Dividing revenue values with a million  

In [334]:
df.revenue = df.revenue.div(1000_000)

#### Observing the head budget and revenue values with a million  

In [335]:
df[['budget', 'revenue']].head(5)

Unnamed: 0,budget,revenue
0,30.0,373.554033
1,65.0,262.797249
2,,
3,16.0,81.452156
4,,76.578911


13. __Analyze__ movies with a __vote_count of 0__. What´s the __vote_average__ for those movies? Do you think this value is the most appropriate value? __Take reasonable measures__!

#### This movies had no rating so is appropriate to replace those values with Zero 

In [349]:
df.vote_average.value_counts(dropna = False).head(1)

0.0    2998
Name: vote_average, dtype: int64

In [354]:
df.loc[df.vote_count ==0, 'vote_average'] = np.nan

In [355]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 19 columns):
belongs_to_collection    4491 non-null object
budget                   8890 non-null float64
genres                   45466 non-null object
id                       45463 non-null float64
original_language        45455 non-null object
overview                 44512 non-null object
popularity               45460 non-null float64
poster_path              45080 non-null object
production_companies     45460 non-null object
production_countries     45460 non-null object
release_date             45379 non-null object
revenue                  7408 non-null float64
runtime                  43645 non-null float64
spoken_languages         45460 non-null object
status                   45379 non-null object
tagline                  20412 non-null object
title                    45460 non-null object
vote_average             42561 non-null float64
vote_count               45460 non-null fl

#### Now Budget, id and popularity are all numeric values 

## Cleaning DateTime Columns

14. __Convert__ the datatype in the column __"release_date"__ __to datetime__. Set invalid values as NaN.

In [356]:
df.release_date = pd.to_datetime(df.release_date, errors = 'coerce')

In [360]:
df.release_date.head(3)

0   1995-10-30
1   1995-12-15
2   1995-12-22
Name: release_date, dtype: datetime64[ns]

In [363]:
df.release_date.value_counts(dropna=False).head(10)

2008-01-01    136
2009-01-01    121
2007-01-01    118
2005-01-01    111
2006-01-01    101
2002-01-01     96
2004-01-01     90
NaT            90
2001-01-01     84
2003-01-01     76
Name: release_date, dtype: int64

## Cleaning Text / String Columns

15. __Analyze__ the text columns "overview" and "tagline". Try to identify __missing data that is not represented by NaN__ (e.g. "No Data"). __Replace as NaN__ (np.nan)!

In [365]:
df.overview.replace("No overview found.", np.nan, inplace = True)

In [366]:
df.overview.replace("No Overview.", np.nan, inplace = True)

In [367]:
df.overview.replace("No movie overview available.", np.nan, inplace = True)

In [368]:
df.overview.replace(" ", np.nan, inplace = True)

In [369]:
df.overview.replace("No overview yet.", np.nan, inplace = True)

In [371]:
df.tagline.value_counts(dropna= False).head(10)

NaN                                         25054
Based on a true story.                          7
-                                               4
Trust no one.                                   4
Be careful what you wish for.                   4
Drama                                           3
Some doors should never be opened.              3
There are two sides to every love story.        3
Who is John Galt?                               3
There is no turning back                        3
Name: tagline, dtype: int64

In [373]:
df.tagline.replace("-", np.nan, inplace =True)

In [374]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 19 columns):
belongs_to_collection    4491 non-null object
budget                   8890 non-null float64
genres                   45466 non-null object
id                       45463 non-null float64
original_language        45455 non-null object
overview                 44369 non-null object
popularity               45460 non-null float64
poster_path              45080 non-null object
production_companies     45460 non-null object
production_countries     45460 non-null object
release_date             45376 non-null datetime64[ns]
revenue                  7408 non-null float64
runtime                  43645 non-null float64
spoken_languages         45460 non-null object
status                   45379 non-null object
tagline                  20408 non-null object
title                    45460 non-null object
vote_average             42561 non-null float64
vote_count               45460 non

## Removing Duplicates

16. __Identify__ and __remove__ duplicates!

In [375]:
df.drop_duplicates(inplace = True)

In [378]:
df.drop_duplicates(subset="id", inplace=True)

In [None]:
df.id.value_counts(dropna=False)

## Handling Missing Values & Removing Observations

17. __Drop__ all rows/movies with unknown __id__ or __title__.

####  Checking the number of missing values in the data

In [379]:
df.isna().sum()

belongs_to_collection    40946
budget                   36554
genres                       0
id                           1
original_language           11
overview                  1097
popularity                   4
poster_path                386
production_companies         4
production_countries         4
release_date                88
revenue                  38036
runtime                   1819
spoken_languages             4
status                      85
tagline                  25037
title                        4
vote_average              2900
vote_count                   4
dtype: int64

#### Dropping movies whose id and titles are not known 

In [381]:
df.dropna(subset=["id", "title"], inplace=True)

#### Changing the id to an integer  

In [382]:
df.id = df.id.astype('int')

18. __Keep__ only those rows/movies in the df with __10 or more non-NaN__ values.

In [383]:
df.dropna(thresh =10, inplace = True )

In [384]:
df.head()

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,Toy Story Collection,30.0,Animation|Comedy|Family,862,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,Pixar Animation Studios,United States of America,1995-10-30,373.554033,81.0,English,Released,,Toy Story,7.7,5415.0
1,,65.0,Adventure|Fantasy|Family,8844,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,TriStar Pictures|Teitler Film|Interscope Commu...,United States of America,1995-12-15,262.797249,104.0,English|Français,Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,Grumpy Old Men Collection,,Romance|Comedy,15602,en,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,Warner Bros.|Lancaster Gate,United States of America,1995-12-22,,101.0,English,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,,16.0,Comedy|Drama|Romance,31357,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,Twentieth Century Fox Film Corporation,United States of America,1995-12-22,81.452156,127.0,English,Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,Father of the Bride Collection,,Comedy,11862,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,Sandollar Productions|Touchstone Pictures,United States of America,1995-02-10,76.578911,106.0,English,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0


## Final (Cleaning) Steps

19. __Keep__ only those rows/movies in the df with __status "Released"__. Then __drop__ the column "status".

###  copying  the section of the dataframe whose status is released

In [385]:
df = df.loc[df.status=="Released"].copy()

In [389]:
df.status.head(10)

0    Released
1    Released
2    Released
3    Released
4    Released
5    Released
6    Released
7    Released
8    Released
9    Released
Name: status, dtype: object

### Dropping the status column  

In [390]:
df.drop(columns=['status'], inplace= True)

In [391]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44985 entries, 0 to 45465
Data columns (total 18 columns):
belongs_to_collection    4463 non-null object
budget                   8855 non-null float64
genres                   44985 non-null object
id                       44985 non-null int32
original_language        44975 non-null object
overview                 43926 non-null object
popularity               44985 non-null float64
poster_path              44612 non-null object
production_companies     44985 non-null object
production_countries     44985 non-null object
release_date             44907 non-null datetime64[ns]
revenue                  7385 non-null float64
runtime                  43238 non-null float64
spoken_languages         44985 non-null object
tagline                  20285 non-null object
title                    44985 non-null object
vote_average             42145 non-null float64
vote_count               44985 non-null float64
dtypes: datetime64[ns](1), float64(

20. The Order of the columns should be as follows: 

In [394]:
col = ["id", "title", "tagline", "release_date", "genres", "belongs_to_collection", 
"original_language", "budget_musd", "revenue_musd", "production_companies",
"production_countries", "vote_count", "vote_average", "popularity", "runtime",
"overview", "spoken_languages", "poster_path"]

21. __Reset__ the Index and create a __RangeIndex__.

In [400]:
df = df.loc[:, col]

In [406]:
df.head(3)

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,production_countries,vote_count,vote_average,popularity,runtime,overview,spoken_languages,poster_path
0,862,Toy Story,,1995-10-30,Animation|Comedy|Family,Toy Story Collection,en,,,Pixar Animation Studios,United States of America,5415.0,7.7,21.946943,81.0,"Led by Woody, Andy's toys live happily in his ...",English,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg
1,8844,Jumanji,Roll the dice and unleash the excitement!,1995-12-15,Adventure|Fantasy|Family,,en,,,TriStar Pictures|Teitler Film|Interscope Commu...,United States of America,2413.0,6.9,17.015539,104.0,When siblings Judy and Peter discover an encha...,English|Français,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
2,15602,Grumpier Old Men,Still Yelling. Still Fighting. Still Ready for...,1995-12-22,Romance|Comedy,Grumpy Old Men Collection,en,,,Warner Bros.|Lancaster Gate,United States of America,92.0,6.5,11.7129,101.0,A family wedding reignites the ancient feud be...,English,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg


In [407]:
df.reset_index(drop=True, inplace=True)

In [409]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44985 entries, 0 to 44984
Data columns (total 18 columns):
id                       44985 non-null int32
title                    44985 non-null object
tagline                  20285 non-null object
release_date             44907 non-null datetime64[ns]
genres                   44985 non-null object
belongs_to_collection    4463 non-null object
original_language        44975 non-null object
budget_musd              0 non-null float64
revenue_musd             0 non-null float64
production_companies     44985 non-null object
production_countries     44985 non-null object
vote_count               44985 non-null float64
vote_average             42145 non-null float64
popularity               44985 non-null float64
runtime                  43238 non-null float64
overview                 43926 non-null object
spoken_languages         44985 non-null object
poster_path              44612 non-null object
dtypes: datetime64[ns](1), float64(6), in

22. __Save__ the cleaned dataset in a __csv-file__.

In [410]:
df.to_csv('movies_clean.csv', index = False)

# +++++++++ See some Hints below +++++++++++++

# ++++++++++++++++ Hints++++++++++++++++++++

__Hints for 3.__ <br>
apply ast.literal_eval() on all stringified elements (you have to import ast):

In [411]:
# example:
df.stringified_column = df.stringified_column.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

AttributeError: 'DataFrame' object has no attribute 'stringified_column'

__Hints for 4., 5., 6., 7., 8.__<br> 
apply an appropriate lambda function on all column elements

__Hints for 9.__<br>
Replace all __""__ (empty strings) in the above columns by NaN (__np.nan__)

__Hints for 10.__<br>
Use pd.to_numeric() and "coerce" errors

__Hints for 11.__<br>
Replace the value 0 by NaN (__np.nan__)

__Hints for 13.__<br>
Replace the value 0 by NaN (__np.nan__)

__Hints for 14.__<br>
Use pd.to_datetime() and "coerce" errors

__Hints for 16.__<br>
There cannot be two or more movies with the same movie id.