In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('movies_imdb_messy.csv')

In [3]:
df.head()

Unnamed: 0,movie,year,imdbID,certificate,genre,runtime,rating,metascore,director,votes,gross
0,The Dark Knight: Le chevalier noir,(2008),/title/tt0468569/,Tous publics,"\nAction, Crime, Drama",152 min,9.0,84.0,Christopher Nolan,2161024,$534.86M
1,Inception,(2010),/title/tt1375666/,Tous publics,"\nAction, Adventure, Sci-Fi",148 min,8.8,74.0,Christopher Nolan,1909234,$292.58M
2,Le seigneur des anneaux: La communauté de l'an...,(2001),/title/tt0120737/,Tous publics,"\nAdventure, Drama, Fantasy",178 min,8.8,92.0,Peter Jackson,1561138,$315.54M
3,Le seigneur des anneaux: Le retour du roi,(2003),/title/tt0167260/,Tous publics,"\nAdventure, Drama, Fantasy",201 min,8.9,94.0,Peter Jackson,1549252,$377.85M
4,The Dark Knight Rises,(2012),/title/tt1345836/,Tous publics,"\nAction, Thriller",164 min,8.4,78.0,Christopher Nolan,1431808,$448.14M


In [4]:
df.isna().sum()

movie             0
year              0
imdbID            0
certificate    2721
genre             0
runtime           1
rating            0
metascore      3339
director          0
votes             0
gross          4044
dtype: int64

### certificate

In [5]:
df.certificate.value_counts()

Tous publics                       62815
12                                 10739
PG-13                               2452
R                                   2332
16                                  1773
Tous publics avec avertissement     1587
PG                                   610
12 avec avertissement                307
Unrated                               38
13                                    33
18                                    21
7                                     20
G                                     19
10                                    16
Not Rated                             15
NC-17                                  1
Tous Publics                           1
Name: certificate, dtype: int64

- for the certificate, there are a lot of Nan values (~25%) 
- the value scraped corresponds to the french certificate but are not unique (for example: https://www.imdb.com/title/tt0328828/parentalguide?ref_=tt_stry_pg#certification : the value scraped is 'unrated' but there is also a value 'Tous publics' for france)

- movies with no certification (Nan) have a certification in other countries (for example:
https://www.imdb.com/title/tt0200465/parentalguide?ref_=tt_stry_pg#certification: the movie is R-rated in most countries whereas
https://www.imdb.com/title/tt0780521/parentalguide?ref_=tt_stry_pg#certification is rated G (='tous publics') in most countries



**the certificate column will not be representative of a movie: we can drop it**

In [6]:
df.drop('certificate',axis = 1, inplace = True)

### metascore

In [7]:
df.metascore.describe()

count    82161.000000
mean        69.863402
std         14.120393
min          1.000000
25%         61.000000
50%         71.000000
75%         80.000000
max        100.000000
Name: metascore, dtype: float64

In [8]:
df[df.metascore.isna()].head()

Unnamed: 0,movie,year,imdbID,genre,runtime,rating,metascore,director,votes,gross
440,Boulevard de la mort,(2007),/title/tt1028528/,\nThriller,113 min,7.0,,Quentin Tarantino,254616,
507,Hatchi,(2009),/title/tt1028532/,"\nDrama, Family",93 min,8.1,,Lasse Hallström,233052,
658,Planète terreur,(2007),/title/tt1077258/,"\nAction, Comedy, Horror",105 min,7.1,,Robert Rodriguez,193129,
687,Le nombre 23,(2007),/title/tt0481369/,"\nCrime, Mystery, Thriller",98 min,6.4,,Joel Schumacher,187256,$35.19M
802,The Man from Earth,(2007),/title/tt0756683/,"\nDrama, Fantasy, Sci-Fi",87 min,7.9,,Richard Schenkman,164388,


- movies with Nan value for metascore don't have a metascore yet on metacritic because there are not enough reviews for the movie (needs to have 4 reviews)

**we drop the movies that don't have metascore**

In [9]:
df.dropna(subset=['metascore'], inplace=True)

In [10]:
df.isna().sum()

movie           0
year            0
imdbID          0
genre           0
runtime         0
rating          0
metascore       0
director        0
votes           0
gross        1160
dtype: int64

### gross

- "Cumulative Worldwide Gross" is being scraped from each movie page: should have less missing values ?
- we leave 'gross' as is for now

In [11]:
df.to_csv('movies_imdb.csv', index=False)