# Cleaning the messy data
## Preparation
First, let's overview the data to see what needs to be fixed
### Imports

In [5]:
import pandas as pd
import re
from pycountry import countries # to check which countries are stored wrong

### Data overview

The file isn't read correctly when I use utf-8 encoding, so I tried latin-1, and ';' is used as separator.

In [6]:
imdb = pd.read_csv('messy_IMDB_dataset.csv', delimiter=';', encoding='latin1')
imdb

Unnamed: 0,IMBD title ID,Original titlÊ,Release year,Genrë¨,Duration,Country,Content Rating,Director,Unnamed: 8,Income,Votes,Score
0,tt0111161,The Shawshank Redemption,1995-02-10,Drama,142,USA,R,Frank Darabont,,$ 28815245,2.278.845,9.3
1,tt0068646,The Godfather,09 21 1972,"Crime, Drama",175,USA,R,Francis Ford Coppola,,$ 246120974,1.572.674,9.2
2,tt0468569,The Dark Knight,23 -07-2008,"Action, Crime, Drama",152,US,PG-13,Christopher Nolan,,$ 1005455211,2.241.615,9.
3,tt0071562,The Godfather: Part II,1975-09-25,"Crime, Drama",220,USA,R,Francis Ford Coppola,,"$ 4o8,035,783",1.098.714,9.0
4,tt0110912,Pulp Fiction,1994-10-28,"Crime, Drama",,USA,R,Quentin Tarantino,,$ 222831817,1.780.147,"8,9f"
...,...,...,...,...,...,...,...,...,...,...,...,...
96,tt0070735,The Sting,1974-03-21,"Comedy, Crime, Drama",129,USA,PG,George Roy Hill,,$ 156000000,236.285,7.5
97,tt0082096,Das Boot,1982-03-18,"Adventure, Drama, Thriller",149,West Germany,R,Wolfgang Petersen,,$ 11487676,226.427,7.5
98,tt0059578,Per qualche dollaro in piÃ¹,1965-12-20,Western,132,Italy,,Sergio Leone,,$ 15000000,226.039,7.4
99,tt1832382,Jodaeiye Nader az Simin,2011-10-21,Drama,123,Iran,PG-13,Asghar Farhadi,,$ 22926076,214.165,7.4


There are a lot of problems here, such as 
- wrong columns names
- *in the 'Original title'* encoding issues, 
- *in the 'Income' column* letter 'o' insead of number 0,
- *in the 'Release year' column* various data formats in the same column
- *in the 'Release year' column* there are random spaces in dates
- *in the 'Countries' column* some countries are written wrong (e.g. the USA are written as US in some cells, in others - USA)
- *the 'Unnamed: 8'* column seems to contain a lot of NaN
- there are missing values


In [7]:
imdb.dtypes

IMBD title ID      object
Original titlÊ     object
Release year       object
Genrë¨             object
Duration           object
Country            object
Content Rating     object
Director           object
Unnamed: 8        float64
Income             object
 Votes             object
Score              object
dtype: object

All the data types except 'Unnamed: 8' are `object`. I suppose 'Unnamed: 8' dtype is object because all its values are `NaN`. Let's check it later in the section [Column 'Unnamed: 8'](#column-unnamed-8).

## Data cleaning

Now each column is gonna be checked separately.

### Rename columns

Some columns are affected by encoding issues, so they should be renamed. Besides, the 'Votes' column has redurant spaces in the left side.

In [8]:
imdb.rename(columns={
    'IMBD title ID': 'title ID',
    'Original titlÊ': 'Original title',
    'Genrë¨': 'Genre',
    ' Votes ': 'Votes',
}, inplace=True)

# change columns' headers
imdb.columns

Index(['title ID', 'Original title', 'Release year', 'Genre', 'Duration',
       'Country', 'Content Rating', 'Director', 'Unnamed: 8', 'Income',
       'Votes', 'Score'],
      dtype='object')

### Column 'Unnamed: 8'

This column seems to have all/a lot of values equal to `NaN`. If all values are `NaN`, then it should be dropped.

In [9]:
# check if the column is redurant
if imdb['Unnamed: 8'].isna().any():
    imdb.drop(columns=['Unnamed: 8'], inplace=True) # if it is, then drop it
imdb

Unnamed: 0,title ID,Original title,Release year,Genre,Duration,Country,Content Rating,Director,Income,Votes,Score
0,tt0111161,The Shawshank Redemption,1995-02-10,Drama,142,USA,R,Frank Darabont,$ 28815245,2.278.845,9.3
1,tt0068646,The Godfather,09 21 1972,"Crime, Drama",175,USA,R,Francis Ford Coppola,$ 246120974,1.572.674,9.2
2,tt0468569,The Dark Knight,23 -07-2008,"Action, Crime, Drama",152,US,PG-13,Christopher Nolan,$ 1005455211,2.241.615,9.
3,tt0071562,The Godfather: Part II,1975-09-25,"Crime, Drama",220,USA,R,Francis Ford Coppola,"$ 4o8,035,783",1.098.714,9.0
4,tt0110912,Pulp Fiction,1994-10-28,"Crime, Drama",,USA,R,Quentin Tarantino,$ 222831817,1.780.147,"8,9f"
...,...,...,...,...,...,...,...,...,...,...,...
96,tt0070735,The Sting,1974-03-21,"Comedy, Crime, Drama",129,USA,PG,George Roy Hill,$ 156000000,236.285,7.5
97,tt0082096,Das Boot,1982-03-18,"Adventure, Drama, Thriller",149,West Germany,R,Wolfgang Petersen,$ 11487676,226.427,7.5
98,tt0059578,Per qualche dollaro in piÃ¹,1965-12-20,Western,132,Italy,,Sergio Leone,$ 15000000,226.039,7.4
99,tt1832382,Jodaeiye Nader az Simin,2011-10-21,Drama,123,Iran,PG-13,Asghar Farhadi,$ 22926076,214.165,7.4


### Column 'title ID'

We can see that title id should be 9-character string, so let's make sure each id is 9-character object.

In [10]:
# check if each id has 9 symbols
imdb[imdb['title ID'].str.len() != 9]

Unnamed: 0,title ID,Original title,Release year,Genre,Duration,Country,Content Rating,Director,Income,Votes,Score
13,,,,,,,,,,,


Only one id is not 9-character object, and it's `NaN`. All the row contains only `NaN`, so let't srop it.

In [11]:
# all the row has NaN values, so let's drop it
imdb.drop(index=13, inplace=True)

Each id should be unique value, so let's check if there is no duplicate.

In [12]:
# find duplicated id
imdb[imdb.duplicated(['title ID'])]

Unnamed: 0,title ID,Original title,Release year,Genre,Duration,Country,Content Rating,Director,Income,Votes,Score


### Column 'Release year'

Let's check if all dates contain only numbers, spaces and '-'.

In [13]:
# if `Release year` contains anything exept numbers, spaces and `-`
imdb[imdb['Release year'].str.contains(r'[^\d\w-]', regex=True)]

Unnamed: 0,title ID,Original title,Release year,Genre,Duration,Country,Content Rating,Director,Income,Votes,Score
1,tt0068646,The Godfather,09 21 1972,"Crime, Drama",175,USA,R,Francis Ford Coppola,$ 246120974,1.572.674,9.2
2,tt0468569,The Dark Knight,23 -07-2008,"Action, Crime, Drama",152,US,PG-13,Christopher Nolan,$ 1005455211,2.241.615,9.
5,tt0167260,The Lord of the Rings: The Return of the King,22 Feb 04,"Action, Adventure, Drama",201,New Zealand,PG-13,Peter Jackson,$ 1142271098,1.604.280,08.9
12,tt0060196,"Il buono, il brutto, il cattivo",23rd December of 1966,Western,161,Italy,Approved,Sergio Leone,$ 25252481,672.499,8.8
15,tt0167261,The Lord of the Rings: The Two Towers,01/16-03,"Action, Adventure, Drama",179,New Zeland,PG-13,Peter Jackson,$ 951227416,1.449.778,8.7.
18,tt0073486,One Flew Over the Cuckoo's Nest,18/11/1976,Drama,-,USA,R,Milos Forman,$ 108997629,891.071,8.7
70,tt0043014,Sunset Blvd.,"The 6th of marzo, year 1951","Drama, Film-Noir",110,USA,,Billy Wilder,$ 299645,195.789,8.0


Some dates contain 
- redurant spaces
- '/' instead of '-', and it's ok
- combination of '-' and '/' (`01/16-03`)
- spaces instead of '-' (`09 21 1972`)
- months written as words insead of numbers
- redurant word 'of' and letters 'rd' (`23rd December of 1966`)
- various orders of yy-mm-dd

Let's check if some dates contain 1-digit or 3-digit or larger than 5-digit number.

In [14]:
imdb[imdb['Release year'].str.contains(r'([^\d]|^)([\d]{1}|[\d]{3}|[\d]{5,})[^\d]', regex=True)]

  imdb[imdb['Release year'].str.contains(r'([^\d]|^)([\d]{1}|[\d]{3}|[\d]{5,})[^\d]', regex=True)]


Unnamed: 0,title ID,Original title,Release year,Genre,Duration,Country,Content Rating,Director,Income,Votes,Score
70,tt0043014,Sunset Blvd.,"The 6th of marzo, year 1951","Drama, Film-Noir",110,USA,,Billy Wilder,$ 299645,195.789,8.0


One date contains the Spanish month name. 

In [23]:
imdb.loc[70, 'Release year'] = '1951-03-06'
imdb.loc[70]

title ID                 tt0043014
Original title        Sunset Blvd.
Release year            1951-03-06
Genre             Drama, Film-Noir
Duration                       110
Country                        USA
Content Rating                 NaN
Director              Billy Wilder
Income                    $ 299645
Votes                      195.789
Score                          8.0
Name: 70, dtype: object

In [4]:
imdb

NameError: name 'imdb' is not defined

To convert time I will create function, which takes numbers and words, which can be relevant to date. If we get the number, then it's a day or a month or a year. If we get a word, then it's a month. I use `format='mixed'` due to various date formats in the column and `errors='coerce'` to deal with errors.

In [13]:
# function for the time conversion
def fix_date(date_str):
    if not date_str:
        return date_str
    
    numbers = re.findall(r'\b\d{1,4}(?=(?:st|nd|rd|th)\b)?', date_str)
    word = re.findall(r'\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\b', date_str, flags=re.IGNORECASE)
    r = '-'.join(numbers)
    if word:
        r = r + '-' + word[0]
    return r

imdb['Release year'] = imdb['Release year'].apply(fix_date)

imdb['Release year'] = pd.to_datetime(imdb['Release year'], errors='coerce', format='mixed')
imdb['Release year']

0     1995-02-10
1     1972-09-21
2     2008-07-23
3     1975-09-25
4     1994-10-28
         ...    
96    1974-03-21
97    1982-03-18
98    1965-12-20
99    2011-10-21
100   1953-02-05
Name: Release year, Length: 100, dtype: datetime64[ns]

### Column 'Genre'

In [14]:
imdb

Unnamed: 0,title ID,Original title,Release year,Genre,Duration,Country,Content Rating,Director,Income,Votes,Score
0,tt0111161,The Shawshank Redemption,1995-02-10,Drama,142,USA,R,Frank Darabont,$ 28815245,2.278.845,9.3
1,tt0068646,The Godfather,1972-09-21,"Crime, Drama",175,USA,R,Francis Ford Coppola,$ 246120974,1.572.674,9.2
2,tt0468569,The Dark Knight,2008-07-23,"Action, Crime, Drama",152,US,PG-13,Christopher Nolan,$ 1005455211,2.241.615,9.
3,tt0071562,The Godfather: Part II,1975-09-25,"Crime, Drama",220,USA,R,Francis Ford Coppola,"$ 4o8,035,783",1.098.714,9.0
4,tt0110912,Pulp Fiction,1994-10-28,"Crime, Drama",,USA,R,Quentin Tarantino,$ 222831817,1.780.147,"8,9f"
...,...,...,...,...,...,...,...,...,...,...,...
96,tt0070735,The Sting,1974-03-21,"Comedy, Crime, Drama",129,USA,PG,George Roy Hill,$ 156000000,236.285,7.5
97,tt0082096,Das Boot,1982-03-18,"Adventure, Drama, Thriller",149,West Germany,R,Wolfgang Petersen,$ 11487676,226.427,7.5
98,tt0059578,Per qualche dollaro in piÃ¹,1965-12-20,Western,132,Italy,,Sergio Leone,$ 15000000,226.039,7.4
99,tt1832382,Jodaeiye Nader az Simin,2011-10-21,Drama,123,Iran,PG-13,Asghar Farhadi,$ 22926076,214.165,7.4


Let's check if there is any symbol different from letters, ',', '-' and spaces.

In [15]:
imdb['Genre'].str.contains(r'[^A-Za-z,\s-]').value_counts()

Genre
False    100
Name: count, dtype: int64

All symbols are valid.

### Column 'Duration'

Duration is either integer or null, so let's check if all symbols are numbers.

In [16]:
# check duration digits
imdb[
    ~(imdb['Duration'].isnull() | imdb['Duration'].str.isdecimal())
    ]['Duration']

4                   
6                Nan
9                Inf
11              178c
16    Not Applicable
18                 -
Name: Duration, dtype: object

In [17]:
# edit duration
imdb['Duration'] = imdb['Duration'].str.replace(r'[^\d]+', '', regex=True)
imdb['Duration'] = pd.to_numeric(imdb['Duration'], errors='coerce', downcast='integer')
imdb

Unnamed: 0,title ID,Original title,Release year,Genre,Duration,Country,Content Rating,Director,Income,Votes,Score
0,tt0111161,The Shawshank Redemption,1995-02-10,Drama,142.0,USA,R,Frank Darabont,$ 28815245,2.278.845,9.3
1,tt0068646,The Godfather,1972-09-21,"Crime, Drama",175.0,USA,R,Francis Ford Coppola,$ 246120974,1.572.674,9.2
2,tt0468569,The Dark Knight,2008-07-23,"Action, Crime, Drama",152.0,US,PG-13,Christopher Nolan,$ 1005455211,2.241.615,9.
3,tt0071562,The Godfather: Part II,1975-09-25,"Crime, Drama",220.0,USA,R,Francis Ford Coppola,"$ 4o8,035,783",1.098.714,9.0
4,tt0110912,Pulp Fiction,1994-10-28,"Crime, Drama",,USA,R,Quentin Tarantino,$ 222831817,1.780.147,"8,9f"
...,...,...,...,...,...,...,...,...,...,...,...
96,tt0070735,The Sting,1974-03-21,"Comedy, Crime, Drama",129.0,USA,PG,George Roy Hill,$ 156000000,236.285,7.5
97,tt0082096,Das Boot,1982-03-18,"Adventure, Drama, Thriller",149.0,West Germany,R,Wolfgang Petersen,$ 11487676,226.427,7.5
98,tt0059578,Per qualche dollaro in piÃ¹,1965-12-20,Western,132.0,Italy,,Sergio Leone,$ 15000000,226.039,7.4
99,tt1832382,Jodaeiye Nader az Simin,2011-10-21,Drama,123.0,Iran,PG-13,Asghar Farhadi,$ 22926076,214.165,7.4


### Column 'Country'

In [18]:
imdb['Country'].isin(countries).value_counts()

Country
False    100
Name: count, dtype: int64

In [19]:
def iscountry(name):
    try:
        countries.lookup(name)
    except(LookupError):
        return False
    else:
        return True

In [20]:
countries_errors = imdb['Country'].apply(lambda x: not iscountry(x))
imdb[countries_errors]['Country']

9               UK
11     New Zesland
15      New Zeland
24             US.
27          Italy1
33              UK
42              UK
50              UK
54              UK
57              UK
61              UK
79              UK
81              UK
86              UK
88              UK
95              UK
97    West Germany
99            Iran
Name: Country, dtype: object

In [21]:
# edit countries names
imdb.loc[imdb['Country'] == 'Italy1', 'Country'] = 'Italy'
imdb.loc[imdb['Country'].str.contains('New Z'), 'Country'] = 'New Zealand'
imdb.loc[imdb['Country'].str.contains('US'), 'Country'] = 'USA'
imdb

Unnamed: 0,title ID,Original title,Release year,Genre,Duration,Country,Content Rating,Director,Income,Votes,Score
0,tt0111161,The Shawshank Redemption,1995-02-10,Drama,142.0,USA,R,Frank Darabont,$ 28815245,2.278.845,9.3
1,tt0068646,The Godfather,1972-09-21,"Crime, Drama",175.0,USA,R,Francis Ford Coppola,$ 246120974,1.572.674,9.2
2,tt0468569,The Dark Knight,2008-07-23,"Action, Crime, Drama",152.0,USA,PG-13,Christopher Nolan,$ 1005455211,2.241.615,9.
3,tt0071562,The Godfather: Part II,1975-09-25,"Crime, Drama",220.0,USA,R,Francis Ford Coppola,"$ 4o8,035,783",1.098.714,9.0
4,tt0110912,Pulp Fiction,1994-10-28,"Crime, Drama",,USA,R,Quentin Tarantino,$ 222831817,1.780.147,"8,9f"
...,...,...,...,...,...,...,...,...,...,...,...
96,tt0070735,The Sting,1974-03-21,"Comedy, Crime, Drama",129.0,USA,PG,George Roy Hill,$ 156000000,236.285,7.5
97,tt0082096,Das Boot,1982-03-18,"Adventure, Drama, Thriller",149.0,West Germany,R,Wolfgang Petersen,$ 11487676,226.427,7.5
98,tt0059578,Per qualche dollaro in piÃ¹,1965-12-20,Western,132.0,Italy,,Sergio Leone,$ 15000000,226.039,7.4
99,tt1832382,Jodaeiye Nader az Simin,2011-10-21,Drama,123.0,Iran,PG-13,Asghar Farhadi,$ 22926076,214.165,7.4


In [22]:
imdb['Country'] = imdb['Country'].astype('category')
imdb['Country']

0               USA
1               USA
2               USA
3               USA
4               USA
           ...     
96              USA
97     West Germany
98            Italy
99             Iran
100             USA
Name: Country, Length: 100, dtype: category
Categories (13, object): ['Brazil', 'Denmark', 'France', 'Germany', ..., 'South Korea', 'UK', 'USA', 'West Germany']

In [23]:
imdb['Country'].cat.categories

Index(['Brazil', 'Denmark', 'France', 'Germany', 'India', 'Iran', 'Italy',
       'Japan', 'New Zealand', 'South Korea', 'UK', 'USA', 'West Germany'],
      dtype='object')

### Column 'Content Rating'

In [24]:
imdb['Content Rating'] = imdb['Content Rating'].astype('category')
imdb['Content Rating']

0          R
1          R
2      PG-13
3          R
4          R
       ...  
96        PG
97         R
98       NaN
99     PG-13
100      NaN
Name: Content Rating, Length: 100, dtype: category
Categories (7, object): ['Approved', 'G', 'Not Rated', 'PG', 'PG-13', 'R', 'Unrated']

In [25]:
imdb.loc[imdb['Content Rating'] == 'Not Rated', 'Content Rating'] = 'Unrated'
imdb['Content Rating'] = imdb['Content Rating'].cat.remove_unused_categories()
imdb['Content Rating']

0          R
1          R
2      PG-13
3          R
4          R
       ...  
96        PG
97         R
98       NaN
99     PG-13
100      NaN
Name: Content Rating, Length: 100, dtype: category
Categories (6, object): ['Approved', 'G', 'PG', 'PG-13', 'R', 'Unrated']

In [26]:
imdb['Content Rating'].cat.set_categories(['Unrated', 'Approved', 'G', 'PG', 'PG-13', 'R'], ordered=True)
imdb['Content Rating']

0          R
1          R
2      PG-13
3          R
4          R
       ...  
96        PG
97         R
98       NaN
99     PG-13
100      NaN
Name: Content Rating, Length: 100, dtype: category
Categories (6, object): ['Approved', 'G', 'PG', 'PG-13', 'R', 'Unrated']

### Column 'Director'

In [27]:
imdb

Unnamed: 0,title ID,Original title,Release year,Genre,Duration,Country,Content Rating,Director,Income,Votes,Score
0,tt0111161,The Shawshank Redemption,1995-02-10,Drama,142.0,USA,R,Frank Darabont,$ 28815245,2.278.845,9.3
1,tt0068646,The Godfather,1972-09-21,"Crime, Drama",175.0,USA,R,Francis Ford Coppola,$ 246120974,1.572.674,9.2
2,tt0468569,The Dark Knight,2008-07-23,"Action, Crime, Drama",152.0,USA,PG-13,Christopher Nolan,$ 1005455211,2.241.615,9.
3,tt0071562,The Godfather: Part II,1975-09-25,"Crime, Drama",220.0,USA,R,Francis Ford Coppola,"$ 4o8,035,783",1.098.714,9.0
4,tt0110912,Pulp Fiction,1994-10-28,"Crime, Drama",,USA,R,Quentin Tarantino,$ 222831817,1.780.147,"8,9f"
...,...,...,...,...,...,...,...,...,...,...,...
96,tt0070735,The Sting,1974-03-21,"Comedy, Crime, Drama",129.0,USA,PG,George Roy Hill,$ 156000000,236.285,7.5
97,tt0082096,Das Boot,1982-03-18,"Adventure, Drama, Thriller",149.0,West Germany,R,Wolfgang Petersen,$ 11487676,226.427,7.5
98,tt0059578,Per qualche dollaro in piÃ¹,1965-12-20,Western,132.0,Italy,,Sergio Leone,$ 15000000,226.039,7.4
99,tt1832382,Jodaeiye Nader az Simin,2011-10-21,Drama,123.0,Iran,PG-13,Asghar Farhadi,$ 22926076,214.165,7.4


In [28]:
imdb[imdb['Director'].str.contains(r'[^A-Za-z,\s]')]['Director']

25    Fernando Meirelles, KÃ¡tia Lund
41    Olivier Nakache, Ãric Toledano
60                     Chan-wook Park
85                 Jean-Pierre Jeunet
Name: Director, dtype: object

In [29]:
imdb['Director'] = imdb['Director'].str.replace('KÃ¡tia Lund', 'Katia Lund')
imdb['Director'] = imdb['Director'].str.replace('Ãric Toledano', 'Eric Toledano')
imdb[imdb['Director'].str.contains(r'[^A-Za-z,\s]')]['Director']

60        Chan-wook Park
85    Jean-Pierre Jeunet
Name: Director, dtype: object

### Column 'Income'

In [30]:
# income check
imdb['Income'] = imdb['Income'].str.strip()
imdb['Income'] = imdb['Income'].str.strip('$')
imdb['Income'] = imdb['Income'].str.strip(',')
imdb[~(imdb['Income'].isnull() | imdb['Income'].str.isdecimal())]['Income']

0          28815245
1         246120974
2        1005455211
3       4o8,035,783
4         222831817
           ...     
96        156000000
97         11487676
98         15000000
99         22926076
100         1864182
Name: Income, Length: 100, dtype: object

In [31]:
imdb['Income'] = imdb['Income'].str.replace('o', '0')
imdb['Income'] = pd.to_numeric(imdb['Income'], errors='coerce', downcast='integer')
imdb

Unnamed: 0,title ID,Original title,Release year,Genre,Duration,Country,Content Rating,Director,Income,Votes,Score
0,tt0111161,The Shawshank Redemption,1995-02-10,Drama,142.0,USA,R,Frank Darabont,2.881524e+07,2.278.845,9.3
1,tt0068646,The Godfather,1972-09-21,"Crime, Drama",175.0,USA,R,Francis Ford Coppola,2.461210e+08,1.572.674,9.2
2,tt0468569,The Dark Knight,2008-07-23,"Action, Crime, Drama",152.0,USA,PG-13,Christopher Nolan,1.005455e+09,2.241.615,9.
3,tt0071562,The Godfather: Part II,1975-09-25,"Crime, Drama",220.0,USA,R,Francis Ford Coppola,,1.098.714,9.0
4,tt0110912,Pulp Fiction,1994-10-28,"Crime, Drama",,USA,R,Quentin Tarantino,2.228318e+08,1.780.147,"8,9f"
...,...,...,...,...,...,...,...,...,...,...,...
96,tt0070735,The Sting,1974-03-21,"Comedy, Crime, Drama",129.0,USA,PG,George Roy Hill,1.560000e+08,236.285,7.5
97,tt0082096,Das Boot,1982-03-18,"Adventure, Drama, Thriller",149.0,West Germany,R,Wolfgang Petersen,1.148768e+07,226.427,7.5
98,tt0059578,Per qualche dollaro in piÃ¹,1965-12-20,Western,132.0,Italy,,Sergio Leone,1.500000e+07,226.039,7.4
99,tt1832382,Jodaeiye Nader az Simin,2011-10-21,Drama,123.0,Iran,PG-13,Asghar Farhadi,2.292608e+07,214.165,7.4


### Column 'Votes'

In [32]:
# income check
imdb['Votes'] = imdb['Votes'].str.strip()
imdb['Votes'] = imdb['Votes'].str.replace('.', '')
imdb[~(imdb['Votes'].isnull() | imdb['Votes'].str.isdecimal())]['Votes']

Series([], Name: Votes, dtype: object)

In [33]:
# convert votes to float
imdb['Votes'] = pd.to_numeric(imdb['Votes'], errors='coerce')

### Column Score

In [34]:
# check scores
imdb[~imdb['Score'].str.contains(r'^\d{1,2}.\d$')]['Score']

2         9.
3       9,.0
4       8,9f
8       8..8
14     ++8.7
15      8.7.
16    8,7e-0
Name: Score, dtype: object

In [35]:
imdb['Score'] = imdb['Score'].str.replace(r'[.,]+\b', '.', regex=True)
imdb['Score'] = imdb['Score'].str.replace(r'[^\d.,]', '', regex=True)
imdb['Score'] = imdb['Score'].str.strip('.')
imdb[~imdb['Score'].str.contains(r'^\d{1,2}.\d$')]['Score']

2        9
10      88
16    8.70
Name: Score, dtype: object

In [36]:
imdb['Score'] = pd.to_numeric(imdb['Score'], errors='coerce')
imdb.dtypes

title ID                  object
Original title            object
Release year      datetime64[ns]
Genre                     object
Duration                 float64
Country                 category
Content Rating          category
Director                  object
Income                   float64
Votes                      int64
Score                    float64
dtype: object

In [37]:
# check if all scores are less than 10.0
imdb[(imdb['Score'] > 10.) | (imdb['Score']< 0.)]

Unnamed: 0,title ID,Original title,Release year,Genre,Duration,Country,Content Rating,Director,Income,Votes,Score
10,tt0109830,Forrest Gump,1994-10-06,"Drama, Romance",142.0,USA,PG-13,Robert Zemeckis,678229452.0,1755490,88.0


In [38]:
imdb['Score'] = imdb['Score'].apply(lambda x: x/10 if x > 10 else x)
imdb[(imdb['Score'] > 10.) | (imdb['Score']< 0.)]

Unnamed: 0,title ID,Original title,Release year,Genre,Duration,Country,Content Rating,Director,Income,Votes,Score


In [39]:
imdb.dtypes

title ID                  object
Original title            object
Release year      datetime64[ns]
Genre                     object
Duration                 float64
Country                 category
Content Rating          category
Director                  object
Income                   float64
Votes                      int64
Score                    float64
dtype: object

In [40]:
imdb

Unnamed: 0,title ID,Original title,Release year,Genre,Duration,Country,Content Rating,Director,Income,Votes,Score
0,tt0111161,The Shawshank Redemption,1995-02-10,Drama,142.0,USA,R,Frank Darabont,2.881524e+07,2278845,9.3
1,tt0068646,The Godfather,1972-09-21,"Crime, Drama",175.0,USA,R,Francis Ford Coppola,2.461210e+08,1572674,9.2
2,tt0468569,The Dark Knight,2008-07-23,"Action, Crime, Drama",152.0,USA,PG-13,Christopher Nolan,1.005455e+09,2241615,9.0
3,tt0071562,The Godfather: Part II,1975-09-25,"Crime, Drama",220.0,USA,R,Francis Ford Coppola,,1098714,9.0
4,tt0110912,Pulp Fiction,1994-10-28,"Crime, Drama",,USA,R,Quentin Tarantino,2.228318e+08,1780147,8.9
...,...,...,...,...,...,...,...,...,...,...,...
96,tt0070735,The Sting,1974-03-21,"Comedy, Crime, Drama",129.0,USA,PG,George Roy Hill,1.560000e+08,236285,7.5
97,tt0082096,Das Boot,1982-03-18,"Adventure, Drama, Thriller",149.0,West Germany,R,Wolfgang Petersen,1.148768e+07,226427,7.5
98,tt0059578,Per qualche dollaro in piÃ¹,1965-12-20,Western,132.0,Italy,,Sergio Leone,1.500000e+07,226039,7.4
99,tt1832382,Jodaeiye Nader az Simin,2011-10-21,Drama,123.0,Iran,PG-13,Asghar Farhadi,2.292608e+07,214165,7.4
