### Cleaning the messy IMDb Dataset found on Kaggle
https://www.kaggle.com/datasets/davidfuenteherraiz/messy-imdb-dataset

Welcome to my very first data cleaning project. This was the first time working completely on my own, looking up errors and fixing them without the guidance of a Youtube video or any other tutorial.

This dataset contains 100 movies from the IMDb database with various errors, which I will be correcting.

In [1]:
from pandas import DataFrame
import pandas as pd
import datetime
df_messy = pd.read_csv(r"messy_IMDB_dataset.csv", encoding = "ISO-8859-1", delimiter=";")
df_messy.head(5)

Unnamed: 0,IMBD title ID,Original titlÊ,Release year,Genrë¨,Duration,Country,Content Rating,Director,Unnamed: 8,Income,Votes,Score
0,tt0111161,The Shawshank Redemption,1995-02-10,Drama,142.0,USA,R,Frank Darabont,,$ 28815245,2.278.845,9.3
1,tt0068646,The Godfather,09 21 1972,"Crime, Drama",175.0,USA,R,Francis Ford Coppola,,$ 246120974,1.572.674,9.2
2,tt0468569,The Dark Knight,23 -07-2008,"Action, Crime, Drama",152.0,US,PG-13,Christopher Nolan,,$ 1005455211,2.241.615,9.
3,tt0071562,The Godfather: Part II,1975-09-25,"Crime, Drama",220.0,USA,R,Francis Ford Coppola,,"$ 4o8,035,783",1.098.714,9.0
4,tt0110912,Pulp Fiction,1994-10-28,"Crime, Drama",,USA,R,Quentin Tarantino,,$ 222831817,1.780.147,"8,9f"


In [2]:
#Some basic information
print(df_messy.shape)
print(df_messy.columns)
print(df_messy.dtypes)

(101, 12)
Index(['IMBD title ID', 'Original titlÊ', 'Release year', 'Genrë¨', 'Duration',
       'Country', 'Content Rating', 'Director', 'Unnamed: 8', 'Income',
       ' Votes ', 'Score'],
      dtype='object')
IMBD title ID      object
Original titlÊ     object
Release year       object
Genrë¨             object
Duration           object
Country            object
Content Rating     object
Director           object
Unnamed: 8        float64
Income             object
 Votes             object
Score              object
dtype: object


Points of interest:
* Correcting the **title** and **genre** column
* Trimming the **votes** column
* Investigating the **unnamed** column
* Changing **score**, **votes**, **income**, and **duration** to numeric types 

In [3]:
df_messy.rename(columns={'Original titlÊ': 'Title', 'Genrë': 'Genre',' Votes ': 'Votes'}, inplace=True)
df_messy.columns

Index(['IMBD title ID', 'Title', 'Release year', 'Genrë¨', 'Duration',
       'Country', 'Content Rating', 'Director', 'Unnamed: 8', 'Income',
       'Votes', 'Score'],
      dtype='object')

In [4]:
df_messy.columns = df_messy.columns.str.replace('ë', '')

In [5]:
df_messy.rename(columns={'Genr¨': 'Genre'}, inplace=True)
df_messy.columns

Index(['IMBD title ID', 'Title', 'Release year', 'Genre', 'Duration',
       'Country', 'Content Rating', 'Director', 'Unnamed: 8', 'Income',
       'Votes', 'Score'],
      dtype='object')

There are quite a few missing values in the **unnamed** column.

In [6]:
df_messy.isnull().sum()

IMBD title ID       1
Title               1
Release year        1
Genre               1
Duration            2
Country             1
Content Rating     24
Director            1
Unnamed: 8        101
Income              1
Votes               1
Score               1
dtype: int64

Deleting the **unnamed** column, since there is not a single value to be found in there.

In [7]:
df_messy.drop(['Unnamed: 8'], axis=1, inplace=True)
df_messy.columns

Index(['IMBD title ID', 'Title', 'Release year', 'Genre', 'Duration',
       'Country', 'Content Rating', 'Director', 'Income', 'Votes', 'Score'],
      dtype='object')

In [8]:
df_messy[df_messy['Duration'].isna()]

Unnamed: 0,IMBD title ID,Title,Release year,Genre,Duration,Country,Content Rating,Director,Income,Votes,Score
13,,,,,,,,,,,
14,tt0133093,The Matrix,1999-05-07,"Action, Sci-Fi",,USA,R,"Lana Wachowski, Lilly Wachowski",$ 465718588,1.632.315,++8.7


I wanted to take a look at the *NaN* value of the **duration** column, because I thought that manually entering the missing values wouldn't be an issue, since there are only 2 of them and it's easily accessible information. But in the process I found an empty row to drop out of the dataframe to make the one missing value in most columns disappear.

In [9]:
df_messy.drop(index=13, inplace=True)
df_messy.isnull().sum()

IMBD title ID      0
Title              0
Release year       0
Genre              0
Duration           1
Country            0
Content Rating    23
Director           0
Income             0
Votes              0
Score              0
dtype: int64

In [10]:
df_messy.at[14, 'Duration'] = '136'

In [11]:
df_messy[df_messy['Title'] == 'The Matrix']

Unnamed: 0,IMBD title ID,Title,Release year,Genre,Duration,Country,Content Rating,Director,Income,Votes,Score
14,tt0133093,The Matrix,1999-05-07,"Action, Sci-Fi",136,USA,R,"Lana Wachowski, Lilly Wachowski",$ 465718588,1.632.315,++8.7


In [12]:
df_messy[df_messy['Content Rating'].isna()]

Unnamed: 0,IMBD title ID,Title,Release year,Genre,Duration,Country,Content Rating,Director,Income,Votes,Score
27,tt0118799,La vita B9 bella,1997-12-20,"Comedy, Drama, Romance",116,Italy1,,Roberto Benigni,$ 230098753,605.648,8.6
28,tt6751668,Gisaengchung,2019-11-07,"Comedy, Drama, Thriller",132,South Korea,,Bong Joon Ho,$ 257604912,470.931,8.6
36,tt0110413,LÃ©on,1995-04-07,"Action, Crime, Drama",110,France,,Luc Besson,$ 19552639,1.007.598,8.5
40,tt7286456,Joker,2019-10-03,"Crime, Drama, Thriller",122,USA,,Todd Phillips,$ 1074251311,855.097,8.4
41,tt1675434,Intouchables,2012-02-24,"Biography, Comedy, Drama",112,France,,"Olivier Nakache, Ãric Toledano",$ 426588510,736.691,8.4
47,tt0095327,Hotaru no haka,2015-10-11,"Animation, Drama, War",89,Japan,,Isao Takahata,$ 516962,225.438,8.3
48,tt0095765,Nuovo Cinema Paradiso,1988-11-17,Drama,155,Italy,,Giuseppe Tornatore,$ 13826605,223.050,8.3
56,tt4154756,Avengers: Infinity War,2018-04-25,"Action, Adventure, Sci-Fi",149,USA,,"Anthony Russo, Joe Russo",$ 2048359754,796.486,8.2
58,tt4154796,Avengers: Endgame,2019-04-24,"Action, Adventure, Drama",181,USA,,"Anthony Russo, Joe Russo",$ 2797800564,754.786,8.2
62,tt0047396,Rear Window,1955-04-14,"Mystery, Thriller",112,USA,,Alfred Hitchcock,$ 37032034,432.390,8.1


There are special characters in the **Title** column, the **Release Year** column needs to get formatted, and there are numbers in the **Country** column. But since the **Content Rating** column has over 20% of its values missing, I will be removing it. Also, I'll be removing **IMDB Title ID** from further analysis since there is no need for it.

In [13]:
df_v2 = DataFrame(df_messy, columns=['Title', 'Release year', 'Genre', 'Duration',
       'Country', 'Director', 'Income', 'Votes', 'Score'])

In [14]:
df_v2.head(5)

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
0,The Shawshank Redemption,1995-02-10,Drama,142.0,USA,Frank Darabont,$ 28815245,2.278.845,9.3
1,The Godfather,09 21 1972,"Crime, Drama",175.0,USA,Francis Ford Coppola,$ 246120974,1.572.674,9.2
2,The Dark Knight,23 -07-2008,"Action, Crime, Drama",152.0,US,Christopher Nolan,$ 1005455211,2.241.615,9.
3,The Godfather: Part II,1975-09-25,"Crime, Drama",220.0,USA,Francis Ford Coppola,"$ 4o8,035,783",1.098.714,9.0
4,Pulp Fiction,1994-10-28,"Crime, Drama",,USA,Quentin Tarantino,$ 222831817,1.780.147,"8,9f"


In [15]:
#Small reminder
df_v2.dtypes

Title           object
Release year    object
Genre           object
Duration        object
Country         object
Director        object
Income          object
Votes           object
Score           object
dtype: object

In [16]:
#df_v2['Release year'] = pd.to_datetime(df_v2['Release year']) throws an error because of strings inside the column

I need to go look for the entry containing *The 6th of marzo, year 1951*. 

In [17]:
df_v2[df_v2['Release year'] == 'The 6th of marzo, year 1951']

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
70,Sunset Blvd.,"The 6th of marzo, year 1951","Drama, Film-Noir",110,USA,Billy Wilder,$ 299645,195.789,8.0


In [18]:
df_v2.at[70, 'Release year'] = '1951-03-06'
df_v2[df_v2['Title'] == 'Sunset Blvd.']

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
70,Sunset Blvd.,1951-03-06,"Drama, Film-Noir",110,USA,Billy Wilder,$ 299645,195.789,8.0


In [19]:
#df_v2['Release year'] = pd.to_datetime(df_v2['Release year']) another error to investigate

In [20]:
df_v2[df_v2['Release year'] == '1984-02-34']

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
83,Scarface,1984-02-34,"Crime, Drama",170,USA,Brian De Palma,$ 66023585,721.343,7.8


In [21]:
#looking up the correct date on IMDb
df_v2.at[83, 'Release year'] = '1983-12-01'
df_v2[df_v2['Title'] == 'Scarface']

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
83,Scarface,1983-12-01,"Crime, Drama",170,USA,Brian De Palma,$ 66023585,721.343,7.8


In [22]:
#df_v2['Release year'] = pd.to_datetime(df_v2['Release year']) another error to investigate

In [23]:
df_v2[df_v2['Release year'] == '1976-13-24']

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
84,Taxi Driver,1976-13-24,"Crime, Drama",114,USA,Martin Scorsese,$ 28441292,703.264,7.7


In [24]:
df_v2.at[84, 'Release year'] = '1976-02-08'
df_v2[df_v2['Title'] == 'Taxi Driver']

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
84,Taxi Driver,1976-02-08,"Crime, Drama",114,USA,Martin Scorsese,$ 28441292,703.264,7.7


In [25]:
df_v2['Release year'] = pd.to_datetime(df_v2['Release year'], infer_datetime_format=True)

  df_v2['Release year'] = pd.to_datetime(df_v2['Release year'], infer_datetime_format=True)


In [26]:
#Generating a few samples
df_v2['Release year'].sample(5)

68   1984-09-28
32   2006-10-27
30   1955-08-19
11   2002-01-18
31   2000-05-19
Name: Release year, dtype: datetime64[ns]

In [27]:
#df_v2['Duration'] = pd.to_numeric(df_v2['Duration'])

In [28]:
df_v2[df_v2['Duration'] == ' ']

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
4,Pulp Fiction,1994-10-28,"Crime, Drama",,USA,Quentin Tarantino,$ 222831817,1.780.147,"8,9f"


In [29]:
df_v2.at[4, 'Duration'] = '154'

In [30]:
#df_v2['Duration'] = pd.to_numeric(df_v2['Duration'])

In [31]:
df_v2.loc[[6]]

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
6,Schindler's List,1994-03-11,"Biography, Drama, History",Nan,USA,Steven Spielberg,$ 322287794,1.183.248,8.9


In [32]:
df_v2.at[6, 'Duration'] = '195'
df_v2.loc[[6]]

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
6,Schindler's List,1994-03-11,"Biography, Drama, History",195,USA,Steven Spielberg,$ 322287794,1.183.248,8.9


In [33]:
#df_v2['Duration'] = pd.to_numeric(df_v2['Duration'])

In [34]:
df_v2.loc[[11]]

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
11,The Lord of the Rings: The Fellowship of the Ring,2002-01-18,"Action, Adventure, Drama",178c,New Zesland,Peter Jackson,$ 887934303,1.619.920,8.8


In [35]:
df_v2.at[11, 'Duration'] = '178'
df_v2.loc[[11]]

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
11,The Lord of the Rings: The Fellowship of the Ring,2002-01-18,"Action, Adventure, Drama",178,New Zesland,Peter Jackson,$ 887934303,1.619.920,8.8


In [36]:
df_v2.loc[[16]]

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
16,Star Wars: Episode V - The Empire Strikes Back,1980-09-19,"Action, Adventure, Fantasy",Not Applicable,USA,Irvin Kershner,$ 549265501,1.132.073,87.0


In [37]:
df_v2.at[16, 'Duration'] = '124'
df_v2.loc[[16]]

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
16,Star Wars: Episode V - The Empire Strikes Back,1980-09-19,"Action, Adventure, Fantasy",124,USA,Irvin Kershner,$ 549265501,1.132.073,87.0


In [38]:
#df_v2['Duration'] = pd.to_numeric(df_v2['Duration'])

I'm only at postion 17 and am getting endless errors. I searched for a way to find every faulty entry.

In [39]:
df_undur = df_v2['Duration'].unique()

In [40]:
print(sorted(df_undur))

['-', '102', '103', '104', '105', '106', '108', '109', '110', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '134', '136', '137', '142', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '160', '161', '164', '165', '169', '170', '175', '178', '179', '181', '189', '195', '201', '207', '220', '228', '229', '81', '87', '88', '89', '95', '96', '98', '99', 'Inf']


In [41]:
df_v2[(df_v2['Duration'] == '-') | (df_v2['Duration'] == 'Inf')]

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
9,Fight Club,1999-10-29,Drama,Inf,UK,David Fincher,$ 101218804,1.807.440,8.8
18,One Flew Over the Cuckoo's Nest,1976-11-18,Drama,-,USA,Milos Forman,$ 108997629,891.071,8.7


In [42]:
df_v2.at[9, 'Duration'] = '139'
df_v2.at[18, 'Duration'] = '133'
df_v2.loc[[9,18]]

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
9,Fight Club,1999-10-29,Drama,139,UK,David Fincher,$ 101218804,1.807.440,8.8
18,One Flew Over the Cuckoo's Nest,1976-11-18,Drama,133,USA,Milos Forman,$ 108997629,891.071,8.7


In [43]:
df_v2['Duration'] = pd.to_numeric(df_v2['Duration'])

In [44]:
df_v2.dtypes

Title                   object
Release year    datetime64[ns]
Genre                   object
Duration                 int64
Country                 object
Director                object
Income                  object
Votes                   object
Score                   object
dtype: object

In [45]:
df_countries = df_v2['Country'].unique()
print(sorted(df_countries))

['Brazil', 'Denmark', 'France', 'Germany', 'India', 'Iran', 'Italy', 'Italy1', 'Japan', 'New Zealand', 'New Zeland', 'New Zesland', 'South Korea', 'UK', 'US', 'US.', 'USA', 'West Germany']


In [46]:
df_v2['Country'] = df_v2['Country'].replace(['US', 'US.', 'Italy1', 'New Zeland', 'New Zesland'],['USA','USA','Italy','New Zealand', 'New Zealand'])
df_v2['Country'].unique()

array(['USA', 'New Zealand', 'UK', 'Italy', 'Brazil', 'Japan',
       'South Korea', 'France', 'Germany', 'India', 'Denmark',
       'West Germany', 'Iran'], dtype=object)

In [47]:
df_v2['Director'].unique()

array(['Frank Darabont', 'Francis Ford Coppola', 'Christopher Nolan',
       'Quentin Tarantino', 'Peter Jackson', 'Steven Spielberg',
       'Sidney Lumet', 'David Fincher', 'Robert Zemeckis', 'Sergio Leone',
       'Lana Wachowski, Lilly Wachowski', 'Irvin Kershner',
       'Martin Scorsese', 'Milos Forman', 'Jonathan Demme',
       'George Lucas', 'Fernando Meirelles, KÃ¡tia Lund',
       'Hayao Miyazaki', 'Roberto Benigni', 'Bong Joon Ho', 'Frank Capra',
       'Akira Kurosawa', 'Ridley Scott', 'Tony Kaye', 'Luc Besson',
       'James Cameron', 'Bryan Singer', 'Roger Allers, Rob Minkoff',
       'Todd Phillips', 'Olivier Nakache, Ã\x89ric Toledano',
       'Roman Polanski', 'Damien Chazelle', 'Alfred Hitchcock',
       'Michael Curtiz', 'Isao Takahata', 'Giuseppe Tornatore',
       'Charles Chaplin', 'Andrew Stanton', 'Stanley Kubrick',
       'Anthony Russo, Joe Russo', 'Chan-wook Park',
       'Lee Unkrich, Adrian Molina', 'Florian Henckel von Donnersmarck',
       'Bob Persichet

Going through every column, I will try to correct errors just for the sake of having a clean dataframe.

In [48]:
df_v2[(df_v2['Director'].str.contains("Lund") | (df_v2['Director'].str.contains("Toledano")))]

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
25,Cidade de Deus,2003-05-09,"Crime, Drama",130,Brazil,"Fernando Meirelles, KÃ¡tia Lund",$ 30680793,685.856,8.6
41,Intouchables,2012-02-24,"Biography, Comedy, Drama",112,France,"Olivier Nakache, Ãric Toledano",$ 426588510,736.691,8.4


In [49]:
df_v2.at[25, 'Director'] = 'Fernando Meirelles, Kátia Lund'
df_v2.at[41, 'Director'] = 'Olivier Nakache, Éric Toledano'
df_v2.loc[[25,41]]

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
25,Cidade de Deus,2003-05-09,"Crime, Drama",130,Brazil,"Fernando Meirelles, Kátia Lund",$ 30680793,685.856,8.6
41,Intouchables,2012-02-24,"Biography, Comedy, Drama",112,France,"Olivier Nakache, Éric Toledano",$ 426588510,736.691,8.4


In [50]:
df_v2['Income'] = df_v2['Income'].str.replace('$', '', regex=True)
df_v2['Income'] = df_v2['Income'].str.replace(',', '', regex=True)
df_v2['Income'] = df_v2['Income'].str.replace('o', '', regex=True)

In [51]:
df_v2['Income'] = pd.to_numeric(df_v2['Income'])
df_v2.dtypes

Title                   object
Release year    datetime64[ns]
Genre                   object
Duration                 int64
Country                 object
Director                object
Income                   int64
Votes                   object
Score                   object
dtype: object

In [52]:
df_v2['Votes'] = df_v2['Votes'].str.replace('.', '', regex=True)
df_v2['Votes'] = df_v2['Votes'].str.replace(',', '', regex=True)

In [53]:
df_v2['Votes'] = pd.to_numeric(df_v2['Votes'])

In [54]:
df_v2['Score'].unique()

array(['9.3', '9.2', '9.', '9,.0', '8,9f', '08.9', '8.9', '8..8', '8.8',
       '8:8', '++8.7', '8.7.', '8,7e-0', '8.7', '8.6', '8,6', '8.5',
       '8.4', '8.3', '8.2', '8.1', '8.0', '7.9', '7.8', '7.7', '7.6',
       '7.5', '7.4'], dtype=object)

In [55]:
# Might be a bit babaric
df_v2['Score'] = df_v2['Score'].str.replace(',', '', regex=True)
df_v2['Score'] = df_v2['Score'].str.replace('.', '', regex=True)
df_v2['Score'] = df_v2['Score'].str.replace(':', '', regex=True)
df_v2['Score'] = df_v2['Score'].str.replace('+', '', regex=True)
df_v2['Score'] = df_v2['Score'].str.replace('f', '', regex=True)

In [56]:
df_v2[(df_v2['Score'].str.contains("089") | (df_v2['Score'].str.contains("e")))]

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
5,The Lord of the Rings: The Return of the King,2004-02-22,"Action, Adventure, Drama",201,New Zealand,Peter Jackson,1142271098,1604280,89.0
16,Star Wars: Episode V - The Empire Strikes Back,1980-09-19,"Action, Adventure, Fantasy",124,USA,Irvin Kershner,549265501,1132073,87.0


In [57]:
df_v2.at[5, 'Score'] = '89'
df_v2.at[16, 'Score'] = '87'
df_v2.at[2, 'Score'] = '90'
df_v2.loc[[5,16,2]]

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
5,The Lord of the Rings: The Return of the King,2004-02-22,"Action, Adventure, Drama",201,New Zealand,Peter Jackson,1142271098,1604280,89
16,Star Wars: Episode V - The Empire Strikes Back,1980-09-19,"Action, Adventure, Fantasy",124,USA,Irvin Kershner,549265501,1132073,87
2,The Dark Knight,2008-07-23,"Action, Crime, Drama",152,USA,Christopher Nolan,1005455211,2241615,90


In [58]:
df_v2['Score'] = df_v2['Score'].astype(int)

In [59]:
df_v2['Score'].sample(15)

44    83
4     89
85    77
56    82
28    86
62    81
76    79
11    88
90    76
66    81
95    75
45    83
6     89
77    79
40    84
Name: Score, dtype: int32

In [60]:
df_final = df_v2
df_final.sample(10)

Unnamed: 0,Title,Release year,Genre,Duration,Country,Director,Income,Votes,Score
14,The Matrix,1999-05-07,"Action, Sci-Fi",136,USA,"Lana Wachowski, Lilly Wachowski",465718588,1632315,87
88,2001: A Space Odyssey,1968-12-12,"Adventure, Sci-Fi",149,UK,Stanley Kubrick,68989547,587866,76
46,C'era una volta il West,1968-12-21,Western,165,Italy,Sergio Leone,112911,295220,83
5,The Lord of the Rings: The Return of the King,2004-02-22,"Action, Adventure, Drama",201,New Zealand,Peter Jackson,1142271098,1604280,89
25,Cidade de Deus,2003-05-09,"Crime, Drama",130,Brazil,"Fernando Meirelles, Kátia Lund",30680793,685856,86
10,Forrest Gump,1994-10-06,"Drama, Romance",142,USA,Robert Zemeckis,678229452,1755490,88
61,Dr. Strangelove or: How I Learned to Stop Worr...,1964-04-03,Comedy,95,UK,Stanley Kubrick,9443876,441115,81
28,Gisaengchung,2019-11-07,"Comedy, Drama, Thriller",132,South Korea,Bong Joon Ho,257604912,470931,86
68,Once Upon a Time in America,1984-09-28,"Crime, Drama",229,USA,Sergio Leone,5472914,302317,80
72,American Beauty,2000-01-21,Drama,122,USA,Sam Mendes,356296601,1049009,80


In [61]:
#One last correction
df_final.at[36, 'Title'] = 'Léon'
df_final.loc[85, 'Title'] = "Le fabuleux destin d'Amélie Poulain"
df_final.loc[98, 'Title'] = 'Per qualche dollaro in più'

In [62]:
df_final.to_csv(r'imdb_clean.csv')

Thank you very much for going through my notebook!