### **DATA CLEANING** 

In [3]:
 # Importing the Necessary Libraries 
import pandas as pd 
import json 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker

In [4]:
 # Loading the Datasets 
with open('IMDB_DATA.json', mode='r', encoding='utf-8') as file:  
    data_IMDB_movies = json.load(file)

with open('Rotten_Tomatoes_DAta.json', mode='r', encoding='utf-8') as file:   
    data_Rotten_Tomatoes_movies = json.load(file)

 # Converting to dataframe
IMDB_df = pd.DataFrame( data_IMDB_movies )
Rotten_Tomatoes_df = pd.DataFrame( data_Rotten_Tomatoes_movies )

In [5]:
 # Imdb data
display(IMDB_df)

print( IMDB_df.columns.to_list() ) # column Names


Unnamed: 0,Title,Link,Year,Rating_Classification,Runtime,Total_Ratings,Average_Rating,Genre,Director
0,The Shawshank Redemption,https://www.imdb.com/title/tt0111161/?ref_=cht...,1994,R,2h 22m,3M,9.3/10,Epic|Period Drama|Prison Drama,Frank Darabont
1,The Godfather,https://www.imdb.com/title/tt0068646/?ref_=cht...,1972,R,2h 55m,2.1M,9.2/10,Epic|Gangster|Tragedy,Francis Ford Coppola
2,The Dark Knight,https://www.imdb.com/title/tt0468569/?ref_=cht...,2008,PG-13,2h 32m,3M,9.0/10,Action Epic|Epic|Superhero,Christopher Nolan
3,The Godfather Part II,https://www.imdb.com/title/tt0071562/?ref_=cht...,1974,R,3h 22m,1.4M,9.0/10,Epic|Gangster|Tragedy,Francis Ford Coppola
4,12 Angry Men,https://www.imdb.com/title/tt0050083/?ref_=cht...,1957,Approved,1h 36m,902K,9.0/10,Legal Drama|Psychological Drama|Crime,Sidney Lumet
...,...,...,...,...,...,...,...,...,...
245,Amores Perros,https://www.imdb.com/title/tt0245712/?ref_=cht...,2000,R,2h 34m,258K,8.0/10,Tragedy|Drama|Thriller,Alejandro G. Iñárritu
246,The Help,https://www.imdb.com/title/tt1454029/?ref_=cht...,2011,PG-13,2h 26m,505K,8.1/10,Period Drama|Drama|Back to top,Tate Taylor
247,Rebecca,https://www.imdb.com/title/tt0032976/?ref_=cht...,1940,Approved,2h 10m,151K,8.1/10,Dark Romance|Psychological Thriller|Drama,Alfred Hitchcock
248,A Silent Voice: The Movie,https://www.imdb.com/title/tt5323662/?ref_=cht...,2016,Not Rated,2h 10m,112K,8.1/10,Anime|Coming-of-Age|Psychological Drama,Taichi Ishidate


['Title', 'Link', 'Year', 'Rating_Classification', 'Runtime', 'Total_Ratings', 'Average_Rating', 'Genre', 'Director']


From the output we can see that the data has 250 rows and 9 columns where the columns are :

1. **Title** ( Title of the movie )
2. **Link** ( The IMDB link of the movie )
3. **Rating_Classification** 
4. **Runtime**
5. **Total_Ratings**
6. **Average_Ratings**
7. **Genre**
8. **Director**


In [6]:
 # Getting the Datatypes
print( IMDB_df.dtypes )

 # Getting the number of null values 
print( IMDB_df.isnull().sum())

 # Getting the number of duplicated values
print(IMDB_df.duplicated().sum())

Title                    object
Link                     object
Year                     object
Rating_Classification    object
Runtime                  object
Total_Ratings            object
Average_Rating           object
Genre                    object
Director                 object
dtype: object
Title                    0
Link                     0
Year                     0
Rating_Classification    0
Runtime                  0
Total_Ratings            0
Average_Rating           0
Genre                    0
Director                 0
dtype: int64
0


We can see that all the columns have the datatypes objects therefore we will need to change some datatypes like Total_ratings , Average_Ratings ,Year and Runtime. 

We can also see that none of the collumns have a missing value.

In [7]:
 # checking the different unique values in year 
print(IMDB_df['Year'].nunique())
print(IMDB_df['Year'].unique()) 


92
['1994' '1972' '2008' '1974' '1957' '2003' '1993' '2001' '1966' '2002'
 '1999' '2010' '1980' '1990' '1975' '2014' '1995' '1946' '1954' '1991'
 '1998' '1997' '1977' '1985' '2000' '2019' '1960' '2006' '1988' '2023'
 '1962' '1942' '2011' '1936' '1979' '2024' '1968' '2012' '1931' '1981'
 '2018' '2023|2h 27m' '1950' '1940' '1986' '2009' '2017' '1984' '1964'
 '1981|2h 29m' '2016|1h 46m' '1963' '1952' '1983' '2004' '1992' '1944'
 '1959' '1941' '1958' '1987' '1971' '1973' '1989' '2007' '1927' '1948'
 '2016' '1976' '2020' '2013' '2005' '1965' '1961' '1921' '2022' '1982'
 '1939' '2015' '1996' '2021' '2021|1h 27m' '1925' '2024|2h 21m' '1978'
 '1924' '1926' '1953' '1949' '1928' '1956' '1967']


From the output we can see that there are 92 unique values and we can see that there are values that contain the year and the runtime we will therefore need to explore this further . we will need to find the rows of columns that contain the **|**

In [8]:
cont = IMDB_df['Year'].str.contains( r'\|' )      # Checking the row that contains '|' in its string

display( IMDB_df[cont] )         # Displaying the data

Unnamed: 0,Title,Link,Year,Rating_Classification,Runtime,Total_Ratings,Average_Rating,Genre,Director
61,12th Fail,https://www.imdb.com/title/tt23849204/?ref_=ch...,2023|2h 27m,CHECK,CHECK,139K,8.8/10,Docudrama|Biography|Drama,Vidhu Vinod Chopra
80,Das Boot,https://www.imdb.com/title/tt0082096/?ref_=cht...,1981|2h 29m,CHECK,CHECK,274K,8.4/10,War Epic|Drama|War,Wolfgang Petersen
82,Your Name.,https://www.imdb.com/title/tt5311514/?ref_=cht...,2016|1h 46m,CHECK,CHECK,343K,8.4/10,Anime|Shōjo|Animation,Makoto Shinkai
179,Demon Slayer: Kimetsu no Yaiba - Tsuzumi Mansi...,https://www.imdb.com/title/tt16492678/?ref_=ch...,2021|1h 27m,CHECK,CHECK,28K,8.7/10,Anime|Action|Animation,Haruo Sotozaki
199,Maharaja,https://www.imdb.com/title/tt26548265/?ref_=ch...,2024|2h 21m,CHECK,CHECK,59K,8.4/10,One-Person Army Action|Action|Crime,Nithilan Saminathan


We can see a total of 5 rows that have a year column that is both year and runtime , it also shows that in the Rating _classification and Runtime are **CHECK** we will therefore display the rows which have Rating_classification and Runtime as **CHECK**

In [9]:
rating_crit = IMDB_df['Rating_Classification'] == 'CHECK'
runtime_crit = IMDB_df['Runtime'] == 'CHECK'

display( IMDB_df[rating_crit] )
display( IMDB_df[runtime_crit] )

Unnamed: 0,Title,Link,Year,Rating_Classification,Runtime,Total_Ratings,Average_Rating,Genre,Director
61,12th Fail,https://www.imdb.com/title/tt23849204/?ref_=ch...,2023|2h 27m,CHECK,CHECK,139K,8.8/10,Docudrama|Biography|Drama,Vidhu Vinod Chopra
80,Das Boot,https://www.imdb.com/title/tt0082096/?ref_=cht...,1981|2h 29m,CHECK,CHECK,274K,8.4/10,War Epic|Drama|War,Wolfgang Petersen
82,Your Name.,https://www.imdb.com/title/tt5311514/?ref_=cht...,2016|1h 46m,CHECK,CHECK,343K,8.4/10,Anime|Shōjo|Animation,Makoto Shinkai
179,Demon Slayer: Kimetsu no Yaiba - Tsuzumi Mansi...,https://www.imdb.com/title/tt16492678/?ref_=ch...,2021|1h 27m,CHECK,CHECK,28K,8.7/10,Anime|Action|Animation,Haruo Sotozaki
199,Maharaja,https://www.imdb.com/title/tt26548265/?ref_=ch...,2024|2h 21m,CHECK,CHECK,59K,8.4/10,One-Person Army Action|Action|Crime,Nithilan Saminathan


Unnamed: 0,Title,Link,Year,Rating_Classification,Runtime,Total_Ratings,Average_Rating,Genre,Director
61,12th Fail,https://www.imdb.com/title/tt23849204/?ref_=ch...,2023|2h 27m,CHECK,CHECK,139K,8.8/10,Docudrama|Biography|Drama,Vidhu Vinod Chopra
80,Das Boot,https://www.imdb.com/title/tt0082096/?ref_=cht...,1981|2h 29m,CHECK,CHECK,274K,8.4/10,War Epic|Drama|War,Wolfgang Petersen
82,Your Name.,https://www.imdb.com/title/tt5311514/?ref_=cht...,2016|1h 46m,CHECK,CHECK,343K,8.4/10,Anime|Shōjo|Animation,Makoto Shinkai
179,Demon Slayer: Kimetsu no Yaiba - Tsuzumi Mansi...,https://www.imdb.com/title/tt16492678/?ref_=ch...,2021|1h 27m,CHECK,CHECK,28K,8.7/10,Anime|Action|Animation,Haruo Sotozaki
199,Maharaja,https://www.imdb.com/title/tt26548265/?ref_=ch...,2024|2h 21m,CHECK,CHECK,59K,8.4/10,One-Person Army Action|Action|Crime,Nithilan Saminathan


From the output we can see that it is the same 5 rows that contain both year and runtime in one column that have **CHECK** in both the Runtime and Total_Ratings columns and from the data we dataframe we can see that these films are not rated and therefore since they are not rated we will replace **CHECK** with **Not Rated** . We will then split the year column in where there is | to divide the year and runtime.

In [10]:
df = IMDB_df[rating_crit]           # saving the rows in a data frame

# Splitting the year column
year_split = df['Year'].str.split( '|' , expand = True )

 # Renaming the year split 
year_split.rename( columns = { 0 : 'Year' , 1 : 'Runtime'} , inplace = True ) 

 # Replacing CHECK in Rating classification with Not Rated
df.loc[:, 'Rating_Classification'] = df['Rating_Classification'].str.replace('CHECK', 'Not Rated')

 # Dropping the runtime and the year in df
df = df.drop( columns = ['Year' , 'Runtime'] )

 # Replacing the Dropped columns with the new year and Runtime
df['Year'] = year_split['Year']
df['Runtime'] = year_split['Runtime']

display(df)

Unnamed: 0,Title,Link,Rating_Classification,Total_Ratings,Average_Rating,Genre,Director,Year,Runtime
61,12th Fail,https://www.imdb.com/title/tt23849204/?ref_=ch...,Not Rated,139K,8.8/10,Docudrama|Biography|Drama,Vidhu Vinod Chopra,2023,2h 27m
80,Das Boot,https://www.imdb.com/title/tt0082096/?ref_=cht...,Not Rated,274K,8.4/10,War Epic|Drama|War,Wolfgang Petersen,1981,2h 29m
82,Your Name.,https://www.imdb.com/title/tt5311514/?ref_=cht...,Not Rated,343K,8.4/10,Anime|Shōjo|Animation,Makoto Shinkai,2016,1h 46m
179,Demon Slayer: Kimetsu no Yaiba - Tsuzumi Mansi...,https://www.imdb.com/title/tt16492678/?ref_=ch...,Not Rated,28K,8.7/10,Anime|Action|Animation,Haruo Sotozaki,2021,1h 27m
199,Maharaja,https://www.imdb.com/title/tt26548265/?ref_=ch...,Not Rated,59K,8.4/10,One-Person Army Action|Action|Crime,Nithilan Saminathan,2024,2h 21m


We will now only take the data that doesnt have films where the Rating classification is not **CHECK** ( dropping these rows ) then I will add the rows to the cleaned data  

In [11]:
IMDB_df = IMDB_df.drop( df.index )

 # Concatenating the two dataframes and ensuring that the list remains as it was
IMDB_df = pd.concat( [ IMDB_df , df ] ).sort_index()

display(IMDB_df)

Unnamed: 0,Title,Link,Year,Rating_Classification,Runtime,Total_Ratings,Average_Rating,Genre,Director
0,The Shawshank Redemption,https://www.imdb.com/title/tt0111161/?ref_=cht...,1994,R,2h 22m,3M,9.3/10,Epic|Period Drama|Prison Drama,Frank Darabont
1,The Godfather,https://www.imdb.com/title/tt0068646/?ref_=cht...,1972,R,2h 55m,2.1M,9.2/10,Epic|Gangster|Tragedy,Francis Ford Coppola
2,The Dark Knight,https://www.imdb.com/title/tt0468569/?ref_=cht...,2008,PG-13,2h 32m,3M,9.0/10,Action Epic|Epic|Superhero,Christopher Nolan
3,The Godfather Part II,https://www.imdb.com/title/tt0071562/?ref_=cht...,1974,R,3h 22m,1.4M,9.0/10,Epic|Gangster|Tragedy,Francis Ford Coppola
4,12 Angry Men,https://www.imdb.com/title/tt0050083/?ref_=cht...,1957,Approved,1h 36m,902K,9.0/10,Legal Drama|Psychological Drama|Crime,Sidney Lumet
...,...,...,...,...,...,...,...,...,...
245,Amores Perros,https://www.imdb.com/title/tt0245712/?ref_=cht...,2000,R,2h 34m,258K,8.0/10,Tragedy|Drama|Thriller,Alejandro G. Iñárritu
246,The Help,https://www.imdb.com/title/tt1454029/?ref_=cht...,2011,PG-13,2h 26m,505K,8.1/10,Period Drama|Drama|Back to top,Tate Taylor
247,Rebecca,https://www.imdb.com/title/tt0032976/?ref_=cht...,1940,Approved,2h 10m,151K,8.1/10,Dark Romance|Psychological Thriller|Drama,Alfred Hitchcock
248,A Silent Voice: The Movie,https://www.imdb.com/title/tt5323662/?ref_=cht...,2016,Not Rated,2h 10m,112K,8.1/10,Anime|Coming-of-Age|Psychological Drama,Taichi Ishidate


I will now change the datatypes of certain columns to integers and datetime and since the rating Average column is a rating out of 10 we will remove the **/10** and change the datatype to a float with one decimal place. 

In [12]:
 # Changing the year 
IMDB_df['Year'] = pd.to_datetime( IMDB_df['Year'] , format='%Y' )
IMDB_df['Year'] = IMDB_df['Year'].dt.year  # Extract the year only

 # Changing the Average Rating 
IMDB_df['Average_Rating'] = IMDB_df['Average_Rating'].str.replace( '/10' , '')

IMDB_df['Average_Rating'] = IMDB_df['Average_Rating'].astype('float')


In [13]:
print(IMDB_df['Total_Ratings'].nunique())
print(IMDB_df['Total_Ratings'].unique())

182
['3M' '2.1M' '1.4M' '902K' '2M' '1.5M' '2.3M' '837K' '1.8M' '2.4M' '2.6M'
 '1.3M' '1.1M' '2.2M' '1.9M' '519K' '378K' '1.6M' '824K' '765K' '1.2M'
 '887K' '947K' '1.7M' '1M' '741K' '334K' '427K' '76K' '623K' '959K' '295K'
 '268K' '997K' '538K' '570K' '361K' '204K' '731K' '422K' '139K' '245K'
 '709K' '222K' '146K' '797K' '629K' '442K' '662K' '532K' '274K' '450K'
 '343K' '58K' '389K' '113K' '107K' '924K' '917K' '379K' '94K' '742K'
 '324K' '218K' '946K' '172K' '356K' '476K' '174K' '439K' '811K' '810K'
 '832K' '901K' '340K' '266K' '976K' '286K' '828K' '214K' '192K' '932K'
 '634K' '182K' '384K' '220K' '958K' '123K' '283K' '612K' '290K' '89K'
 '206K' '143K' '659K' '581K' '749K' '141K' '717K' '487K' '447K' '136K'
 '848K' '265K' '105K' '471K' '195K' '186K' '342K' '360K' '229K' '847K'
 '207K' '572K' '390K' '239K' '743K' '508K' '978K' '830K' '28K' '737K'
 '397K' '260K' '84K' '756K' '189K' '843K' '352K' '911K' '621K' '575K'
 '231K' '193K' '122K' '863K' '224K' '59K' '827K' '683K' '371K' '60K'
 '

We can see that there are 182 unique values in the Total_Ratings column where it contains numbers that either end with an M or a K where M stands for a million and K stands for a thousand , we will therefore replace these with '000000' and '000' respectivley then change the datatype to an integer. But since the rows that contain M some have a point while others dont we therefore need to differenciate which ones will get 00000 or 000000

In [14]:
IMDB_df['Total_Ratings'] = IMDB_df['Total_Ratings'].str.replace( 'K' , '000')   # Replacing the K with 000

def M_func(dataframe):
    def transform(value):
        if '.' in value:  # Check if the value contains a '.'
            value = value.replace('.', '')  # Remove the '.'
            value = value.replace('M', '00000')  # Replace 'M' with '00000'
        else:
            value = value.replace('M', '000000')  # Replace 'M' with '000000'
        return value
    
    dataframe['Total_Ratings'] = dataframe['Total_Ratings'].apply(transform)
    return dataframe

M_func(IMDB_df)

 # Changing the data type 
IMDB_df['Total_Ratings'] = IMDB_df['Total_Ratings'].astype('int64')

print( IMDB_df['Total_Ratings'].dtype)


int64


We will now check the rating clasification and how many unique values are there.

In [15]:
print(IMDB_df['Rating_Classification'].nunique())
print(IMDB_df['Rating_Classification'].unique())

8
['R' 'PG-13' 'Approved' 'PG' 'Not Rated' 'G' 'NC-17' 'Passed']


We can see that the different rating classifications are :

 1. R - Restricted, under 17 requires accompanying parent or adult guardian

 2. PG-13 - Parents strongly cautioned, some material may be inappropriate for children under 13 

 3. PG - Parental guidance suggested, some material may not be suitable for children 

 4. Approved or Passed - These are  movies where the movies were either approved or passed for release for release( The movies can either be old movies that did not recieve a rating or they can be movies that are foreign and not necessarily from the US )

 5. Not Rated - Movies that have not been rated 

 6. G - General audiences, suitable for all ages

 7. NC-17 - No one 17 and under admitted, content is clearly adult

Since Approved and Passed means the same thing we will change all off them to approved for uniformity then we will then change the datatype to category where the order will be Not Rated ,G , PG , PG-13, Approved ,R , NC-17  



In [16]:
IMDB_df['Rating_Classification'] = IMDB_df['Rating_Classification'].str.replace( 'Passed' , 'Approved')   # Replacing passed with Approved 

 # Changing the datatype to a category 
IMDB_df['Rating_Classification'] = IMDB_df['Rating_Classification'].astype('category')

 # Reordering the category 
IMDB_df['Rating_Classification'] = IMDB_df['Rating_Classification'].cat.reorder_categories(['Not Rated','G','PG','PG-13','Approved','R','NC-17'] , ordered = True)

print( IMDB_df['Rating_Classification'].dtype )


category


We will now change the runtime into minutes but since there is h that represents the hours and a m that represents the minutes we will need to write a function that takes the hours changes it to minutes then adds it to the minutes.

In [17]:
# Function to convert runtime to total minutes
def convert_to_minutes(runtime):
    hours = 0
    minutes = 0
    if 'h' in runtime:
        hours = int(runtime.split('h')[0])  # Extract hours
        runtime = runtime.split('h')[1]    # Extract remaining part
    if 'm' in runtime:
        minutes = int(runtime.split('m')[0])  # Extract minutes
    return hours * 60 + minutes

IMDB_df['Runtime_in_Minutes'] = IMDB_df['Runtime'].apply(convert_to_minutes)

 # Dropping the Runtime column 
IMDB_df.drop( columns = 'Runtime' , inplace = True )

 # Renaming the Runtime by minutes back to runtime
IMDB_df.rename( columns = {'Runtime_in_Minutes':'Runtime'},inplace = True )


We will now examine the Genre column, where multiple genres are listed for some entries and separated by the | symbol (e.g., Legal Drama|Psychological Drama|Crime). To simplify our analysis, we will split these entries at the | symbol and retain only the first listed genre for each movie. This approach ensures that each item has a single, primary genre for easier categorization and comparison.

In [18]:
 # spliting the genre
genre_split = IMDB_df['Genre'].str.split( '|' , expand = True )

 # Renaming the first column 
genre_split = genre_split.rename( columns = { 0 : 'Genre' } )

 # Dropping the Genre column in the original dataframe
IMDB_df.drop( columns = 'Genre' , inplace = True )

 # Adding the new genre column 
IMDB_df['Genre'] = genre_split['Genre']

 # checking number of unique genres
print(IMDB_df['Genre'].nunique())

66


**ROTTEN TOMATOES CLEANING**

Next, we will proceed to clean the rotten tomatoes data, ensuring it is properly structured and consistent for analysis

In [19]:
display(Rotten_Tomatoes_df)

print(Rotten_Tomatoes_df.columns.to_list())   # Getting the columns

print(Rotten_Tomatoes_df.isnull().sum())   # Getting the number of null values

print(Rotten_Tomatoes_df.duplicated().sum())  # Getting the number of duplicated values 

print(Rotten_Tomatoes_df.dtypes)

Unnamed: 0,Title,Link,Year,Runtime,Rating_Classification,Total_Ratings,Critic_Score,Audience_Score,Genre,Director
0,L.A. Confidential,https://www.rottentomatoes.com/m/la_confidential,"Released Sep 19, 1997,","2h 16m,","R,","\n 100,000+ Ratings\n",99%,94%,Crime/,Curtis Hanson
1,The Godfather,https://www.rottentomatoes.com/m/the_godfather,"Released Mar 15, 1972,","2h 57m,","R,","\n 250,000+ Ratings\n",97%,98%,Crime/,Francis Ford Coppola
2,Casablanca,https://www.rottentomatoes.com/m/1003707-casab...,"Released Jan 23, 1943,","1h 42m,","PG,","\n 250,000+ Ratings\n",99%,95%,Drama,Michael Curtiz
3,Seven Samurai,https://www.rottentomatoes.com/m/seven_samurai...,"Released Nov 19, 1956, |3h 28m,",CHECK,CHECK,"\n 50,000+ Ratings\n",100%,97%,Action,Akira Kurosawa
4,Parasite,https://www.rottentomatoes.com/m/parasite_2019,"Released Nov 1, 2019,","2h 12m,","R,","\n 5,000+ Verified Ratings\n",99%,90%,Comedy/,Bong Joon Ho
...,...,...,...,...,...,...,...,...,...,...
295,Beauty and the Beast,https://www.rottentomatoes.com/m/beauty_and_th...,"Released Jan 1, 1947, |1h 35m,",CHECK,CHECK,"\n 10,000+ Ratings\n",96%,90%,Fantasy,Jean Cocteau
296,The Killing,https://www.rottentomatoes.com/m/killing,"Released May 20, 1956, |1h 23m,",CHECK,CHECK,"\n 10,000+ Ratings\n",96%,92%,Crime/,Stanley Kubrick
297,The Rules of the Game,https://www.rottentomatoes.com/m/the_rules_of_...,"Released Jul 8, 1939, |1h 50m,",CHECK,CHECK,"\n 10,000+ Ratings\n",97%,89%,Comedy/,Jean Renoir
298,Eyes Without a Face,https://www.rottentomatoes.com/m/eyes_without_...,"Released Oct 31, 1962, |1h 30m,",CHECK,CHECK,"\n 5,000+ Ratings\n",97%,87%,Horror/,Georges Franju


['Title', 'Link', 'Year', 'Runtime', 'Rating_Classification', 'Total_Ratings', 'Critic_Score', 'Audience_Score', 'Genre', 'Director']
Title                    0
Link                     0
Year                     0
Runtime                  0
Rating_Classification    0
Total_Ratings            0
Critic_Score             0
Audience_Score           0
Genre                    0
Director                 0
dtype: int64
0
Title                    object
Link                     object
Year                     object
Runtime                  object
Rating_Classification    object
Total_Ratings            object
Critic_Score             object
Audience_Score           object
Genre                    object
Director                 object
dtype: object


We can see from the output that the dataframe contains 300rows and 10 columns 

The column names are namely :

1. **Title**  
2. **Link**  
3. **Runtime**
4. **Rating Classification**(A combination of weighted average and the total ratings the weighted average is based on)
5. **Genre** 
6. **Director**
7. **Total Ratings**
8. **Critic Score** 
9. **Audience Score**
10. **Year**

The Dataframe also does not have any null values nor does it have any duplicate values.

We can also see that all the columns have datatype object and we therefore need to change the datatype of certain columns like runtime,Rating classification,Total Ratings,Critic Score,Audience Score and Year 

We will first check the critic score and audience score 

In [20]:
 # Getting the unique values of both critic and audience score 
print(Rotten_Tomatoes_df['Critic_Score'].unique())
print(Rotten_Tomatoes_df['Audience_Score'].unique())

['99%' '97%' '100%' '98%' '96%' '95%' '94%' '93%' '90%' '92%' '91%' '89%']
['94%' '98%' '95%' '97%' '90%' '99%' '87%' '93%' '92%' '91%' '96%' '86%'
 '89%' '88%' '82%' '84%' '79%' '78%' '83%' '85%' '80%']


From the output we can see that the the values in critic and audience score have number followed by % therefore we will need to remove the percentage sign then change the datatype to an integer.

In [21]:
 # Replacing the % with ''
Rotten_Tomatoes_df['Audience_Score'] = Rotten_Tomatoes_df['Audience_Score'].str.replace( '%' , '' )
Rotten_Tomatoes_df['Critic_Score'] = Rotten_Tomatoes_df['Critic_Score'].str.replace( '%' , '' )

 # Changing the datatype 
Rotten_Tomatoes_df['Audience_Score'] = Rotten_Tomatoes_df['Audience_Score'].astype('int64')
Rotten_Tomatoes_df['Critic_Score'] = Rotten_Tomatoes_df['Critic_Score'].astype('int64')

 # Checking the datatype 
print( Rotten_Tomatoes_df['Audience_Score'].dtype )
print( Rotten_Tomatoes_df['Critic_Score'].dtype )

int64
int64


We will then check the different unique values in Total ratings.

In [22]:
print( Rotten_Tomatoes_df['Total_Ratings'].unique() )

['\n            100,000+ Ratings\n        '
 '\n            250,000+ Ratings\n        '
 '\n            50,000+ Ratings\n        '
 '\n            5,000+ Verified Ratings\n        '
 '\n            50,000+ Verified Ratings\n        '
 '\n            10,000+ Ratings\n        '
 '\n            25,000+ Ratings\n        '
 '\n            1,000+ Verified Ratings\n        '
 '\n            25,000+ Verified Ratings\n        '
 '\n            5,000+ Ratings\n        '
 '\n            1,000+ Ratings\n        '
 '\n            250+ Verified Ratings\n        '
 '\n            10,000+ Verified Ratings\n        '
 '\n            2,500+ Ratings\n        '
 '\n            2,500+ Verified Ratings\n        '
 '\n            100+ Verified Ratings\n        ']


The `Total Ratings` column contains unwanted characters such as `\n`, spaces, commas, the "+" symbol, and the word "Ratings" and "Verified Ratings" We will clean this column by removing these elements and converting the data type to integers. Since the "+" symbol represents "more than," we will interpret it as the exact number for consistency. For example, "100,000+" will be treated as `100,000`. This transformation ensures the data is standardized .

In [23]:
Rotten_Tomatoes_df['Total_Ratings'] = Rotten_Tomatoes_df['Total_Ratings'].str.replace( '\n' , '' ) 
Rotten_Tomatoes_df['Total_Ratings'] = Rotten_Tomatoes_df['Total_Ratings'].str.replace( ' ' , '' )
Rotten_Tomatoes_df['Total_Ratings'] = Rotten_Tomatoes_df['Total_Ratings'].str.replace( '+' , '' )
Rotten_Tomatoes_df['Total_Ratings'] = Rotten_Tomatoes_df['Total_Ratings'].str.replace( ',' , '' )
Rotten_Tomatoes_df['Total_Ratings'] = Rotten_Tomatoes_df['Total_Ratings'].str.replace( 'Ratings' , '' )
Rotten_Tomatoes_df['Total_Ratings'] = Rotten_Tomatoes_df['Total_Ratings'].str.replace( 'Verified' , '' )

 # Changing the datatype to integer
Rotten_Tomatoes_df['Total_Ratings'] = Rotten_Tomatoes_df['Total_Ratings'].astype('int64')

print(Rotten_Tomatoes_df['Total_Ratings'].dtype)

int64


We will then print the unique values of Rating classification 

In [24]:
print( Rotten_Tomatoes_df['Rating_Classification'].unique() )

['R,\xa0' 'PG,\xa0' 'CHECK' 'PG-13,\xa0' 'G,\xa0' 'TV-14,\xa0']


The Rating column contains values such as R, \xa0, PG, CHECK, PG-13, and TV-14. As a first step, we will clean the column by removing the \xa0 (a non-breaking space) to make the data more uniform. Once this is done, we will further inspect the column.

In [25]:
Rotten_Tomatoes_df['Rating_Classification'] = Rotten_Tomatoes_df['Rating_Classification'].str.replace( '\xa0' , '' )
Rotten_Tomatoes_df['Rating_Classification'] = Rotten_Tomatoes_df['Rating_Classification'].str.replace( ',' , '' )

print( Rotten_Tomatoes_df['Rating_Classification'].unique() )

['R' 'PG' 'CHECK' 'PG-13' 'G' 'TV-14']


Since CHECK is not a Rating classification we will now check all columns where the Rating classification is CHECK

In [26]:
x = Rotten_Tomatoes_df['Rating_Classification'] == 'CHECK'    # Setting the parameter
display(Rotten_Tomatoes_df[x])

Unnamed: 0,Title,Link,Year,Runtime,Rating_Classification,Total_Ratings,Critic_Score,Audience_Score,Genre,Director
3,Seven Samurai,https://www.rottentomatoes.com/m/seven_samurai...,"Released Nov 19, 1956, |3h 28m,",CHECK,CHECK,50000,100,97,Action,Akira Kurosawa
9,On the Waterfront,https://www.rottentomatoes.com/m/on_the_waterf...,"Released Jul 28, 1954, |1h 48m,",CHECK,CHECK,50000,99,95,Drama,Elia Kazan
10,The Battle of Algiers,https://www.rottentomatoes.com/m/the_battle_of...,"Released Sep 20, 1967, |2h 5m,",CHECK,CHECK,10000,99,95,War/,Gillo Pontecorvo
15,All About Eve,https://www.rottentomatoes.com/m/1000626-all_a...,"Released Oct 13, 1950, |2h 18m,",CHECK,CHECK,25000,99,94,Drama,Joseph L. Mankiewicz
18,The Third Man,https://www.rottentomatoes.com/m/the_third_man,"Released Feb 1, 1949, |1h 44m,",CHECK,CHECK,50000,99,93,Mystery & Thriller,Carol Reed
...,...,...,...,...,...,...,...,...,...,...
295,Beauty and the Beast,https://www.rottentomatoes.com/m/beauty_and_th...,"Released Jan 1, 1947, |1h 35m,",CHECK,CHECK,10000,96,90,Fantasy,Jean Cocteau
296,The Killing,https://www.rottentomatoes.com/m/killing,"Released May 20, 1956, |1h 23m,",CHECK,CHECK,10000,96,92,Crime/,Stanley Kubrick
297,The Rules of the Game,https://www.rottentomatoes.com/m/the_rules_of_...,"Released Jul 8, 1939, |1h 50m,",CHECK,CHECK,10000,97,89,Comedy/,Jean Renoir
298,Eyes Without a Face,https://www.rottentomatoes.com/m/eyes_without_...,"Released Oct 31, 1962, |1h 30m,",CHECK,CHECK,5000,97,87,Horror/,Georges Franju


We noticed that wherever the `Rating` column has `CHECK`, the `Runtime` column also shows `CHECK`. On top of that, the `Year` column combines the release year and runtime, separated by a `|`. This makes it seem like the runtime and rating info might’ve been missing when the data was collected.

To fix this, we’ll merge the release year and runtime into the `Year` column (using the `|` as a separator) and set both `Runtime` and `Rating` to `CHECK`. But first, let’s double-check this theory by looking at rows where the `Rating` isn’t marked as `CHECK`.

In [27]:
y = Rotten_Tomatoes_df['Rating_Classification'] != 'CHECK'    # Setting the parameter
display(Rotten_Tomatoes_df[y])

Unnamed: 0,Title,Link,Year,Runtime,Rating_Classification,Total_Ratings,Critic_Score,Audience_Score,Genre,Director
0,L.A. Confidential,https://www.rottentomatoes.com/m/la_confidential,"Released Sep 19, 1997,","2h 16m,",R,100000,99,94,Crime/,Curtis Hanson
1,The Godfather,https://www.rottentomatoes.com/m/the_godfather,"Released Mar 15, 1972,","2h 57m,",R,250000,97,98,Crime/,Francis Ford Coppola
2,Casablanca,https://www.rottentomatoes.com/m/1003707-casab...,"Released Jan 23, 1943,","1h 42m,",PG,250000,99,95,Drama,Michael Curtiz
4,Parasite,https://www.rottentomatoes.com/m/parasite_2019,"Released Nov 1, 2019,","2h 12m,",R,5000,99,90,Comedy/,Bong Joon Ho
5,Schindler's List,https://www.rottentomatoes.com/m/schindlers_list,"Released Dec 15, 1993,","3h 15m,",R,250000,98,97,History/,Steven Spielberg
...,...,...,...,...,...,...,...,...,...,...
288,Being There,https://www.rottentomatoes.com/m/being_there,"Released Dec 19, 1979,","2h 10m,",PG,25000,95,92,Comedy,Hal Ashby
290,Arrival,https://www.rottentomatoes.com/m/arrival_2016,"Released Nov 11, 2016,","1h 56m,",PG-13,50000,94,82,Sci-Fi/,Denis Villeneuve
291,Wings of Desire,https://www.rottentomatoes.com/m/wings_of_desire,"Released Jan 1, 1987,","2h 8m,",PG-13,25000,95,93,Fantasy,Wim Wenders
292,Raging Bull,https://www.rottentomatoes.com/m/raging_bull,"Released Jan 1, 1980,","2h 8m,",R,100000,92,93,Biography/,Martin Scorsese


From the output, we can confirm that our assumption is correct. There are 92 rows where the Rating classification is marked as CHECK and 208 rows where it is not. Together, these add up to 300, which matches the total number of rows in the entire dataframe. This verifies that whenever the Rating classification is CHECK, the runtime information is also missing.

First, we'll create a new dataframe containing only the rows where the `Rating` classification is marked as `CHECK`. In this new dataframe, we'll split the `Year` column at the `|` separator to extract the release year and runtime as separate columns. After that, we'll drop the original `Year` and `Runtime` columns and replace them with the newly extracted ones. We'll also update the `Rating` classification from `CHECK` to `Not Rated` for consistency.  

Next, we'll remove the rows with `CHECK` in the `Rating` classification from the original dataframe. Finally, we'll concatenate the cleaned dataframe back into the original one, ensuring that all rows are properly structured.

In [28]:
 # Creating the new dataframe
rt_df = Rotten_Tomatoes_df[x]

rt_df = rt_df.copy()

 # splitting the year
yr_split = rt_df['Year'].str.split( '|' , expand = True )

 # Renaming the columns 
yr_split.rename( columns = { 0 : 'Year' , 1 : 'Runtime'} , inplace = True )

 # Droppin the Year and Runtime and adding the new year and Runtime 
rt_df.drop( columns = ['Year','Runtime'] , inplace = True )
rt_df.loc[:, 'Year'] = yr_split['Year']
rt_df.loc[:, 'Runtime'] = yr_split['Runtime']

 # Replacing 'CHECK' with Not Rated
rt_df.loc[:, 'Rating_Classification'] = rt_df['Rating_Classification'].str.replace('CHECK', 'Not Rated')

 # Dropping the rows with 'CHECK' in the original dataframe
Rotten_Tomatoes_df = Rotten_Tomatoes_df.drop( rt_df.index )

 # Concatenating the two dataframes 
Rotten_Tomatoes_df = pd.concat( [ Rotten_Tomatoes_df , rt_df ] ).sort_index()

display( Rotten_Tomatoes_df )

Unnamed: 0,Title,Link,Year,Runtime,Rating_Classification,Total_Ratings,Critic_Score,Audience_Score,Genre,Director
0,L.A. Confidential,https://www.rottentomatoes.com/m/la_confidential,"Released Sep 19, 1997,","2h 16m,",R,100000,99,94,Crime/,Curtis Hanson
1,The Godfather,https://www.rottentomatoes.com/m/the_godfather,"Released Mar 15, 1972,","2h 57m,",R,250000,97,98,Crime/,Francis Ford Coppola
2,Casablanca,https://www.rottentomatoes.com/m/1003707-casab...,"Released Jan 23, 1943,","1h 42m,",PG,250000,99,95,Drama,Michael Curtiz
3,Seven Samurai,https://www.rottentomatoes.com/m/seven_samurai...,"Released Nov 19, 1956,","3h 28m,",Not Rated,50000,100,97,Action,Akira Kurosawa
4,Parasite,https://www.rottentomatoes.com/m/parasite_2019,"Released Nov 1, 2019,","2h 12m,",R,5000,99,90,Comedy/,Bong Joon Ho
...,...,...,...,...,...,...,...,...,...,...
295,Beauty and the Beast,https://www.rottentomatoes.com/m/beauty_and_th...,"Released Jan 1, 1947,","1h 35m,",Not Rated,10000,96,90,Fantasy,Jean Cocteau
296,The Killing,https://www.rottentomatoes.com/m/killing,"Released May 20, 1956,","1h 23m,",Not Rated,10000,96,92,Crime/,Stanley Kubrick
297,The Rules of the Game,https://www.rottentomatoes.com/m/the_rules_of_...,"Released Jul 8, 1939,","1h 50m,",Not Rated,10000,97,89,Comedy/,Jean Renoir
298,Eyes Without a Face,https://www.rottentomatoes.com/m/eyes_without_...,"Released Oct 31, 1962,","1h 30m,",Not Rated,5000,97,87,Horror/,Georges Franju


We will change the datatype of Runtime where we will change the runtime to minutes but first we need to remove the comma that is at the end of each run time change the hour to minutes and add it to the minutes 

In [29]:
Rotten_Tomatoes_df['Runtime'] = Rotten_Tomatoes_df['Runtime'].str.replace( ',' , '' )

Rotten_Tomatoes_df['Runtime_in_Minutes'] = Rotten_Tomatoes_df['Runtime'].apply(convert_to_minutes)

 # Dropping the Runtime column 
Rotten_Tomatoes_df.drop( columns = 'Runtime' , inplace = True )

 # Renaming the Runtime by minutes back to runtime
Rotten_Tomatoes_df.rename( columns = {'Runtime_in_Minutes':'Runtime'},inplace = True )

 # Printing the datatype 
print(Rotten_Tomatoes_df['Runtime'].dtype)

int64


We will now check the different unique values in the Rating classsification. 

In [30]:
print(Rotten_Tomatoes_df['Rating_Classification'].unique())

['R' 'PG' 'Not Rated' 'PG-13' 'G' 'TV-14']


From the data, we can see that there are six unique rating classifications. To maintain consistency, we will categorize them into the following six groups: **R, PG, PG-13, G, TV-14, and Not Rated**. This ensures that all movies and TV shows fall into a well-defined rating system, making the data more structured and easier to analyze.

In [31]:
 # Changing the datatype to a category 
Rotten_Tomatoes_df['Rating_Classification'] = Rotten_Tomatoes_df['Rating_Classification'].astype('category')

 # Reordering the category 
Rotten_Tomatoes_df['Rating_Classification'] = Rotten_Tomatoes_df['Rating_Classification'].cat.reorder_categories(['Not Rated','G','PG','PG-13','TV-14','R'] , ordered = True)

print( Rotten_Tomatoes_df['Rating_Classification'].dtype )

category


Next, we'll examine the unique values in both the **Genre** and **Director** columns to identify any inconsistencies, such as extra spaces, unexpected characters, or variations in formatting. This will help ensure that our data is clean and standardized before moving forward with the analysis.

In [32]:
print(Rotten_Tomatoes_df['Director'].unique())
print(Rotten_Tomatoes_df['Genre'].unique())

['Curtis Hanson' 'Francis Ford Coppola' 'Michael Curtiz' 'Akira Kurosawa'
 'Bong Joon Ho' 'Steven Spielberg' 'Joseph Kosinski' 'Ash Brannon'
 'Roman Polanski' 'Elia Kazan' 'Gillo Pontecorvo' 'John Lasseter'
 'Alfred Hitchcock' 'Charlie Chaplin' 'Christopher Sanders'
 'Joseph L. Mankiewicz' 'Hayao Miyazaki' 'Pete Docter' 'Carol Reed'
 'Tom McCarthy' 'Bob Persichetti' 'George Cukor' 'Andrew Stanton'
 'Stanley Donen' 'Sidney Lumet' 'Lee Unkrich' 'Billy Wilder'
 'Krzysztof Kieslowski' 'Ava DuVernay' 'Byron Howard' 'Orson Welles'
 'Woody Allen' 'Stuart Rosenberg' 'Alexander Payne' 'Stanley Kubrick'
 'Tomas Alfredson' 'Peter Jackson' 'Rian Johnson' 'Fritz Lang'
 'Josh Cooley' 'Darren Aronofsky' 'Martin Scorsese' 'Victor Fleming'
 'Paul King' 'Richard Linklater' 'Christopher Nolan' 'John Huston'
 'Frank Capra' 'Henri-Georges Clouzot' 'Vittorio De Sica' 'Ridley Scott'
 'Ben Affleck' 'Jordan Peele' 'Christopher McQuarrie' 'Robert Hamer'
 'François Truffaut' 'Isao Takahata' 'Michael Showalter' '

Since the Director column has no inconsistencies, we can leave it as is. However, the Genre column contains a '/' at the end of the genre and we will therefore remove it to ensure consistency in our analysis.

In [33]:
 # Replacing the / with ''
Rotten_Tomatoes_df['Genre'] = Rotten_Tomatoes_df['Genre'].str.replace('/','')
 
 # Checking the unique values 
print(Rotten_Tomatoes_df['Genre'].unique())

['Crime' 'Drama' 'Action' 'Comedy' 'History' 'Kids & Family' 'War'
 'Mystery & Thriller' 'Fantasy' 'Romance' 'Musical' 'Holiday' 'Horror'
 'Sci-Fi' 'Adventure' 'Western' 'Biography' 'Documentary']


We will then proceed to checking the different unique values in the year column 

In [34]:
print(Rotten_Tomatoes_df['Year'].unique())

['Released Sep 19, 1997,\xa0' 'Released Mar 15, 1972,\xa0'
 'Released Jan 23, 1943,\xa0' 'Released Nov 19, 1956,\xa0'
 'Released Nov 1, 2019,\xa0' 'Released Dec 15, 1993,\xa0'
 'Released May 27, 2022,\xa0' 'Released Nov 24, 1999,\xa0'
 'Released Jun 20, 1974,\xa0' 'Released Jul 28, 1954,\xa0'
 'Released Sep 20, 1967,\xa0' 'Released Nov 22, 1995,\xa0'
 'Released Sep 1, 1954,\xa0' 'Released Feb 5, 1936,\xa0'
 'Released Mar 26, 2010,\xa0' 'Released Oct 13, 1950,\xa0'
 'Released Sep 20, 2002,\xa0' 'Released May 29, 2009,\xa0'
 'Released Feb 1, 1949,\xa0' 'Released Nov 20, 2015,\xa0'
 'Released Dec 14, 2018,\xa0' 'Released Dec 1, 1940,\xa0'
 'Released May 30, 2003,\xa0' 'Released Apr 10, 1952,\xa0'
 'Released Apr 20, 1957,\xa0' 'Released Jun 18, 2010,\xa0'
 'Released Aug 10, 1950,\xa0' 'Released Nov 22, 2017,\xa0'
 'Released Dec 12, 1974,\xa0' '1994,\xa0' 'Released Jan 9, 2015,\xa0'
 'Released Mar 4, 2016,\xa0' 'Released May 1, 1941,\xa0'
 'Released Jan 1, 1977,\xa0' 'Released Nov 1, 1967,\

To clean the Release Year column, we will first extract the year from the text that contains the phrase "Released" followed by the month, date, and year, often ending with a non-breaking space (\xa0). To achieve this, we will use regular expressions (regex) to identify the first occurrence of four consecutive digits (representing the year). After extracting the year, we will convert it to a datetime format.

In [35]:
 # Extract the year using regex and convert to datetime
Rotten_Tomatoes_df['Year'] = pd.to_datetime(Rotten_Tomatoes_df['Year'].str.extract(r'(\d{4})')[0], format='%Y')
 
 # Extracting only the Year
Rotten_Tomatoes_df['Year'] = Rotten_Tomatoes_df['Year'].dt.year.astype('Int64') 

print(Rotten_Tomatoes_df['Year'].unique()) 

<IntegerArray>
[1997, 1972, 1943, 1956, 2019, 1993, 2022, 1999, 1974, 1954, 1967, 1995, 1936,
 2010, 1950, 2002, 2009, 1949, 2015, 2018, 1940, 2003, 1952, 1957, 2017, 1994,
 2016, 1941, 1977, 2023, 1964, 2008, 1933, 1990, 1939, 1944, 1960, 1934, 1955,
 1959, 1948, 1979, 2012, 1921, 1988, 2021, 1971, 1986, 1953, 1927, 1998, 1925,
 2011, 1991, 1983, 1930, 2001, 1982, 1984, 2000, 2013, 1975, 2006, 1981, 1946,
 1970, 2007, 2014, 1992, 1938, 1980, 1928, 1985, 1923, 1935, 1945, 2004, 2024,
 1987, 1962, 1951, 1996, 1969, 1963, 1926, 1942, 1989, 1978, 1973, 1937, 1958,
 1961, 1931, 1968, 1929, <NA>, 1976, 1947]
Length: 98, dtype: Int64


We can see that there are null values in the year therefore we will check which values are null and maybe replace them with the year 

In [36]:
 # Displaying the rows that have null values 
Rt_df = ( Rotten_Tomatoes_df[Rotten_Tomatoes_df.isna().any(axis=1)] )

Rt_df = Rt_df.copy()

display(Rt_df)

Unnamed: 0,Title,Link,Year,Rating_Classification,Total_Ratings,Critic_Score,Audience_Score,Genre,Director,Runtime
280,To Be or Not to Be,https://www.rottentomatoes.com/m/to_be_or_not_...,,Not Rated,5000,96,93,Comedy,Ernst Lubitsch,99
289,Aguirre: The Wrath of God,https://www.rottentomatoes.com/m/aguirre_the_w...,,Not Rated,10000,96,91,Adventure,Werner Herzog,94


We have identified that only 2 rows contain null values in the Year column. To address this, we will replace these null values with the correct release years for the respective movies. Specifically, "To Be or Not to Be" was released in 1942, and "Aguirre: The Wrath of God" was released in 1972. After filling in these values, we will drop the rows with null values from the original dataframe and concatenate it with the newly updated dataframe.

In [37]:
 # Manually replacing NaN values
Rt_df.loc[280, 'Year'] = 1942  # First row
Rt_df.loc[289, 'Year'] = 1972  # Second row

# Convert Year to integer type
Rt_df['Year'] = Rt_df['Year'].astype('Int64')  

 # Dropping the null values 
Rotten_Tomatoes_df = Rotten_Tomatoes_df.drop( Rt_df.index )

 # Readding the year with data with the year
Rotten_Tomatoes_df = pd.concat( [ Rotten_Tomatoes_df , Rt_df ] ).sort_index()


In [38]:
display(Rotten_Tomatoes_df)

Unnamed: 0,Title,Link,Year,Rating_Classification,Total_Ratings,Critic_Score,Audience_Score,Genre,Director,Runtime
0,L.A. Confidential,https://www.rottentomatoes.com/m/la_confidential,1997,R,100000,99,94,Crime,Curtis Hanson,136
1,The Godfather,https://www.rottentomatoes.com/m/the_godfather,1972,R,250000,97,98,Crime,Francis Ford Coppola,177
2,Casablanca,https://www.rottentomatoes.com/m/1003707-casab...,1943,PG,250000,99,95,Drama,Michael Curtiz,102
3,Seven Samurai,https://www.rottentomatoes.com/m/seven_samurai...,1956,Not Rated,50000,100,97,Action,Akira Kurosawa,208
4,Parasite,https://www.rottentomatoes.com/m/parasite_2019,2019,R,5000,99,90,Comedy,Bong Joon Ho,132
...,...,...,...,...,...,...,...,...,...,...
295,Beauty and the Beast,https://www.rottentomatoes.com/m/beauty_and_th...,1947,Not Rated,10000,96,90,Fantasy,Jean Cocteau,95
296,The Killing,https://www.rottentomatoes.com/m/killing,1956,Not Rated,10000,96,92,Crime,Stanley Kubrick,83
297,The Rules of the Game,https://www.rottentomatoes.com/m/the_rules_of_...,1939,Not Rated,10000,97,89,Comedy,Jean Renoir,110
298,Eyes Without a Face,https://www.rottentomatoes.com/m/eyes_without_...,1962,Not Rated,5000,97,87,Horror,Georges Franju,90


In [40]:
 # saving the Dataframe as csv 
IMDB_df.to_csv('CLEANED_IMDB_DATA.csv', index=False)
Rotten_Tomatoes_df.to_csv('CLEAN_ROTTEN_TOMATOES_DATA.csv', index=False)

Now that the data has been cleaned and preprocessed, we can move on to the analysis phase. This step will help us explore relationships between different features, uncover trends, and perform statistical analyses to extract meaningful insights.  

Our focus areas will include:  

1. **Exploratory Data Analysis (EDA)** – Examining the distributions of key variables to understand patterns in the dataset.  
2. **Trend Identification** – Investigating how the number of ratings and average ratings vary across different genres, release years, and rating classifications.  
3. **Comparative Analysis** – Comparing the top-rated movies across **IMDb, Letterboxd, and Rotten Tomatoes** to identify commonalities and differences.  
4. **Visualizations** – Using graphs, pie charts, and other visual tools to present insights in a clear and engaging way. 