## Import libraries and create the dataframe

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import math

In [6]:
df = pd.read_csv('Desktop/Rotten_Tomatoes_Movies3.csv')
display(df.head())

Unnamed: 0,movie_title,movie_info,critics_consensus,rating,genre,directors,writers,cast,in_theaters_date,on_streaming_date,runtime_in_minutes,studio_name,tomatometer_status,tomatometer_rating,tomatometer_count,audience_rating
0,Percy Jackson & the Olympians: The Lightning T...,A teenager discovers he's the descendant of a ...,Though it may seem like just another Harry Pot...,PG,"Action & Adventure, Comedy, Drama, Science Fic...",Chris Columbus,Craig Titley,"Logan Lerman, Brandon T. Jackson, Alexandra Da...",2/12/2010,6/29/2010,83.0,20th Century Fox,Rotten,49,144,53.0
1,Please Give,Kate has a lot on her mind. There's the ethics...,Nicole Holofcener's newest might seem slight i...,R,Comedy,Nicole Holofcener,Nicole Holofcener,"Catherine Keener, Amanda Peet, Oliver Platt, R...",4/30/2010,10/19/2010,90.0,Sony Pictures Classics,Certified Fresh,86,140,64.0
2,10,Blake Edwards' 10 stars Dudley Moore as George...,,R,"Comedy, Romance",Blake Edwards,Blake Edwards,"Dudley Moore, Bo Derek, Julie Andrews, Robert ...",10/5/1979,8/27/1997,118.0,Waner Bros.,Fresh,68,22,53.0
3,12 Angry Men (Twelve Angry Men),"A Puerto Rican youth is on trial for murder, a...",Sidney Lumet's feature debut is a superbly wri...,NR,"Classics, Drama",Sidney Lumet,Reginald Rose,"Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",4/13/1957,3/6/2001,95.0,Criterion Collection,Certified Fresh,100,51,97.0
4,"20,000 Leagues Under The Sea","This 1954 Disney version of Jules Verne's 20,0...","One of Disney's finest live-action adventures,...",G,"Action & Adventure, Drama, Kids & Family",Richard Fleischer,Earl Felton,"James Mason, Kirk Douglas, Paul Lukas, Peter L...",1/1/1954,5/20/2003,127.0,Disney,Fresh,89,27,74.0


In [7]:
print(df.shape)

(16638, 16)


## Data preprocessing

### Check the data types as well as the null/missing values in the dataset

In [15]:
display(df.dtypes, df.isnull().sum())

movie_title            object
movie_info             object
rating                 object
genre                  object
directors              object
writers                object
cast                   object
in_theaters_date       object
on_streaming_date      object
runtime_in_minutes    float64
studio_name            object
tomatometer_status     object
tomatometer_rating      int64
tomatometer_count       int64
audience_rating       float64
dtype: object

movie_title              0
movie_info              24
rating                   0
genre                   17
directors              114
writers               1349
cast                   284
in_theaters_date       815
on_streaming_date        2
runtime_in_minutes       0
studio_name            416
tomatometer_status       0
tomatometer_rating       0
tomatometer_count        0
audience_rating          0
dtype: int64

The column 'critics_consensus' contains more than 50% missing data, so we can drop it.
For the columns which have int or float types and contain missing data, we can replace the missing values by the mean of those columns. In order to do that, we create a separate function called null_check which handles the null_values. Along with that, it will also drop the column critics_consensus (and any other column whihc consists of 50% missing data which we missed).

In [10]:
def null_check():
    for i in df.columns:
        if ((df[i].isnull().sum()*100/len(df[i]))>50):
            df.drop([i],axis=1,inplace=True)
        elif(df[i].dtypes=='int' or df[i].dtypes=='float'):
            df[i]=df[i].fillna(df[i].mean())

In [11]:
null_check()

In [16]:
display(df.dtypes, df.isnull().sum())

movie_title            object
movie_info             object
rating                 object
genre                  object
directors              object
writers                object
cast                   object
in_theaters_date       object
on_streaming_date      object
runtime_in_minutes    float64
studio_name            object
tomatometer_status     object
tomatometer_rating      int64
tomatometer_count       int64
audience_rating       float64
dtype: object

movie_title              0
movie_info              24
rating                   0
genre                   17
directors              114
writers               1349
cast                   284
in_theaters_date       815
on_streaming_date        2
runtime_in_minutes       0
studio_name            416
tomatometer_status       0
tomatometer_rating       0
tomatometer_count        0
audience_rating          0
dtype: int64

We can see that the missing data among all the int/float datatypes have been handled.

### Handling the object datatypes

The columns 'genre', 'directors' and 'writers' consist of common elements (i.e different movies can belong to the same genre and may have same directors or writers). In this case, we take the elements of one of those columns, for eg. genre, and we convert the objects into integers or floats. 

1) We first take all the genres present in the dataset and put them into a list, called 'genre_values'
2) Then we assign a score to each movie based on it's genre - the score is the index of the genre in the list. If a movie consists of multiple genres, we take the average of the indexes of all the genres and assign it to a list called 'new_genre'
3) We convert new_genre to a dataframe

In [18]:
genre_values = []
for i in df['genre']:
    genre_values += str(i).split()
genre_values = list(set(genre_values))
#print(genre_values)
new_genre = []
for i in df['genre']:
    i = str(i).split()
    avg_value = 0
    for j in i:
        if j in genre_values:
            avg_value += genre_values.index(j)
    avg_value = avg_value/len(i)
    #print(avg_value)
    new_genre.append(avg_value)
#print(new_genre)
new_genre=pd.DataFrame(new_genre)
print(new_genre)

               0
0      18.111111
1      50.000000
2      34.000000
3      20.500000
4      18.142857
...          ...
16633  37.800000
16634  27.600000
16635  18.333333
16636  20.500000
16637  17.750000

[16638 rows x 1 columns]


We do the same process for the columns 'directors' and 'writers'

In [20]:
directors_values = []
for i in df['directors']:
    directors_values += str(i).split(',')
directors_values = list(set(directors_values))
#print(directors_values)
new_directors = []
for i in df['directors']:
    i = str(i).split(',')
    avg_value = 0
    for j in i:
        if j in directors_values:
            avg_value += directors_values.index(j)
    avg_value = avg_value/len(i)
    #print(avg_value)
    new_directors.append(avg_value)
#print(new_directors)
new_directors=pd.DataFrame(new_directors)
print(new_directors)

                 0
0      6132.000000
1      4236.000000
2      7913.000000
3      4651.000000
4      4176.000000
...            ...
16633  2009.000000
16634  3079.333333
16635  8074.000000
16636  6035.500000
16637  6668.000000

[16638 rows x 1 columns]


In [21]:
writers_values = []
for i in df['writers']:
    writers_values += str(i).split(',')
writers_values = list(set(writers_values))
#print(writers_values)
new_writers = []
for i in df['writers']:
    i = str(i).split(',')
    avg_value = 0
    for j in i:
        if j in writers_values:
            avg_value += writers_values.index(j)
    avg_value = avg_value/len(i)
    #print(avg_value)
    new_writers.append(avg_value)
#print(new_directors)
new_writers=pd.DataFrame(new_writers)
print(new_writers)

             0
0      15506.0
1       7656.0
2      14134.0
3      15302.0
4      12749.0
...        ...
16633   3563.0
16634   7375.0
16635  14462.0
16636   5898.0
16637   8951.0

[16638 rows x 1 columns]


For the column 'cast' also, we can use a similar process as we did for genre, directors and writers, however, there are 2,04,103 unique cast members in total, so finding the average and creating a different dataframe with the averages is going to be time-consuming, therefore it may be better to drop that column. Similarly, the column 'studio_name' also consists of 2887 unique names, therefore it may be better to drop it. 

The column 'movie_title' consists of 16106 unique titles out of a total of 16638, so almost 500 are repeated ones, we do not know any way by which we can handle these, so we drop that too.

Apart from this column, the columns movie_title, movie_info, in_theatres_date and on_streaming_date does not give us any important information as to predict audience_rating, so we drop those as well.

Along with these, we can drop the columns genre, directors and writers since we already have worked on those and just need to add the new dataframes to the dataset.

In [22]:
df = df.drop(['movie_title', 'movie_info', 'in_theaters_date', 'on_streaming_date', 'studio_name', 'cast'], axis=1)
df = df.drop(['genre', 'writers', 'directors'], axis=1)

In [26]:
new_genre.columns = ['genre']
new_writers.columns = ['writers']
new_directors.columns = ['directors']
updated_df = updated_df = pd.concat([df, new_genre, new_directors,new_writers],axis=1)

In [27]:
print(updated_df.columns)

Index(['rating', 'runtime_in_minutes', 'tomatometer_status',
       'tomatometer_rating', 'tomatometer_count', 'audience_rating', 'genre',
       'directors', 'writers'],
      dtype='object')


In [29]:
display(updated_df.dtypes, updated_df.isnull().sum())

rating                 object
runtime_in_minutes    float64
tomatometer_status     object
tomatometer_rating      int64
tomatometer_count       int64
audience_rating       float64
genre                 float64
directors             float64
writers               float64
dtype: object

rating                0
runtime_in_minutes    0
tomatometer_status    0
tomatometer_rating    0
tomatometer_count     0
audience_rating       0
genre                 0
directors             0
writers               0
dtype: int64

All the data which contained missing values have been handled properly, along with converting string data to numerical format as well.

### Handling categorical data

Now we need to work on the columns rating and tomatometer_status. We can see that these are categorical values, so we use OneHotEncoder to convert these to numerical data.

In [32]:
categorical_columns = ['tomatometer_status', 'rating']
encoded_df = updated_df.drop(categorical_columns, axis=1)
encoder = OneHotEncoder(sparse_output=False)
one_hot_encoded = encoder.fit_transform(updated_df[categorical_columns])   
one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out
                          (categorical_columns))                                
final_df = pd.concat([updated_df, one_hot_df], axis=1)
final_df = final_df.drop(['rating', 'tomatometer_status'], axis=1)

In [33]:
print(final_df.columns) 

Index(['runtime_in_minutes', 'tomatometer_rating', 'tomatometer_count',
       'audience_rating', 'genre', 'directors', 'writers',
       'tomatometer_status_Certified Fresh', 'tomatometer_status_Fresh',
       'tomatometer_status_Rotten', 'rating_G', 'rating_NC17', 'rating_NR',
       'rating_PG', 'rating_PG-13', 'rating_PG-13)', 'rating_R', 'rating_R)'],
      dtype='object')


In [35]:
print(final_df.dtypes)

runtime_in_minutes                    float64
tomatometer_rating                      int64
tomatometer_count                       int64
audience_rating                       float64
genre                                 float64
directors                             float64
writers                               float64
tomatometer_status_Certified Fresh    float64
tomatometer_status_Fresh              float64
tomatometer_status_Rotten             float64
rating_G                              float64
rating_NC17                           float64
rating_NR                             float64
rating_PG                             float64
rating_PG-13                          float64
rating_PG-13)                         float64
rating_R                              float64
rating_R)                             float64
dtype: object


In [36]:
print(final_df.isnull().sum())

runtime_in_minutes                    0
tomatometer_rating                    0
tomatometer_count                     0
audience_rating                       0
genre                                 0
directors                             0
writers                               0
tomatometer_status_Certified Fresh    0
tomatometer_status_Fresh              0
tomatometer_status_Rotten             0
rating_G                              0
rating_NC17                           0
rating_NR                             0
rating_PG                             0
rating_PG-13                          0
rating_PG-13)                         0
rating_R                              0
rating_R)                             0
dtype: int64


## Training and Testing the data

In [37]:
Y = np.array(final_df["audience_rating"])  
X = final_df.drop("audience_rating", axis=1)                        

x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
print(x_train, x_test, y_train, y_test)

       runtime_in_minutes  tomatometer_rating  tomatometer_count      genre  \
5814           106.000000                  74                 34  18.000000   
10782          120.000000                  85                 20  30.333333   
5932           121.000000                  92                 25  29.600000   
2257            93.000000                  44                  9  18.000000   
16545          105.000000                  17                 93  50.000000   
...                   ...                 ...                ...        ...   
1770            99.000000                  29                 17  52.500000   
16242           98.000000                   7                148  35.000000   
14464           90.000000                  55                131  15.375000   
1645           113.000000                  57                 44  16.000000   
12120          102.391494                  27                 22  50.000000   

       directors       writers  tomatometer_status_

Let's use linear regression, ridge regression, decision trees and random forests on this data. We can use accuracy, mean absolute error, root mean square error and R2 score as parameters.

In [38]:
lr = LinearRegression()
ridge = Ridge(alpha=0.5, fit_intercept=True, max_iter=1000)
dt = DecisionTreeRegressor(criterion='absolute_error', max_depth=1000)
rf = RandomForestRegressor(max_depth=1000)

The hyperparameters were used after a lot of trial and error matches. These are the best hyperparameters I could find.

In [39]:
models = [lr, ridge, dt, rf]
names = ["Linear Regressor", "Ridge Regressor", "Decision Tree Regressor", 
        "Random Forest Regressor"]

In [40]:
for name, model in zip(names, models):
    print(name)   
    model.fit(x_train, y_train)
    y_predict = model.predict(x_test)
    train_score = model.score(x_train, y_train)
    test_score = model.score(x_test, y_test)
    mae = mean_absolute_error(y_test, y_predict)
    rmse = math.sqrt(mean_squared_error(y_test, y_predict))
    r2 = r2_score(y_test, y_predict)
    print("Train Accuracy: {}, Test Accuracy: {}, Mean Absolute error: {}, Root Mean Square error: {}, R2 score: {}\n".format(train_score, test_score, mae, rmse, r2))

Linear Regressor
Train Accuracy: 0.46173979310477886, Test Accuracy: 0.47391400267357, Mean Absolute error: 11.82928184139104, Root Mean Square error: 14.837658975778513, R2 score: 0.47391400267357

Ridge Regressor
Train Accuracy: 0.46173259099176234, Test Accuracy: 0.4740879516705503, Mean Absolute error: 11.82768038619076, Root Mean Square error: 14.835205755872735, R2 score: 0.4740879516705503

Decision Tree Regressor
Train Accuracy: 1.0, Test Accuracy: -0.052304837991966835, Mean Absolute error: 16.291518938540403, Root Mean Square error: 20.984943172584856, R2 score: -0.052304837991966835

Random Forest Regressor
Train Accuracy: 0.92727744219255, Test Accuracy: 0.4865405703248201, Mean Absolute error: 11.572819939952257, Root Mean Square error: 14.658518551130843, R2 score: 0.4865405703248201



We can see that none of the models produce excellent results. None of them have even provided 50% accuracy. This is clearly stated in Decision trees when we have training accuracy of 100% and test accuracy as -0.05%, indicating a clear case of overfitting. However, there is a chance of even less scores if we drop more columns, so this is the best shot we have.

(Note: Trying Standard Scaling and using PCA resulted in even less accuracy, so I have decided not to use that.)