#**Rotten Tomatoes Movies Rating Prediction**

**Assignment**

In this project, you are given large datasets from Rotten Tomatoes - a popular online review aggregator for film and television. Your task is to build a high performing classification algorithm to predict whether a particular movie on Rotten Tomatoes is labeled as 'Rotten', 'Fresh', or 'Certified-Fresh'.

## |**Data Description**

**There are 2 datasets**

1. rotten_tomatoes_movies.csv - contains basic information about each movie listed on Rotten Tomatoes; each row represents one movie;
2. rotten_tomatoes_critic_reviews_50k.tsv - contains 50.000 individual reviews by Rotten Tomatoes critics; each row represents one review corresponding to a movie;

rotten_tomatoes_movies dataset contains the following columns:

- rotten_tomatoes_link - movie ID
- movie_title - title of the movie as displayed on the Rotten Tomatoes website
- movie_info - brief description of the movie
- critics_consensus - comment from Rotten Tomatoes
- content_rating - category based on the movie suitability for audience
- genres - movie genres separated by commes, if multiple
- directors - name of director(s)
- authors - name of author(s)
- actors - name of actors
- original_release_date - date in which the movie has been released in  
   theatres, in YYY-MM-DD format
- streaming_release_date - date in which the movie has been released on
   streaming platforms, in YYY-MM-DD format
- runtime - duration of the movie in minutes
- production_company - name of a studio/company that produced the movie
- tomatometer_status - a label assgined by Rotten Tomatoes: "Fresh",
- "Certified-Fresh" or "Rotten"; this is the target variables in this challenge
- tomatometer_rating - percentage of positive critic ratings
- tomatometer_count - critic ratings counted for the calculation of the
   tomatomer status
- audience_status - a label assgined based on user ratings: "Spilled" or "Upright"
- audience_rating - percentage of positive user ratings
- audience_count - user ratings counted for the calculation of the audience
   status
- tomatometer_top_critics_count - number of ratings by top critics
- tomatometer_fresh_critics_count - number of critic ratings labeled "Fresh"
- tomatometer_rotten_critics_count - - number of critic ratings labeled "Rotten"


rotten_tomatoes_critic_reviews_50k dataset contains the following columns:

- rotten_tomatoes_link - movie ID
- critic_name - name of critic who rated the movie
- top_critic - boolean value that clarifies whether the critic is a top critic or not
- publisher_name - name of the publisher for which the critic works
- review_type - was the review labeled "Fresh" or "Rotten"?
- review_score - review score provided by the critic
- review_date - date of the review in YYYY-MM-DD format
- review_content - text of the review

## **Practicalities**

Define, train and evaluate a predictive model that takes as the input the data provided. You may want to split the data into training, testing and validation sets, according to your discretion. Do not use external data for this project. You may use any algorithm of your choice or compare multiple models.

Make sure that the solution reflects your entire thought process - it is more important how the code is structured rather than the final metrics. You are expected to spend no more than 3 hours working on this project.

#### To download the dataset <a href="https://drive.google.com/drive/folders/1gfJwHeushdTfOeval0vceu0k1wOQ0kSv?usp=sharing"> Click here </a>

In [3]:
import pandas as pd

In [15]:
movies_df = pd.read_csv('rotten_tomatoes_movies.csv')
reviews_df = pd.read_csv('rotten_tomatoes_critic_reviews_50k.csv')

print("Movies Dataset:")
print(movies_df.head())
print("\nReviews Dataset:")
print(reviews_df.head())


Movies Dataset:
                    rotten_tomatoes_link  \
0                              m/0814255   
1                              m/0878835   
2                                   m/10   
3                 m/1000013-12_angry_men   
4  m/1000079-20000_leagues_under_the_sea   

                                         movie_title  \
0  Percy Jackson & the Olympians: The Lightning T...   
1                                        Please Give   
2                                                 10   
3                    12 Angry Men (Twelve Angry Men)   
4                       20,000 Leagues Under The Sea   

                                          movie_info  \
0  Always trouble-prone, the life of teenager Per...   
1  Kate (Catherine Keener) and her husband Alex (...   
2  A successful, middle-aged Hollywood songwriter...   
3  Following the closing arguments in a murder tr...   
4  In 1866, Professor Pierre M. Aronnax (Paul Luk...   

                                   critics_co

In [18]:
# review scores
def convert_review_score(score):
    try:
        # Extract the numeric part (e.g., '3.5' from '3.5/5')
        return float(score.split('/')[0])
    except (ValueError, AttributeError):
        # Return NaN if conversion fails
        return None

# function to review_score column
reviews_df['review_score'] = reviews_df['review_score'].apply(convert_review_score)


reviews_df.fillna({
    'critic_name': 'Unknown',
    'review_score': reviews_df['review_score'].median(),
    'review_content': 'No content'
}, inplace=True)


print("\nMissing values in Reviews Dataset after handling:")
print(reviews_df.isnull().sum())



Missing values in Reviews Dataset after handling:
rotten_tomatoes_link    0
critic_name             0
top_critic              0
publisher_name          0
review_type             0
review_score            0
review_date             0
review_content          0
dtype: int64


In [19]:
print("\nMerged Dataset Sample:")
print(merged_df.head())


Merged Dataset Sample:
  rotten_tomatoes_link      critic_name  top_critic           publisher_name  \
0            m/0814255  Andrew L. Urban       False           Urban Cinefile   
1            m/0814255    Louise Keller       False           Urban Cinefile   
2            m/0814255          Unknown       False      FILMINK (Australia)   
3            m/0814255     Ben McEachen       False  Sunday Mail (Australia)   
4            m/0814255      Ethan Alter        True       Hollywood Reporter   

  review_type  review_score review_date  \
0       Fresh           3.0  2010-02-06   
1       Fresh           3.0  2010-02-06   
2       Fresh           3.0  2010-02-09   
3       Fresh           3.5  2010-02-09   
4      Rotten           3.0  2010-02-10   

                                      review_content  \
0  A fantasy adventure that fuses Greek mythology...   
1  Uma Thurman as Medusa, the gorgon with a coiff...   
2  With a top-notch cast and dazzling special eff...   
3  Whether a

In [20]:
from sklearn.preprocessing import LabelEncoder

categorical_columns = ['review_type', 'top_critic', 'publisher_name', 'content_rating', 'genres', 'directors', 
                        'authors', 'actors', 'production_company', 'audience_status']

label_encoders = {}
for column in categorical_columns:
    le = LabelEncoder()
    final_df[column] = le.fit_transform(final_df[column].astype(str))
    label_encoders[column] = le


ValueError: y should be a 1d array, got an array of shape (49957, 2) instead.

In [21]:
print(final_df[categorical_columns].dtypes)

review_type             int32
top_critic              int32
publisher_name          int32
content_rating          int32
genres                  int32
directors               int32
authors                 int32
actors                 object
actors                float64
production_company     object
audience_status        object
dtype: object


In [22]:
final_df['actors'] = final_df['actors'].astype(str)
final_df['production_company'] = final_df['production_company'].astype(str)


In [24]:
print(final_df['actors'].head(10))

                                              actors actors
0  Logan Lerman, Brandon T. Jackson, Alexandra Da...    0.0
1  Logan Lerman, Brandon T. Jackson, Alexandra Da...    0.0
2  Logan Lerman, Brandon T. Jackson, Alexandra Da...    0.0
3  Logan Lerman, Brandon T. Jackson, Alexandra Da...    0.0
4  Logan Lerman, Brandon T. Jackson, Alexandra Da...    0.0
5  Logan Lerman, Brandon T. Jackson, Alexandra Da...    0.0
6  Logan Lerman, Brandon T. Jackson, Alexandra Da...    0.0
7  Logan Lerman, Brandon T. Jackson, Alexandra Da...    0.0
8  Logan Lerman, Brandon T. Jackson, Alexandra Da...    0.0
9  Logan Lerman, Brandon T. Jackson, Alexandra Da...    0.0


In [25]:
final_df = final_df.loc[:, ~final_df.columns.duplicated()]

# Converting the 'actors' column to string, replacing NaNs with 'Unknown'
final_df['actors'] = final_df['actors'].astype(str).replace('nan', 'Unknown')

print(final_df['actors'].head(10))


0    Logan Lerman, Brandon T. Jackson, Alexandra Da...
1    Logan Lerman, Brandon T. Jackson, Alexandra Da...
2    Logan Lerman, Brandon T. Jackson, Alexandra Da...
3    Logan Lerman, Brandon T. Jackson, Alexandra Da...
4    Logan Lerman, Brandon T. Jackson, Alexandra Da...
5    Logan Lerman, Brandon T. Jackson, Alexandra Da...
6    Logan Lerman, Brandon T. Jackson, Alexandra Da...
7    Logan Lerman, Brandon T. Jackson, Alexandra Da...
8    Logan Lerman, Brandon T. Jackson, Alexandra Da...
9    Logan Lerman, Brandon T. Jackson, Alexandra Da...
Name: actors, dtype: object


In [26]:
# Converting the 'production_company' column to string, replacing NaNs with 'Unknown'
final_df['production_company'] = final_df['production_company'].astype(str).replace('nan', 'Unknown')

print(final_df['production_company'].head(10))

0    20th Century Fox
1    20th Century Fox
2    20th Century Fox
3    20th Century Fox
4    20th Century Fox
5    20th Century Fox
6    20th Century Fox
7    20th Century Fox
8    20th Century Fox
9    20th Century Fox
Name: production_company, dtype: object


In [27]:
from sklearn.preprocessing import LabelEncoder

# List of categorical columns
categorical_columns = [
    'review_type',
    'top_critic',
    'publisher_name',
    'content_rating',
    'genres',
    'directors',
    'authors',
    'actors',
    'production_company',
    'audience_status'
]


label_encoders = {}

# Convertng categorical columns to numeric
for column in categorical_columns:
    le = LabelEncoder()
    final_df[column] = le.fit_transform(final_df[column].astype(str))
    label_encoders[column] = le

print(final_df[categorical_columns].head(10))


   review_type  top_critic  publisher_name  content_rating  genres  directors  \
0            0           0             927               3     195        434   
1            0           0             927               3     195        434   
2            0           0             516               3     195        434   
3            0           0              72               3     195        434   
4            1           1             658               3     195        434   
5            1           1             724               3     195        434   
6            1           0              26               3     195        434   
7            0           1             613               3     195        434   
8            0           0             772               3     195        434   
9            0           1              98               3     195        434   

   authors  actors  production_company  audience_status  
0      512     807                   0            

In [28]:
from sklearn.model_selection import train_test_split

target_column = 'tomatometer_status'
feature_columns = [col for col in final_df.columns if col != target_column]

X = final_df[feature_columns]
y = final_df[target_column]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape}")
print(f"Testing set size: {X_test.shape}")



Training set size: (39965, 1025)
Testing set size: (9992, 1025)


In [29]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

model = RandomForestClassifier(n_estimators=100, random_state=42)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:\n", report)

ValueError: could not convert string to float: 'm/1001567-bad_news_bears'

In [30]:
print(X_train.dtypes)
print(X_test.dtypes)

rotten_tomatoes_link     object
critic_name              object
top_critic                int32
publisher_name            int32
review_type               int32
                         ...   
year                    float64
years                   float64
york                    float64
young                   float64
younger                 float64
Length: 1025, dtype: object
rotten_tomatoes_link     object
critic_name              object
top_critic                int32
publisher_name            int32
review_type               int32
                         ...   
year                    float64
years                   float64
york                    float64
young                   float64
younger                 float64
Length: 1025, dtype: object


In [31]:
non_numeric_columns = X_train.select_dtypes(include=['object', 'float64']).columns
print("Non-numeric columns:", non_numeric_columns)

Non-numeric columns: Index(['rotten_tomatoes_link', 'critic_name', 'review_score', 'review_date',
       'movie_title', 'critics_consensus', 'original_release_date',
       'streaming_release_date', 'runtime', 'tomatometer_rating',
       ...
       'world', 'worse', 'worth', 'writer', 'wrong', 'year', 'years', 'york',
       'young', 'younger'],
      dtype='object', length=1012)


In [33]:
combined = pd.concat([X_train, X_test], axis=0)

categorical_columns = combined.select_dtypes(include=['object']).columns

combined_encoded = pd.get_dummies(combined, columns=categorical_columns, drop_first=True)

X_train_encoded = combined_encoded.iloc[:X_train.shape[0], :]
X_test_encoded = combined_encoded.iloc[X_train.shape[0]:, :]

In [34]:
print("X_train_encoded shape:", X_train_encoded.shape)
print("X_test_encoded shape:", X_test_encoded.shape)


X_train_encoded shape: (39965, 16124)
X_test_encoded shape: (9992, 16124)


In [35]:
print("Training features:", X_train_encoded.columns)
print("Testing features:", X_test_encoded.columns)


Training features: Index(['top_critic', 'publisher_name', 'review_type', 'review_score',
       'content_rating', 'genres', 'directors', 'authors', 'actors', 'runtime',
       ...
       'streaming_release_date_2020-03-12',
       'streaming_release_date_2020-03-13',
       'streaming_release_date_2020-04-07',
       'streaming_release_date_2020-04-17',
       'streaming_release_date_2020-04-27',
       'streaming_release_date_2020-05-24',
       'streaming_release_date_2020-09-20',
       'streaming_release_date_2020-10-13',
       'streaming_release_date_2020-10-15', 'streaming_release_date_Unknown'],
      dtype='object', length=16124)
Testing features: Index(['top_critic', 'publisher_name', 'review_type', 'review_score',
       'content_rating', 'genres', 'directors', 'authors', 'actors', 'runtime',
       ...
       'streaming_release_date_2020-03-12',
       'streaming_release_date_2020-03-13',
       'streaming_release_date_2020-04-07',
       'streaming_release_date_2020-04-17'

In [36]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)

model.fit(X_train_encoded, y_train)


In [37]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

model = RandomForestClassifier(random_state=42)

model.fit(X_train_encoded, y_train)

y_pred = model.predict(X_test_encoded)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)

conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)


Accuracy: 1.0
Classification Report:
                  precision    recall  f1-score   support

Certified-Fresh       1.00      1.00      1.00      2793
          Fresh       1.00      1.00      1.00      2913
         Rotten       1.00      1.00      1.00      4286

       accuracy                           1.00      9992
      macro avg       1.00      1.00      1.00      9992
   weighted avg       1.00      1.00      1.00      9992

Confusion Matrix:
 [[2793    0    0]
 [   0 2913    0]
 [   0    0 4286]]


Accuracy of 1.0: This means the model correctly predicted all instances in the test set.

**Classification Report:**

Precision, Recall, F1-score: All are 1.00 for each class (Certified-Fresh, Fresh, Rotten). This indicates that the model is perfectly classifying each class without any false positives or false negatives.
Support: The number of true instances for each class, which matches the number of instances in your confusion matrix.

**Confusion Matrix:**

The matrix shows all true positives and no misclassifications. Each row and column correctly represent the classes, confirming that every prediction is accurate.
