<a href="https://colab.research.google.com/github/ShaikSony-07/data-analytics-project/blob/main/Rotten_Tomatoes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Step 1: Import Libraries**

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder

**Step 2: Load and Explore the Dataset**

In [12]:
# Load the dataset
file_path = "/content/Rotten_Tomatoes_Movies3.xlsx"  # Adjust the file path
data = pd.read_excel("/content/Rotten_Tomatoes_Movies3.xlsx", engine='openpyxl')


# Display the structure of the dataset
print(data.head())
print(data.info())

                                         movie_title  \
0  Percy Jackson & the Olympians: The Lightning T...   
1                                        Please Give   
2                                                 10   
3                    12 Angry Men (Twelve Angry Men)   
4                       20,000 Leagues Under The Sea   

                                          movie_info  \
0  A teenager discovers he's the descendant of a ...   
1  Kate has a lot on her mind. There's the ethics...   
2  Blake Edwards' 10 stars Dudley Moore as George...   
3  A Puerto Rican youth is on trial for murder, a...   
4  This 1954 Disney version of Jules Verne's 20,0...   

                                   critics_consensus rating  \
0  Though it may seem like just another Harry Pot...     PG   
1  Nicole Holofcener's newest might seem slight i...      R   
2                                                NaN      R   
3  Sidney Lumet's feature debut is a superbly wri...     NR   
4  One of D

**Step 3: Preprocessing
Handle Missing Values: Replace or drop missing values.
Encode Categorical Variables: Use LabelEncoder or OneHotEncoder for non-numeric columns.
Normalize Features: Scale numeric features if required.**

In [15]:
# Handle missing values for numeric columns
numeric_cols = data.select_dtypes(include=[np.number]).columns
data[numeric_cols] = data[numeric_cols].fillna(data[numeric_cols].mean())

# For non-numeric columns, you can fill missing values with a placeholder or drop them
non_numeric_cols = data.select_dtypes(exclude=[np.number]).columns
data[non_numeric_cols] = data[non_numeric_cols].fillna("Unknown")  # Replace "Unknown" as appropriate

# Encode categorical columns
label_encoders = {}
for column in data.select_dtypes(include=['object']).columns:
    label_encoders[column] = LabelEncoder()
    data['column'] = label_encoders[column].fit_transform(data[column].astype(str))

# Feature scaling
numeric_data = data.select_dtypes(include=['float64', 'int64'])

# Apply scaling only to numeric columns
scaler = StandardScaler()
scaled_features = scaler.fit_transform(numeric_data)
# Check for any inconsistent columns
X = data.drop(columns=['tomatometer_rating'])
print(X.dtypes)
print(X.columns)
if 'some_column' in X.columns:
    # Do something with 'some_column'
    X['some_column'] = pd.to_numeric(X['some_column'], errors='coerce')  # This will convert non-numeric values to NaN
else:
    print("'some_column' does not exist in the DataFrame")



movie_title            object
movie_info             object
critics_consensus      object
rating                 object
genre                  object
directors              object
writers                object
cast                   object
in_theaters_date       object
on_streaming_date      object
runtime_in_minutes    float64
studio_name            object
tomatometer_status     object
tomatometer_count       int64
audience_rating       float64
column                  int64
dtype: object
Index(['movie_title', 'movie_info', 'critics_consensus', 'rating', 'genre',
       'directors', 'writers', 'cast', 'in_theaters_date', 'on_streaming_date',
       'runtime_in_minutes', 'studio_name', 'tomatometer_status',
       'tomatometer_count', 'audience_rating', 'column'],
      dtype='object')
'some_column' does not exist in the DataFrame


**Step 4: Split the Data**

In [16]:
# Split into features and target
if 'datetime_column' in X.columns:
    X['datetime_column'] = pd.to_datetime(X['datetime_column']).astype(int) / 10**9  # Convert to timestamp in seconds

if 'time_column' in X.columns:
    X['time_column'] = pd.to_timedelta(X['time_column']).dt.total_seconds()  # Convert to total secon
X = data.drop(columns=['audience_rating'])
y = data['audience_rating']
categorical_cols = X.select_dtypes(include=['object']).columns

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


**Step 5: Build the Model**

In [25]:
# Example DataFrame (Replace with your dataset)
data = {
    'rating': ['PG', 'R', 'PG-13', 'R'],
    'genre': ['Action', 'Comedy', 'Action', 'Drama'],
    'directors': ['A', 'B', 'C', 'A'],
    'writers': ['X', 'Y', 'Z', 'X'],
    'cast': ['Actor1', 'Actor2', 'Actor3', 'Actor4'],
    'runtime_in_minutes': [120, 90, 150, 100],
    'tomatometer_status': ['Certified Fresh', 'Fresh', 'Rotten', 'Fresh'],
    'target': [3.5, 4.0, 2.5, 3.0]  # Replace 'target' with your actual target column
}
df = pd.DataFrame(data)

# Split features and target
X = df.drop('target', axis=1)
y = df['target']

# Specify categorical and numeric columns
categorical_columns = ['rating', 'genre', 'directors', 'writers', 'cast', 'tomatometer_status']
numeric_columns = ['runtime_in_minutes']

# Define preprocessing for categorical and numeric features
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_columns),
    ],
    remainder='passthrough'  # Leave numeric columns unchanged
)

# Create pipeline including preprocessing and regressor
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the pipeline
pipeline.fit(X_train, y_train)


The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



**Step 6: Validate the Model
Use metrics like RMSE and R² score.**

In [26]:
# Predictions and evaluation
y_pred = pipeline.predict(X_test)
print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred):.2f}")
print(f"R-squared Score: {r2_score(y_test, y_pred):.2f}")


Mean Squared Error: 1.00
R-squared Score: nan


