This Python code is a data preprocessing and modeling pipeline for analyzing IMDb movie ratings.

Dataset: https://www.kaggle.com/datasets/adrianmcmahon/imdb-india-movies

### Importing Libraries
- The code imports necessary libraries including pandas for data manipulation, scikit-learn for machine learning tools, and modules such as RandomForestRegressor, DecisionTreeRegressor, mean_squared_error, SimpleImputer, ColumnTransformer, and OneHotEncoder.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Step 1: Data Preprocessing
1. **Loading Data**: It loads the IMDb movies data from a CSV file into a pandas DataFrame. During loading, it handles non-numeric columns by specifying certain values to be treated as missing values.
   
2. **Handling Non-Numeric Values in 'Year' and 'Duration'**: It extracts the numeric values from the 'Year' and 'Duration' columns and converts them to float data type.

3. **Handling Missing Values**: It uses SimpleImputer from scikit-learn to fill missing values in the 'Year' and 'Duration' columns with their mean values.

4. **Defining Features and Target**: It defines the feature matrix (X) containing columns like 'Year', 'Duration', 'Genre', 'Director', and actors, and sets the target variable (y) as 'Rating', converted to float data type.

5. **Splitting Data**: It splits the data into training and testing sets using train_test_split from scikit-learn.

6. **Encoding Categorical Columns**: It uses ColumnTransformer with OneHotEncoder to encode the categorical columns ('Genre', 'Director', 'Actor 1', 'Actor 2', 'Actor 3') while keeping other columns unchanged.

In [2]:
# Step 1: Data Preprocessing
file_path = '/content/drive/MyDrive/Colab Notebooks/Neuronexus Innovations/NeuroNexus Innovations - Data Science/Movie Rating Prediction/IMDb Movies India.csv'

# Load the data and handle non-numeric columns during data loading
movie_data = pd.read_csv(file_path, encoding='latin1', na_values=['N/A', 'NA', 'NaN'])

# Extract numeric values from 'Year' and 'Duration' columns
movie_data['Year'] = movie_data['Year'].str.extract('(\d+)').astype(float)
movie_data['Duration'] = movie_data['Duration'].str.extract('(\d+)').astype(float)

# Handle missing values using SimpleImputer for numerical columns
numerical_cols = ['Year', 'Duration']
imputer = SimpleImputer(strategy='mean')
movie_data[numerical_cols] = imputer.fit_transform(movie_data[numerical_cols])

# Define the feature matrix (X) and the target variable (y)
X = movie_data[['Year', 'Duration', 'Genre', 'Director', 'Actor 1', 'Actor 2', 'Actor 3']]
y = movie_data['Rating'].astype(float)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Use ColumnTransformer to apply OneHotEncoder to the categorical columns
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), ['Genre', 'Director', 'Actor 1', 'Actor 2', 'Actor 3'])
    ],
    remainder='passthrough'
)

# Fit and transform the preprocessor on the training data
X_train_preprocessed = preprocessor.fit_transform(X_train)

# Transform the testing data using the preprocessor fitted on the training data
X_test_preprocessed = preprocessor.transform(X_test)

### Step 2: Modeling
1. **Initializing and Fitting Models**: It initializes two models, Random Forest and Decision Tree regressors, and fits them using the preprocessed training data.
   
2. **Model Evaluation**: It calculates Mean Squared Error (MSE) for each model's predictions on the test data and stores the evaluation metrics in a results dictionary.

3. **Printing Results**: It prints the model names and their corresponding mean squared errors.

In [3]:
# Step 2: Modeling
# Initialize and fit the models
models = {
    'Random Forest': RandomForestRegressor(),
    'Decision Tree': DecisionTreeRegressor()
}

# Results dictionary to store model evaluation metrics
results = {}

# Handle missing values in the target variable using SimpleImputer
y_imputer = SimpleImputer(strategy='mean')
y_train_imputed = y_imputer.fit_transform(y_train.values.reshape(-1, 1)).flatten()
y_test_imputed = y_imputer.transform(y_test.values.reshape(-1, 1)).flatten()

# Use the imputed target variable for modeling
for model_name, model in models.items():
    model.fit(X_train_preprocessed, y_train_imputed)
    y_pred = model.predict(X_test_preprocessed)

    mse = mean_squared_error(y_test_imputed, y_pred)

    # Store predictions and other relevant information for further analysis
    results[model_name] = {
        'mean_squared_error': mse,
        'predictions': y_pred
        # Additional information can be added here for comprehensive analysis
    }

# Print results for each model
for model_name, result in results.items():
    print(f"Results for {model_name}:")
    print("Mean Squared Error:", result['mean_squared_error'])
    print("\n")

Results for Random Forest:
Mean Squared Error: 0.8023782412474423


Results for Decision Tree:
Mean Squared Error: 1.1008900312837027




Overall, this code performs data preprocessing, encoding categorical variables, training models, and evaluating their performance using mean squared error as the metric.