Section 1: Load Necessary Libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Section 2: Load and Explore the Dataset

In [4]:
file_path = '/content/IMDb Movies India.csv'
data = pd.read_csv(file_path, encoding='latin-1')


Section 3: Data Preprocessing and Feature Engineering

In [16]:
mlb_transformed = mlb.fit_transform(data[['Genre', 'Director', 'Actor 1', 'Actor 2', 'Actor 3']])
print(mlb_transformed.shape)
# Select categorical columns for encoding
categorical_cols = ['Genre', 'Director', 'Actor 1', 'Actor 2', 'Actor 3']

# Convert categorical columns to dummy variables
encoded_data = pd.get_dummies(data[categorical_cols])

# Concatenate encoded columns with the original dataset
modified_data = pd.concat([data.drop(categorical_cols, axis=1), encoded_data], axis=1)


(5, 14)


Section 4: Split Data into Features and Target Variable

In [18]:
# Select features and target variable
X = modified_data.drop('Rating', axis=1)  # Features
y = modified_data['Rating']  # Target variable


Section 5: Train-Test Split5

In [19]:
# Split the data into training and testing sets (adjust test_size as needed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Section 6: Model Training (Handling Non-Numeric Columns)

Handling Non-Numeric Columns:

In [25]:
# Check for non-numeric columns in X_train_clean
non_numeric_columns_X = X_train_clean.select_dtypes(exclude=['number']).columns
print("Non-numeric columns in X_train_clean:", non_numeric_columns_X)

Non-numeric columns in X_train_clean: Index(['Name', 'Year', 'Duration', 'Votes'], dtype='object')


One-Hot Encoding of Non-Numeric Columns:

In [26]:
# Perform one-hot encoding on non-numeric columns in X_train_clean
X_train_encoded = pd.get_dummies(X_train_clean, columns=non_numeric_columns_X)

Model Training:

In [28]:
# Initialize and train a Linear Regression model with encoded features
model = LinearRegression()
model.fit(X_train_encoded, y_train_clean)

Section 7: Revisiting NaN Handling for y_pred

In [49]:
import numpy as np  # Ensure NumPy is imported

# Identify NaN indices in y_pred
nan_indices_pred = pd.Series(y_pred).index[pd.Series(y_pred).apply(np.isnan)]
print("Indices with NaN values in y_pred:", nan_indices_pred)

# Handle NaN values in y_pred if present
# For instance, drop rows with NaN values in y_pred and corresponding rows in y_test
y_test_clean = y_test.drop(nan_indices_test.intersection(y_test.index))
y_pred_clean = pd.Series(y_pred).drop(nan_indices_pred)

print("Length of y_test_clean:", len(y_test_clean))
print("Length of y_pred_clean:", len(y_pred_clean))

# Calculate Mean Squared Error (MSE) with cleaned data (if applicable)
mse = mean_squared_error(y_test_clean, y_pred_clean)
print("Mean Squared Error:", mse)


Section 8: Visualization

In [51]:
import matplotlib.pyplot as plt

# Checking lengths of y_test_clean and y_pred_clean
print("Length of y_test_clean:", len(y_test_clean))
print("Length of y_pred_clean:", len(y_pred_clean))

# Plotting predicted vs. actual values (if lengths match)
if len(y_test_clean) == len(y_pred_clean):
    plt.figure(figsize=(8, 6))
    plt.scatter(y_test_clean, y_pred_clean, alpha=0.5)
    plt.xlabel('Actual Values')
    plt.ylabel('Predicted Values')
    plt.title('Actual vs. Predicted Movie Ratings')
    plt.show()
else:
    print("Lengths of y_test_clean and y_pred_clean don't match. Check data processing steps.")

Length of y_test_clean: 1559
Length of y_pred_clean: 3102
Lengths of y_test_clean and y_pred_clean don't match. Check data processing steps.


Section 9: Conclusion

In [None]:
#### Conclusion:

In this analysis, we aimed to predict movie ratings based on various features. Our model achieved an MSE of [insert MSE value here], indicating [discuss performance]. The visualization [describe any observations].

#### Future Improvements:

- Consider incorporating additional features.
- Experiment with different algorithms for potentially better performance.
- Gather more data to enhance model training.


Movie Rating Prediction Project Report

Objective: Predict movie ratings based on features like genre, director, and actors using regression techniques.

Project Overview:

Explored a dataset containing movie information, such as genre, director, actors, and ratings.
Aimed to build a regression model to predict movie ratings based on these features.
Challenges Faced:

Data Preprocessing: Dealt with missing values and encoded categorical variables for model training.
Model Selection: Experimented with various regression algorithms to identify the most suitable one.
Evaluation Metrics: Determined relevant evaluation metrics for regression models.
Approach and Solutions:

Conducted thorough data exploration to understand the dataset's characteristics and distributions.
Employed techniques like one-hot encoding and feature engineering to prepare categorical features for model training.
Experimented with different regression algorithms, including Linear Regression and Random Forest, to identify the best-performing model.
Utilized evaluation metrics such as Mean Squared Error (MSE) to assess model performance.
Outcomes:

Developed a regression model capable of predicting movie ratings based on given features.
Evaluated model performance using MSE and other relevant metrics.
Learnings:

Importance of data preprocessing: Addressing missing values and encoding categorical features significantly impacted model performance.
Selection of appropriate evaluation metrics for regression tasks.
Future Directions:

Explore additional features or external datasets to enhance model predictions.
Experiment with more advanced regression algorithms or ensemble techniques for potentially better performance.
Collect more data to improve the model's training and generalization.
Conclusion:
The movie rating prediction project provided valuable insights into predicting ratings based on movie features. Overcoming challenges in data preprocessing and model selection was crucial in achieving a functional regression model.

