# Movie Recommendation and Rating Prediction System

## User-Based Collaborative Filtering using matrix factorization.

User-based collaborative filtering predicts a user's rating for an item by finding similar users and using their ratings for the item. It constructs a user-item matrix from existing ratings, computes user similarities, weighs ratings of similar users, and normalizes the prediction. This method offers personalized recommendations based on the preferences of similar users.


In [None]:
import pandas as pd
import numpy as np

# Load data
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')

# Merge movies and ratings data
movie_ratings = pd.merge(ratings, movies, on='movieId')

# User-item matrix
user_item_matrix = movie_ratings.pivot_table(index='userId', columns='movieId', values='rating')

# User-based collaborative filtering
def user_based_cf(user_id, item_id):
    user_ratings = user_item_matrix.loc[user_id].dropna()
    if user_ratings.empty:
        return 0
    similar_users = user_item_matrix.corrwith(user_ratings, axis=1)
    similar_users = similar_users.dropna().sort_values(ascending=False)
    if similar_users.empty:
        return 0
    user_item_matrix_filtered = user_item_matrix.loc[similar_users.index]
    weighted_sum = (user_item_matrix_filtered.loc[:, item_id] * similar_users).sum()
    sum_of_weights = similar_users.abs().sum()
    if sum_of_weights == 0:
        return 0
    else:
        return weighted_sum / sum_of_weights

In [None]:
!pip install scikit-surprise



## Content-Based Movie Recommendation System

1. **Data Preparation**: Movies' genres are processed for consistency and completeness.

2. **Feature Extraction**: Genres are transformed into numerical vectors using TF-IDF.

3. **Similarity Calculation**: Cosine similarity is computed between movie vectors, measuring genre similarity.

4. **Recommendation Generation**: For a user, watched movies are used to find similar ones based on content.

5. **Presentation**: Top similar movies, not yet watched, are suggested to the user.
\

In [None]:
# Import libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# Load data from CSV files
movies_df = pd.read_csv("movies.csv")
ratings_df = pd.read_csv("ratings.csv")

# Merge movie genres into a single string
movies_df['genres'] = movies_df['genres'].fillna('')
movies_df['genres'] = movies_df['genres'].apply(lambda x: ' '.join(x.split('|')))

# Initialize TF-IDF Vectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies_df['genres'])

# Compute cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Function to get recommendations based on content similarity
def get_content_based_recommendations(movie_title, cosine_sim=cosine_sim):
    idx = movies_df[movies_df['title'] == movie_title].index[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    return movies_df['title'].iloc[movie_indices]

# Get user ID for recommendations (replace with actual user input)
user_id = 10

# Get top 10 movie recommendations for the user
user_ratings = ratings_df[ratings_df['userId'] == user_id]
watched_movies = set(user_ratings['movieId'])
recommended_movies = []

for movie_id in watched_movies:
    movie_title = movies_df[movies_df['movieId'] == movie_id]['title'].values[0]
    similar_movies = get_content_based_recommendations(movie_title)
    recommended_movies.extend(similar_movies)

recommended_movies = [movie for movie in recommended_movies if movie not in watched_movies]
top_10_recommendations = recommended_movies[:10]

# Print recommendations
print("Top 10 Content-Based Movie Recommendations for User", user_id)
for recommendation in top_10_recommendations:
    print(f"- {recommendation}")


Top 10 Content-Based Movie Recommendations for User 10
- Sabrina (1995)
- Clueless (1995)
- Two if by Sea (1996)
- French Twist (Gazon maudit) (1995)
- If Lucy Fell (1996)
- Boomerang (1992)
- Pie in the Sky (1996)
- Mallrats (1995)
- Nine Months (1995)
- Forget Paris (1995)


## Content-Based Movie Recommendations Based on Movie ID

1. **Data Loading**: The movie and ratings data are loaded from CSV files.

2. **Genres Preprocessing**: Movie genres are merged into a single string and processed for consistency.

3. **TF-IDF Vectorization**: The TF-IDF Vectorizer from scikit-learn is used to convert movie genres into numerical vectors.

4. **Cosine Similarity Calculation**: Cosine similarity is computed between the TF-IDF vectors of movies, resulting in a similarity matrix.

5. **Recommendation Function**: A function is defined to retrieve similar movies based on a given movie ID. It takes the cosine similarity matrix as input and returns a list of similar movie titles.

6. **Movie ID Specification**: A specific movie ID is provided for which recommendations are sought.

7. **Recommendation Retrieval**: Using the defined function, similar movies are retrieved based on content similarity to the specified movie.

8. **Presentation**: The list of similar movies is printed to display the recommendations to the user.


In [None]:
# Import libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# Load data from CSV files
movies_df = pd.read_csv("movies.csv")
ratings_df = pd.read_csv("ratings.csv")

# Merge movie genres into a single string
movies_df['genres'] = movies_df['genres'].fillna('')
movies_df['genres'] = movies_df['genres'].apply(lambda x: ' '.join(x.split('|')))

# Initialize TF-IDF Vectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies_df['genres'])

# Compute cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Function to get recommendations based on content similarity
def get_content_based_recommendations(movie_id, cosine_sim=cosine_sim):
    idx = movies_df[movies_df['movieId'] == movie_id].index[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    return movies_df['title'].iloc[movie_indices]

# Specify movie ID for which recommendations are needed
movie_id = 1

# Get similar movies based on content similarity
similar_movies = get_content_based_recommendations(movie_id)

# Print recommendations
print("Movies Similar to Movie with ID:", movie_id)
for movie_title in similar_movies:
    print("- ", movie_title)


Movies Similar to Movie with ID: 1
-  Antz (1998)
-  Toy Story 2 (1999)
-  Adventures of Rocky and Bullwinkle, The (2000)
-  Emperor's New Groove, The (2000)
-  Monsters, Inc. (2001)
-  Wild, The (2006)
-  Shrek the Third (2007)
-  Tale of Despereaux, The (2008)
-  Asterix and the Vikings (Astérix et les Vikings) (2006)
-  Turbo (2013)


## Combined Recommendation from Multiple Models

1. **Data Load**: Load ratings and movie data.

2. **Surprise Setup**: Configure Surprise library.

3. **Model Training**: Train SVD, KNNBasic, BaselineOnly, and CoClustering models.

4. **Recommendation Generation**: Predict ratings for unwatched movies.

5. **Top Recommendations**: Select top 10 recommendations based on average rating.

6. **Display**: Print top 10 movie recommendations for User 1.


In [None]:
from surprise import SVD, KNNBasic, CoClustering, BaselineOnly
from surprise import Dataset, Reader
import pandas as pd
import numpy as np

# Load data from CSV files
movies_df = pd.read_csv("movies.csv")
ratings_df = pd.read_csv("ratings.csv")

# Create Surprise Reader object
reader = Reader(rating_scale=(1, 5))

# Create Surprise Dataset
data = Dataset.load_from_df(ratings_df[['userId', 'movieId', 'rating']], reader)

# Train-test split
trainset = data.build_full_trainset()

# Initialize models
svd = SVD()
knn = KNNBasic()
baseline = BaselineOnly()
co_clustering = CoClustering()

# Train models
svd.fit(trainset)
knn.fit(trainset)
baseline.fit(trainset)
co_clustering.fit(trainset)

# Get the list of movies already watched by User 1
user_id = 1
watched_movies = set(ratings_df[ratings_df['userId'] == user_id]['movieId'])

# Generate recommendations excluding watched movies
all_movie_ids = set(movies_df['movieId'])
unwatched_movies = list(all_movie_ids - watched_movies)

# Predict ratings for unwatched movies using each model
predictions = []
for model in [svd, knn, baseline, co_clustering]:
    model_predictions = [(user_id, movie_id, model.predict(user_id, movie_id).est) for movie_id in unwatched_movies]
    predictions.extend(model_predictions)

# Aggregate predictions
combined_preds = {}
for user_id, movie_id, est in predictions:
    if movie_id not in combined_preds:
        combined_preds[movie_id] = [est]
    else:
        combined_preds[movie_id].append(est)

# Take the average of predictions
for movie_id in combined_preds:
    combined_preds[movie_id] = np.mean(combined_preds[movie_id])

# Sort recommendations by the average estimated rating
sorted_recommendations = sorted(combined_preds.items(), key=lambda x: x[1], reverse=True)

# Get top 10 recommendations
top_10_recommendations = sorted_recommendations[:10]

# Print recommendations
print(f"Top 10 Movie Recommendations for User {user_id}:")
for movie_id, _ in top_10_recommendations:
    movie_title = movies_df[movies_df['movieId'] == movie_id]['title'].values[0]
    print(f"- {movie_title}")


Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Top 10 Movie Recommendations for User 1:
- Shawshank Redemption, The (1994)
- Three Billboards Outside Ebbing, Missouri (2017)
- Godfather, The (1972)
- Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)
- Lawrence of Arabia (1962)
- Secrets & Lies (1996)
- Godfather: Part II, The (1974)
- Departed, The (2006)
- Streetcar Named Desire, A (1951)
- Guess Who's Coming to Dinner (1967)


### Movie Recommendation Model Evaluation

1. **Data Load**: Load ratings and movie data from CSV files.

2. **Surprise Setup**: Configure Surprise library and create a Reader object.

3. **Dataset Creation**: Create a Surprise Dataset from the ratings data.

4. **Train-Test Split**: Split the dataset into training and testing sets.

5. **Model Definition**: Define four recommendation models: KNNBasic, SVD, BaselineOnly, and CoClustering.

6. **Model Training and Evaluation**: Train each model on the training set and evaluate its performance using RMSE and MAE metrics on the test set.

7. **Results Presentation**: Display the RMSE and MAE for each model in a DataFrame.



In [None]:
import pandas as pd
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import KNNBasic, SVD, BaselineOnly, CoClustering
from surprise.accuracy import rmse, mae
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split

# Load data from CSV files
movies_df = pd.read_csv("movies.csv")
ratings_df = pd.read_csv("ratings.csv")

# Create Surprise Reader object
reader = Reader(rating_scale=(1, 5))

# Create Surprise Dataset
data = Dataset.load_from_df(ratings_df[['userId', 'movieId', 'rating']], reader)

# Train-test split
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Define models
models = {
    "KNNBasic": KNNBasic(),
    "SVD": SVD(),
    "BaselineOnly": BaselineOnly(),
    "CoClustering": CoClustering()
}

# Train and evaluate models
results = {}
for model_name, model in models.items():
    print(f"Evaluating {model_name}...")
    model.fit(trainset)
    predictions = model.test(testset)
    results[model_name] = {
        "RMSE": rmse(predictions),
        "MAE": mae(predictions)
    }

# Convert results to DataFrame
results_df = pd.DataFrame(results)

# Display results
print("\nPerformance Results:")
print(results_df)


Evaluating KNNBasic...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9561
MAE:  0.7325
Evaluating SVD...
RMSE: 0.8768
MAE:  0.6738
Evaluating BaselineOnly...
Estimating biases using als...
RMSE: 0.8785
MAE:  0.6778
Evaluating CoClustering...
RMSE: 0.9511
MAE:  0.7349

Performance Results:
      KNNBasic       SVD  BaselineOnly  CoClustering
RMSE  0.956073  0.876759      0.878510      0.951070
MAE   0.732520  0.673789      0.677786      0.734887


In [None]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split

# Load data
ratings = pd.read_csv('ratings.csv')

# Train-test split
train, test = train_test_split(ratings, test_size=0.2, random_state=42)

# Function to calculate RMSE and MAE
def evaluate(predictions):
    rmse = np.sqrt(mean_squared_error(predictions['rating'], predictions['prediction']))
    mae = mean_absolute_error(predictions['rating'], predictions['prediction'])
    return rmse, mae

# Collaborative Filtering using K-Nearest Neighbors
knn_cf = KNeighborsRegressor(n_neighbors=10)
knn_cf.fit(train[['userId', 'movieId']], train['rating'])
knn_cf_preds = knn_cf.predict(test[['userId', 'movieId']])
knn_cf_predictions = pd.DataFrame({'rating': test['rating'], 'prediction': knn_cf_preds})
knn_cf_rmse, knn_cf_mae = evaluate(knn_cf_predictions)

# Support Vector Machine (SVM)
svm_regressor = SVR()
svm_regressor.fit(train[['userId', 'movieId']], train['rating'])
svm_preds = svm_regressor.predict(test[['userId', 'movieId']])
svm_predictions = pd.DataFrame({'rating': test['rating'], 'prediction': svm_preds})
svm_rmse, svm_mae = evaluate(svm_predictions)

# Decision Tree
dt_regressor = DecisionTreeRegressor()
dt_regressor.fit(train[['userId', 'movieId']], train['rating'])
dt_preds = dt_regressor.predict(test[['userId', 'movieId']])
dt_predictions = pd.DataFrame({'rating': test['rating'], 'prediction': dt_preds})
dt_rmse, dt_mae = evaluate(dt_predictions)

# Principal Component Analysis (PCA)
pca = PCA()
pca_train = pca.fit_transform(train[['userId', 'movieId']])
pca_test = pca.transform(test[['userId', 'movieId']])
pca_regressor = DecisionTreeRegressor()
pca_regressor.fit(pca_train, train['rating'])
pca_preds = pca_regressor.predict(pca_test)
pca_predictions = pd.DataFrame({'rating': test['rating'], 'prediction': pca_preds})
pca_rmse, pca_mae = evaluate(pca_predictions)

# Compare results
results = pd.DataFrame({
    'Model': ['KNN CF', 'SVM', 'Decision Tree', 'PCA'],
    'RMSE': [knn_cf_rmse, svm_rmse, dt_rmse, pca_rmse],
    'MAE': [knn_cf_mae, svm_mae, dt_mae, pca_mae]
})

print(results)


           Model      RMSE       MAE
0         KNN CF  1.042864  0.821442
1            SVM  1.065239  0.832667
2  Decision Tree  1.281258  0.946822
3            PCA  1.280799  0.948879


## Movie Rating Prediction with Random Forest Regression

1. **Data Load**: Load movie and ratings data from CSV files.

2. **Genre Processing**: Merge movie genres into a single string and preprocess.

3. **Data Merging**: Combine ratings with movies based on movieId.

4. **Feature Engineering**: Extract features for prediction (movieId, title, genres).

5. **TF-IDF Vectorization**: Convert movie genres into numerical vectors using TF-IDF.

6. **Train-Test Split**: Split the dataset into training and testing sets.

7. **Model Training**: Train a Random Forest Regressor with 100 estimators.

8. **Prediction**: Predict ratings for the test set.

9. **Model Evaluation**: Evaluate the model performance using Mean Squared Error (MSE).


In [None]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.feature_extraction.text import TfidfVectorizer

# Load data from CSV files
movies_df = pd.read_csv("movies.csv")
ratings_df = pd.read_csv("ratings.csv")

# Merge movie genres into a single string
movies_df['genres'] = movies_df['genres'].fillna('')
movies_df['genres'] = movies_df['genres'].apply(lambda x: ' '.join(x.split('|')))

# Merge ratings with movies
data = pd.merge(ratings_df, movies_df, on='movieId')

# Feature Engineering
X = data[['movieId', 'title', 'genres']]  # Use 'data' instead of 'movies_df'
y = data['rating']

# TF-IDF Vectorization
tfidf = TfidfVectorizer(stop_words='english')
X_tfidf = tfidf.fit_transform(X['genres'])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Train a Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
rf_regressor.fit(X_train, y_train)

# Predict ratings
y_pred = rf_regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 0.9986157839587124


## Top Movie Recommendations for User 1 based on Highly Rated Movies

1. **Filtering User Ratings**: Extract ratings given by User 1 from the ratings dataframe.

2. **Merging Data**: Merge User 1's ratings with movie data to gather information about the rated movies.

3. **Filtering Highly Rated Movies**: Select movies highly rated by User 1, using a threshold (e.g., rating >= 4).

4. **Recommendation Generation**: For each highly rated movie, predict the rating using the trained Random Forest Regressor model.

5. **Sorting Recommendations**: Sort the recommendations by predicted rating in descending order.

6. **Top Recommendations**: Display the top 10 movie recommendations based on predicted ratings for User 1.


In [None]:
import numpy as np

# Assuming 'user_id' column in ratings_df represents user IDs
# Let's filter ratings_df to get ratings by user 1
user_1_ratings = ratings_df[ratings_df['userId'] == 1]

# Let's merge user 1's ratings with movie data to get information about the movies
user_1_data = pd.merge(user_1_ratings, movies_df, on='movieId')

# Let's filter the movies that user 1 rated highly (you can define a threshold for what's considered highly rated)
highly_rated_movies = user_1_data[user_1_data['rating'] >= 4]

# Now, let's use these highly rated movies to get recommendations
recommendations = pd.DataFrame(columns=['movieId', 'title', 'predicted_rating'])

for movie_id, title in zip(highly_rated_movies['movieId'], highly_rated_movies['title']):
    # Get the TF-IDF vector for the movie's genres
    movie_idx = movies_df[movies_df['movieId'] == movie_id].index[0]
    tfidf_vector = X_tfidf[movie_idx]

    # Predict the rating for this movie using the trained model
    predicted_rating = rf_regressor.predict(tfidf_vector.reshape(1, -1))[0]

    recommendations = pd.concat([recommendations, pd.DataFrame({'movieId': [movie_id], 'title': [title], 'predicted_rating': [predicted_rating]})], ignore_index=True)

# Sort the recommendations by predicted rating in descending order
recommendations = recommendations.sort_values(by='predicted_rating', ascending=False)

# Get the top 10 recommendations
top_10_recommendations = recommendations.head(10)

print("Top 10 Recommendations for User 1:")
print(top_10_recommendations[['title']].to_string(index=False, justify='left'))



Top 10 Recommendations for User 1:
title                                                         
                       Teenage Mutant Ninja Turtles III (1993)
                                              Ladyhawke (1985)
                                          Wayne's World (1992)
                                               Scream 3 (2000)
                                                    JFK (1991)
Teenage Mutant Ninja Turtles II: The Secret of the Ooze (1991)
                                               Red Dawn (1984)
                                  Good Morning, Vietnam (1987)
                                         Grumpy Old Men (1993)
                                                   Hook (1991)


## Comparison of Regression Models for Movie Rating Prediction

1. **Regressor Definition**: Five regression models (Random Forest Regressor, Gradient Boosting Regressor, Support Vector Regressor, Linear Regression, Ridge Regression) are initialized.

2. **Model Training and Evaluation**: Each regressor is trained on the training data and evaluated using Mean Squared Error (MSE) on the test set.

3. **MSE Comparison**: The MSE values for each regressor are computed and compared to assess their performance in predicting movie ratings.



In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Ridge

# Define a dictionary to store the regressors
regressors = {
    "Random Forest Regressor": RandomForestRegressor(n_estimators=100, random_state=42),
    "Gradient Boosting Regressor": GradientBoostingRegressor(random_state=42),
    "Support Vector Regressor": SVR(),
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge()
}

# Dictionary to store MSE values for each regressor
mse_scores = {}

# Iterate over each regressor
for name, regressor in regressors.items():
    # Train the regressor
    regressor.fit(X_train, y_train)

    # Predict ratings
    y_pred = regressor.predict(X_test)

    # Compute mean squared error
    mse = mean_squared_error(y_test, y_pred)

    # Store MSE in dictionary
    mse_scores[name] = mse

# Print MSE values for each regressor
for name, mse in mse_scores.items():
    print(f"{name}: MSE = {mse}")


Random Forest Regressor: MSE = 0.9986157839587124
Gradient Boosting Regressor: MSE = 1.013223073678727
Support Vector Regressor: MSE = 1.0195395514788514
Linear Regression: MSE = 1.0317016766515907
Ridge Regression: MSE = 1.0316982219610418


## Movie Rating Prediction using Bagging Ensemble with Multiple Base Regressors

1. **Data Loading and Preprocessing**: Load movie and ratings data from CSV files. Merge movie genres and ratings.

2. **Feature Engineering**: Extract features (movieId, title, genres) for prediction. Vectorize genres using TF-IDF.

3. **Train-Test Split**: Split the dataset into training and testing sets.

4. **Base Regressor Initialization**: Initialize five base regressor models (Random Forest, Linear Regression, Gradient Boosting, Support Vector, Ridge Regression).

5. **BaggingRegressor Initialization**: Initialize BaggingRegressor ensembles with each base regressor, using 10 estimators.

6. **Model Training**: Train each BaggingRegressor ensemble on the training data.

7. **Prediction**: Predict ratings for the test set using each BaggingRegressor ensemble.

8. **Ensemble Prediction**: Average the predictions from all BaggingRegressor ensembles.

9. **Model Evaluation**: Evaluate the ensemble model's performance using Mean Squared Error (MSE) on the test set.


In [None]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, BaggingRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.feature_extraction.text import TfidfVectorizer

# Load data from CSV files
movies_df = pd.read_csv("movies.csv")
ratings_df = pd.read_csv("ratings.csv")

# Merge movie genres into a single string
movies_df['genres'] = movies_df['genres'].fillna('')
movies_df['genres'] = movies_df['genres'].apply(lambda x: ' '.join(x.split('|')))

# Merge ratings with movies
data = pd.merge(ratings_df, movies_df, on='movieId')

# Feature Engineering
X = data[['movieId', 'title', 'genres']]  # Use 'data' instead of 'movies_df'
y = data['rating']

# TF-IDF Vectorization
tfidf = TfidfVectorizer(stop_words='english')
X_tfidf = tfidf.fit_transform(X['genres'])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Initialize base regressor models
base_regressor_1 = RandomForestRegressor(n_estimators=100, random_state=42)
base_regressor_2 = LinearRegression()
base_regressor_3 = GradientBoostingRegressor(random_state=42)
base_regressor_4 = SVR()
base_regressor_5 = Ridge()

# BaggingRegressor
bagged_regressor_1 = BaggingRegressor(base_regressor_1, n_estimators=10, random_state=42)
bagged_regressor_2 = BaggingRegressor(base_regressor_2, n_estimators=10, random_state=42)
bagged_regressor_3 = BaggingRegressor(base_regressor_3, n_estimators=10, random_state=42)
bagged_regressor_4 = BaggingRegressor(base_regressor_4, n_estimators=10, random_state=42)
bagged_regressor_5 = BaggingRegressor(base_regressor_5, n_estimators=10, random_state=42)

# Train the BaggingRegressors
bagged_regressor_1.fit(X_train, y_train)
bagged_regressor_2.fit(X_train, y_train)
bagged_regressor_3.fit(X_train, y_train)
bagged_regressor_4.fit(X_train, y_train)
bagged_regressor_5.fit(X_train, y_train)

# Predict ratings
y_pred_1 = bagged_regressor_1.predict(X_test)
y_pred_2 = bagged_regressor_2.predict(X_test)
y_pred_3 = bagged_regressor_3.predict(X_test)
y_pred_4 = bagged_regressor_4.predict(X_test)
y_pred_5 = bagged_regressor_5.predict(X_test)

# Average the predictions
y_pred_avg = (y_pred_1 + y_pred_2 + y_pred_3 + y_pred_4 + y_pred_5) / 5

# Evaluate the model
mse = mean_squared_error(y_test, y_pred_avg)
print("Mean Squared Error:", mse)


Mean Squared Error: 1.0043565952101163


In [None]:
# Initialize DataFrame to store predicted ratings
predictions = pd.DataFrame(columns=['movieId', 'title', 'predicted_rating'])

# Iterate over highly rated movies
for movie_id, title in zip(highly_rated_movies['movieId'], highly_rated_movies['title']):
    # Get TF-IDF vector for movie genres
    movie_idx = movies_df[movies_df['movieId'] == movie_id].index[0]
    tfidf_vector = X_tfidf[movie_idx]

    # Predict rating using bagged regressors
    predicted_rating_ensemble = (bagged_regressor_1.predict(tfidf_vector.reshape(1, -1))[0] +
                                  bagged_regressor_2.predict(tfidf_vector.reshape(1, -1))[0] +
                                  bagged_regressor_3.predict(tfidf_vector.reshape(1, -1))[0] +
                                  bagged_regressor_4.predict(tfidf_vector.reshape(1, -1))[0] +
                                  bagged_regressor_5.predict(tfidf_vector.reshape(1, -1))[0]) / 5

    # Append prediction to DataFrame
    predictions = pd.concat([predictions, pd.DataFrame({'movieId': [movie_id], 'title': [title], 'predicted_rating': [predicted_rating_ensemble]})], ignore_index=True)

# Sort predictions by predicted rating
predictions = predictions.sort_values(by='predicted_rating', ascending=False)

# Print top recommendations for user 1
print("Top Recommendations for User 1 (Bagged Ensemble Model):")
print(predictions[['title', 'predicted_rating']].head(10))


Top Recommendations for User 1 (Bagged Ensemble Model):
                                                 title  predicted_rating
179  Teenage Mutant Ninja Turtles II: The Secret of...          3.985773
177                                    Scream 3 (2000)          3.985773
190                             Blazing Saddles (1974)          3.985773
189                Man with the Golden Gun, The (1974)          3.985773
188                                   Road Trip (2000)          3.985773
187                                   Gladiator (2000)          3.985773
186                                    Predator (1987)          3.985773
185                                        Hook (1991)          3.985773
184                                   Ladyhawke (1985)          3.985773
183                              Grumpy Old Men (1993)          3.985773
