# Movie Ratings Prediction Project

## Introduction
Welcome to the Movie Ratings Prediction Project. In this analysis, we aim to build a recommender system using collaborative filtering. The goal of our algorithm is to accurately forecast user ratings for movies they have not yet seen, leveraging their historical preferences.

The value of a well-crafted recommender system cannot be overstated. It has the potential to enhance user experience by providing personalized content, increasing user engagement, and consequently, driving revenue for streaming platforms. Our evaluation focuses on minimizing the Root Mean Square Error (RMSE) between the predicted and the actual movie ratings, offering quantifiable evidence of the recommender system's performance.

***Model Development***

## Data Exploration
We begin our analysis with a thorough exploration of the dataset, comprising user rating histories. We identify patterns, anomalies, and the distribution of user interactions.

In [None]:
import pandas as pd

**Load the dataset**

## Preprocessing
Data preprocessing steps were taken to ensure the quality of the dataset. This includes handling missing values, normalizing data, and potentially feature engineering.

In [None]:
genome_scores_df = pd.read_csv("/kaggle/input/ea-movie-recommendation-predict-2023-2024/genome_scores.csv")


In [None]:
genore_tags_df = pd.read_csv("/kaggle/input/ea-movie-recommendation-predict-2023-2024/genome_tags.csv")

In [None]:
train_df = pd.read_csv("/kaggle/input/ea-movie-recommendation-predict-2023-2024/train.csv")

In [None]:
test_df = pd.read_csv("/kaggle/input/ea-movie-recommendation-predict-2023-2024/test.csv")

In [None]:
movies_df = pd.read_csv("/kaggle/input/ea-movie-recommendation-predict-2023-2024/movies.csv")

In [None]:
tags_df = pd.read_csv("/kaggle/input/ea-movie-recommendation-predict-2023-2024/tags.csv")

In [None]:
links_df = pd.read_csv("/kaggle/input/ea-movie-recommendation-predict-2023-2024/links.csv")

In [None]:
My_submission_df = pd.read_csv("/kaggle/input/my-submission/my_submission.csv")

**Display the first few rows of the training data to understand its structure**

In [None]:

print(links_df.head())

In [None]:

print(tags_df.head())

In [None]:

print(movies_df.head())

In [None]:

print(test_df.head())

In [None]:

print(train_df.head())

In [None]:

print(genome_scores_df.head())

In [None]:

print(genore_tags_df.head())


## Model Evaluation
Our model evaluation is based on the comparison between actual ratings and the SVD model predictions. We diligently split our data into training and test sets to validate the model's predictive power.


1. Model Training:
This would follow the steps for setting up a collaborative filtering model with Surprise's SVD. 

2. Predicting Ratings:
We utilize the trained model to make predictions on the user-movie pairs provided in the test set.

3. Preparing Submission File:
Creating the submsission file

In [None]:
import pandas as pd
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

# Now, set up the reader for Surprise, which specifies the scale of your rating scores
reader = Reader(rating_scale=(0.5, 5.0))

# Load the data from the DataFrame into the Surprise data structure
data = Dataset.load_from_df(train_df[['userId', 'movieId', 'rating']], reader)

# Instantiate the SVD algorithm from Surprise
algo = SVD()

# Train on the full dataset
trainset = data.build_full_trainset()
algo.fit(trainset)

# Load the test set
test_df = pd.read_csv("/kaggle/input/ea-movie-recommendation-predict-2023-2024/test.csv")

# Prepare the testset in the format that Surprise requires: a list of tuples (userId, movieId, rating)
# Since we don't have the ratings for the test set, we can fill in dummy values (like the global mean)
testset = [(row['userId'], row['movieId'], trainset.global_mean) for index, row in test_df.iterrows()]

# Obtain the predictions
predictions = algo.test(testset)

# Assuming predictions is a list of Prediction objects from the Surprise library...
submission = pd.DataFrame(predictions, columns=['uid', 'iid', 'r_ui', 'est', 'details'])

# Construct the 'Id' according to the competition's specifications
submission['Id'] = submission['uid'].astype(str) + "_" + submission['iid'].astype(str)

# Select the predictions (estimates) and the 'Id' for the submission file
submission = submission[['Id', 'est']]

# Rename the 'est' column to 'rating'
submission.rename(columns={'est': 'rating'}, inplace=True)

# Viewing the first few entries of the submission file
print(submission.head())

# Save the DataFrame to a CSV file, formatted as per the submission requirements
submission.to_csv('my_submission.csv', index=False)

**Load My_submission_data **********

In [None]:
My_submission_df = pd.read_csv("/kaggle/input/my-submission/my_submission.csv")

In [None]:

print(My_submission_df.head())

**Visualization**

Findings will be visualised on the below as follows: 

In [None]:
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Provided 'Id' and 'rating' data
id_data = [
    # ... (other 'Id' strings were here)
    '3_858', '3_1198', '3_1201'
]
rating_data = [
    # ... (other ratings matched with 'Id' strings were here)
    3.9355, 4.5145, 4.3727
]

# Create a DataFrame from the data
submission_df = pd.DataFrame({
    'Id': id_data,
    'rating': rating_data
})

# Fit K-means clustering on the 'rating' column
# Reshaping 'rating' column to (-1, 1) because it's a single feature
kmeans = KMeans(n_clusters=3, random_state=42)
submission_df['cluster'] = kmeans.fit_predict(submission_df[['rating']].values.reshape(-1, 1))

# Visualize the clusters using a scatter plot
# Using index of the dataframe as a proxy for x-axis
plt.figure(figsize=(12, 8))
plt.scatter(submission_df.index, submission_df['rating'], c=submission_df['cluster'], cmap='viridis')
plt.title('K-Means Clustering of Ratings')
plt.xlabel('Index (as a proxy for Id)')
plt.xticks(ticks=submission_df.index, labels=submission_df['Id'], rotation=90)  # Set x-ticks as 'Id' labels
plt.ylabel('Rating')
plt.colorbar(label='Cluster Label')
plt.show()




In [None]:
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np

# Provided 'Id' and 'rating' data
id_data = [
    '1_2011', '1_4144', '1_5767', '1_6711', '1_7318', '1_8405', '1_8786',
    '2_150', '2_356', '2_497', '2_588', '2_653', '2_1080', '2_1196',
    # ... (other 'Id' strings were here)
    '3_858', '3_1198', '3_1201'
]
rating_data = [
    3.7213, 4.3325, 3.9026, 4.0267, 2.6756, 3.9144, 4.1232,
    3.7162, 3.6696, 3.4718, 3.4780, 3.2578, 3.9144, 4.5850,
    # ... (other ratings matched with 'Id' strings were here)
    3.9355, 4.5145, 4.3727
]

# Create a DataFrame from the data
submission_df = pd.DataFrame({
    'Id': id_data,
    'rating': rating_data
})

# Fit K-means clustering on the 'rating' column
# We must reshape 'rating' column to (-1, 1) because it's a single feature
kmeans = KMeans(n_clusters=3, random_state=42)
submission_df['cluster'] = kmeans.fit_predict(submission_df[['rating']])

# Visualize the clusters using a scatter plot
plt.figure(figsize=(12, 8))
plt.scatter(submission_df['Id'], submission_df['rating'], c=submission_df['cluster'], cmap='viridis')
plt.title('K-Means Clustering of Ratings')
plt.xlabel('Id')
plt.ylabel('Rating')
plt.xticks(rotation=90)  # Rotate x-axis labels to make them readable
plt.show()

**RMSE **

In [None]:
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

# Load your training dataset
# train_df = pd.read_csv('path_to_your_training_dataset.csv')

# Create the 'reader' object to interpret the data
reader = Reader(rating_scale=(0.5, 5.0))

# Load the dataset into Surprise's format
data = Dataset.load_from_df(train_df[['userId', 'movieId', 'rating']], reader)

# Instantiate the SVD algorithm
algo = SVD(verbose=True)

# Split the dataset for evaluation
trainset, testset = train_test_split(data, test_size=0.25, random_state=42)

# Here we fit the algorithm on the trainset and evaluate it
algo.fit(trainset)

# Make predictions on the testset
predictions = algo.test(testset)

# Calculate RMSE on the test set
rmse = accuracy.rmse(predictions)

With these changes, the code will correctly fit the SVD algorithm on the training subset and then evaluate the predictions on the test subset using RMSE for the accuracy measure.

After obtaining predictions, We visualize the performance of the recommendation system, specifically the RMSE, by comparing the predicted ratings against the actual ratings from the test set.

In [None]:
import matplotlib.pyplot as plt

# Convert predictions to a DataFrame
preds = pd.DataFrame([(pred.uid, pred.iid, pred.r_ui, pred.est) for pred in predictions], 
                     columns=['UserID', 'MovieID', 'Actual', 'Predicted'])

# Scatter Plot
plt.figure(figsize=(10, 6))
plt.scatter(preds['Actual'], preds['Predicted'], alpha=0.3)
plt.plot([0, 5], [0, 5], 'r-', linewidth=2)  # plot a thin red line for 1-1 correlation
plt.title('Actual vs Predicted Ratings')
plt.xlabel('Actual Rating')
plt.ylabel('Predicted Rating')
plt.grid(True)
plt.show()

# Distribution of Residuals (Errors)
residuals = preds['Actual'] - preds['Predicted']

plt.figure(figsize=(10, 6))
plt.hist(residuals, bins=30, edgecolor='black')
plt.title('Distribution of Residuals')
plt.xlabel('Actual - Predicted Rating')
plt.ylabel('Frequency')
plt.show()

# RMSE Calculation for Test Set
rmse = np.sqrt(np.mean(residuals**2))
print(f'Test RMSE: {rmse:.2f}')

In these plots:

Scatter Plot: It shows the actual ratings compared to the predicted ratings. If the predictions were perfect, all points would fall on the red line where the actual rating equals the predicted rating.

Histogram of Residuals: It displays the distribution of errors (residuals). If the predictions are unbiased, the histogram should be centered around zero, and ideally, have a normal distribution.

Remember that visualizations are not just for presentation but for insight as well. They might inform you about the nature of the errors your model makes and guide you on what to focus on to improve the model. For instance, a long tail on the left side in the histogram might suggest that the system often underestimates the actual ratings.


# Conclusion

## Summary of Findings
In summary, our recommender system powered by Singular Vector Decomposition (SVD) has demonstrated a commendable capability in predicting user ratings for movies. With an RMSE value that quantifies our modelâ€™s predictive accuracy, we've established a solid benchmark for future enhancements. The visualizations provided key insights into the distribution of residuals and the correlation between actual and predicted ratings, underscoring the model's effectiveness and also revealing areas for improvement.

## Economic Significance
The economic implications of deploying a functional recommender system are far-reaching. The predictive power that our system holds can facilitate a transformative shift for streaming platforms, paving the way for higher user retention, increased user engagement, and an uptick in revenue through more personalized content delivery. As such, our model's deployment can serve as a strategic tool in curating user-specific content, thereby bolstering platform affinity and fostering a sophisticated content discovery experience.

## Future Work and Improvements
While the current model performs admirably, in the realm of recommendation engines, there is always room for refinement. Further tuning of hyperparameters, exploring alternative collaborative filtering algorithms, or integrating content-based methods may yield even more precise recommendations. Additional feature engineering, such as incorporating temporal dynamics or social networking effects, might also enhance the system's accuracy. Furthermore, scalability concerns and cold start problems associated with new users or items warrant strategic solutions as we envision our model's application at scale.

## Final Remarks
We are optimistic that with continual innovation and adaptation to user dynamics, recommendation systems like the one developed in this project will remain instrumental in shaping the vibrant landscape of digital media consumption. Our work is a stepping stone towards creating more intelligent, adaptive, and context-aware systems that could redefine the industry standards for content recommendation.

In [None]:
import numpy as np
import pandas as pd
df = pd.DataFrame(
    np.random.rand(100, 5),
    columns=['a', 'b', 'c', 'd', 'e'])
df.to_csv('/kaggle/working/df.csv',index=False)