### MOVIE RECOMMENDATION SYSTEM

In today’s technology-driven world, recommender systems are socially and economically critical for ensuring that individuals can make appropriate choices surrounding the content they engage with on a daily basis. One application where this is especially true surrounds movie content recommendations; where intelligent algorithms can help viewers find great titles from tens of thousands of options.

...ever wondered how Netflix, Amazon Prime, Showmax, Disney and the likes somehow know what to recommend to you?
...it's not just a guess drawn out of the hat. There is an algorithm behind it.

With this context, we are challenging you to construct a recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences.

What value is achieved through building a functional recommender system?
Providing an accurate and robust solution to this challenge has immense economic potential, with users of the system being exposed to content they would like to view or purchase - generating revenue and platform affinity

## DATA

Supplied Files
genome_scores.csv - a score mapping the strength between movies and tag-related properties.

genome_tags.csv - user assigned tags for genome-related scores

imdb_data.csv - Additional movie metadata scraped from IMDB using the links.csv file.

links.csv - File providing a mapping between a MovieLens ID and associated IMDB and TMDB IDs.

sample_submission.csv - Sample of the submission format for the hackathon.

tags.csv - User assigned for the movies within the dataset.

test.csv - The test split of the dataset. Contains user and movie IDs with no rating data.

train.csv - The training split of the dataset. Contains user and movie IDs with associated rating data.

In [None]:
ratings_df = pd.read_csv('/kaggle/input/alx-movie-recommendation-project-2024/train.csv')
movies_df = pd.read_csv('/kaggle/input/alx-movie-recommendation-project-2024/movies.csv')
imdb_df = pd.read_csv('/kaggle/input/alx-movie-recommendation-project-2024/imdb_data.csv')
test_df = pd.read_csv('/kaggle/input/alx-movie-recommendation-project-2024/test.csv')
links_df = pd.read_csv('/kaggle/input/alx-movie-recommendation-project-2024/links.csv')
tags = pd.read_csv('/kaggle/input/alx-movie-recommendation-project-2024/tags.csv')
genome_scores = pd.read_csv('/kaggle/input/alx-movie-recommendation-project-2024/genome_scores.csv')
genome_tags = pd.read_csv('/kaggle/input/alx-movie-recommendation-project-2024/genome_tags.csv')


In [None]:
# Install packages here
# Packages for data processing
import numpy as np
import pandas as pd
import datetime
from sklearn import preprocessing
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
from scipy.sparse import csr_matrix
import scipy as sp


# Packages for visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Packages for modeling
from surprise import Reader
from surprise import Dataset
from surprise import KNNWithMeans
from surprise import KNNBasic
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
from surprise import SVD
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
import heapq

# Packages for model evaluation
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from time import time

# Package to suppress warnings
import warnings
warnings.filterwarnings("ignore")

# Packages for saving models
import pickle

In [None]:
movies_df.info()

In [None]:
movies_df.describe()

In [None]:
print("Train: ")
print(str(ratings_df.isnull().sum()))
print("************")
print("Test: ")
print(str(test_df.isnull().sum()))
print("************")
print("Movies: ")
print(str(movies_df.isnull().sum()))
print("************")
print("Links: ")
print(str(links_df.isnull().sum()))
print("************")
print("IMDB: ")
print(str(imdb_df.isnull().sum()))
print("************")
print("Genome scores: ")
print(str(genome_scores.isnull().sum()))
print("************")
print("Genome tags: ")
print(str(genome_tags.isnull().sum()))

In [None]:
# Create dataframe containing only the movieId and genres
movies_genres = pd.DataFrame(movies_df[['movieId', 'genres']],
                             columns=['movieId', 'genres'])

# Split genres seperated by "|" and create a list containing the genres allocated to each movie
movies_genres.genres = movies_genres.genres.apply(lambda x: x.split('|'))

# Create expanded dataframe where each movie-genre combination is in a seperate row
movies_genres = pd.DataFrame([(tup.movieId, d) for tup in movies_genres.itertuples() for d in tup.genres],
                             columns=['movieId', 'genres'])

movies_genres.head()

In [None]:
plot = plt.figure(figsize=(15, 10))
plt.title('Most common genres\n', fontsize=20)
sns.countplot(y="genres", data=movies_genres,
              order=movies_genres['genres'].value_counts(ascending=False).index,
              palette='Reds_r')
plt.show()

### ***Stage 1: Data Preparation***

In [None]:
# Sample data for efficiency (use 10%)
ratings_df = ratings_df.sample(frac=0.1, random_state=42)

# Merge genome_scores with genome_tags to get tag names
genome_scores = genome_scores.merge(genome_tags, on='tagId')

# Create a pivot table where rows are movies and columns are tags, values are relevance scores
movie_tag_matrix = genome_scores.pivot_table(index='movieId', columns='tag', values='relevance', fill_value=0)

# Normalize the movie_tag_matrix
scaler = StandardScaler()
movie_tag_matrix_scaled = scaler.fit_transform(movie_tag_matrix)


### ***ANALYSIS 1(PCA)***

In [None]:
from sklearn.decomposition import PCA
# Apply PCA
pca = PCA(n_components=50)
movie_tag_matrix_pca = pca.fit_transform(movie_tag_matrix_scaled)

# Create a DataFrame for the PCA components
movie_tag_pca_df = pd.DataFrame(movie_tag_matrix_pca, index=movie_tag_matrix.index)

# Merge genres with ratings
movies_df = movies_df[['movieId', 'genres']]
ratings_with_genres = ratings_df.merge(movies_df, on='movieId')


#### ***ANALYSIS 2 (HYPER PARAMETER TUNING)***

In [None]:
from surprise.model_selection import train_test_split, GridSearchCV
# Define the Reader and Dataset
reader = Reader(rating_scale=(0.5, 5.0))
data = Dataset.load_from_df(ratings_with_genres[['userId', 'movieId', 'rating']], reader)

# Split the data into training and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Define SVD model
svd = SVD()

# Hyperparameter tuning with Grid Search
param_grid = {
    'n_epochs': [20, 30],
    'lr_all': [0.002, 0.005],
    'reg_all': [0.4, 0.6]
}

gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)
gs.fit(data)

# Best SVD model
best_svd = gs.best_estimator['rmse']
print(f"Best RMSE: {gs.best_score['rmse']} with parameters: {gs.best_params['rmse']}")

In [None]:
# Train the best model on the full trainset
trainset = data.build_full_trainset()
best_svd.fit(trainset)

# Function to predict rating for a user-movie pair
def predict_rating(user_id, movie_id):
    prediction = best_svd.predict(user_id, movie_id)
    return prediction.est

# Generate predictions for test set
test_df['rating'] = test_df.apply(lambda x: predict_rating(x['userId'], x['movieId']), axis=1)

# Calculate RMSE for test predictions
predictions = [best_svd.predict(row['userId'], row['movieId']).est for _, row in test_df.iterrows()]
true_ratings = [row['rating'] for _, row in test_df.iterrows()]


In [None]:
# Prepare submission DataFrame
submission = test_df[['userId', 'movieId', 'rating']]
submission['Id'] = submission.apply(lambda x: f"{int(x['userId'])}_{int(x['movieId'])}", axis=1)
submission = submission[['Id', 'rating']]

# Save submission to CSV
submission.to_csv('movie_recommender.csv', index=False)