#Data Overview#
This dataset consists of several million 5-star ratings obtained from users of the online MovieLens movie recommendation service. The MovieLens dataset has long been used by industry and academic researchers to improve the performance of explicitly-based recommender systems, and now you get to as well!

For this Predict, we'll be using a special version of the MovieLens dataset which has enriched with additional data, and resampled for fair evaluation purposes.

#Source#
The data for the MovieLens dataset is maintained by the GroupLens research group in the Department of Computer Science and Engineering at the University of Minnesota. Additional movie content data was legally scraped from IMDB

#Supplied Files#
1. genome_scores.csv - a score mapping the strength between movies and tag-related properties. Read more here
2. genome_tags.csv - user assigned tags for genome-related scores
3. imdb_data.csv - Additional movie metadata scraped from IMDB using the links.csv file.
4. links.csv - File providing a mapping between a MovieLens ID and associated IMDB and TMDB IDs.
5. sample_submission.csv - Sample of the submission format for the hackathon.
6. tags.csv - User assigned for the movies within the dataset.
7. test.csv - The test split of the dataset. Contains user and movie IDs with no rating data.
8. train.csv - The training split of the dataset. Contains user and movie IDs with associated rating data.






In [None]:
import numpy as np # linear algebra
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

from surprise import Reader, Dataset, SVD
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, mean_absolute_error

#!pip install surprise
import surprise

from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

##BASIC EDA##
Our data is picked from kaggle. To upload our files, we follow these steps before basic exploratory data analysis.


In [None]:
#install kaggle
#!pip install -q kaggle

In [None]:
#upload kaggle.json file from local system
from google.colab import files
files.upload()

Saving kaggle.json to kaggle (1).json


{'kaggle (1).json': b'{"username":"esthernekesa","key":"eb0d9ee569083d788e050d45306823eb"}'}

In [None]:
#create a kaggle folder
!mkdir ~/.kaggle

#copy kaggle.json to folder created
! cp kaggle.json ~/.kaggle/

#change permission for json to act
!chmod 600 ~/.kaggle/kaggle.json


mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [None]:
!kaggle competitions download -c alx-movie-recommendation-project-2024

alx-movie-recommendation-project-2024.zip: Skipping, found more recently modified local copy (use --force to force download)


In [None]:
#unzip files
!unzip alx-movie-recommendation-project-2024.zip


Archive:  alx-movie-recommendation-project-2024.zip
  inflating: genome_scores.csv       
  inflating: genome_tags.csv         
  inflating: imdb_data.csv           
  inflating: links.csv               
  inflating: movies.csv              
  inflating: sample_submission.csv   
  inflating: tags.csv                
  inflating: test.csv                
  inflating: train.csv               


For each of the tables, we carry out basic exploratory data analysis for data cleaning. We check for cells with null values,wrong data type and duplicates.We drop rows with null values and duplicates in our dataframes.

In [None]:
g_score_df = pd.read_csv('genome_scores.csv')
g_score_df.head()
#g_score_df.info()
#g_score_df.duplicated().sum()

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875
1,1,2,0.02375
2,1,3,0.0625
3,1,4,0.07575
4,1,5,0.14075


In [None]:
g_tag_df  = pd.read_csv('genome_tags.csv')
g_tag_df.head()
#g_tag_df.info()
#g_tag_df.duplicated().sum()

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s


In [None]:
imdb_df = pd.read_csv('imdb_data.csv')
imdb_df.head()
#imdb_df.info()
imdb_df.duplicated().sum()

0

In [None]:
link_df = pd.read_csv('links.csv')
link_df.head()
#link_df.info()
#link_df.duplicated().sum()
#link_df.dropna(inplace=True)
#link_df.isnull().sum()
link_df.dropna(inplace=True)
print(link_df.info())


<class 'pandas.core.frame.DataFrame'>
Index: 62316 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  62316 non-null  int64  
 1   imdbId   62316 non-null  int64  
 2   tmdbId   62316 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 1.9 MB
None


We cleaned the above dataframe of null values.

In [None]:
movie_df = pd.read_csv('movies.csv')
movie_df.head()
#movie_df.info()
#movie_df.duplicated().sum()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
sample_df = pd.read_csv('sample_submission.csv')
sample_df.head()
#sample_df.info()
#sample_df.duplicated().sum()

Unnamed: 0,Id,rating
0,1_2011,1.0
1,1_4144,1.0
2,1_5767,1.0
3,1_6711,1.0
4,1_7318,1.0


In [None]:
tags_df = pd.read_csv('tags.csv')
tags_df.head()
#tags_df.info()
#tags_df.describe()
#tags_df.duplicated().sum()
tags_df.isnull().sum()
tag = tags_df.dropna(inplace=True)
tags_df.isnull().sum()


userId       0
movieId      0
tag          0
timestamp    0
dtype: int64

Running the *isnull* after dropping null columns shows that we have *dropped* the 16 null columns.

In [None]:
test_df = pd.read_csv('test.csv')
print(test_df.head())
print('Shape:',  test_df.shape)
#test_df.info()
#test_df.duplicated().sum()
#test_df.isnull().sum()

   userId  movieId
0       1     2011
1       1     4144
2       1     5767
3       1     6711
4       1     7318
Shape: (5000019, 2)


In [None]:
train_df = pd.read_csv('train.csv')
print(train_df.head())
print('Shape:', train_df.shape)
print('Summary stats:', train_df.info())
print('No. of duplicate rows:', train_df.duplicated().sum())
print('No. of null columns:', train_df.isnull().sum().sum())

   userId  movieId  rating   timestamp
0    5163    57669     4.0  1518349992
1  106343        5     4.5  1206238739
2  146790     5459     5.0  1076215539
3  106362    32296     2.0  1423042565
4    9041      366     3.0   833375837
Shape: (10000038, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000038 entries, 0 to 10000037
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 305.2 MB
Summary stats: None
No. of duplicate rows: 0
No. of null columns: 0


From the above , we notice that the link_df has 107 null values while the tags_df has 16 null values. No duplicates are in our datasets. We may need to standardize our data for uniformity.  We clean our data using *dropna* to drop null columns.

For our recommendation system, we will the movie_df to create a content-based syetm. Then use the train_df to estimate how well our model recommendation will work.


#Content Based Filtering#

Content-based filtering is a recommendation approach that suggests movies by analyzing their features and matching them with the user’s preferences. It leverages data like genres, directors, actors, and plot descriptions to find similar movies to those the user has liked or interacted with.

For example, if a user enjoys action movies with a specific actor, the system will recommend other action movies featuring the same actor or with similar themes. This method ensures personalized recommendations tailored to each user’s tastes without relying on other users' preferences.



In [None]:
movie_df.tail()
#movie_df.shape


#Split the Dataframe#
For ease of computation, owing to our large database,we split the data into 3 sets:
1. Training set - used to train the model
2. Validation set - used to tune hyperparameters of the model system.
3. Test set - for evaluating perfomance of the model system.

In [None]:
# Define the test size and validation size
test_size = 0.35  # 20% of the data
val_size = 0.35   # 10% of the data

# Split the data into training+validation and test sets
train_val_df, testing_df = train_test_split(movie_df, test_size=test_size, random_state=42)

# Calculate the proportion of validation size from the train_val set
val_size_adjusted = val_size / (1 - test_size)

# Split the train_val set into training and validation sets
training_df, val_df = train_test_split(train_val_df, test_size=val_size_adjusted, random_state=42)


In [None]:
print(f'Training set size: {len(training_df)}')
print(f'Validation set size: {len(val_df)}')
print(f'Test set size: {len(testing_df)}')


Training set size: 18726
Validation set size: 21848
Test set size: 21849


Now, you can use train_df to train your recommendation system, val_df to tune it, and test_df to evaluate its performance.

In [None]:
Vectorizer = TfidfVectorizer(stop_words = 'english')
tfidf_matrix = Vectorizer.fit_transform(training_df["title"])
tfidf_matrix.shape

(18726, 15920)

In [None]:
from sklearn.metrics.pairwise import linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):

    idx = movie_df.index[movie_df['title'] == title].tolist()[0]  # Get the index of the movie that matches the title
    # Create a reverse mapping of movie titles to indices
    #indices = pd.Series(movie_df.index, index=movie_df['title'])

    # Get the index of the movie that matches the title
    #idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return movie_df['title'].iloc[movie_indices]

In [None]:
movie_df['title'].unique()

array(['Toy Story (1995)', 'Jumanji (1995)', 'Grumpier Old Men (1995)',
       ..., 'Bad Poems (2018)', 'A Girl Thing (2001)',
       "Women of Devil's Island (1962)"], dtype=object)

We test our recommendation using an example of our movies.

In [None]:
get_recommendations('Toy Story (1995)', cosine_sim)

9765                         This Sporting Life (1963)
4810                                     Midway (1976)
2555                      House of Frankenstein (1944)
6376                       Orphans of the Storm (1921)
6431                                Garage Days (2002)
7646                         Sodom and Gomorrah (1962)
9324                               'Salem's Lot (2004)
10168                                     Black (2005)
11590    One Nite In Mongkok (Wong gok hak yau) (2004)
12365                         One Hour with You (1932)
Name: title, dtype: object

#Model Evaluation#

The evaluation process assesses the performance of the recommendation system to ensure its effectiveness and accuracy. Metrics such as Precision, Recall, and Mean Squared Error (MSE) are used to measure how well the system predicts user preferences.

We wish to estimate the raing a uer will give to a movie thet have not previously watched.

In [None]:
train_df.head()
train_df.shape

(10000038, 4)

In [None]:
# Define the test size and validation size
test_size = 0.55  # 35% of the data
val_size = 0.45   # 35% of the data

# Split the data into training+validation and test sets
train_tdf, test_tdf = train_test_split(train_df, test_size=test_size, random_state=42)

# Split the train_val set into training and validation sets
training_trdf, val_df = train_test_split(train_tdf, test_size=val_size_adjusted, random_state=42)


In [None]:
train_tdf.shape

(3600013, 4)

In [None]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(train_tdf[['userId', 'movieId', 'rating']], reader)
trainset = data.build_full_trainset()


In [None]:
svd = SVD()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f7aa8553520>

In [None]:
kf = KFold(n_splits=5)

# Get the indices for the train and test splits for each fold
for train_index, test_index in kf.split(train_tdf):
    train_fold = train_tdf.iloc[train_index]
    test_fold = train_df.iloc[test_index]
    # Use these folds for training and evaluating your model


In [None]:
svd = SVD()
evaluate = surprise.model_selection.cross_validate(svd, data, measures=['RMSE', 'MAE'], cv= 5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8916  0.8909  0.8897  0.8909  0.8904  0.8907  0.0006  
MAE (testset)     0.6827  0.6822  0.6808  0.6817  0.6814  0.6818  0.0006  
Fit time          100.89  103.92  103.50  101.19  98.47   101.59  1.97    
Test time         20.33   21.79   19.84   21.70   14.67   19.67   2.61    


In [None]:
# Generate predictions for the provided test file
submission_test_df = pd.read_csv('test.csv')

# Create an empty list to store the predictions
pred_list = []

# Loop through the test dataframe and predict ratings
for _, row in submission_test_df.iterrows():
    user_id = row['userId']
    movie_id = row['movieId']
    prediction = svd.predict(user_id, movie_id)
    pred_list.append([f"{user_id}_{movie_id}", prediction.est])

# Convert the predictions to a DataFrame
predictions_df = pd.DataFrame(pred_list, columns=['Id', 'rating'])

# Verify the number of rows
assert len(predictions_df) == 5000019, "The output file must have 5000019 rows."

# Print the first few rows of the predictions
print(predictions_df.head())

# Save the predictions to a CSV file
predictions_df.to_csv('submission.csv', index=False)

       Id    rating
0  1_2011  3.528263
1  1_4144  3.879183
2  1_5767  3.985523
3  1_6711  3.400355
4  1_7318  3.012564


In [None]:
from google.colab import files
files.download('submission.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>