# Syed Hamza Ali

# Task 4: Recommendation System - Movie Recommendation
## This notebook outlines the concepts involved in building a Complete Recommendation System for recommending Movies to users
## Movie Recommender System - A very very very simple Clone of Netflix

**MovieLens dataset** and build a model to **recommend movies** to the end users.

### Note: Due to the size of the data set and the amount of ram that was left for me on google colab to accomodate this task, I commented out the collaborative filtering method as it immediately crashes the colab session (would only run with small sample sizes, for example when testing with 1000 samples, probably due too low memory on google colab).

### Other than that, I tried everything for turicreate, but since i'm on windows 10 it gave me alot of problems for some reason. So i was not able to test with that on building the recommenders. Therefore i could only use surprise.

In [1]:
! pip install scikit-surprise



### Import the libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from surprise import Dataset, Reader, accuracy, SVD
from surprise.model_selection import cross_validate, train_test_split
from sklearn.metrics.pairwise import cosine_similarity

### Download the dataset

In [3]:
# Step 1: Download the dataset
# !wget http://files.grouplens.org/datasets/movielens/ml-20m.zip

In [4]:
# Step 2: Unzip the dataset
# ! unzip ml-20m.zip

### Load the dataset
### Reading users file
- u.user

- Please check the column names from the readme file
- Pass in column names for each CSV as the column name is not given in the file and read them using pandas
- Use these following columns
    - 'user_id', 'age', 'sex', 'occupation', 'zip_code'

In [5]:
# Load CSV files into DataFrames
movies = pd.read_csv('ml-20m/movies.csv')
ratings = pd.read_csv('ml-20m/ratings.csv')
tags = pd.read_csv('ml-20m/tags.csv')

### Display the user data

In [6]:
# Display the data for each DataFrame
print("\nMovies Data:")
print(movies.head())

print("\nRatings Data:")
print(ratings.head())

print("\nTags Data:")
print(tags.head())


Movies Data:
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  

Ratings Data:
   userId  movieId  rating   timestamp
0       1        2     3.5  1112486027
1       1       29     3.5  1112484676
2       1       32     3.5  1112484819
3       1       47     3.5  1112484727
4       1       50     3.5  1112484580

Tags Data:
   userId  movieId            tag   timestamp
0      18     4141    Mark Waters  1240597180
1      65      208      dark hero  136

# Merge tags by movieId

In [7]:
movie_tags = tags.groupby('movieId')['tag'].apply(lambda x: ' '.join(map(str, x))).reset_index()

# Merging movies and movie_tags

In [8]:
movies = pd.merge(movies, movie_tags, on='movieId', how='left')

# Content Filtering method

In [9]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(movies['tag'].fillna(''))
svd = TruncatedSVD(n_components=200)
latent_matrix_1 = svd.fit_transform(tfidf_matrix)

# Collaborative Filtering method

In [11]:
# ratings_matrix = ratings.pivot_table(index='userId', columns='movieId', values='rating')
# ratings_matrix = ratings_matrix.fillna(0)

# Matrix Factorization using SVD

In [12]:
svd = SVD()
reader = Reader()
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7d27f8eda4d0>

# Hybrid Model

In [21]:
def hybrid_model(title):
    # Build movie title to index mapping
    indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()
    movie_id_map = dict(zip(movies['title'], movies.index))

    idx = indices[title]
    sim_scores = []
    for i in range(len(indices)):
        sim_score = cosine_similarity([latent_matrix_1[idx]], [latent_matrix_1[i]])[0][0]
        sim_scores.append((i, sim_score))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:101]
    movie_indices = [i[0] for i in sim_scores]
    movies_subset = movies.iloc[movie_indices]

    # Content-based filtering
    cb_score = []
    for _, movie in movies_subset.iterrows():
        movie_title = movie['title']
        if movie_title in movie_id_map and movie_title in indices:
            movie_idx_in_map = movie_id_map[movie_title]
            movie_idx_in_indices = indices[movie_title]
            cb_score.append(cosine_similarity([latent_matrix_1[idx]], [latent_matrix_1[movie_idx_in_map]])[0][0])
        else:
            cb_score.append(0)  # Assign a default score of 0 for missing titles

    # Combine scores
    hybrid_scores = cb_score
    movie_indices = [i for i in range(len(movies_subset))]
    recommendations = pd.DataFrame({'movieId': movie_indices, 'score': hybrid_scores})
    recommendations = recommendations.sort_values(by='score', ascending=False)

    return movies_subset.iloc[recommendations.head(10)['movieId']]

# Build movie title to index mapping

In [16]:
indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()
movie_id_map = dict(zip(movies['title'], movies['movieId']))

# Sample recommendation

In [22]:
movie_title = 'Toy Story (1995)'  # Sample movie title
recommended_movies = hybrid_model(movie_title)
print(recommended_movies[['title', 'tag']])

                         title  \
3027        Toy Story 2 (1999)   
2270      Bug's Life, A (1998)   
11614       Ratatouille (2007)   
4790     Monsters, Inc. (2001)   
5121            Ice Age (2002)   
15401       Toy Story 3 (2010)   
6271       Finding Nemo (2003)   
8278   Incredibles, The (2004)   
13767                Up (2009)   
2209               Antz (1998)   

                                                     tag  
3027   animation humorous Pixar animation cute fancif...  
2270   Watched computer animation Disney animated fea...  
11614  Pixar's Formula Starting To Get Stale Watched ...  
4790   Billy Crystal Jennifer Tilly John Goodman Pixa...  
5121   Watched Carlos Saldanha Chris Wedge animated r...  
15401  tense Alive toys adventure animation bitterswe...  
6271   animation Watched Disney animated feature miss...  
8278   animation powers Watched alter ego death/fatal...  
13767  emotional friendship Watched Bechdel Test:Fail...  
2209   Pixar animation Pixar anti-w