## **Certification Project for Data Science and Machine Learning Internship Program**

**Context:** 
Over  the  past  two  decades,  there  has  been  a  monumental  shift  in  how  people  access  and consume video content. With the universal access to broadband internet, numerous platforms like YouTube, Netflix, and HBO Go emerged and steadily grew to prominence

**Business Requirement:** “MyNextMovie” is a budding startup in the space of recommendations on top of various OTT(Over The Top)platforms providing suggestions to its customer base regarding their next movie.Their  major  business  is  to  create  a  recommendation  layer  on  top  of  these  OTT  platforms so that they can make suitable recommendations to their customers, however, since they are in research  mode  right  now,  they  would  want  to  experiment  with  open-source data first to understand the depth of the models which can be delivered by them.

### **Objective:**
1. Create a popularity-based recommender system at a genre level. 

    The user will input a genre (g), minimum rating threshold (t) for a movie, and no. of recommendations(N) for which it should be recommended top N movies which are most popular within that genre (g) ordered by ratings in descending order where each movie has at least (t) reviews.

2. Create a content-based recommender system that recommends top N movies based on similar movie(m) genres.

3. Create a collaborative based recommender system which recommends top N movies based on “K” similar users for a target user “u”

### **Data Description:** 
The  data  consists  of  105339  ratings  applied  over  10329  movies.  The  average  rating  and minimum and maximum rating are 0.5 and 5 respectively. There are 668 users who have given their ratings for 149532 movies.

There are two data files which are provided:

Movies.csv

● movieId: ID assigned to a movie
 
● title: Title of a movie

● genres: pipe-separated list of movie genres

Ratings.csv

● userId: ID assigned to a user

● movieId: ID assigned to a movie

● rating: rating by a user to a movie

● Timestamp: time at which the rating was provided.


### **Steps and Tasks**

● Import libraries and load dataset

● Exploratory Data Analysis including:

    o Understanding of distribution of the features available

    o Finding unique users and movies

    o Average rating and Total movies at genre level.

    o Unique genres considered..

● Design the 3different types of recommendation modules as mentioned in the   objectives 

● Additional/Optional: 

Create a GUI interface using Python libraries (ipywidgetsetc.) to play around with there commendation modules

### **Step 1:** *Importing Libs and Load dataset*

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import NearestNeighbors

In [2]:
# Load the datasets
movies = pd.read_csv('certif_project_dataset\\dataset\\movies.csv')
ratings = pd.read_csv('certif_project_dataset\\dataset\\ratings.csv')

### **Step 2:** *Exploratory Data Analysis (EDA)*

Understanding the distribution of features and basic statistics.

Understanding distribution of the features

*Check the first few rows of each dataset*

In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,16,4.0,1217897793
1,1,24,1.5,1217895807
2,1,32,4.0,1217896246
3,1,47,4.0,1217896556
4,1,50,4.0,1217896523


*Check for missing values*

In [5]:
movies.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [6]:
ratings.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

*Check for duplicate values*

In [7]:
movies[movies['title'].duplicated()]

Unnamed: 0,movieId,title,genres
6270,26982,Men with Guns (1997),Drama
7963,64997,War of the Worlds (2005),Action|Sci-Fi


In [8]:
movies.drop_duplicates(inplace=True)

*Summary statistics*

In [9]:
ratings.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,105339.0,105339.0,105339.0,105339.0
mean,364.924539,13381.312477,3.51685,1130424000.0
std,197.486905,26170.456869,1.044872,180266000.0
min,1.0,1.0,0.5,828565000.0
25%,192.0,1073.0,3.0,971100800.0
50%,383.0,2497.0,3.5,1115154000.0
75%,557.0,5991.0,4.0,1275496000.0
max,668.0,149532.0,5.0,1452405000.0


Check first few rows: Displays the first few rows of the datasets to understand their structure.

Check for missing values: Identifies any missing values in the datasets.

Check for duplicate values: Identifies any duplicate values in the datasets and drop it.

Summary statistics for ratings: Provides summary statistics (mean, std, min, max) for the ratings DataFrame.

*Finding unique users and movies*

In [10]:
num_unique_users = ratings['userId'].nunique()
num_unique_movies = ratings['movieId'].nunique()

print(f'Number of unique users: {num_unique_users}')
print(f'Number of unique movies: {num_unique_movies}')


Number of unique users: 668
Number of unique movies: 10325


Calculates the number of unique users and movies in the ratings DataFrame.

*Average rating and Total movies at genre level*

Merge DataFrames: Merges movies and ratings on movieId to get all relevant data in one DataFrame.

Extract genres: Splits the genres column into lists of genres.

Explode genres: Expands the DataFrame so each row has one genre, making it easier to analyze genres individually.

Calculate stats: Groups by genre to calculate average rating, total number of ratings, and unique movies for each genre.

In [11]:
# Merge movies and ratings dataframes
movie_ratings = pd.merge(ratings, movies, on='movieId')

# Extract genres
movie_ratings['genres'] = movie_ratings['genres'].str.split('|')

# Explode the genres to get one genre per row
movie_ratings = movie_ratings.explode('genres')

# Calculate average rating and total movies at genre level
genre_stats = movie_ratings.groupby('genres').agg({'rating': ['mean', 'count'], 'movieId': 'nunique'}).reset_index()
genre_stats.columns = ['Genre', 'Average Rating', 'Total Ratings', 'Unique Movies']
genre_stats


Unnamed: 0,Genre,Average Rating,Total Ratings,Unique Movies
0,(no genres listed),3.071429,7,7
1,Action,3.45145,31205,1737
2,Adventure,3.518027,23076,1164
3,Animation,3.63535,5966,400
4,Children,3.439429,8098,540
5,Comedy,3.420996,38055,3513
6,Crime,3.642392,18291,1440
7,Documentary,3.643035,1206,415
8,Drama,3.650266,46960,5218
9,Fantasy,3.500459,10889,670


*Unique genres considered*

Expands the DataFrame so each row has one genre and Identifies all unique genres present in the dataset.

In [12]:
unique_genres = movies['genres'].str.split('|').explode().unique()
print(f'Unique genres: {unique_genres}')


Unique genres: ['Adventure' 'Animation' 'Children' 'Comedy' 'Fantasy' 'Romance' 'Drama'
 'Action' 'Crime' 'Thriller' 'Horror' 'Mystery' 'Sci-Fi' 'IMAX' 'War'
 'Musical' 'Documentary' 'Western' 'Film-Noir' '(no genres listed)']


### **Step 3:** *Design the Recommendation Modules*

**Popularity-based Recommender System**

The system recommends top N movies from a specific genre with at least t reviews, ordered by rating.

In [13]:
def popularity_based_recommender(genre, min_reviews, top_n):
    genre_movies = movie_ratings[movie_ratings['genres'] == genre]
    genre_movies = genre_movies.groupby('movieId').agg({'rating': ['mean', 'count'], 'title': 'first'}).reset_index()
    genre_movies.columns = ['movieId', 'Average Rating', 'Total Reviews', 'Title']
    genre_movies = genre_movies[genre_movies['Total Reviews'] >= min_reviews]
    top_movies = genre_movies.sort_values(by='Average Rating', ascending=False).head(top_n)
    return top_movies[['Title', 'Average Rating', 'Total Reviews']]



Function definition: popularity_based_recommender recommends movies based on popularity within a specified genre.

Filter by genre: Filters movies to only those within the specified genre.

Group by movie: Groups movies by movieId and calculates average rating and total reviews.

Filter by minimum reviews: Keeps only movies with at least the specified number of reviews.

Sort and select top N: Sorts the movies by average rating in descending order and selects the top N movies.

In [14]:
popularity_based_recommender('Action', 100, 10)

Unnamed: 0,Title,Average Rating,Total Reviews
319,"Matrix, The (1999)",4.264368,261
137,Star Wars: Episode V - The Empire Strikes Back...,4.22807,228
139,Raiders of the Lost Ark (Indiana Jones and the...,4.212054,224
1395,Inception (2010),4.18932,103
36,Star Wars: Episode IV - A New Hope (1977),4.188645,273
370,Fight Club (1999),4.188406,207
82,Blade Runner (1982),4.169872,156
138,"Princess Bride, The (1987)",4.163743,171
140,Aliens (1986),4.146497,157
1212,"Dark Knight, The (2008)",4.141732,127



**Content-based Recommender System**

The system recommends top N movies based on similar genres to a given movie.

In [15]:
# Create a string of genres for each movie
movies['genre_str'] = movies['genres'].str.replace('|', ' ')

# Create a CountVectorizer to transform the genre strings into a genre matrix, transforms the text into numerical data
count_vectorizer = CountVectorizer()
genre_matrix = count_vectorizer.fit_transform(movies['genre_str'])

# Compute the cosine similarity matrix, find movies with similar genre patterns
cosine_sim = cosine_similarity(genre_matrix, genre_matrix)

# Function to get movie index
def get_movie_index(title):
    print(movies[movies['title'] == title].index[0])
    return movies[movies['title'] == title].index[0]

# Content-based recommender function
def content_based_recommender(movie_title, top_n):
    movie_idx = get_movie_index(movie_title)
    similarity_scores = list(enumerate(cosine_sim[movie_idx]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    top_movie_indices = [x[0] for x in similarity_scores[1:top_n+1]]
    return movies.iloc[top_movie_indices][['title', 'genres']]


String of genres: Creates a single string of genres for each movie,  This makes the genres ready for text vectorization.

CountVectorizer: Transforms the genre strings into a genre matrix using CountVectorizer, The matrix entries contain the frequency of each genre in each movie (in this case, it’s either 0 or 1 since each movie either has or doesn’t have a genre).

Cosine similarity matrix: Computes the cosine similarity between all movies based on their genres.

Get movie index: Function to get the index of a movie by title.

Content-based recommender function: Recommends top N movies similar to a given movie based on genre similarity.

In [16]:
content_based_recommender('Toy Story (1995)', 10)

0


Unnamed: 0,title,genres
1815,Antz (1998),Adventure|Animation|Children|Comedy|Fantasy
2496,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy
2967,"Adventures of Rocky and Bullwinkle, The (2000)",Adventure|Animation|Children|Comedy|Fantasy
3166,"Emperor's New Groove, The (2000)",Adventure|Animation|Children|Comedy|Fantasy
3811,"Monsters, Inc. (2001)",Adventure|Animation|Children|Comedy|Fantasy
6617,DuckTales: The Movie - Treasure of the Lost La...,Adventure|Animation|Children|Comedy|Fantasy
6997,"Wild, The (2006)",Adventure|Animation|Children|Comedy|Fantasy
7382,Shrek the Third (2007),Adventure|Animation|Children|Comedy|Fantasy
7987,"Tale of Despereaux, The (2008)",Adventure|Animation|Children|Comedy|Fantasy
9215,Asterix and the Vikings (Astérix et les Viking...,Adventure|Animation|Children|Comedy|Fantasy


**Collaborative-based Recommender System**

The system recommends top N movies based on K similar users for a target user.

In [17]:
# Pivot ratings dataframe to create a user-item matrix
user_item_matrix = ratings.pivot(index='userId', columns='movieId', values='rating').fillna(0) # By filling with 0, you indicate that the user hasn't rated that movie.

# Fit NearestNeighbors model
knn = NearestNeighbors(metric='cosine', algorithm='brute') # The brute-force method is used to calculate distances (or similarities) between users
knn.fit(user_item_matrix)

# Function to get movie recommendations for a user
def collaborative_based_recommender(user_id, k, top_n):
    # finds the k+1 nearest neighbors for the given user (including the user themselves, which is why k+1 is used).
    distances, indices = knn.kneighbors(user_item_matrix.loc[user_id].values.reshape(1, -1), n_neighbors=k+1)
    user_indices = indices.flatten()[1:] # Since the first index corresponds to the target user themselves, it is excluded using [1:]
    similar_users_ratings = user_item_matrix.iloc[user_indices]
    mean_ratings = similar_users_ratings.mean(axis=0)
    top_movie_indices = mean_ratings.sort_values(ascending=False).head(top_n).index
    return movies[movies['movieId'].isin(top_movie_indices)][['title', 'genres']]


Collaborative Filtering: This method uses the idea that users who have rated items similarly in the past will continue to have similar preferences in the future. Instead of analyzing the content (e.g., genres, actors), it focuses on user behavior (ratings).

User-item matrix: Creates a pivot table where rows are users, columns are movies, and values are ratings.

Fit model: Fits a NearestNeighbors model using the user-item matrix.

Collaborative recommender function: Recommends top N movies for a user based on K similar users.

In [18]:
collaborative_based_recommender(1, 5, 10)

Unnamed: 0,title,genres
44,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
47,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
260,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
279,"Shawshank Redemption, The (1994)",Crime|Drama
471,Schindler's List (1993),Drama|War
525,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller
2056,"Matrix, The (1999)",Action|Sci-Fi|Thriller
2281,American Beauty (1999),Drama|Romance
3885,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy
5206,"Lord of the Rings: The Return of the King, The...",Action|Adventure|Drama|Fantasy


### **Step 4:** *Optional: GUI Interface*

creating a GUI interface using libraries like ipywidgets to allow users to interact with these recommendation systems.

In [19]:
import ipywidgets as widgets
from IPython.display import display

# Popularity-based Recommender GUI
def popularity_recommender_interface():
    genre = widgets.Text(value='Action', description='Genre:')
    min_reviews = widgets.IntSlider(value=5, min=5, max=100, step=1, description='Min Reviews:')
    top_n = widgets.IntSlider(value=10, min=1, max=20, description='Top N:')
    
    display(genre, min_reviews, top_n)
    
    def on_button_clicked(b):
        recommendations = popularity_based_recommender(genre.value, min_reviews.value, top_n.value)
        print(recommendations)
    
    button = widgets.Button(description='Get Recommendations')
    button.on_click(on_button_clicked)
    display(button)



Import libraries: Imports ipywidgets for creating interactive widgets and display from IPython.display for displaying them.

Function definition: Defines a function to create a GUI for the popularity-based recommender.

Create widgets: Creates text input for genre, sliders for minimum reviews and top N.

Display widgets: Displays the created widgets.

Button click event: Defines what happens when the button is clicked (calls the recommender function and displays results).

Show GUI: Calls the function to display the GUI.

In [20]:
popularity_recommender_interface()

Text(value='Action', description='Genre:')

IntSlider(value=5, description='Min Reviews:', min=5)

IntSlider(value=10, description='Top N:', max=20, min=1)

Button(description='Get Recommendations', style=ButtonStyle())