In [1]:
###########################################################################

In [2]:
# Introduction

# This Jupyter notebook is part of your assignment
# You will work with a data set that contains the movie ratings of 10 movies given by 100 unique users
# The overall task is to build a set of functions that will act as a simple movie recommendation system
# The recommender system must recommend the top three movies to a user based on their previously searched movie

# In this exercise, you will perform the following tasks:
# 1 - Load and study the data
# 2 - Create a function to calculate the angle between the user ratings vectors for two movies
# 3 - Create a function to print the names of the top three most similar movies based on the movie that a user watched recently

In [3]:
###########################################################################

In [4]:
# Task 1 - Load and study the data

# Load the data and study its features such as:
# The number of users
# The number of movies
# The ranges of ratings

In [5]:
# Load "numpy" and "pandas" for manipulating numbers, vectors and data frames
import numpy as np
import pandas as pd

In [7]:
# Read in the "User_Movie_Ratings_Graded.csv" file as a Pandas Data Frame
# Note: Make sure the code and the data are in the same folder or specify the appropriate path for the data
df = pd.read_csv('User_Movie_Ratings_.csv', index_col = 0)

In [7]:
# Study the description of the data
# Note: Make sure the code and the data description are in the same folder or specify the appropriate path for the data
with open('User_Movie_Ratings_Graded_Feature_Description.txt', 'r') as f:
    print(f.read())

The "User_Movie_Ratings_Graded.csv" data is a completely fabricated data set for use only on the upGrad platform.

Any resemblance to entities past, present or future is merely a coincidence.

Feature Description:
The data set contains ratings given by users to movies on a scale of 1 to 5.
Each row contains the ratings given to all the movies by a particular user.
Each column contains the ratings given to a particular movie by all the users.


In [8]:
# Take a brief look at the data frame using ".head()"
df.head()

Unnamed: 0_level_0,Avengers: Endgame,King Kong,Wonder Woman,Star Wars: The Last Jedi,Thor: Ragnarok,The Lighthouse,The Babadook,A Quite Place,The Invisible Man,It Follows
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Abhijit,5,4,5,5,4,1,2,1,3,2
Amanda,1,2,2,3,3,5,4,5,5,4
Arnold,2,2,3,1,2,5,5,5,4,5
Arvind,4,5,3,4,5,1,2,2,3,2
Azalea,3,2,4,3,5,1,3,1,3,1


In [9]:
# Check the dimensions of the data frame using ".shape"
########## CODE HERE ##########
df.shape

(100, 10)

In [10]:
# View basic information about the data frame using ".info()"
########## CODE HERE ##########
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, Abhijit to Zubair
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   Avengers: Endgame         100 non-null    int64
 1   King Kong                 100 non-null    int64
 2   Wonder Woman              100 non-null    int64
 3   Star Wars: The Last Jedi  100 non-null    int64
 4   Thor: Ragnarok            100 non-null    int64
 5   The Lighthouse            100 non-null    int64
 6   The Babadook              100 non-null    int64
 7   A Quite Place             100 non-null    int64
 8   The Invisible Man         100 non-null    int64
 9   It Follows                100 non-null    int64
dtypes: int64(10)
memory usage: 8.6+ KB


In [11]:
# Observations

# There are 100 rows and 10 columns in the data
# Each row corresponds to the movie ratings given by a particular user
# Each column corresponds to the user ratings given to a particular movie
# The movie ratings range from 1 to 5 and they are whole numbers

In [12]:
###########################################################################

In [13]:
# Task 2 - Create a function to calculate the angle between two vectors

# The angle between any two vectors can be used as a similarity measure between those two vectors
# We will calculate the angle between two vectors using dot products

# The algorithm to find the angle between two vectors is as follows:
# Step 1 - Calculate the dot product of the two vectors
# Step 2 - Calculate the magnitudes of the two vectors
# Step 3 - Divide the value obtained in step 1 by the product of the values obtained in step 2
# Step 4 - Find the inverse cosine of the value obtained in step 3

In [20]:
# Create a function called "ang()" which takes in two movie names and returns the angle between their user ratings vectors
# Note: Pass in only the movie names as inputs and access the user ratings vectors within the function
# Note: Use "np.dot()" to calculate dot products
# Note: Use "np.linalg.norm()" with default parameters to calculate magnitudes
# Note: Use "round()" to round the denominator part of the angle formula which contains the product of the two magnitudes
# Note: Use "np.arccos()" to calculate inverse cosines
########## CODE HERE ##########
def ang(movie1, movie2):
    vec1 = df[movie1]
    vec2 = df[movie2]
    
    dot_product = np.dot(vec1, vec2)
    magnitude_vec1 = np.linalg.norm(vec1)
    magnitude_vec2 = np.linalg.norm(vec2)
    
    angle = np.arccos(dot_product / (magnitude_vec1 * magnitude_vec2))
    
    return np.degrees(angle)


In [22]:
# Calculate the angle between the user ratings vectors for the movies "King Kong" and 'Wonder Woman' using the function "ang()"
########## CODE HERE #########
angle=ang('King Kong','Wonder Woman')
print(angle)
radians= (angle*3.1415)/180
print(radians)

26.008495885526973
0.45392049902434994


In [29]:
# Calculate the angle between the user ratings vectors for the movies "King Kong" and 'The Lighthouse' using the function "ang()"
########## CODE HERE ##########

angle=ang('King Kong', 'The Lighthouse')
print(angle)
radians= (angle*3.1415)/180
print(radians)

48.44797891336492
0.8455518097574216


In [30]:
# Calculate the angle between the user ratings vectors for the movie "King Kong" with itself using the function "ang()"
########## CODE HERE ##########
angle=ang('King Kong', 'King Kong')
print(angle)
radians= (angle*3.1415)/180
print(radians)

nan
nan


  angle = np.arccos(dot_product / (magnitude_vec1 * magnitude_vec2))


In [18]:
# Observations

# The dot product of two vectors can be used to calculate the angle between the two vectors
# The angle obtained is in radians
# The angle between two vectors measures the similarity between those vectors
# The angle between a vector and itself is 0 radians
# The smaller the angle between two vectors, the higher their similarity score, and vice versa
# The larger the angle between two vectors, the lower their similarity score, and vice versa
# To overcome Python's precision limitations, we round the denominator of the angle formula

In [19]:
###########################################################################

In [20]:
# Task 3 - Create a function to print the names of the top three most similar movies based on the movie that a user watched recently

# The angle between two vectors can be used as a measure of similarity between the two vectors
# The smaller the angle between two vectors, the higher their similarity score, and vice versa
# The larger the angle between two vectors, the lower their similarity score, and vice versa
# We will create a function that returns the names of the top three most similar movies based on a given input movie name

In [31]:
# Define a function "moviereco()" that takes in the name of a movie and returns the names of the top three most similar movies
# Note: Use the "ang()" function to measure similarities between movies (using their user ratings vectors)
# Note: You may create a temporary Pandas Series within the "moviereco()" function to store the angle values
# Note: The "index" parameter of the series can be set as "df.columns"
# Note: You may need to specify the "dtype" parameter of the series as "float64" to avoid some warnings
# Note: You may sort the entries in this series and return the second, third and fourth index names of the series
# Note: Use the ".sort_values()" function with the default value for the "ascending" parameter, which is "True"
# Note: The first entry after sorting will be trivial
# Note: Return the three movie names as a list
########## CODE HERE ##########
def moviereco(movie_name):
    similarities = pd.Series({movie: ang(movie_name, movie) for movie in df.keys()}, dtype='float64')
    top_similarities = similarities.sort_values()[1:4]
    return top_similarities.index.tolist()

In [34]:
# Use the function "moviereco()" to recommend the top three similar movies to a user who watched "Star Wars: The Last Jedi"
########## CODE HERE ##########
rec=moviereco('Star Wars: The Last Jedi')
print(rec)

['Wonder Woman', 'Thor: Ragnarok', 'King Kong']


In [35]:
# Use the function "moviereco()" to recommend the top three similar movies to a user who watched "The Babadook"
########## CODE HERE ##########
rec=moviereco('The Babadook')
print(rec)

['A Quite Place', 'The Lighthouse', 'It Follows']


In [24]:
# Observations

# The angle between the user ratings vectors for two particular movies is a measure of similarity between the movies
# The smaller the angle, the more similar the movies, and vice versa
# The larget the angle, the less similar the movies, and vice versa
# Using the name of the last movie watched by a user, we can recommend a list of similar movies to them

In [25]:
###########################################################################

In [26]:
# Conclusions

# We can use the angle between user ratings vectors of two movies to measure their similarity
# Using these measures, we can recommend similar movies to a user based on the movie they watched recently
# This is a very basic look into the working methodology of recommender systems
# Extensions and additions to this methodology are used extensively in industrial movie recommendation systems

In [27]:
###########################################################################