# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint

## Learning Objective

At the end of this experiment, you will be able to :

* Recommend movies to the users

In [None]:
#@title Experiment Walkthrough Video

from IPython.display import HTML

HTML("""<video width="854" height="480" controls>
  <source src="https://cdn.talentsprint.com/talentsprint/archives/sc/aiml/aiml_labs_blr/movie_recommendation_system_knn.mp4" type="video/mp4">
</video>
""")

## Dataset

### Description

The dataset chosen for this experiment is a subset of the original movielens dataset.

Consider the problem of recommending movies to users. You have M Users and N Movies. 
Now, you want to predict whether a given test user $x$ will watch movie $y$.

User $x$ has seen and not seen few movies in the past. you will use $x$'s movie watching history as a feature for our recommendation system.

Let us use KNN to find the K nearest neighbour users (users with similar taste) to $x$, and make predictions based on their entries for movie $y$.

A user either had seen the movie (1) or not seen the movie (0). You can represent this as a matrix of size M×N. (M rows and N columns). We have actually used a dictionary with the keys userId and movieId to represent this matrix.

Each element of the matrix is either zero or one. If (u, m) entry in this matrix is 1, then the $u^{th}$ user has seen the movie $m$.

#### Training set
M×N binary matrix indicating seen/not-seen.
#### Test set: 
L test cases with $(x, y)$ pairs. $x$ is N-dimensional binary vector with missing $y^{th}$ entry - which we want to predict.


### Data Source

* AIML_DS_MOVIE-TRAIN_SMALLSUBSETOFMOVIELENSDATASET.csv

*  AIML_DS_MOVIE-TEST_SMALLSUBSETOFMOVIELENSDATASET.csv

This is a small subset of the original movielens dataset.
https://grouplens.org/datasets/movielens/



* Let us use KNN to find the K nearest neighbour users (users with similar taste) to $x$, and make predictions based on their entries for the movie $y$.

* We have given the code for Cosine distance, when computing nearest neighbours.

In [None]:
! wget https://cdn.talentsprint.com/aiml/Experiment_related_data/AIML_DS_MOVIE-TEST_SMALLSUBSETOFMOVIELENSDATASET.csv
! wget https://cdn.talentsprint.com/aiml/Experiment_related_data/AIML_DS_MOVIE-TRAIN_SMALLSUBSETOFMOVIELENSDATASET.csv
    

### Importing required packages


In [None]:
import pandas as pd

### Setting up the files

In [None]:
Train_set = "AIML_DS_MOVIE-TRAIN_SMALLSUBSETOFMOVIELENSDATASET.csv"
Test_set = "AIML_DS_MOVIE-TEST_SMALLSUBSETOFMOVIELENSDATASET.csv"   

In [None]:
Train_set

### Loading the data from set up files


In [None]:
rated = pd.read_csv(Train_set, converters={"userId":int, "movieId":int})
rated.head()

In [None]:
rated.describe()

In [None]:
userCount = max(rated.userId) # Print maximum rated userid
movieCount = max(rated.movieId) # Print maximum rated movieid
print(userCount, movieCount)

In [None]:
# User who have watched the movie are considered as 1 in the dictionary
seen = {}
for x in rated.values:
    seen[(int(x[0]), int(x[1]))] = 1     # Storing Key as (userId, movieId): value as 1 in dictionary
len(seen) 

In [None]:
# Storing all matching possibilities of users and movies
allUsersMovies = [(u,m) for u in range(userCount) for m in range(movieCount)]  

# 670*9065 is the total matching possibilities of users and movies
len(allUsersMovies) 

In [None]:
# If one particular match (user, movie) is not provided in data, then that user has not watched that movie, so it is considered as 0 in the dictionary
for x in allUsersMovies:
    if x not in seen:
        seen[x] = 0

Now we have the data loaded into a dictionary, let us recast the distance function to use it. Given two users, $u_1$ and $u_2$, for a movie $mx$, we must ignore the entries for $mx$ while computing the distance

In [None]:
# This is actually to find the distance between user 1 and user 2 for all the movies
def distance(u1, u2, mx):
    d = 0 - seen[(u1, mx)] * seen[(u2, mx)] 

    for m in range(movieCount):
        d += seen[(u1, m)] * seen[(u2, m)]      # Distance is based on how many movies did user1 and user2 watched in similar
    return d

def kNN(k, givenUser, givenMovie):

    '''calculating the distance between given user and all other remaining users,
    returning the top 'k' no.of users with higher distance (as cosine distance is based on similarity)'''

    distances = []
    for u in range(userCount):
        if u != givenUser:  
            distances.append([distance(u, givenUser, givenMovie), u])
    distances.sort()
    distances.reverse() # Because cosine distances mean higher = closer
    return distances[:k] 

def prediction(k, givenUser, givenMovie):

    '''For the given user and given movie we are getting k-nearest neighbours based on cosine distance, 
       if half of the neighbour users saw the given movie, which means user is likely to watch the movie'''
       
    neighbours = kNN(k, givenUser, givenMovie)
    howmanySaw = sum([seen[(u, givenMovie)] for d, u in neighbours])

    return 2 * howmanySaw > k      # Predict 1 (True) if more than half of the similar users have seen this movie, otherwise 0 (False).    

In [None]:
test_data = pd.read_csv(Test_set)
test_data.head()

In [None]:
# Take input from test data for prediction
prediction(5,0,4)

### Summary

In above experiment we have learnt how to build recommendation systems using KNN classifier.