## Importing the Data into Python
We'll be using [pandas](https://pandas.pydata.org/docs/user_guide/index.html#user-guide), a data analysis package for python.

The first step is to import the package, and use it to import the data as [dataframes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). Since the data is stored as [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) files, we should use [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas-read-csv)

In [1]:
# original dataset from the GroupLens Research Project at the University of Minnesota
import pandas as pd
import numpy as np

In [2]:
# read the data into dataframes
df_movies = pd.read_csv(r'./movies.csv')
df_metrics = pd.read_csv(r'./metrics.csv')
df_ratings = pd.read_csv(r'./ratings_simplified.csv')

Lets take a look into the dataframes.

In [3]:
df_movies

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)
...,...,...
62418,209157,We (2018)
62419,209159,Window of the Soul (2001)
62420,209163,Bad Poems (2018)
62421,209169,A Girl Thing (2001)


In [4]:
df_metrics

Unnamed: 0,movieId,count,mean,deviation,title
0,356,81491,8.096023,1.877620,Forrest Gump (1994)
1,318,81482,8.827152,1.520888,"Shawshank Redemption, The (1994)"
2,296,79672,8.377824,1.917676,Pulp Fiction (1994)
3,593,74127,8.302683,1.723758,"Silence of the Lambs, The (1991)"
4,2571,72674,8.308198,1.825748,"Matrix, The (1999)"
...,...,...,...,...,...
583,2078,10052,7.300537,1.866745,"Jungle Book, The (1967)"
584,56782,10044,7.965452,1.876157,There Will Be Blood (2007)
585,529,10037,7.698416,1.644286,Searching for Bobby Fischer (1993)
586,96610,10012,7.310328,1.680528,Looper (2012)


In [5]:
df_ratings

Unnamed: 0,userId,movieId,rating
0,1,296,10
1,1,306,7
2,1,307,10
3,1,665,10
4,1,899,7
...,...,...,...
25000090,162541,50872,9
25000091,162541,55768,5
25000092,162541,56176,4
25000093,162541,58559,8


# Generating the Profile
Now, we need to query the client on these movies to generate their profile, lets prompt the client until they provide 10 valid ratings. We need to keep track of the rating, and the movieId. Additionally, we need to ensure that the rating is valid. Lets start be telling
the client what we need from them.

In [6]:
# Prompt client
print("Hello viewer! Before we can start recommending movies,")
print( "We need you to rate a few movies to compute your preference profile.")
print( "Please rate the following movies on a scale of 1 to 10.")
print( "If you wish to skip rating a movie, press ENTER.")

def prompt_till_valid(movie_title):
    client_prompt = "{}: ".format(movie_title)
    while True:
        client_response = input(client_prompt)
        # skip if necessary
        if (client_response == ''):
            break
        # else, convert input to float
        try:
            client_response = float(client_response)
        except ValueError:
            print("Sorry, we can only accept numerical ratings")
        else:
            # check if rating in 1-10
            if (1 <= client_response <= 10):
                break
            print("Sorry, ratings must be between 1 and 10")
    return client_response

Hello viewer! Before we can start recommending movies,
We need you to rate a few movies to compute your preference profile.
Please rate the following movies on a scale of 1 to 10.
If you wish to skip rating a movie, press ENTER.


Now, we need to write code that iterates over the movies we have curated, and asks the client to rate them until we have 10 valid ratings. We can do this using the [iterrows](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html) feature for dataframes. Lets store the ratings in a list, and then convert to a dataframe after all the ratings are collected. Use the function given above, to collect the client response.

In [7]:
required_ratings = 10
df_client = []
for index, movie in df_metrics.iterrows():
    # call the function
    client_response = prompt_till_valid(movie['title'])
    # add to list if valid rating
    if (client_response != ''):
        df_client.append([movie['movieId'], client_response])
    # stop once we have enough ratings
    if (len(df_client) == required_ratings):
        break
df_client = pd.DataFrame(df_client, columns=['movieId', 'clientRating'])
df_client.set_index('movieId', inplace=True)

Forrest Gump (1994): 
Shawshank Redemption, The (1994): 
Pulp Fiction (1994): 
Silence of the Lambs, The (1991): 
Matrix, The (1999): 8
Star Wars: Episode IV - A New Hope (1977): 9
Jurassic Park (1993): 
Schindler's List (1993): 
Braveheart (1995): 
Fight Club (1999): 
Terminator 2: Judgment Day (1991): 
Star Wars: Episode V - The Empire Strikes Back (1980): 7
Toy Story (1995): 8
Lord of the Rings: The Fellowship of the Ring, The (2001): 
Usual Suspects, The (1995): 
Star Wars: Episode VI - Return of the Jedi (1983): 6
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981): 
American Beauty (1999): 
Godfather, The (1972): 9
Lord of the Rings: The Two Towers, The (2002): 
Lord of the Rings: The Return of the King, The (2003): 
Seven (a.k.a. Se7en) (1995): 
Fugitive, The (1993): 
Back to the Future (1985): 2
Independence Day (a.k.a. ID4) (1996): 6
Apollo 13 (1995): 7
Fargo (1996): 
Twelve Monkeys (a.k.a. 12 Monkeys) (1995): 
Saving Private Ryan (1998): 8


# Computing Cosine Similarity
Now, we have enough ratings to start computing the similarity. First, lets create a simplified ratings dataframe to compute the similarity. We can drop ratings of movies that the client has not rated.

Additionally, to make sure the similarity scores are accurate, lets only consider users that have rated at least half the movies the client has.

In [10]:
# get a list of the 10 movieIds from the dictionary
client_movies = df_client.index.tolist()
# remove ratings of movies not one of those 10
df_simple = df_ratings.drop(df_ratings[~df_ratings['movieId'].isin(client_movies)].index)
# remove ratings of users that have less than 5 ratings in common
df_simple = df_simple[df_simple['userId'].map(df_simple['userId'].value_counts()) >= required_ratings / 2]
# merge client ratings
df_simple = df_simple.merge(df_client, left_on = 'movieId', right_index = True)
df_simple.sort_values('userId', inplace=True)

Now, lets create a structure to store the similarity between the client and every other user. We can use this to identify the most similar users.

In [11]:
# get all unique users and add columns to store the magnitude and dot product for each user
df_sim = pd.DataFrame(index=df_simple['userId'].unique(), 
                             columns = ['userMag', 'clientMag', 'dotProd'])

$$S(\text{u, c}) = \frac{\sum\limits_{m \in M}R(u, m) \times R(c, m)}{|u| \times |c|}$$

Recalling the formula for calculating cosine similarity, we need to compute user and client magnitudes, as well as pairwise dot products.

Of course, to compute these values, the easiest thing would be to go over every rating, and make the necessary update to the specific user. We can compute the squared sum first, and then take the root at the end to obtain the magnitude.

In [12]:
# fill the columns with 0
df_sim.fillna(0, inplace=True)
# for visualization
import progressbar
n = len(df_simple)
bar = progressbar.ProgressBar(maxval=n, \
    widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
bar.start()
i = 0
for index, rating in df_simple.iterrows():
    bar.update(i)
    i+=1
    # update magnitudes and dotprod
bar.finish()
# take the square root to get the actual magnitudes
df_sim['userMag'] = df_sim['userMag'].pow(0.5)
df_sim['clientMag'] = df_sim['clientMag'].pow(0.5)
# check values
df_sim



Unnamed: 0,userMag,clientMag,dotProd
2,0.0,0.0,0
3,0.0,0.0,0
4,0.0,0.0,0
5,0.0,0.0,0
8,0.0,0.0,0
...,...,...,...
162530,0.0,0.0,0
162532,0.0,0.0,0
162533,0.0,0.0,0
162534,0.0,0.0,0


Note how inefficient this would become on a larger scale, with 10s of millions of users, and billions of ratings. We can speed up this process by taking advantage of method chaining, and the [groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) function. Use the function given below to help you compute the dot product; use the [apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.apply.html#pandas.core.groupby.DataFrameGroupBy.apply) function to do this. Note that functions chained after grouping must return a single value.

In [None]:
def computeDotProd(df):
    return (df['rating'] * df['clientRating']).sum()
# reset the dataframe
df_sim = pd.DataFrame(index=df_simple['userId'].unique(), 
                             columns = ['userMag', 'clientMag', 'dotProd'])
# method chain to get the magnitude
df_sim['UserMag'] = df_simple['rating'].pow(2).groupby(df_simple['userId']).sum().pow(0.5)
df_sim['clientMag'] = df_simple['clientRating'].pow(2).groupby(df_simple['userId']).sum().pow(0.5)
# use helper function to compute dot product
df_sim['dotProd'] = df_
# check values
df_sim

Now we have the necessary prerequisites to compute the similarity, 

In [None]:
# compute similarity and sort users by it (descending)
df_sim['similarity'] = np.divide(df_sim['dotProd'], np.multiply(df_sim['userMag'], df_sim['clientMag']))
df_sim.sort_values(by='similarity', ascending=False, inplace=True)
df_sim
# WHY ARE WE KEEPING THE MAGNITUDES AND DOTPROD COLUMNS?

Lets plot the similarities

In [None]:
df_sim.reset_index().plot(y='similarity')

# Recommending Movies
Now that we have quantified the similarity between the client and users in the database, it is time to recommend movies to the client. Intuitively, you want to recommend movies that similar users have liked. There are a lot of ways of doing this, and the algorithm can be made as complicated as you want. 

For example, you could, for each movie $m$, compute a score based on every user $u$ that has rated $m$ and that user's similarity with the client $c$.
$$score(m) = \sum\limits_{u}similarity(c, u)\times rating(u, m)$$

Such functions are called utility/score functions, they are used to quantify sentiments we wish to optimize for. You can take [CSCD84](http://www.cs.utoronto.ca/~strider/LectureNotes.html) if you want to learn more.

For now, we're going to use the simple algorithm of filtering users that have a similarity score of less than 0.99. Then, we can compute a weighted average by taking into account the mean rating of a movie based on similar users, and the popularity among similar users.

In [None]:
# drop users below similarity threshold
sim_threshold = 0.95
df_sim_reduced = df_sim.drop(df_sim[df_sim['similarity'] < sim_threshold].index)
df_sim_reduced

In [None]:
# drop ratings not given by similar users
best_friends = df_sim_reduced.index.tolist()
friend_ratings = df_ratings.drop(df_ratings[~df_ratings['userId'].isin(best_friends)].index)
# calculate local average and popularity
df_mean = friend_ratings.groupby('movieId')['rating'].mean()
df_count = friend_ratings.groupby('movieId')['rating'].count()
# merge and drop already seen movies
df_metrics = pd.concat([df_count, df_mean], axis=1)
df_metrics.columns = ['count', 'mean']
df_movies.set_index = 'movieId'
df_metrics.reset_index(inplace=True)
df_metrics = df_metrics.drop(df_metrics[df_metrics['movieId'].isin(client_movies)].index)
# get movie titles
df_metrics = pd.merge(df_metrics, df_movies, on='movieId', how='left')

def scoreMovie(movie, alpha):
    return ((1-alpha) * movie['mean']) + (alpha * (movie['count'] / len(best_friends)))
df_metrics['score'] = df_metrics.apply(scoreMovie, alpha=0.95, axis=1)
# sort by score
df_metrics.sort_values('score', ascending=False, inplace=True)

In [None]:
df_metrics