# Machine Learning with Collaborative Filtering

## Contents
1. **Loading** movie dataset
2. Creating your movie **profile**
3. **Predicting** ratings
4. Finding **recommendations**

In this challenge exercise, we will implement user-item collaborative filtering. 
We will first make a prediction for the rating you would give to a certain movie. Then we will generate recommendations, based on the movies that people similar to you liked.

You will need two data files, which can be found in the `moviedata` folder. The data is separated with tabs.
* `ratings.data`, which contains the ratings of 100,000 movies by many users (4 fields : user_id, movie_id, rating, timestamp).
* `movies.data`, a mapping between movie ID's and titles.

## 1. Loading movie dataset

First, we will load the `ratings.data` and `movies.data`. We will use the Python library [Pandas](http://pandas.pydata.org/) for this and load the data into a [DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

Find the function in Pandas that is used to read csv data and use it to load the ratings and titles. Don't forget to specify the separator!

(Hint: if you encounter encoding problems, set the encoding to 'cp1252'.)

In [None]:
import pandas as pd

ratings =  # TODO: load moviedata/ratings.data
movies =  # TODO: load moviedata/movies.data

In [2]:
### SOLUTION ###
import pandas as pd

ratings = pd.read_csv("ratings.data", sep="\t")
movies = pd.read_csv("movies.data", sep='\t', encoding="cp1252")

We can look at our new dataframes with the `.head()` function. This gives us the first five rows of the dataframe (unless you put a different number between the brackets).

In [6]:
movies.head()

Unnamed: 0,movie_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [3]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In our ratings dataframe, we don't really need the timestamp column. Remove this column from the dataframe.

In [None]:
ratings =  # TODO: remove timestamp column

In [4]:
### SOLUTION ###
ratings = ratings.drop(['timestamp'], axis=1)

Now we will try to combine the ratings with the movie titles. We will write a function that can turn a column with movie id's into a column with movie names. By turning this into a function, we can easily apply it to other dataframes later.

In [None]:
def add_titles(df, movie_titles_df):
    df =  # merge dataframes
    df =  # remove movie_id column
    return df

In [6]:
### SOLUTION ###
def add_titles(df, movie_titles_df):
    df = df.merge(movie_titles_df, on='movie_id')
    df = df.drop(['movie_id'], axis=1)
    return df

In [7]:
ratings = add_titles(ratings, movies)
ratings.head()

Unnamed: 0,user_id,rating,title
0,196,3,Kolya (1996)
1,63,3,Kolya (1996)
2,226,5,Kolya (1996)
3,154,3,Kolya (1996)
4,306,5,Kolya (1996)


You should not remove the movie id column without thinking though. This id could separate two movies with the same title (and year) that are not the same. However, in this case, this is not necessary so we can safely remove it.

## 2. Creating your own movie profile

Now we have a dataframe that describes how a lot of users rate movies. The idea of collaborative filtering is to find a users that are similar to you and look at the movies they like or don't like. Your own ratings can be predicted based on the ratings your "neighbours" give these movies.

So before we can find your neighbours, you need to rate a couple of movies. Let's create a movie profile for you.

### 2.1. Getting the most famous movies

The first step will be to filter our movies. To make sure you know the movies you're about to rate, we will find the 100 most reviewed movies.

Follow these steps:
- Group the ratings dataframe by title
- On these grouped items, apply the count function, this will give you a dataframe
- Select the rating column. (Hint: `df["colname"]`)
- Sort the values descending, so the most reviewed titles appear on top.
- Take the first 100 rows. (Hint: use `head()`)

You can do this all in one long chain and assign it to a new dataframe.

In [None]:
most_famous_movies =  # TODO: group, count, select, sort and take first 100

In [8]:
### SOLUTION ###
most_famous_movies = ratings.groupby("title").count().rating.sort_values(ascending=False).head(100)

Have a look at the result with `head()`. The most reviewed movie should have 583 reviews. Can you guess which one it is?

In [None]:
# TODO: check result

In [9]:
### SOLUTION ###
most_famous_movies.head()

title
Star Wars (1977)             583
Contact (1997)               509
Fargo (1996)                 508
Return of the Jedi (1983)    507
Liar Liar (1997)             485
Name: rating, dtype: int64

You might also notice that this result is not displayed as a table anymore. That is because the result has only an index and one column, so pandas turned it into a [Series](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) instead. Luckily for us, series work almost exactly the same as dataframes.

### 2.2. Rating a few of the most famous movies

Now onto step 2: reviewing these movies. We will write a loop that will present a movie title to you and asks you to rate it. You don't want to rate all 100 movies though, let's just rate five of them.

First, we want a small function that allows you to enter a number. So write a function that:
- Gets input from the user. (Hint: experiment with `input("Hello")`.)
- Then check if the entered text is a number.
- If the text is a number between 1 and 5: return the number
- Else: return None (python's null value)

In [None]:
def get_int_input(descr):
    # TODO: write function

In [10]:
### SOLUTION ###
def get_int_input(descr):
    s = input(descr)
    return int(s) if s.isdigit() and int(s) >= 1 and int(s) <= 5 else None

Now we write a loop that keeps showing you movie titles until you have rated 5 of them.

Here are the steps:
- Create a new, empty dataframe with columns "title" and "rating".
- Shuffle the most famous movies. (Hint: check pandas' `sample` function)
- Ask rating for every movie.
- Append result to the new dataframe if it is not None. (Hint: it's easy to append a dict to a df if you set `ignore_index` to True)
- If we have enough ratings: break the loop.

When you are done you can run the cell and rate five movies. You can skip the ones you don't know by pressing enter, or anything other than a number.

In [None]:
user_ratings =  # TODO: create new dataframe
ratings_needed = 5
shuffled_movies =  # TODO: shuffle most_famous_movies

for title, row in shuffled_movies.iteritems():
    rating = get_int_input("What rating (1-5) would you give to the movie " + title + "? ")
    
    # TODO: append rating to user_ratings if it is not None
    
    if len(user_ratings) >= ratings_needed:
        break

In [11]:
### SOLUTION ###
user_ratings = pd.DataFrame(columns=["title", "rating"])
ratings_needed = 5
shuffled_movies = most_famous_movies.sample(frac=1)

for title, row in shuffled_movies.iteritems():
    rating = get_int_input("What rating (1-5) would you give to the movie " + title + "? ")
    
    if rating  is not None:
        user_ratings = user_ratings.append({
            "title": title, "rating": rating
        }, ignore_index=True)
    
    if len(user_ratings) >= ratings_needed:
        break

What rating (1-5) would you give to the movie Terminator 2: Judgment Day (1991)? 6
What rating (1-5) would you give to the movie Clockwork Orange, A (1971)? 1
What rating (1-5) would you give to the movie Game, The (1997)? 2
What rating (1-5) would you give to the movie Twelve Monkeys (1995)? 2
What rating (1-5) would you give to the movie Speed (1994)? 5
What rating (1-5) would you give to the movie Dances with Wolves (1990)? 4


In [12]:
most_famous_movies

title
Star Wars (1977)                                583
Contact (1997)                                  509
Fargo (1996)                                    508
Return of the Jedi (1983)                       507
Liar Liar (1997)                                485
English Patient, The (1996)                     481
Scream (1996)                                   478
Toy Story (1995)                                452
Air Force One (1997)                            431
Independence Day (ID4) (1996)                   429
Raiders of the Lost Ark (1981)                  420
Godfather, The (1972)                           413
Pulp Fiction (1994)                             394
Twelve Monkeys (1995)                           392
Silence of the Lambs, The (1991)                390
Jerry Maguire (1996)                            384
Chasing Amy (1997)                              379
Rock, The (1996)                                378
Empire Strikes Back, The (1980)                 367
Star T

Now you can view your profile:

In [13]:
user_ratings

Unnamed: 0,title,rating
0,"Clockwork Orange, A (1971)",1
1,"Game, The (1997)",2
2,Twelve Monkeys (1995),2
3,Speed (1994),5
4,Dances with Wolves (1990),4


## 3. Predicting ratings

Based on these ratings, we will predict your opinion about other movies. The general approach to find your rating for a certain movie, is to look at all the users that have rated this movie and then take the average of their ratings for this movie. We multiply their rating by a weight though, based on how similar the users's ratings are to your own.

Now let's do that step by step.

We begin with the most basic function: getting a movie title based on an id. This will make it easier for us to test our code, because we won't need to type the entire movie title every time.

Write a function that uses the `movies` dataframe to return the title of a movie id. You'll need some pandas for this.

In [None]:
def get_movie_title(movie_id):
    # TODO: complete function

In [14]:
### SOLUTION ###
def get_movie_title(movie_id):
    return movies[movies.movie_id==movie_id]["title"].iloc[0]

Now we can use the function to select a movie:

In [15]:
selected_movie = get_movie_title(1)
selected_movie

'Toy Story (1995)'

We will have a look at your predicted rating for this movie. If you rated this one, you can change the movie id.

### 3.1. Normalising ratings

The next important part is to normalise your ratings. Basically you divide them all by the average. We do this because, if you always gave movies five stars, that says less about your taste than if you rated some higher and some lower. It also centers your ratings around zero. A negative rating means you enjoyed it less than usual, a positive rating means you liked it more than usual.

From your user_ratings dataframe, calculate the average of your rating column. Store this as a variable, we will need it later.

In [None]:
user_avg_rating =  # TODO: calculate your rating average
print(user_avg_rating)

In [16]:
### SOLUTION ###
user_avg_rating = user_ratings.rating.mean()
print(user_avg_rating)

2.8


### 3.2. Finding voters

Now we will have a look at other people who rated our movie. We will refer to them as voters. 

Filter the ratings dataframe by our selected movie title and then get the id's of those users.

In [None]:
voters =  # TODO: find voters id's in rating df

In [17]:
### SOLUTION ###
voters = ratings[ratings['title']==selected_movie]['user_id']

In [18]:
print("Number of voters who rated {}: {}".format(selected_movie, len(voters)))

Number of voters who rated Toy Story (1995): 452


If you didn't change the selected movie, there should be 452 voters.

### 3.3. Comparing one voter to you

We will focus on one voter first and compare him to you. After that, we will generalise it for all voters. We will select the first voter for now. We will assign his `user_id` to a variable so we can use it later.

In [19]:
voter = voters.iloc[0]  # iloc[0] gets the element at index 0
voter

308

Now get all the voter's ratings from the ratings dataframe.

In [None]:
voter_ratings =  # TODO: get voter's ratings
voter_ratings.head()

In [20]:
### SOLUTION ###
voter_ratings = ratings[ratings.user_id==voter]
voter_ratings.head()

Unnamed: 0,user_id,rating,title
845,308,3,"Hunt for Red October, The (1990)"
1724,308,4,Men in Black (1997)
2601,308,3,Sabrina (1995)
2781,308,4,"Man Without a Face, The (1993)"
2813,308,4,Sabrina (1954)


We would like to know how similar the voter is to you. If you have rated no movies in common, there is no point in considering him. So let's compare your ratings to his.

Combine your ratings with pandas' `merge` function (or alternatives).

In [None]:
mutual_ratings =  # TODO: merge voter_ratings and user_ratings
mutual_ratings

In [21]:
### SOLUTION ###
mutual_ratings = pd.merge(voter_ratings, user_ratings, on="title")
mutual_ratings

Unnamed: 0,user_id,rating_x,title,rating_y
0,308,5,Speed (1994),5
1,308,4,Twelve Monkeys (1995),2
2,308,1,Dances with Wolves (1990),4
3,308,4,"Clockwork Orange, A (1971)",1


Do you and the voter have the same taste, or in other words: do you have many movies in common? We would like to put a number to that question. So let's calculate the correlation coëfficient of your ratings. This says how similar two arrays are and is a function that can be found in numpy. Google a bit until you find the right function. This function gives you a matrix of multiple values as result, you need the element at `[0,1]`.

Hint: Get your ratings from the mutual_ratings dataframe and convert them to a list.

In [None]:
import numpy as np

corr =  # TODO: calculate correlation between your ratings and voter's ratings
corr

In [22]:
import numpy as np

### SOLUTION ###
corr = np.corrcoef(list(mutual_ratings.rating_x), list(mutual_ratings.rating_y))[0,1]
corr

-0.10540925533894599

If this number is NaN (not a number), then you have no movies in common with this voter or there is not enough distinction between your or his ratings. Just go back, pick another voter and run all the cells up to this point again. We will deal with these cases later.

Last but not least, we need the voter's rating for our movie and his average rating. Calculate those two and store them in variables.

In [None]:
voter_movie_rating =  # TODO: find voter's rating for selected_movie
print(voter_movie_rating)

voter_avg_rating =  # TODO: calculate voter's average rating
print(voter_avg_rating)

In [23]:
### SOLUTION ###
voter_movie_rating = voter_ratings[voter_ratings.title==selected_movie]["rating"].iloc[0]
print(voter_movie_rating)

voter_avg_rating = voter_ratings.rating.mean()
print(voter_avg_rating)

4
3.7581863979848866


We're almost done with this step. Now comes the magic math: we multiply the voter's normalised rating with his correlation with you. This we divide by the absolute correlation. Then we add your average rating to un-normalise it for you and there is your rating, based on one single voter.

In [24]:
correlated_rating = corr * (voter_movie_rating - voter_avg_rating)
user_movie_rating = user_avg_rating + (correlated_rating / abs(corr))
user_movie_rating

2.5581863979848865

### 3.4. Comparing all voters to you

We don't want to base our rating on one single voter though. We will edit the code a bit to base it on all users that have rated the movie.

We will:
- Loop over all the voters.
- Calculate the same things we did before for one voter.
- Ignore voters with a correlation of NaN.
- Sum the absolute correlation and the correlated rating for all voters and then divide them in the end.
- Turn this all into a function that we can use again later

In [None]:
def predicted_rating(user_ratings, ratings, selected_movie):
    abs_corr_sum = 0
    corr_rating_sum = 0
    
    voters =  # TODO: find voters id's in rating df
    
    for voter in voters:
        # get voter's movies
        voter_ratings =  # TODO: get voter's ratings

        # check similarity to user not nan
        mutual_ratings =  # TODO: merge voter_ratings and user_ratings
        corr =  # TODO: calculate correlation between your ratings and voter's ratings

        if not np.isnan(corr):
            # get selected movie rating
            voter_movie_rating =  # TODO: find voter's rating for selected_movie

            # get voter's average
            voter_avg_rating =  # TODO: calculate voter's average rating

            # sum similarity
            abs_corr_sum += abs(corr)
            corr_rating_sum += corr * (voter_movie_rating - voter_avg_rating)

    # calculate predicted rating
    user_movie_rating = user_avg_rating + (corr_rating_sum / abs_corr_sum)
    return user_movie_rating

In [25]:
### SOLUTION ###
def predicted_rating(user_ratings, ratings, selected_movie):
    abs_corr_sum = 0
    corr_rating_sum = 0
    
    voters = ratings[ratings['title']==selected_movie]['user_id']

    for voter in voters:
        # get voter's movies
        voter_ratings = ratings[ratings.user_id==voter]

        # check similarity to user not nan
        mutual_ratings = pd.merge(voter_ratings, user_ratings, on="title")
        corr = np.corrcoef(list(mutual_ratings.rating_x), list(mutual_ratings.rating_y))[0,1]

        if not np.isnan(corr):
            # get selected movie rating
            voter_movie_rating = ratings[(ratings.user_id==voter) & (ratings.title==selected_movie)]["rating"].iloc[0]

            # get voter's average
            voter_avg_rating = voter_ratings.rating.mean()

            # sum similarity
            abs_corr_sum += abs(corr)
            corr_rating_sum += corr * (voter_movie_rating - voter_avg_rating)

    # calculate predicted rating
    user_movie_rating = user_avg_rating + (corr_rating_sum / abs_corr_sum)
    return user_movie_rating

Now we can use our function to get the actual prediction for the movie:

In [26]:
print("Predicted rating for {}: {}".format(selected_movie, round(predicted_rating(user_ratings, ratings, selected_movie))))

  avg = a.mean(axis)
  ret, rcount, out=ret, casting='unsafe', subok=False)
  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)
  c *= np.true_divide(1, fact)
  c /= stddev[:, None]
  c /= stddev[None, :]


Predicted rating for Toy Story (1995): 3.0


In [27]:
# If you get a lot of warnings (these come from the correlation function), you can run this cell to turn all warnings off.
import warnings
warnings.filterwarnings('ignore')

You are now ready to make predictions! Try it for different movies, and see if it accurately reflects your tastes. If it does, you can use it to decide which movie to watch tonight! (Bear in mind though that this rating is only based on five ratings of yours, so it might not always be accurate.)

In [29]:
movie = get_movie_title(4)
print("Predicted rating for {}: {}".format(movie, round(predicted_rating(user_ratings, ratings, movie))))

Predicted rating for Get Shorty (1995): 3.0


## 4. Finding recommendations

Now we can predict ratings for a specific movie, but when you're wondering what to watch you won't know which movie to look for. We could calculate the predicted rating for every single movie, but there is another solution as well. We could find the voters that are most similar to you, your neighbours, and see which movies they liked most.

### 4.1. Finding your neighbours

Calculate your correlation with every user in the ratings dataframe.

In [None]:
corr_list = []
for voter in ratings.user_id.unique():
    voter_ratings =  # TODO: get voter's ratings
    mutual_ratings =  # TODO: merge voter_ratings and user_ratings
    corr =  # TODO: calculate correlation between your ratings and voter's ratings
    corr_list.append([voter, len(mutual_ratings), corr])
voter_corr = pd.DataFrame(corr_list, columns=["user_id", "movies_in_common", "corr"])

In [30]:
### SOLUTION ###
corr_list = []
for voter in ratings.user_id.unique():
    voter_ratings = ratings[ratings.user_id==voter]
    mutual_ratings = pd.merge(voter_ratings, user_ratings, on="title")
    corr = np.corrcoef(list(mutual_ratings.rating_x), list(mutual_ratings.rating_y))[0,1]
    corr_list.append([voter, len(mutual_ratings), corr])
voter_corr = pd.DataFrame(corr_list, columns=["user_id", "movies_in_common", "corr"])

Now we have a dataframe of our similarity with all users.

In [31]:
voter_corr.head()

Unnamed: 0,user_id,movies_in_common,corr
0,196,0,
1,63,1,
2,226,3,-0.944911
3,154,1,
4,306,0,


We want to find out which users are most similar to us. In this case, we want the ten users with the most movies in common and the highest correlation.

Create a list of similar voters:
- Sort the `voter_corr` dataframe, by `movies_in_common` and then by `corr`. Sort the values descending so the highest ones appear on top. 
- Take the first ten rows to get the most similar users.
- Select the `user_id` column. 
- Cast this result to a list.

In [None]:
similar_voters =  # TODO: create list of most similar user_id's
print(similar_voters)

In [32]:
### SOLUTION ###
similar_voters = list(voter_corr.sort_values(["movies_in_common", "corr"], ascending=False).head(10).user_id)
print(similar_voters)

[178, 13, 416, 301, 303, 655, 606, 64, 592, 363]


Those are the id's of your neighbours.

### 4.2. Finding your neighbours favourite movies

Now we have a look at those users and the movies they gave high ratings.

Follow these steps:
- Start from the ratings dataframe.
- Filter out user_id's that don't appear in `similar_voters`.
- Filter out titles that appear in `user_ratings.title`.
- Group by title.
- Aggregate by both count and mean.
- Select rating column.
- Filter out movies with a mean rating of less than 4.
- Sort the result, on count and then on mean, descending.
- Take the top 10 rows.

In [None]:
recommendations =  # TODO: filter voters and already seen movies from ratings df
recommendations =  # TODO: group by title and aggregate
recommendations =  # TODO: filter mean ratings less than 4
recommendations =  # TODO: sort and take first 10
recommendations

In [33]:
### SOLUTION ###
recommendations = ratings[(ratings.user_id.isin(similar_voters)) & (~ratings.title.isin(user_ratings.title))]
recommendations = recommendations.groupby("title").agg(["count","mean"])["rating"]
recommendations = recommendations[recommendations["mean"] >= 4]
recommendations = recommendations.sort_values(["count","mean"], ascending=False).head(10)
recommendations

Unnamed: 0_level_0,count,mean
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Chasing Amy (1997),11,4.181818
Star Wars (1977),10,4.9
"Empire Strikes Back, The (1980)",10,4.7
"Godfather, The (1972)",10,4.7
Fargo (1996),10,4.6
Pulp Fiction (1994),10,4.6
Raiders of the Lost Ark (1981),10,4.6
Return of the Jedi (1983),10,4.5
Alien (1979),10,4.4
Field of Dreams (1989),10,4.4


There you go, you have just built your own recomendation engine from scratch. Time to start making money!