# CSCE 470 :: Information Storage and Retrieval :: Texas A&M University :: Fall 2018


# Homework 3 (and 4):  Recommenders

### 100 points [10% of your final grade; that's double!]

### Due: November 8, 2018

*Goals of this homework:* Put your knowledge of recommenders to work. 

*Submission Instructions (Google Classroom):* To submit your homework, rename this notebook as  `lastname_firstinitial_hw#.ipynb`. For example, my homework submission would be: `caverlee_j_hw3.ipynb`. Submit this notebook via **Google Classroom**. Your IPython notebook should be completely self-contained, with the results visible in the notebook. We should not have to run any code from the command line, nor should we have to run your code within the notebook (though we reserve the right to do so).

# Part 0: Movielens Data

For this first part, we're going to use part of the Movielens 100k dataset. Prior to the Netflix Prize, the Movielens data was **the** most important collection of movie ratings.

First off, we need to load the data (including u.user, u.item, and ua.base). Here, we provide you with some helper code to load the data using [Pandas](http://pandas.pydata.org/). Pandas is a nice package for Python data analytics.

You may need to install pandas doing something like:

`conda install --name cs470 pandas`

In [None]:
import pandas as pd

# Load the user data
users_df = pd.read_csv('u.user', sep='|', names=['UserId', 'Age', 'Gender', 'Occupation', 'ZipCode'])

# Load the movies data: we will only use movie id and title for this homework
movies_df = pd.read_csv('u.item', sep='|', names=['MovieId', 'Title'], usecols=range(2), encoding = "ISO-8859-1")

# Load the ratings data: ignore the timestamps
ratings_df = pd.read_csv('ua.base', sep='\t', names=['UserId', 'MovieId', 'Rating'],usecols=range(3))

# Working on three different data frames is a pain
# Let us create a single dataset by "joining" these three data frames
movie_ratings_df = pd.merge(movies_df, ratings_df)
movielens_df = pd.merge(movie_ratings_df, users_df)

movielens_df.head()

# Part 1. Let's find similar users [20 points]

Before we get to the actual task of building our recommender, let's familiarize ourselves with the Movielens data.

Pandas is really nice, since it let's us do simple aggregates. For example, we can find a specific user and take a look at that user's ratings. For example, for the user with user ID = 363, we have:

In [None]:
gb = movielens_df.groupby('UserId')
User363 = gb.get_group(363)
#the information for the user
User363[:1][["UserId", "Age", "Gender","Occupation", "ZipCode"]]

In [None]:
# And then we can see his first 10 ratings:
User363[['Title', 'Rating']][:10]

Balderdash! Everyone agrees that Toy Story should be rated 5! Oh well, there's no accounting for taste.

Moving on, let's try our hand at finding similar users to this base user (user ID = 363). In each of the following, **find the top-10 most similar users** to this base user. You should use all of the user's ratings, not just the top-10 like we showed above. We're going to try different similarity methods and see what differences arise.

You should implement each of these similar methods yourself! 

###     Top-10 Most Similar Users Using
#### Jaccard

In [None]:
# your code here

####     Cosine

In [None]:
# your code here

#### Pearson

In [None]:
# your code here

### What are the differences among these three similarity methods? Which one do you prefer, why?

 your answer here

# Part 2: User-User Collaborative Filtering: Similarity-Based Ratings Prediction [20 points]

Now let's estimate the rating of UserID 363 for the movie "Dances with Wolves (1990)" (MovieId 97) based on the similar users. Find the 10 nearest (most similiar by using Pearson) users who rated the movie "Dances with Wolves (1990)" and try different aggregate functions. Recall, there are many different ways to aggregate the ratings of the nearest neighbors. We'll try three popular methods:

### Method 1: Average. 
The first is to simply average the ratings of the nearest neighbors:
$r_{c,s} = \frac{1}{N}\sum_{c'\in \hat{C}}r_{c',s}$

In [None]:
# your code here

### Method 2: Weighted Average 1. 
The second is to take a weighted average, where we weight more "similar" neighbors higher: $r_{c,s} = k\sum_{c'\in \hat{C}}sim(c, c')\times r_{c',s}$

Choose a reasonable k so that r_{c,s} is between 1 to 5

In [None]:
# your code here

### Method 3: Weighted Average 2. 
An alternative weighted average is to weight the differences between their ratings and their average rating (in essence to reward movies that are above the mean): $r_{c,s} = \bar{r}_c + k\sum_{c'\in \hat{C}}sim(c, c')\times (r_{c',s} - \bar{r}_{c'})$

Choose a reasonable k so that r_{c,s} is between 1 to 5

In [None]:
# your code here

# Part 3: Baseline Recommendation (Global) [20 points]

OK, so far we've built the basics of a user-user collaborative filtering approach; that is, we take a user, find similar users and then aggregate their ratings. 

An alternative approach is to consider just basic statistics of the movies and users themselves. This is the essence of the "baseline" recommender we discussed in class:

$b_{xi} = \mu + b_x + b_i$

where $b_{x,i}$ is the baseline estimate rating user x would give to item i, $\mu$ is the overall mean rating, $b_x$ is a user bias term, and $b_i$ is an item bias term.

For this part, let's once again estimate the rating of UserID 363 for the movie "Dances with Wolves (1990)" (MovieId 97), but this time using the baseline recommender.

In [None]:
# your code here

# Part 4: Local + Global Recommendation (Baseline + Item-Item CF) [20 points]

Our final recommender combines the global baseline recommender with an item-item local recommender. 

$\hat{r}_{xi} = b_{xi} + \dfrac{\sum_{j \in N(i;x)}s_{ij} \cdot (r_{xj} - b_{xj})}{\sum_{j \in N(i;x)}s_{ij}} $

where 
* $\hat{r}_{xi}$ is our estimated rating for what user x would give to item i.
* $s_{ij}$ is the similarity of items i and j.
* $r_{xj}$ is the rating of user x on item j.
* $N(i;x)$ is the set of items similar to item i that were rated by x.

You will need to make some design choices here about what similarity measure to use and what threshold to determine what items belong in $N(i;x)$.

Now show us what this estimates the rating of UserID 363 for the movie "Dances with Wolves (1990)" (MovieId 97) to be:

In [None]:
# your code here

# Part 5. Putting it all together! [20 points]

Finally, we're going to experiment with our different methods to see which one performs the best on our special test set of "hidden" ratings. We have three big "kinds" of recommenders:

* User-User Collaborative Filtering
* Baseline Recommendation (Global)
* Local + Global Recommender


But within each, we have lots of design choices. For example, do we try Jaccard+Average or Jaccard+WeightedAverage1? Do we try Jaccard or Cosine or Pearson? What choice of k? Etc.

For this part, you should train your methods on a special train set (the base set, see below). Then report your results over the test set (see below). You should use RMSE as your metric of choice. Which method performs best? You will need to experiment with many different approaches, but ultimately, you should tell us the best 2 or 3 methods and report the RMSE you get.

In [None]:
train = pd.read_csv('ua.base', sep='\t', names=['UserId', 'MovieId', 'Rating'],usecols=range(3))
test = pd.read_csv('ua.test', sep='\t', names=['UserId', 'MovieId', 'Rating'],usecols=range(3))


In [None]:
# your code here 

*provide your best 2 or 3 methods, their RMSE, plus some discussion of why they did the best*

### BONUS: 
Can you do better? Find a way to improve the results!

In [None]:
# your code here