## INFS 770 - Assignment 2

**Note**: Created using Anaconda Python 3.7.1 (64-bit)

---

### Pre-task setup

In [1]:
# Imports
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Sci-Py Packages
from scipy.spatial.distance import cosine, euclidean

In [2]:
# Load data
data_file = "DBbook_train_ratings.tsv"
df_data1 = pd.read_csv(data_file,
                       sep="\t",
                       header=None,
                       skiprows=1,
                       names=["userID", "itemID", "rate" ]
                       )

In [3]:
# Make sure we've got data in our dataframe
print(df_data1.head())

   userID  itemID  rate
0    6873    3201     4
1    6873    3098     4
2    6873    4198     4
3    6873    5950     4
4    6873     204     4


In [4]:
# Euclidean distance functiones (copied from ml_100k notebook)

def top_k_users(user_number, k):
    # no need to transpose the matrix this time because the rows already represent users
    # remove the active user
    df_sim = dense_matrix.loc[dense_matrix.index != user_number]
    # calculate the distance for between the given user and each row
    df_sim["distance"] = df_sim.apply(lambda x: euclidean(dense_matrix.loc[user_number], x), axis=1)
    # return the top k from the sorted distances
    return df_sim.sort_values(by="distance").head(k)["distance"]

def top_k_items(item_number, k):
    # copy the dense matrix and transpose it so each row represents an item
    df_sim = dense_matrix.transpose()
    # remove the active item 
    df_sim = df_sim.loc[df_sim.index != item_number]
    # calculate the distance between the given item for each row (apply the function to each row if axis = 1)
    df_sim["distance"] = df_sim.apply(lambda x: euclidean(dense_matrix[item_number], x), axis=1)
    # return the top k from the sorted distances
    return df_sim.sort_values(by="distance").head(k)["distance"]   

### Question 1

In [5]:
# Q1: Print the number of unique users and the number of unique books in this data set
unique_users = len(df_data1['userID'].unique())
unique_books = len(df_data1['itemID'].unique())

print("Number of unique users: {}".format(unique_users))
print("Number of unique books: {}".format(unique_books))

Number of unique users: 6181
Number of unique books: 6166


### Question 2

In [6]:
# Q2.1: Create the utility matrix
dense_matrix = df_data1.pivot_table(values="rate", index=["userID"], columns=["itemID"])

In [7]:
# Q2.2: Print the first 10 rows of the matrix
print(dense_matrix.head(n=10))

itemID  1     2     3     5     7     8     9     11    12    13    ...  8157  \
userID                                                              ...         
1        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
2        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
3        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
4        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
5        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
6        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
7        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
8        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
9        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
10       NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   

itemID  8160  8161  8162  8

In [8]:
# Q2.3: Print the number and the percentage of cells in the utility matrix that are not populated

num_rows, num_cols = dense_matrix.shape # Get the total size of our matrix
total_cells = num_rows * num_cols
num_unpopulated = dense_matrix.isna().sum().sum()
percent_unpopulated = float(num_unpopulated / total_cells) # Force to floating point number rather than integer

print("Number of unpopulated cells: {}".format(num_unpopulated))
print("Percent of unpopulated cells: {:.1%}".format(percent_unpopulated))

Number of unpopulated cells: 38036488
Percent of unpopulated cells: 99.8%


In [9]:
# Q2.4: Write code to fill these empty cells with 0s
dense_matrix = dense_matrix.fillna(0)
# Validate that we have 0s now rather than NaN
dense_matrix.head()

itemID,1,2,3,5,7,8,9,11,12,13,...,8157,8160,8161,8162,8163,8164,8166,8167,8168,8169
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Question 3

In [10]:
# Q3: Print the top 5 similar users to userID 2 based on Euclidean distance. 
top_k_users(2, 5)

userID
3917    11.401754
6875    12.569805
986     12.806248
1983    12.884099
1156    12.922848
Name: distance, dtype: float64

### Question 4

In [11]:
# Q4.1: Please write code to print the Euclidean distance between itemID 18 and itemID 1

# copy the dense matrix and transpose it so each row represents an item
df_sim = dense_matrix.transpose()

distance = euclidean(df_sim.loc[18], df_sim.loc[1])
print("Euclidean distance between the two items is: {}".format(distance))

Euclidean distance between the two items is: 29.189039038652847


In [12]:
# Q4.2: Please write code to print the Enclidean distance between itemID36 and itemID 1

distance = euclidean(df_sim.loc[36], df_sim.loc[1])
print("Euclidean distance between the two items is: {}".format(distance))

Euclidean distance between the two items is: 40.124805295477756


In [13]:
# Q4.3: Write a print statement that tells me between itemID36 and itemID18, which is more similar to itemID 1 and why.

print("itemID 18 is more similar to itemID 1 because its Euclidean distance is shorter than the distance between itemID 36 and itemID 1")

itemID 18 is more similar to itemID 1 because its Euclidean distance is shorter than the distance between itemID 36 and itemID 1


### Question 5

In [14]:
# Q5: Please write code to print the top 5 similar items to itemID 8010
top_k_items(8010, 5)

itemID
3711    127.921851
4559    127.964839
330     129.715072
1311    129.722781
7328    129.761319
Name: distance, dtype: float64

### Question 6

In [15]:
# Q6.1: Write code to remove books and users with less than 20 rating scores 
#       from the utility matrix by copying and maybe modifying the following codes
df_item_freq = df_data1.groupby("itemID").count()
df_user_freq = df_data1.groupby("userID").count()
selected_items = df_item_freq[df_item_freq["userID"]>20].index
dense_matrix = dense_matrix[selected_items]
selected_users = df_user_freq[df_user_freq["itemID"]>20].index
dense_matrix = dense_matrix.loc[selected_users]

#dense_matrix.head()

In [16]:
# Q6.2: Print the shape of the dataset
print("Matrix shape: {} x {}".format(dense_matrix.shape[0], dense_matrix.shape[1]))

Matrix shape: 766 x 776


### Question 7

In [17]:
# Q7.1: Please use the dataset you obtained from Q6 and write code to remove users that haven’t rated itemID8010
dense_matrix = dense_matrix[dense_matrix.loc[:,8010] != 0.0]

In [18]:
# Q7.2: Please write code to print the counts of the different rating scores of this item 
dense_matrix.loc[:,8010].value_counts()

4.0    68
5.0    58
3.0    27
2.0    13
1.0     8
Name: 8010, dtype: int64

In [19]:
# Q7.3: Print the shape of the data
print("Matrix shape: {} x {}".format(dense_matrix.shape[0], dense_matrix.shape[1]))

Matrix shape: 174 x 776


### Question 8

In [20]:
# Q8.1: Write code to partition the data set you obtained from Q7 for validating 
#       the performance on predicting rating on itemID 8010
from sklearn.model_selection import train_test_split

# create a data frame for the predictors
df_x = dense_matrix[[col for col in dense_matrix.columns if col != 8010]]
print(df_x.shape)

# create a series for the outcome
df_y = dense_matrix[[8010]]
print(df_y.shape)

(174, 775)
(174, 1)


In [21]:
# Q8.2: Randomly select 25% of the users as the testing set and the others as the training set

# Create our split
train_x, test_x, train_y, test_y = train_test_split(df_x, df_y, test_size=0.25, random_state=0)
df_train_x = pd.DataFrame(train_x, columns=df_x.columns)
df_test_x = pd.DataFrame(test_x, columns=df_x.columns)
df_train_y = pd.DataFrame(train_y, columns=[8010])
df_test_y = pd.DataFrame(test_y, columns=[8010])

In [22]:
# Q8.3: Please print the dimensions of the training set and the testing set
print("shapes")
print("x training:", df_train_x.shape)
print("x testing:", df_test_x.shape)
print("y training:", df_train_y.shape)
print("y testing:", df_test_y.shape)
print() 
print("class counts")
print("y training:\n", df_train_y[8010].value_counts())
print("y testing:\n", df_test_y[8010].value_counts())

shapes
x training: (130, 775)
x testing: (44, 775)
y training: (130, 1)
y testing: (44, 1)

class counts
y training:
 4.0    51
5.0    43
3.0    23
2.0     8
1.0     5
Name: 8010, dtype: int64
y testing:
 4.0    17
5.0    15
2.0     5
3.0     4
1.0     3
Name: 8010, dtype: int64


In [23]:
# Q8.4: Please write code to print the mean rating of itemID 8010 in the training set and its mean rating in the testing set
train_mean = df_train_y[8010].mean()
test_mean = df_test_y[8010].mean()
print("Training set mean for itemID 8010: {}".format(train_mean))
print("Testing set mean for itemID 8010: {}".format(test_mean))

Training set mean for itemID 8010: 3.9153846153846152
Testing set mean for itemID 8010: 3.8181818181818183


### Question 9

In [24]:
# Q9.1: Use the training and test dataset obtained in Q8 and write code to 
#         1) print the userID of the 5th user  (the user in the 5th row) in the test dataset
uid = df_test_x.index[5]
print("5th user's userID is: {}".format(uid))

5th user's userID is: 4521


In [25]:
# Q9.2: Predict this user’s rating of itemID 8010 based on the top 8 similar users in the training dataset

# specify the number of similar users to retrieve
k = 8

def user_based_predict(user_number):
    # retrieve the top k similar users
    # copy from all the training predictors
    df_sim = df_train_x.copy()
    # for each user, calculate the distance between this user and the active user
    df_sim["distance"] = df_sim.apply(lambda x: euclidean(df_test_x.loc[user_number], x), axis=1)
    # create a new data frame to store the top k similar users
    df_sim_users = df_sim.loc[df_sim.sort_values(by="distance").head(k).index]
    # calculate these similar users' rating on 8010, weighted by distance
    df_sim_users["weighed_d"] = list(map(lambda x: df_sim_users.loc[x]["distance"]*df_train_y.loc[x][8010], df_sim_users.index))
    predicted = df_sim_users["weighed_d"].sum()/df_sim_users["distance"].sum()
    return predicted

predicted_rating = user_based_predict(uid)
actual_rating = df_test_y.loc[uid][8010]

In [26]:
# Q9.3: Print the user’s predicted rating and the actual rating of the book. 

print("Predicted rating on Item 8010: {}".format(predicted_rating))
print("Actual rating on Item 8010: {}".format(actual_rating))

Predicted rating on Item 8010: 3.135703468754046
Actual rating on Item 8010: 5.0


### Question 10

1)	Use your own language (<= 6 sentences) to briefly describe model-based collaborative filtering. 


Model-based collaborative filtering can be used in creating recommender systems that will predict a user's rating of an item.  With this approach, you start off with extracting some information from your user's rating data and then use that information to create your model that your software will use to make its prediction.  The main idea behind collaborative filtering is that similar users will have similar preferences, so you can use those similarities to predict the preferences of a user based off your model that you've created.

2)	Summarize the differences between content-based filtering and model-based collaborative filtering? (at least two differences). How are they similar? 

Two of the differences between the two types of filtering are:
- Content-based filtering focuses on using known features, whereas collaborative filtering focuses on finding and using previously unknown features.  For example, in the lecture we learned that for doing content-based filtering on movies we would need to know the name of the movie, the director, the genre, etc.  For collaborative filtering, we wouldn't need to know this information to construct the feature vector, we'd only need to look at the rating table to group similar types of users.
- Another difference is that content-based filtering can deal with the cold start problem for new items (such as when we add a new item into our recommender system), whereas collaborative filtering doesn't deal well with cold starts on new items.

Both filtering types use feature vectors that you construct from your data, just focused on different features of your data. As we learned in the lecture, in practice these two types of filtering are frequently used together when building recommender systems.