# Operations Analytics - Building recommendation system

This notebook provides introduction to building user-by-user recommendation system in Python. As an example the movie ratings Movielens dataset (100k ratings, old dataset) is used publicly available [here](https://grouplens.org/datasets/movielens/). To start coding, please, download the dataset and unpack it into the working directory (e.g. ADStudent folder in lab computers). The ratings are provided in the **u.data** file. The latter turns out to be a text file (tab separated and not coma perated csv), but because we do not know it yet we will use the **read_table()** function from pandas to read it automatically indifferent from type.

We will use the kNN method to find neighbors (similar users) for each and every user and then provide recommendations based on it.

In [1]:
# importing libraries
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.cross_validation import train_test_split

In [2]:
# reading the data:
# read_table() is used, as we do not know the format of data
# header=None, as the data has no column names provided
movies = pd.read_table("u.data",header=None)
movies.head()

Unnamed: 0,0,1,2,3
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [3]:
# creating column names
movies.columns = ['user_id', 'item_id', 'rating', 'timestamp']
movies.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [4]:
# number of (unique) users
len(movies["user_id"].unique())

943

In [5]:
# number of (unique) movies
len(movies["item_id"].unique())

1682

In [6]:
# creating a (pivot) table from our data, where:
# columns are movies
# rows are users
# values are ratings (directly just ratings, not sum or count)
# missing values are filled with 0s
# (if a user did not rate a movie, then its value is missing)

df = pd.pivot_table(data=movies,columns="item_id",index="user_id",values="rating",fill_value=0)
df.head()

item_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5,3,4,3,3,5,4,1,5,3,...,0,0,0,0,0,0,0,0,0,0
2,4,0,0,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,4,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# specifying the model with 3 neighbors and fitting it to our data
model = NearestNeighbors(n_neighbors=3)
model.fit(df)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=1, n_neighbors=3, p=2, radius=1.0)

In [8]:
# finding neighbors for the 3rd user
print(model.kneighbors(df.iloc[3,:]))

(array([[  0.        ,  20.90454496,  21.14237451]]), array([[  3, 569, 301]], dtype=int64))




As we can see, printed output above provides the closes 3 uers and their distances. Among them the closest one is himself, as the distance between a user and himself is 0. We have this issue as we trained (fitted) the model on a dataset, and made prediction based on the same dataset. A better (correct) approach is to divide the data into 2 components (train and test) and use the first one to build a model and 2nd one to test it. Let's do that, by splitting 75-25 (e.g. 75% train and 25% test). 

In [9]:
train, test = train_test_split(df,test_size=0.25)

In [10]:
# as you can see the lenght of test components is 236 onservations/rows
len(test)

236

In [11]:
# let's now specify the model again but for 5 neighbors
# and fit it to train component only
model = NearestNeighbors(n_neighbors=5)
model.fit(train)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

In [12]:
# let's now make recommendation for the 3rd user, but from the test component
print(model.kneighbors(test.iloc[3,:]))

(array([[ 55.87486018,  55.91958512,  56.00892786,  56.0178543 ,
         56.10704056]]), array([[428, 291, 252, 510,  29]], dtype=int64))




In [13]:
# choose only users, not distances
print(model.kneighbors(test.iloc[3,:])[1][0])

[428 291 252 510  29]




In [14]:
# let's get 3 closes users for everybody inside a list
neighbor_list = []
for i in range(len(test)):
    neighbor_list.append(model.kneighbors(test.iloc[i,:])[1][0])



In [15]:
# "convert" this list into dataframe
neighbors = pd.DataFrame(neighbor_list)
neighbors.head()

Unnamed: 0,0,1,2,3,4
0,274,133,385,537,488
1,130,668,488,597,609
2,658,55,304,650,238
3,428,291,252,510,29
4,469,488,609,597,489


In [17]:
# additional

# if you are interested which movies a user rated with max rating,
# the following will do the job (and save it inside a dataframe)
max_ratings = pd.DataFrame(df.idxmax(axis=1))
max_ratings.head()

Unnamed: 0_level_0,0
user_id,Unnamed: 1_level_1
1,1
2,50
3,320
4,50
5,42
