_The main focus of this assignment is Building Recommendation Systems from theoretical as well as practical perspective_

## Problem 1: Implementing Recommendation Systems

The goal of this task is to predict the recommendation score for products given user reviews. The data consists of products as columns and users as rows. The data is given as follows where P refers to the product and U refers to the user. An entry in each cell refers to the users review score or recommendation for that product.

|&nbsp; | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | 
|----|----|----|----|----|----|----|----|----|----|-----| 
| U1 | 3  | 7  | 4  | 9  | 9  | 7  | 6  | 7  | 8  | 8   | 
| U2 | 7  | 5  | 5  | 3  | 8  | 8  | 7  | 4  | 9  | 5   | 
| U3 | 7  | 5  | 5  | 0  | 8  | 4  | 8  | 6  | 7  | 9   | 
| U4 | 5  | 6  | 8  | 5  | 9  | 8  | 5  | 7  | 10 | 7   | 
| U5 | 5  | 8  | 8  | 8  | 10 | 9  | 7  | 4  | 9  | 8   | 
| U6 | 7  | 7  | 8  | 4  | 7  | 8  | 6  | 7  | 7  | 8   |


Consider the following test set of users. The missing values are the products that the corresponding users have not bought. Given this dataset, determine which products U7, U8 and U9 should buy. Show the recommendation scores for the top 3 products.

|  &nbsp; | P1  | P2  | P3  | P4  | P5  | P6  | P7  | P8  | P9  | P10 | 
|----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----| 
| U7 |  ?  | 6   | 9   |   ? |   ? | 6   |   ? | 9   |   ? |   ? | 
| U8 | 7   |   ? | 9   |   ? | 4   |  ?  | 9   |  ?  | 7   |   ? | 
| U9 |   ? | 6   |   ? | 9   |   ? | 7   |   ? | 8   |   ? | 4   | 


In [1]:
# Load the Relevant libraries
import sklearn as sk
import pandas as pd
import numpy as np

train = pd.read_csv("ratingsDataTrain.csv",index_col=0)
test = pd.read_csv("ratingsDataTest.csv",index_col=0)
test = test.rename(index=dict(zip(['U1','U2','U3'],['U7','U8','U9'])))
test_original = test.copy()

In [2]:
test = test.apply(lambda x: x.str.strip() if x.dtype == "object" else x).replace('?',np.nan)
train = train.apply(lambda x: x.str.strip() if x.dtype == "object" else x).replace('?',np.nan)

test = test.astype(float)
train = train.astype(float)

In [3]:
nearest_neighbor = {}
# For each user in the test set, find their nearest neighbor
for u in test.index:
    # Append the user to the training set
    j = train.append(test.loc[u])
    # Find the correlation matrix
    corr_mat = j.T.corr()
    # Eliminate the diagonal perfect correlation
    corr_mat = corr_mat.replace(1.0,0.0)
    # Find the nearest neighbor for everyone
    NNs = corr_mat.idxmax(axis=1)
    # extract out just the user from the test set
    nearest_neighbor[u] = NNs[u]

In [4]:
test_users = list(test.index.values)

# For each of the users in the test set, fill their gaps with the review from their most similar user.
# Replace their missing data in the test set to fill out the test set
for u in test_users:
    best_nn = nearest_neighbor.get(u)
    test.loc[u] = test.loc[u].combine_first(train.loc[best_nn])

test = test.astype(int)

In [5]:
test

Unnamed: 0,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10
U7,7,6,9,0,8,6,8,9,7,9
U8,7,7,9,4,4,8,9,7,7,8
U9,3,6,4,9,9,7,6,8,8,4


In [6]:
test_original

Unnamed: 0,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10
U7,?,6,9,?,?,6,?,9,?,?
U8,7,?,9,?,4,?,9,?,7,?
U9,?,6,?,9,?,7,?,8,?,4


#### The highest rated products that our test users have not bought yet:

|  User | Suggested Purchase  | Anticipated Review Score
|----|:---:|:-:|
| U7 |  P10  | 9 |
| U8 | P6 or P10|8|
| U9 | P5 | 9 |

## Problem 2: Social Recommendation Systems

Using the FilmTrust dataset which has historical likes information and social network connection create recommendation systems, pick any three pairs of randomly selected users and 2 randomly selected movies and make recommendations with and without the social network information.


In [7]:
# Load the Relevant libraries
import sklearn as sk
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise import Reader
from surprise.model_selection import GridSearchCV

import random
random.seed(11235)
import matplotlib.pyplot as plt

%matplotlib inline

In [8]:
# File for the Social Network Data (UW Repository)
film_social_network = "FilmTrustSocialNetwork.txt"
colnames = ['trustor', 'trustee', 'trust-value']
sn = pd.read_table(film_social_network, sep=' ', header=0, names=colnames, dtype={'trustor':str,'trustee':str})

# File for the Ratings Data (UW Repository)
film_ratings = "FilmTrustRatings.txt"
colnames = ['uid', 'iid', 'rating']
rating = pd.read_table(film_ratings, sep=' ', header=0, names=colnames, dtype={'uid':str,'iid':str})

In [9]:
# Get list of items
items = list(rating['iid'].unique())

# Get list of users, trustors, and trustees
users = list(rating['uid'].unique())
trustors = list(sn['trustor'].unique())
trustees = list(sn['trustee'].unique())

# Only include users that are in the trustor list, as that will be required when using social network information
users = [u for u in users if u in trustors]

In [10]:
# Get 3 random users and 2 random items (without replacement, each should be unique values)
random_users = list(np.random.choice(users, 3, replace=False))
random_items = list(np.random.choice(items, 2, replace=False))

In [11]:
# Used for getting predictions for all items below
df_items = pd.DataFrame(data=items, columns=['Item Id'], index=np.arange(len(items)))

### Without Including Social Network Information

In [12]:
# Default reader
reader = Reader()

# Load data from df, but do not include users for which we are doing prediction
cond = (~rating['uid'].isin(random_users))
data = Dataset.load_from_df(rating.loc[cond, ['uid', 'iid', 'rating']], reader)

# Use all data in training (does not include target random users)
train = data.build_full_trainset()

# Fit to training data
algo = SVD().fit(train)

In [13]:
# Loop over random users and random items. At end, give top five items for each user based on all ratings
for u in random_users:
    print('*' * 70)
    item_prediction = df_items.copy()
    for i in random_items:
        p = algo.predict(u,i).est
        print('User: %s, Item: %s, Predicted Score: %.2f' % (u,i,p))
    item_prediction['Predicted Score'] = item_prediction['Item Id'].apply(lambda x: algo.predict(u,x).est).round(2)
    item_prediction = item_prediction.sort_values('Predicted Score', ascending=False).reset_index(drop=True)
    print('User: %s, top 5 predictions:' % (u))
    print(item_prediction.iloc[:5])

**********************************************************************
User: 1336, Item: 1239, Predicted Score: 2.8
User: 1336, Item: 479, Predicted Score: 3.2
User: 1336, top 5 predictions:
  Item Id  Predicted Score
0     286         3.737533
1     805         3.651469
2     335         3.593501
3     218         3.530804
4     658         3.516248
**********************************************************************
User: 556, Item: 1239, Predicted Score: 2.8
User: 556, Item: 479, Predicted Score: 3.2
User: 556, top 5 predictions:
  Item Id  Predicted Score
0     286         3.737533
1     805         3.651469
2     335         3.593501
3     218         3.530804
4     658         3.516248
**********************************************************************
User: 628, Item: 1239, Predicted Score: 2.8
User: 628, Item: 479, Predicted Score: 3.2
User: 628, top 5 predictions:
  Item Id  Predicted Score
0     286         3.737533
1     805         3.651469
2     335         3.593501
3

Above shows the predicted scores for 3 random users and 2 random items. In addition, for each random user, we show the item ids with the top 5 predicted scores. The top of that list would be the number 1 recommendation for each user based on being the highest predicted score for that user.

### Including Social Network Information

In [23]:
# Loop over all random users
for u in random_users:
    print('*' * 70)
    
    # Make sure user is in social network data
    cond = (sn['trustor']==u)
    if cond.sum()==0:
        print("User is not in social network data! Skipping...")
        continue
    
    # Get trustees for this user
    trustees = sn[cond]['trustee'].unique()
    print("User: %s, Number of Trustees = %d" % (u, trustees.shape[0]))
    
    # Skip if trustees have no ratings in data (social network doesn't rate movies)
    cond = (rating['uid'].isin(trustees))
    if cond.sum()==0:
        print("User does not have any trustees in movie ratings data! Skipping...")
        continue

    # Only load data from trustees for a given user
    data = Dataset.load_from_df(rating[cond][['uid', 'iid', 'rating']], reader)

    # Training data only includes people user trusts
    train = data.build_full_trainset()

    # Fit to training data
    algo = SVD().fit(train)

    item_prediction = df_items.copy()
    for i in random_items:
        p = algo.predict(u,i).est
        print('User: %s, Item: %s, Predicted Score: %.2f' % (u,i,p))
    item_prediction['Predicted Score'] = item_prediction['Item Id'].apply(lambda x: algo.predict(u,x).est).round(2)
    item_prediction = item_prediction.sort_values('Predicted Score', ascending=False).reset_index(drop=True)
    print('User: %s, top 5 predictions:' % (u))
    print(item_prediction.iloc[:5])

**********************************************************************
User: 1336, Number of Trustees = 3
User: 1336, Item: 1239, Predicted Score: 2.67
User: 1336, Item: 479, Predicted Score: 2.67
User: 1336, top 5 predictions:
  Item Id  Predicted Score
0       7             2.86
1      13             2.85
2       2             2.83
3     215             2.83
4     206             2.82
**********************************************************************
User: 556, Number of Trustees = 2
User: 556, Item: 1239, Predicted Score: 1.97
User: 556, Item: 479, Predicted Score: 1.97
User: 556, top 5 predictions:
  Item Id  Predicted Score
0     278             2.15
1       8             2.12
2     247             2.12
3      11             2.11
4     236             2.11
**********************************************************************
User: 628, Number of Trustees = 22
User: 628, Item: 1239, Predicted Score: 3.11
User: 628, Item: 479, Predicted Score: 3.11
User: 628, top 5 predictions:

Above shows the predicted scores for 3 random users and 2 random items. In addition, for each random user, we show the item ids with the top 5 predicted scores. The top of that list would be the number 1 recommendation for each user based on being the highest predicted score for that user. Note that this only uses training information from people that each user trusts. Therefore, the variability in the predicted score is limited by the variability in the scores by all the trustees for that user. If a user has a small number of trustees (they only connect to a few other people), then the user will likely get a more limited set of scores. In general, a user with a small number of trustees will have smaller predicted scores that span a more narrow range than a user with a large number of trustees.