# Master Thesis - Recommender System

The following code is based on Hu & Pu's (2011) Cascade Hybrid Collaborative Filtering technique integrating personality information with item ratings to come to recommendations.

### Set up

**Item**: Input constraint

**Item dimensions**:
- Informativeness {low, medium, high}
- Type {video, audio, textual}

**Item ratings**:
- 9 ratings, 1 for each combination of Complexity X Type

**Profile information**:
- Openness to Experience score
- Demographics {Education, Profession, Sex, Gender}

In [1]:
# imports
import pandas as pd
import numpy as np
import math
from random import *

### Data prep

#### Loading data

In [2]:
overview = pd.read_csv('./Data/Overview_All_updated.csv', sep = ';')

In [3]:
# Select appropriate columns
overview = overview[['ID', 'Education', 'Education.num', 'Employment', 'Employment.num', 'Openness', 
          'vs', 'vm', 'vc', 'as', 'am', 'ac', 'ts', 'tm', 'tc', 'avg_rating']].set_index('ID')
# Make average ratings floats, first dealing with , instead of . transform
overview['avg_rating'] = overview['avg_rating'].replace(',', '.', regex=True).astype(float)

### Recommender system [Rating-Based Collaborative Filtering]

**Pearson correlation coefficient**, the proximity between user *u* and user *v* is (looking at the rating sets of their overlapping item ratings) the summed total of the product of their individual ratings minus their average rating, divided by the absolute product over their summed totals ratings minus their average rating.

$\large sim(u,v) = \frac{\sum_{i \in I_u \cap I_v} (r_{u,i}-\bar{r}_u)(r_{v,i}-\bar{r}_v)}{\sqrt{\sum_{i \in I_u \cap I_v}(r_{u,i}-\bar{r}_u)^2\sum_{i \in I_u \cap I_v}(r_{u,i}-\bar{r}_u)^2}}$

In [4]:
def sim(data,u,v):
    """
    Calculates the pearson correlation coefficient as similarity measure between the 2 users {u,v}.
    - {u,v} are users represented by their 'ID' [int]
    - dependent on the 'overview' DataFrame
    """
    if(u == v):
        return 1
    
    # get rating matrix of overlapping items (every item since all ratings available)
    ratings_matrix = data.loc[[u,v]]
    # add columns of adjusted ratings (rating per item minus average rating) #lazyprogramming
    ratings_matrix_adj = pd.DataFrame({
        'vs_adj' : ratings_matrix['vs'] - ratings_matrix['avg_rating'], #visual[simple]
        'vm_adj' : ratings_matrix['vm'] - ratings_matrix['avg_rating'], #visual[medium]
        'vc_adj' : ratings_matrix['vc'] - ratings_matrix['avg_rating'], #visual[complex]
        'as_adj' : ratings_matrix['as'] - ratings_matrix['avg_rating'], #audio[simple]
        'am_adj' : ratings_matrix['am'] - ratings_matrix['avg_rating'], #audio[medium]
        'ac_adj' : ratings_matrix['ac'] - ratings_matrix['avg_rating'], #audio[complex]
        'ts_adj' : ratings_matrix['ts'] - ratings_matrix['avg_rating'], #text[simple]
        'tm_adj' : ratings_matrix['tm'] - ratings_matrix['avg_rating'], #text[medium]
        'tc_adj' : ratings_matrix['tc'] - ratings_matrix['avg_rating']  #text[complex]
    })
    
    # calculate proximity
    
    ## sum over (all adjusted ratings multiplied per item for both users)
    frac_top = sum([ratings_matrix_adj[i].product() for i in ratings_matrix_adj])
    
    ## sum over all adjusted item ratings
    frac_bottom = math.sqrt(sum(ratings_matrix_adj.loc[u]**2) * sum(ratings_matrix_adj.loc[v]**2))
    
    return(frac_top/frac_bottom)

Predicted rating $\tilde{r}_{u,i}$ is computed as

$\large \tilde{r}_{u,i} = \bar{r}_u + \kappa \sum_{v \in \Omega_u} sim(u,v) \times (r_{v,i}-\bar{r}_v)$,

where $\kappa$ is a normalising factor calculated as 

$\large \kappa = 1/\sum_{v \in \Omega_u} |sim(u,v)|$

, and $\Omega_u$ is the set of user $u$'s neighbours

In [6]:
def find_neighbours(u, num_neighbours, sim_matrix):
    """
    Finds the N nearest neighbours of user u in the similarity matrix.
    Returns a Pandas Series.
    """
    
    #looks at absolute values, sort, remove the last value (similarity with itself), then return closest neighbours
    return( sim_matrix.loc[u].abs().sort_values().iloc[:-1].tail(num_neighbours) )

In [7]:
def expect_rating(data, u, i, kappa):
    """
    Calculates the expected rating for a user u on an item (constraint) i, using kappa k.
    Returns the expected value [float].
    """
    sum_sim_v = 0
    # for every neighbour of u, named v
    for v in neighbours_index[u]:
        # sum as in the equation
        sum_sim_v += sim_matrix.loc[u,v] * (data.loc[v][i] - data.loc[v]['avg_rating'])
    # return expected value as in the full equation
    print(sum_sim_v * kappa[u] + data.loc[u]['avg_rating'])

### Recommender system [Personality-Based Collaborative Filtering]

The Pearson correlation coefficient is also used to compute the personality similarity between user $u$ and user $v$ as

$\large simp(u,v) = \frac{\sum_k (p^k_u - \bar{p}_u) (p^k_v - \bar{p}_v) }{\sqrt{\sum_k (p^k_u - \bar{p}_u)^2 \sum_k (p^k_v - \bar{p}_v)^2}}$



In [8]:
def normalise(data, min_value, max_value):
    """Normalisation function."""
    return(( data - min_value) / (max_value - min_value) )

#normalize personality values
overview['Openness_norm'] = normalise(overview['Openness'], 20, 120)
overview['Education_norm'] = normalise(overview['Education.num'], 1, 3) #1 vocational, 2 undergrads, 3 postgrads
overview['Employment_norm'] = normalise(overview['Employment.num'], 0, 3)
overview['avg_personality'] = (overview['Education_norm'] + 
                               overview['Openness_norm'] + 
                               overview['Employment_norm']) / 3 # equal weights

In [9]:
def sim_personality(data,u,v):
    """
    Calculates the pearson correlation coefficient as similarity measure between the 2 users {u,v} on personality.
    - {u,v} are users represented by their 'ID' [int]
    - dependent on the 'overview' DataFrame
    """
    if(u == v):
        return 1
    
    ratings_matrix_pers = pd.DataFrame({
        'Openness' : data['Openness_norm'][[u,v]] - data['avg_personality'][[u,v]],
        'Education' : data['Education_norm'][[u,v]] - data['avg_personality'][[u,v]],
        'Employment' : data['Employment_norm'][[u,v]] - data['avg_personality'][[u,v]]
    })
    
    # calculate proximity
    
    ## sum over (all adjusted ratings multiplied per item for both users)
    frac_top = sum([ratings_matrix_pers[k].product() for k in ratings_matrix_pers])
    
    ## sum over all adjusted item ratings
    frac_bottom = math.sqrt(sum(ratings_matrix_pers.loc[u]**2) * sum(ratings_matrix_pers.loc[v]**2))
    
    return(frac_top/frac_bottom)

### Recommender system [Linear Hybrid Collaborative Filtering]

$\large sim_{lhcf}(u,v) = \alpha * sim(u,v) + (1-\alpha) * simp(u,v)$

$\alpha = 0.8 * |I_u \cap I_v| / (|I_u \cap I_v| + 5)$

In [10]:
def sim_full(data, u, v):
    """
    Linear Hybrid Collaborative Filtering, linearly integrating personality and ratings.
    a = weight parameter
    """
    alpha = 0.8 * 9 / (9+5)
    
    return(alpha * sim(data,u,v) + (1-alpha) * sim_personality(data,u,v))

### Evaluation

parameters:
1. neighbours
    - k = 2/3/4
2. sampling ratings training data
    - 50/75/100%, or 5/7/9 constraint ratings
3. sampling new user data
    - 2/5 ratings given

In [11]:
# set all parameter options

num_neighbours = [2,3,4]
num_ratings_train_users = [5,7,9]
num_ratings_test_user = [1,2,5]

In [12]:
def LOO(overview, num_neighbours, num_ratings_train_users, num_ratings_test_user):
    """
    Implements Leave One Out cross-validation. 
    'overview' is full dataset, 
    'k' is number of neighbours, 
    'num_ratings_train_users' is number of ratings to use for training, 
    'num_ratings_test_user' is amount of ratings to grab randomly from new user in test.
    """
    #store results per round of LOO
    prelim_results = []

    for ID in overview.index:

        #split into train and test sets
        train = overview.drop(ID)
        test = overview.loc[ID]


        ### retain sample of ratings / remove set number of ratings


        # vs = 5, tc = 13
        cols = [5,6,7,8,9,10,11,12,13]

        ## for train set
        # amount of columns to set to zero
        num_col_rem = 9 - num_ratings_train_users
        # set number of columns to zero
        for row in range(0,len(train.index)):
            train.iloc[row, sample(cols, num_col_rem)] = 0
        # recalculate average ratings for training
        for row in range(0,len(train.index)):
            train.iloc[row, 14] = sum(train.iloc[row, cols])/num_ratings_train_users

        ## for test set
        # amount of columns to set to zero
        num_col_rem = 9 - num_ratings_test_user
        # set number of columns to zero
        test.iloc[sample(cols, num_col_rem)] = 0
        # recalculate average ratings for test
        test['avg_rating'] = sum(test.iloc[cols])/num_ratings_test_user

        ## paste data sets together again
        data = pd.concat([train, test.to_frame().T])


        ### create similarity list for new user


        # get similarity with other users, put into dataframe
        u_sim = pd.DataFrame(list({v: sim_full(data, ID, v) for v in train.index}.items()), columns = ['ID',ID]).set_index('ID').T

        # get top neighbours
        top_neighbours = u_sim.T.abs().sort_values(ID).tail(3)


        ### Calculate performance


        # calculate kappa
        kappa = 1/sum(top_neighbours.values)

        # calculate expected ratings
        sum_v = 0
        for v in top_neighbours.index:
            sum_v += u_sim.loc[ID, v] * ( data.loc[v][cols] - data.loc[v]['avg_rating'] )
        expected_ratings = sum_v * kappa + test['avg_rating']

        # calculate measures
        abs_error = (expected_ratings - overview.loc[ID][cols]).abs()
        MAE = np.mean(abs_error)
        RMSE = math.sqrt(np.mean(abs_error**2))

        # save preliminary performance information
        prelim_results.append({'ID': ID, 
                               'num_neighbours': num_neighbours, 
                               'num_ratings_train_users': num_ratings_train_users, 
                               'num_ratings_test_user': num_ratings_test_user,
                               'MAE': MAE,
                               'RMSE': RMSE})
    results = {'num_neighbours': num_neighbours, 
               'num_ratings_train_users': num_ratings_train_users, 
               'num_ratings_test_user': num_ratings_test_user,
               'MAE': np.mean(pd.DataFrame(prelim_results)['MAE']),
               'RMSE': np.mean(pd.DataFrame(prelim_results)['RMSE'])}
    return(results)

In [13]:
# run LOO for all parameter options

full_results = []

for k in num_neighbours:
    for l in num_ratings_train_users:
        for m in num_ratings_test_user:
            full_results.append(LOO(overview, k, l, m))
            
full_results_df = pd.DataFrame(full_results)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


### Results

In [15]:
full_results_df.sort_values('MAE')

Unnamed: 0,num_neighbours,num_ratings_train_users,num_ratings_test_user,MAE,RMSE
26,4,9,5,1.470995,1.77399
8,2,9,5,1.480185,1.790021
17,3,9,5,1.503676,1.826815
25,4,9,2,1.540844,1.83297
7,2,9,2,1.681711,2.038779
16,3,9,2,1.738367,2.074516
24,4,9,1,1.821611,2.208167
15,3,9,1,2.014873,2.391765
5,2,7,5,2.060978,2.565678
14,3,7,5,2.090686,2.702719


In [16]:
full_results_df.sort_values('RMSE')

Unnamed: 0,num_neighbours,num_ratings_train_users,num_ratings_test_user,MAE,RMSE
26,4,9,5,1.470995,1.77399
8,2,9,5,1.480185,1.790021
17,3,9,5,1.503676,1.826815
25,4,9,2,1.540844,1.83297
7,2,9,2,1.681711,2.038779
16,3,9,2,1.738367,2.074516
24,4,9,1,1.821611,2.208167
15,3,9,1,2.014873,2.391765
6,2,9,1,2.138763,2.531059
5,2,7,5,2.060978,2.565678


In [17]:
full_results_df

Unnamed: 0,num_neighbours,num_ratings_train_users,num_ratings_test_user,MAE,RMSE
0,2,5,1,3.704774,4.315251
1,2,5,2,3.893416,4.649557
2,2,5,5,3.52168,4.248474
3,2,7,1,2.412785,2.944037
4,2,7,2,2.331529,2.881682
5,2,7,5,2.060978,2.565678
6,2,9,1,2.138763,2.531059
7,2,9,2,1.681711,2.038779
8,2,9,5,1.480185,1.790021
9,3,5,1,3.568136,4.100894
