Àlex Escolà

- <h1> Jester Online Joke Recommender System </h1>

    - <h4> Traditional Neiborhood Collaborative Filtering </h4>
    In this first section, a matrix of similarity between all users is firstly defined. The Pearson correlation is used to define this similarity between users. Given that the matrix is symmetric, the main functions defined until a prediction for each test rating is obtained, only work with the lower triangular part of the matrix for computational time purposes. Having obbtained a similarity between all users  several prediction functions are tested in order to obtain an estimate of each rating in the test split, which will be later compared with the real value. The rating estimation in the user-user case is obtained as an average of the ratings of the k most similar users to the user in question times their similarity, obtained from the similarity matrix.
    
    The same process is repeated but defining an item-item similarity matrix. The process is very similar, but the prediction is performed by looking through similar items to predict a rating for a new item, instead of looking at similar users. Specifically, as before, the rating is estimated through the top k similar items. Results in this section show to be slightly better.
    - <h4> Graph-Based recommender system </h4>
    In this section a User-Item graph is built for each user independently using the Networkx library, from which the page rank is obtained and used as a measure of similarity. More detaliled in the section.
    - <h4>  Content-Based recommender </h4>
    In order to obtain the a measure of similarity between jokes, the process implemented here is as follows:
        - All jokes are modelled by several topics using LDA
        - The resulting vector of probabilites of the jokes belonging to each topic are obtained
        - These vectors are used to obtain similarity measures between the jokes, taking the Pearson correlation

In [2]:
import numpy as np
import pandas as pd
import random
from itertools import chain
import matplotlib.pyplot as plt

* <h3> Data preprocessing</h3>

For the following sections 1.000 samples from the original dataset are used, given that 20.000 reuquires a lot of computational time, and the result however does not vary significantly. 

In [3]:
df1 = pd.read_excel('data/jester/jester-data-1.xls',header=None)
df2 = pd.read_excel('data/jester/jester-data-2.xls',header=None)
df3 = pd.read_excel('data/jester/jester-data-3.xls',header=None)

In [17]:
df = pd.concat([df1,df2,df3])
df.index = range(len(df))
#df = df.drop([0], axis = 1)

In [18]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,100
0,74,-7.82,8.79,-9.66,-8.16,-7.52,-8.5,-9.85,4.17,-8.98,...,2.82,99.0,99.0,99.0,99.0,99.0,-5.63,99.0,99.0,99.0
1,100,4.08,-0.29,6.36,4.37,-2.38,-9.66,-0.73,-5.34,8.88,...,2.82,-4.95,-0.29,7.86,-0.19,-2.14,3.06,0.34,-4.32,1.07
2,49,99.0,99.0,99.0,99.0,9.03,9.27,9.03,9.27,99.0,...,99.0,99.0,99.0,9.08,99.0,99.0,99.0,99.0,99.0,99.0
3,48,99.0,8.35,99.0,99.0,1.8,8.16,-2.82,6.21,99.0,...,99.0,99.0,99.0,0.53,99.0,99.0,99.0,99.0,99.0,99.0
4,91,8.5,4.61,-4.17,-5.39,1.36,1.6,7.04,4.61,-0.44,...,5.19,5.58,4.27,5.19,5.73,1.55,3.11,6.55,1.8,1.6


In [19]:
# Get a subsample datset of 20.000 samples
df = df.sample(1000)

# Divide the data into training and test set
training = []
test = []

training = pd.DataFrame(index=df.index,columns=['items','#items'])
test = pd.DataFrame(index=df.index,columns=['items','#items'])

random.seed(47)

for index,row in df.iterrows():
    row = row[1:]
    l = row[row <11]   
    training_indexes = random.sample(l.index,int(0.75*(len(l))))
    test_indexes = list(set(l.index) - set(training_indexes))
    test['items'].ix[index] = test_indexes
    test['#items'].ix[index] = len(test_indexes)
    training['items'].ix[index] = training_indexes
    training['#items'].ix[index] = len(training_indexes)
   

In [295]:
training.head()

Unnamed: 0,items,#items
54311,"[18, 20, 70, 17, 63, 57, 5, 50, 46, 7, 13, 27,...",16
72458,"[35, 16, 31, 19, 68, 5, 18, 32, 27, 20, 77, 13...",19
38730,"[18, 46, 61, 64, 57, 15, 49, 50, 39, 42, 94, 4...",53
44417,"[70, 49, 15, 65, 13, 68, 50, 22, 25, 18, 5, 11...",38
61433,"[27, 38, 82, 55, 7, 17, 18, 32, 15, 13, 8, 67,...",14


* <h3> Collaborative: Rating prediction though the similarity between users using Pearson Correlation </h3>

In [29]:
from scipy.stats import pearsonr

Here the Pearson correlation between all users is calculated, resulting in a symmetric matrix. For this reason and in order to save computing time only the lower triangular part of the matrix is calculated. 

In [313]:
Pearson = pd.DataFrame(index=training.index,columns=training.index)
cont = 1
for user1 in training.index:
    for user2 in training.index[cont:]:
        items1 = training['items'].loc[user1]
        items2 = training['items'].loc[user2]
        n_inters = set(items1).intersection(items2)
        Pearson[user1][user2]=pearsonr(df.loc[user1],df.loc[user2])[0]
    cont += 1

In [89]:
Pearson.head()

Unnamed: 0,50718,3722,42825,53055,61686,30845,58555,51496,2845,40671,...,58068,13889,51600,50801,65963,31193,15081,43462,54697,48399
50718,,,,,,,,,,,...,,,,,,,,,,
3722,0.41699,,,,,,,,,,...,,,,,,,,,,
42825,-0.00914689,-0.0311717,,,,,,,,,...,,,,,,,,,,
53055,0.574231,0.327561,0.029006,,,,,,,,...,,,,,,,,,,
61686,0.546881,0.241821,-0.0523102,0.594919,,,,,,,...,,,,,,,,,,


The following function selects for a specific user, all users which having rated the same item, have the highest Pearson correlation with this user. This function is called within the predict fuction.

In [90]:
def most_similar(user, item, Pearson, training, k=10):
    users_rated_item = [idx for idx in training.index if item in training['items'].loc[idx]]
    y = Pearson[user][Pearson[user].notnull()]
    x = Pearson.ix[user][Pearson.ix[user].notnull()]
    all_users = pd.concat([x,y], axis = 0)
    users_with_rating = all_users * all_users.index.isin(users_rated_item)
    sort = users_with_rating.iloc[users_with_rating.argsort()][::-1][:k]
    return sort

As an example:

In [91]:
most_similar(50718, 1, Pearson, training)

695      0.592028
48875    0.518501
880      0.491167
39462    0.456271
39018    0.412335
47376    0.365912
40613    0.363519
16572    0.355471
7644     0.337809
19659    0.329439
Name: 50718, dtype: object

Bellow are three different versions of prediction functions based on the similarity among the most similar users. These take into account the most similar users in order to predict the rating of this  user on a specific item.

The users who's ratings will be predicted are thoes within the test splint of the original dataset, in order to perform accuracy metrics further on. 

These ratings will be further on used in order to create a "prediction matrix", which will contain the ratings of all test elements, and will be then compared to the actual test data.

As a first prediction function, the standart method is implemented for predicting new user ratings from the similarity among other users is shown bellow:

\begin{equation}
\hat{r}_{u,j}=\frac{\sum_{v \in{P_u(j)}}{sim(u,v)\cdot r_{v,j}}}{\sum_{v \in{P_u(j)}}{sim(u,v)}}
\end{equation}

In [93]:
def predict_standart(user, item, Pearson, training, k=10):
    similars = most_similar(user, item, Pearson, training, k)
    numerator = [None]*len(similars)
    boolean = [None]*len(similars)
    for SimilarUser,similarity,i in zip(similars.index, similars.values, range(len(similars))):
        rating = df.ix[SimilarUser][item] 
        boolean[i] = rating < 11
        numerator[i] = rating * similarity if boolean[i] else 0
    return sum(numerator)/sum(similars.values * boolean) 

The following prediction function has also been implemented, which as an addition to the previous one, it also takes into account the different scales in the ratings, or the bias in different user rating behaviours:

\begin{equation}
\hat{r}_{u,j}=\bar{r}_u + \frac{\sum_{v \in{P_u(j)}}{sim(u,v)\cdot(r_{v,j}-\bar{r}_v})}{\sum_{v \in{P_u(j)}}{sim(u,v)}}
\end{equation}

In [94]:
def predict_rating_bias(user, item, Pearson, training, k=10):
    similars = most_similar(user, item, Pearson, training, k)
    numerator = [None]*len(similars)
    boolean = [None]*len(similars)
    for SimilarUser,similarity,i in zip(similars.index, similars.values, range(len(similars))):
        mean_rv = np.mean(df.ix[SimilarUser][df.ix[SimilarUser] != 99.00])
        rating = df.ix[SimilarUser][item] - mean_rv
        boolean[i] = rating < 11
        numerator[i] = rating * similarity if boolean[i] else 0
        mean_ru = np.mean(df.ix[user][df.ix[user] != 99.00])
    return (sum(numerator)/sum(similars.values * boolean)) + mean_ru

This estimator is similar to the latter, but also takes into ccount the standart deviation of the user ratings. This allow to compensate from the variance of the ratings from user to user, having this way a normalized rating for each user.

\begin{equation}
\hat{r}_{u,j}=\bar{r}_u + \sigma _u \frac{\sum_{v \in{P_u(j)}}{sim(u,v)\cdot z_{v,j}}}{\sum_{v \in{P_u(j)}}{\left | sim(u,v)\right |}}
\end{equation}

In [95]:
def predict_rating_bias_std(user, item, Pearson, training, k=10):
    similars = most_similar(user, item, Pearson, training, k)
    numerator = [None]*len(similars)
    boolean = [None]*len(similars)
    for SimilarUser,similarity,i in zip(similars.index, similars.values, range(len(similars))):
        mean_rv = np.mean(df.ix[SimilarUser][df.ix[SimilarUser] != 99.00])
        std_rv = np.std(df.ix[SimilarUser][df.ix[SimilarUser] != 99.00])
        rating = (df.ix[SimilarUser][item] - mean_rv)/std_rv
        boolean[i] = rating < 11
        numerator[i] = rating * similarity if boolean[i] else 0
        mean_ru = np.mean(df.ix[user][df.ix[user] != 99.00])
        std_ru = np.std(df.ix[user][df.ix[user] != 99.00])
    return (sum(numerator)/sum(abs(similars.values * boolean)))*std_ru + mean_ru

- Prediction matrix: 

    Here a matrix if the original matrix dimentions is generated, where a predicted rating is calculated for all users and each item in the test split:

In [96]:
def predition_matrix(prediction_function, test, similarity_mtx, training):
    PredMat = pd.DataFrame(index = test.index, columns = range(101))
    for user in test.index:
        for item in test.loc[user]['items']:
            PredMat[item][user] = prediction_function(user, item, similarity_mtx, training)
    return PredMat

Here the following funtion is defined, which will return the Mean Absolute Error, and it will receive as an input the different prediction matrices obtained though the different prediction methods:

In [97]:
def mae(PredMat):
    mae = 0.
    for user in PredMat.index:
        predicts = PredMat.loc[user][PredMat.loc[user].notnull()]
        df_values = df.loc[user][predicts.index]
        mae += np.mean(abs(predicts - df_values))
    return mae / len (PredMat.index)

* Results using standart rating prediction  function

In [314]:
pred_mat_rating_standart = predition_matrix(predict_standart, test, Pearson, training)

In [122]:
pred_mat_rating_standart.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,100
50718,,,,,,,,-3.71654,,,...,,,,,,-0.509791,,,,
3722,,,-2.27847,,,,2.19155,,0.227044,,...,,,,,,,,,,
42825,,,,,,,,1.48023,,,...,,4.16772,,2.01805,,4.49796,,3.98365,,
53055,,,,,,,,,,,...,,,,,,,,,,
61686,,,,,,,,,,,...,,,0.500709,,,,,,,


In [99]:
mae_rating_standart = mae(pred_mat_rating_standart)
print "MAE: {}".format(round(mae_rating_standart,3))
print "NMAE: {}".format(round(mae_rating_standart/20,3))

MAE: 4.077
NMAE: 0.204


* Results using the rating prediction "with bias" function

In [100]:
pred_mat_rating_bias = predition_matrix(predict_rating_bias, test, Pearson, training)

In [123]:
pred_mat_rating_bias.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,100
50718,,,,,,,,-4.47072,,,...,,,,,,-3.56866,,,,
3722,,,-3.25875,,,,1.10671,,-1.03517,,...,,,,,,,,,,
42825,,,,,,,,-1.56512,,,...,,2.08296,,0.29587,,2.2159,,1.83984,,
53055,,,,,,,,,,,...,,,,,,,,,,
61686,,,,,,,,,,,...,,,0.799581,,,,,,,


In [101]:
mae_rating_bias = mae(pred_mat_rating_bias)
print "MAE: {}".format(round(mae_rating_bias,3))
print "NMAE: {}".format(round(mae_rating_bias/20,3))

MAE: 3.377
NMAE: 0.169


* Results using the rating prediction "with bias and std" function

In [102]:
pred_mat_rating_bias_std = predition_matrix(predict_rating_bias_std, test, Pearson, training)

In [124]:
pred_mat_rating_bias_std.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,100
50718,,,,,,,,-3.75975,,,...,,,,,,-3.16134,,,,
3722,,,-3.54243,,,,1.14088,,-1.37988,,...,,,,,,,,,,
42825,,,,,,,,-2.19687,,,...,,2.10448,,0.108869,,2.29535,,1.82874,,
53055,,,,,,,,,,,...,,,,,,,,,,
61686,,,,,,,,,,,...,,,1.10479,,,,,,,


In [103]:
mae_rating_bias_std = mae(pred_mat_rating_bias_std)
print "MAE: {}".format(round(mae_rating_bias_std,3))
print "NMAE: {}".format(round(mae_rating_bias_std/20,3))

MAE: 3.326
NMAE: 0.166


* <h3> Collaborative: Rating prediction through the similarity between items using the Pearson Correlation </h3>

In this case, given that the prediction will be based on the similarity measured between items, in order to simplify further operations the training-test splits are performed on portions of sets of users who have rated each item. Also this way there will be the certainty that all items have several ratings, for when lower datasets are used.

In [20]:
# Divide the data into training and test set
training_items = []
test_items = []

training_items = pd.DataFrame(index=df.columns,columns=['users'])[1:]
test_items = pd.DataFrame(index=df.columns,columns=['users'])[1:]

random.seed(47)

for index,row in df.transpose()[1:].iterrows():
    row = row[1:]
    l = row[row <11]   
    training_indexes = random.sample(l.index,int(0.75*(len(l))))
    test_indexes = list(set(l.index) - set(training_indexes))
    test_items['users'].ix[index] = test_indexes
    training_items['users'].ix[index] = training_indexes   

In [106]:
training_items.head()

Unnamed: 0,users
1,"[45496, 7736, 34957, 23436, 32535, 17767, 1912..."
2,"[12316, 40598, 26954, 10143, 37723, 46273, 483..."
3,"[22614, 4040, 41830, 26609, 27523, 35409, 4742..."
4,"[21426, 40799, 4240, 7503, 44196, 45420, 26080..."
5,"[72078, 28243, 60633, 63020, 41085, 29371, 454..."


Here the Pearson correlation matrix between items is defined:

In [107]:
Pearson_items = pd.DataFrame(index=training_items.index, columns=training_items.index)
cont = 1
for item1 in training_items.index:
    for item2 in training_items.index[cont:]:
        users1 = training_items['users'].loc[item1]
        users2 = training_items['users'].loc[item2]
        n_inters = set(users1).intersection(users2)
        Pearson_items[item1][item2]=pearsonr(df.transpose().loc[item1],df.transpose().loc[item2])[0]
    cont += 1

In [108]:
Pearson_items.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,91,92,93,94,95,96,97,98,99,100
1,,,,,,,,,,,...,,,,,,,,,,
2,0.883079,,,,,,,,,,...,,,,,,,,,,
3,0.938265,0.873756,,,,,,,,,...,,,,,,,,,,
4,0.912433,0.835149,0.937821,,,,,,,,...,,,,,,,,,,
5,0.0110715,-0.0049216,-0.0173057,-0.0224777,,,,,,,...,,,,,,,,,,


Similarly to before, we want to obtaina list of "top k most similar" items in this case, which will be used in the prediciton functions to obtain the rating of a new item based on the most similar items.

In [33]:
def most_similar_items(item, user, Pearson, training, k=10):
    items_rated_by_same_user = [idx for idx in training.index if user in training['users'].loc[idx]]
    y = Pearson[item][Pearson[item].notnull()]
    x = Pearson.ix[item][Pearson.ix[item].notnull()]
    all_items = pd.concat([x,y], axis = 0)
    items_with_rating = all_items * all_items.index.isin(items_rated_by_same_user)
    sort = items_with_rating.iloc[items_with_rating.argsort()][::-1][:k]
    return sort

In [110]:
most_similar_items(2,45496,Pearson_items, training_items)

1     0.883079
3     0.873756
10    0.864833
23    0.860456
33    0.858797
30    0.853744
25    0.849792
24    0.849499
6      0.84623
4     0.835149
Name: 2, dtype: object

Several prediction functions are also tested in this case, in this case the standart prediction method, and the normalized rating prediction method:

In [34]:
def predict_standart_item_based(user, item, Pearson, training, k=10):
    similar_items = most_similar_items(item, user, Pearson, training)
    numerator = [None]*len(similar_items)
    boolean = [None]*len(similar_items)
    for SimilarItem,similarity,i in zip(similar_items.index, similar_items.values, range(len(similar_items))):
        rating = df.transpose().ix[SimilarItem][user] 
        boolean[i] = rating < 11
        numerator[i] = rating * similarity if boolean[i] else 0
    return sum(numerator)/sum(similar_items.values * boolean) 

In [35]:
def predict_rating_bias_std_item_based(user, item, Pearson, training, k=10):
    similar_items = most_similar_items(item, user, Pearson, training)
    numerator = [None]*len(similar_items)
    boolean = [None]*len(similar_items)
    df_t = df.transpose()
    for SimilarItem,similarity,i in zip(similar_items.index, similar_items.values, range(len(similar_items))):
        mean_rv = np.mean(df_t.ix[SimilarItem][df_t.ix[SimilarItem] != 99.00])
        std_rv = np.std(df_t.ix[SimilarItem][df_t.ix[SimilarItem] != 99.00])
        rating = (df_t.ix[SimilarItem][user] - mean_rv)/std_rv
        boolean[i] = rating < 11
        numerator[i] = rating * similarity if boolean[i] else 0
        mean_ru = np.mean(df_t.ix[item][df_t.ix[item] != 99.00])
        std_ru = np.std(df_t.ix[item][df_t.ix[item] != 99.00])
    return (sum(numerator)/sum(abs(similar_items.values * boolean)))*std_ru + mean_ru

In [36]:
def predition_matrix_item_based(prediction_function, test, similarity_mtx, training):
    
    # Prediction matrix columns only with the users that appear in the test split
    unique_users = np.unique(list(chain.from_iterable(test['users'].tolist())))
    
    PredMat = pd.DataFrame(index = test.index, columns = unique_users)
    for item in test.index[:96]:
        for user in test.loc[item]['users']:
            PredMat[user][item] = prediction_function(user, item, similarity_mtx, training)
    return PredMat

In [37]:
def mae_item_based(PredMat):
    mae = 0.
    for user in PredMat.transpose().index:
        predicts = PredMat.transpose().loc[user][PredMat.transpose().loc[user].notnull()]
        df_values = df.loc[user][predicts.index]
        mae += np.mean(abs(predicts - df_values))
    return mae / len (PredMat.transpose().index)

Predictions of the test split obtained through the standart prediction function:

In [115]:
pred_mat_item_based_standart = predition_matrix_item_based(predict_standart_item_based, test_items, Pearson_items, training_items)

In [116]:
pred_mat_item_based_standart.head()

Unnamed: 0,209,241,245,270,272,511,542,614,626,695,...,72930,73014,73071,73079,73096,73202,73256,73288,73322,73415
1,,6.91439,-0.163414,,,,,,,3.91533,...,,,,,,,,,,
2,,,1.81399,,,,,,,,...,,,,,,,,,,7.13351
3,,,-0.0975623,,,,,,,,...,,,,,,,,,,6.33523
4,,,,,,,,,-2.12175,,...,,,,,,,,,,
5,8.29473,5.20917,0.27836,0.981683,3.16564,,,,,,...,,2.8799,,,,-1.10646,,,-2.78748,


And through the normalized rating prediction function:

In [117]:
pred_mat_item_based_bias_std = predition_matrix_item_based(predict_rating_bias_std_item_based, test_items, Pearson_items, training_items)

In [118]:
pred_mat_item_based_bias_std.head()

Unnamed: 0,209,241,245,270,272,511,542,614,626,695,...,72930,73014,73071,73079,73096,73202,73256,73288,73322,73415
1,,9.04855,2.58153,,,,,,,4.00247,...,,,,,,,,,,
2,,,1.80175,,,,,,,,...,,,,,,,,,,8.1486
3,,,1.68966,,,,,,,,...,,,,,,,,,,7.4049
4,,,,,,,,,-2.3716,,...,,,,,,,,,,
5,9.86118,6.72701,1.6324,1.88806,4.7026,,,,,,...,,4.12169,,,,0.278095,,,-1.41028,


As it can be seen, the recomendation based on an item-item similarity matrix, sensibly improves results obtained using a user-user similarity matrix for this problem.

Results using the standart prediction function:

In [119]:
mae_item_based_standart = mae_item_based(pred_mat_item_based_standart)
print "MAE: {}".format(round(mae_item_based_standart,3))
print "NMAE: {}".format(round(mae_item_based_standart/20,3))

MAE: 3.407
NMAE: 0.17


And using the latter prediciton function, which takes into account possible biases in rating behaviours and also the standart deviation:

In [120]:
mae_item_based_bias_std = mae_item_based(pred_mat_item_based_bias_std)
print "MAE: {}".format(round(mae_item_based_bias_std,3))
print "NMAE: {}".format(round(mae_item_based_bias_std/20,3))

MAE: 3.326
NMAE: 0.166


* <h3> Collaborative: Similarity between users using personalized Page Rank </h3>

In this section a User-Item graph is defined for each user individually, through which the page rank is obtained and used as a similarity measure between the different users. The top ranked user are used as the most similar ones, and taken into account for the rating estimation.

Firstly a general graph is defined with the all the interactions of all users to their corresponding items, where each weight is set to be the corresponding rating, divided by the amount of ratings the corresponding user has made.

Afterwoods this same original graph is called for each user individually and links of weight 1/n are assigned between all items and all other items (avoiding items that are already connected) and also from each item to all other users, except for the user in question, to which a rating of 1/n is added to the already existing rating, personalizing this way the graph for each user.


Networkx is used to both define the graph and calculate the Page rank on each graph that is defined.

In [9]:
import networkx as nx

In [10]:
training.head()

Unnamed: 0,items,#items
31787,"[31, 35, 36, 29, 39, 96, 7, 69, 5, 13, 19, 56,...",33
24149,"[39, 54, 25, 63, 85, 89, 80, 21, 69, 71, 55, 6...",75
28035,"[53, 56, 1, 8, 81, 88, 32, 96, 93, 22, 6, 11, ...",75
7856,"[94, 51, 30, 93, 52, 35, 11, 87, 79, 68, 33, 3...",75
46938,"[44, 64, 35, 30, 56, 27, 33, 28, 58, 20, 52, 8...",53


In [11]:
g_original = nx.DiGraph()
g_original.add_nodes_from(training.index, bipartite=0)
g_original.add_nodes_from(range(1,100), bipartite=1)

In [12]:
for user in training.index:
    n = len(training.ix[user]['items'])
    for item in training.ix[user][0]:
        weight = df.ix[user].loc[item]
        g_original.add_weighted_edges_from([(user,item,weight/n)])

In [13]:
def page_rank(g, user, training, df):
    
    add_link_from_item = 1
    n = len(training.ix[user]['items'])
    for item in training.ix[user][0]:
        
        # Links of the user in question are redefined as the already existing ones + 1/n
        weight = df.ix[user].loc[item]
        g.add_weighted_edges_from([(user,item,weight/n + 1./n)])
        
        # Set of links to avoid dangling nodes
        # Links of weight 1/n from item in question to all users( except thoes previously added)
        set_of_users_to_link = [(item, users, 1./n) for users in training.index if users != user]
        g.add_weighted_edges_from(set_of_users_to_link)
        
        # Links of weight 1/n from item in question to all other items (excluding existing links)
        range_items_to_link = range(add_link_from_item,100)
        set_of_items_to_link = [(item, i, 1./n) for i in range_items_to_link]
        g.add_weighted_edges_from(set_of_items_to_link)
        add_link_from_item += 1
    return g

The following function builds the similarity matrix by calling the previous function for each user, and therefore building a User-Item graph for each user,and obtaining a similarity measure. 

After the graph (g) is returned from the previus function, the pagerank builtin function from Networkx is called (nx.pagerank(g)), and each Rank-User_2 couple returned for each User_1, are used to define a similarity of User_1 with User_2

In [329]:
similarity_mtx = pd.DataFrame(index=training.index,columns=training.index)
for user1 in training.index:
    g = page_rank(g_original, user1, training, df)
    pr = nx.pagerank(g)
    [similarity_mtx.set_value(user1, user2, rat) for user2, rat in zip(pr.keys(), pr.values()) if user2 > 100]           

The prediction matrix is built in this case the same way as before

In [330]:
pred_mat_rating_bias_page_rank = predition_matrix(predict_rating_bias, test, similarity_mtx, training)

Results are very similar to thoes obtained using a collaborative filtering approach. The main issue in this case on why results are not improved through this method, is that page rank techniques tend to work better when there is more sparsity in the data.

In [331]:
mae_page_rank = mae(pred_mat_rating_bias_page_rank)
print "MAE: {}".format(round(mae_page_rank,3))
print "NMAE: {}".format(round(mae_page_rank/20,3))

MAE: 3.488
NMAE: 0.174


* <h3> Content Based </h3>

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import lda
import gensim

In [22]:
import codecs
import re

- Obtaining the jokes:

In [23]:
jokes_read = [None]*100
for i in range(1,100):
    files = "jokes/init{}.html".format(i)
    f = codecs.open(files, 'r').read()
    jokes_read[i] = re.findall ( '-->\n(.*?)\n<!--', f, re.DOTALL)
del jokes_read[0]

In [24]:
jokes = [item for sublist in jokes_read for item in sublist]
[joke for joke in jokes[:5]]

['A man visits the doctor. The doctor says "I have bad news for you.You have\ncancer and Alzheimer\'s disease". <P>\nThe man replies "Well,thank God I don\'t have cancer!"',
 'This couple had an excellent relationship going until one day he came home\nfrom work to find his girlfriend packing. He asked her why she was leaving him\nand she told him that she had heard awful things about him. \n<P>\n"What could they possibly have said to make you move out?" \n<P>\n"They told me that you were a pedophile." \n<P>\nHe replied, "That\'s an awfully big word for a ten year old." ',
 "Q. What's 200 feet long and has 4 teeth? <P>\n\nA. The front row at a Willie Nelson Concert.",
 "Q. What's the difference between a man and a toilet? \n<P>\nA. A toilet doesn't follow you around after you use it.",
 "Q.\tWhat's O. J. Simpson's Internet address? <P>\nA.\tSlash, slash, backslash, slash, slash, escape."]

----------
In order to obtain the a measure of similarity between jokes, the process implemented here is as follows:
- All jokes are modelled by several topics using LDA
- The resulting vector of probabilites of the jokes belonging to each topic are obtained
- These vectors are used to obtain similarity measures between the jokes, by using the Pearson correlation

In [25]:
def topic_modeling(x, n_topics, n_iter):
    vectorizer =  CountVectorizer(analyzer='word', ngram_range=(1,1), min_df = 0, stop_words = ["english"])
    matrix =  vectorizer.fit_transform(x)
    feature_names = vectorizer.get_feature_names()
    model = lda.LDA(n_topics = n_topics, n_iter = n_iter, random_state=1)
    model.fit(matrix.astype(int))
    topic_word = model.topic_word_
    doc_topic = model.doc_topic_
    return model, feature_names, topic_word, doc_topic

In [41]:
lda_model, feature_names, topic_word, doc_topic = topic_modeling(jokes, n_topics=10, n_iter=100)

INFO:lda:n_documents: 98
INFO:lda:vocab_size: 1550
INFO:lda:n_words: 5415
INFO:lda:n_topics: 10
INFO:lda:n_iter: 100
INFO:lda:<0> log likelihood: -56209
INFO:lda:<10> log likelihood: -42401
INFO:lda:<20> log likelihood: -41224
INFO:lda:<30> log likelihood: -40797
INFO:lda:<40> log likelihood: -40462
INFO:lda:<50> log likelihood: -40327
INFO:lda:<60> log likelihood: -40289
INFO:lda:<70> log likelihood: -40189
INFO:lda:<80> log likelihood: -39929
INFO:lda:<90> log likelihood: -39955
INFO:lda:<99> log likelihood: -39709


Though the top_modeling function, a vector of probabilities of each joke of belonging to each topic (in this case 10 topics) are obtained, which will be used as a measure of similarity between jokes:

In [42]:
# Shows topics at which the IDs belong.
doc_topic = lda_model.doc_topic_
for i in range(6,8):
    print "Joke: {}".format(jokes[i])
    print ""
    print "Doc_topic:{}".format(doc_topic[i])
    print ""

Joke: How many feminists does it take to screw in a light bulb?<P>
That's not funny.

Doc_topic:[ 0.07333333  0.00666667  0.00666667  0.00666667  0.00666667  0.00666667
  0.87333333  0.00666667  0.00666667  0.00666667]

Joke: Q. Did you hear about the dyslexic devil worshiper? 
<P>
A. He sold his soul to Santa.

Doc_topic:[ 0.34        0.00666667  0.27333333  0.00666667  0.00666667  0.00666667
  0.07333333  0.27333333  0.00666667  0.00666667]



-----------

In [30]:
Pearson_jokes = pd.DataFrame(index = range(1,97),columns = range(1,97))
cont = 1
for item1 in range(1,97):
    for item2 in range(1,97)[cont:]:
        Pearson_jokes[item1][item2]=pearsonr(doc_topic[item1],doc_topic[item2])[0]
    cont += 1

In [31]:
Pearson_jokes.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,87,88,89,90,91,92,93,94,95,96
1,,,,,,,,,,,...,,,,,,,,,,
2,-0.113262,,,,,,,,,,...,,,,,,,,,,
3,-0.0729681,0.0118106,,,,,,,,,...,,,,,,,,,,
4,-0.00999327,0.00555761,-0.0473207,,,,,,,,...,,,,,,,,,,
5,-0.0159041,-0.0316496,-0.0534512,-0.0590373,,,,,,,...,,,,,,,,,,


And the same prediction algorithms as before are used in this case to test the performance of the content based approach. As results show, for this problem the content based approach yields poorer results. However, it would be expectalbe that an improvement should be noticed for this case, when using the whole original dataset, as I just took a subsample due to computational time requirements.

In [38]:
pred_mat_content_based_bias_std = predition_matrix_item_based(predict_rating_bias_std_item_based, test_items, Pearson_jokes, training_items)

Results after topic modeling with 50 topics.

In [40]:
mae_content_based_bias_std = mae_item_based(pred_mat_content_based_bias_std)
print "MAE: {}".format(round(mae_content_based_bias_std,3))
print "NMAE: {}".format(round(mae_content_based_bias_std/20,3))

MAE: 3.685
NMAE: 0.184


Results after topic modeling with 10 topics:

In [346]:
mae_content_based_bias_std = mae_item_based(pred_mat_content_based_bias_std)
print "MAE: {}".format(round(mae_content_based_bias_std,3))
print "NMAE: {}".format(round(mae_content_based_bias_std/20,3))

MAE: 3.628
NMAE: 0.181


In [1]:
import numpy as np

In [6]:
2**30 +2**30 +2**30+2**30 ==2**32

True