## Overview
Provide a recommendation engine for a question-and-answer website, Quora-like. The prerequisites for the first part of the assignment are that the engine must be based on users binary feedback and the question topics (tags). For the second part of assignment, a set of hybrid methods have to be used to fine-tune the recommendations.

## Instructions
The resolution is to be done in Python.
The proposed solutions need to be properly explained in a one-page pdf document to understand the rationale behind.
This assignment is individual.

## 1. Collected Data
- Category: questions and answers
- Action types: explicit ( binary ratings ), implicit ( popularity ) 
- Format: excel
- Size: 20x10 matrix

## Dataset Preparation

In [56]:
import pandas as pd
import numpy as np

In [57]:
xls = pd.ExcelFile('./exerciseCB.xlsx')
df1 = pd.read_excel(xls, 'Sheet1', header = 0, index_col = 0)

In [58]:
df1 = df1.fillna(0)
df1

Unnamed: 0,Sports,Books,Leadership,Philosophy,Society,Fiction,Security,Love,VideoGames,Superheroes,User 1 - WA / DV,User 2 - WA / DV,User 3 - WA / DV,User 4 - WA / DV,User 1 - U/D,User 2 - U/D,User 3 - U/D,User 4 - U/D
question1,1,0,1,0,1,1,0,0,0,1,1.0,-1.0,0.0,0.0,15.0,0.0,0.0,0.0
question2,0,1,1,1,0,0,0,1,0,0,-1.0,1.0,0.0,0.0,0.0,0.0,40.0,0.0
question3,0,0,0,1,1,1,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
question4,0,0,1,1,0,0,1,1,0,0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
question5,0,1,0,0,0,0,0,0,1,1,0.0,0.0,1.0,0.0,0.0,2.0,0.0,0.0
question6,1,0,0,1,0,0,0,0,0,0,1.0,0.0,0.0,0.0,25.0,0.0,0.0,0.0
question7,0,0,0,0,0,0,0,1,0,1,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0
question8,0,0,1,1,0,0,1,0,0,1,0.0,0.0,1.0,0.0,0.0,-4.0,0.0,0.0
question9,0,0,0,0,0,1,0,0,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
question10,0,1,0,0,1,0,1,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [59]:
feedback = df1[list(df1.columns[:10])]
user_wa_dv = df1[list(df1.columns[10:14])]
user_up_down = df1[list(df1.columns[14:18])]

In [60]:
np_feedback = feedback.to_numpy()
np_user_wa_dv = user_wa_dv.to_numpy()
np_user_up_down = user_up_down.to_numpy()

## 2. Building User Profiles for Content-based Filtering
### Simply Unary

Given a set of users and questions, infer the users profile considering how many questions with its associated topics the user likes / dislikes. Use a dot product. Each user would end with a numeric value for each topic. With the user profiles, predict each user probability to like / dislike each question and count the total number of likes, dislikes and neutral predictions. To finalise, with the predictions provide the top-5 questions recommended per each user.

In [61]:
# Create user profiles

user_profile = np.dot(np_feedback.transpose(),np_user_wa_dv)
user_profile

array([[ 3., -2., -2.,  0.],
       [-2.,  2.,  1.,  0.],
       [-1.,  2.,  1.,  0.],
       [ 0.,  3.,  0.,  0.],
       [ 0., -1.,  0.,  0.],
       [ 2., -2., -3.,  0.],
       [-1.,  0., -1.,  0.],
       [-1.,  3., -2.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0., -1.,  1.,  0.]])

In [62]:
# Exemplary cosine prediction for 1 question and 1 user
from math import sqrt

num = np.dot(user_profile[:,0],np_feedback[0])
denom = sqrt(np.dot(np_feedback[0],np_feedback[0])) * sqrt(np.dot(user_profile[:,0],user_profile[:,0]))

num / denom

0.3903600291794133

In [63]:
# Computing all predictions for each of the users by applying the cosine prediction method
# Retrieve number of likes, dislikes and neutral ratings per user by counting over the array columns

predictions = []

for k in range(0,4):
    for i in range(0,20):
        num = np.dot(user_profile[:,k],np_feedback[i])
        denom = sqrt(np.dot(np_feedback[i],np_feedback[i])) * sqrt(np.dot(user_profile[:,k],user_profile[:,k]))
        predictions.append(num/denom)

  predictions.append(num/denom)


In [64]:
x = np.reshape(predictions, (4, 20)).T
x

array([[ 0.39036003, -0.2981424 , -0.29277002,         nan],
       [-0.43643578,  0.83333333,  0.        ,         nan],
       [ 0.25197632,  0.        , -0.37796447,         nan],
       [-0.32732684,  0.66666667, -0.21821789,         nan],
       [-0.12598816,  0.09622504,  0.25197632,         nan],
       [ 0.46291005,  0.11785113, -0.3086067 ,         nan],
       [-0.15430335,  0.23570226, -0.15430335,         nan],
       [-0.21821789,  0.33333333,  0.10910895,         nan],
       [ 0.46291005, -0.23570226, -0.46291005,         nan],
       [-0.37796447,  0.09622504,  0.        ,         nan],
       [ 0.        ,  0.09622504,  0.12598816,         nan],
       [ 0.50395263, -0.38490018, -0.75592895,         nan],
       [-0.21821789,  0.58333333, -0.10910895,         nan],
       [-0.21821789,  0.58333333,  0.21821789,         nan],
       [ 0.        ,  0.33333333, -0.65465367,         nan],
       [ 0.75592895, -0.38490018, -0.62994079,         nan],
       [-0.43643578,  0.

In [65]:
# Count number of likes, dislikes and neutral predictions
likes = np.sum(np.array(x) > 0, axis=0)
dislikes = np.sum(np.array(x) < 0, axis=0)
neutral = np.sum(np.array(x) == 0, axis=0)

print("Number of Likes per user: " + str(likes))
print("Number of Dislikes per user: " + str(dislikes))
print("Number of Neutrals per user: " + str(neutral))

Number of Likes per user: [ 7 15  5  0]
Number of Dislikes per user: [11  4 10  0]
Number of Neutrals per user: [2 1 5 0]


  likes = np.sum(np.array(x) > 0, axis=0)
  dislikes = np.sum(np.array(x) < 0, axis=0)


In [66]:
# Retrieve top 5 rated questions for each user
predictions_df = pd.DataFrame(x,columns = ['User1','User2','User3','User4'],index=list(df1.index))

top5_pred_user1 = predictions_df['User1'].sort_values(ascending = False)[0:5]
top5_pred_user2 = predictions_df['User2'].sort_values(ascending = False)[0:5]
top5_pred_user3 = predictions_df['User3'].sort_values(ascending = False)[0:5]
top5_pred_user4 = predictions_df['User4'].sort_values(ascending = False)[0:5]

print(str(top5_pred_user1) + "\n",str(top5_pred_user2) + "\n",str(top5_pred_user3) + "\n",str(top5_pred_user4))

question16    0.755929
question12    0.503953
question6     0.462910
question9     0.462910
question1     0.390360
Name: User1, dtype: float64
 question17    0.833333
question2     0.833333
question4     0.666667
question14    0.583333
question13    0.583333
Name: User2, dtype: float64
 question5     0.251976
question14    0.218218
question19    0.195180
question11    0.125988
question8     0.109109
Name: User3, dtype: float64
 question1   NaN
question2   NaN
question3   NaN
question4   NaN
question5   NaN
Name: User4, dtype: float64


## Unit Weight
Some questions have more influence in the result as contain more topics. Normalise the topics frequency for each question and calculate the predictions again. Divide the keywords (topics) appearance by the total number of keywords that the question has. With the new predictions, provide the top-5 questions recommended per each user.

In [68]:
# Define unit weight feedback table and user profiles by dividing the number of keywords appearance by the total number of keywords the question has
np_feedback_uw = np_feedback/np_feedback.sum(1, keepdims=True)
user_profile_uw = np.dot(np_feedback_uw.transpose(),np_user_wa_dv)

In [69]:
user_profile_uw

array([[ 1.03333333, -0.53333333, -0.66666667,  0.        ],
       [-0.45      ,  0.5       ,  0.33333333,  0.        ],
       [-0.25      ,  0.55      ,  0.25      ,  0.        ],
       [ 0.25      ,  0.75      ,  0.        ,  0.        ],
       [ 0.        , -0.2       ,  0.        ,  0.        ],
       [ 0.53333333, -0.53333333, -0.91666667,  0.        ],
       [-0.2       , -0.08333333, -0.33333333,  0.        ],
       [-0.25      ,  0.75      , -0.75      ,  0.        ],
       [ 0.33333333,  0.        ,  0.        ,  0.        ],
       [ 0.        , -0.2       ,  0.08333333,  0.        ]])

In [70]:
# Computing all predictions for each of the users by applying the cosine prediction method
# Retrieve number of likes, dislikes and neutral ratings per user by counting over the array columns

predictions_uw = []

for k in range(0,4):
    for i in range(0,20):
        num = np.dot(user_profile_uw[:,k],np_feedback_uw[i])
        denom = sqrt(np.dot(np_feedback_uw[i],np_feedback_uw[i])) * sqrt(np.dot(user_profile_uw[:,k],user_profile_uw[:,k]))
        predictions_uw.append(num/denom)


  predictions_uw.append(num/denom)


In [71]:
# Count number of likes, dislikes and neutral predictions

x_uw = np.reshape(predictions_uw, (4, 20)).T
likes_uw = np.sum(np.array(x_uw) > 0, axis=0)
dislikes_uw = np.sum(np.array(x_uw) < 0, axis=0)
neutral_uw = np.sum(np.array(x_uw) == 0, axis=0)

print("Number of Likes per user: " + str(likes_uw))
print("Number of Dislikes per user: " + str(dislikes_uw))
print("Number of Neutrals per user: " + str(neutral_uw))

Number of Likes per user: [10 16  4  0]
Number of Dislikes per user: [10  4 13  0]
Number of Neutrals per user: [0 0 3 0]


  likes_uw = np.sum(np.array(x_uw) > 0, axis=0)
  dislikes_uw = np.sum(np.array(x_uw) < 0, axis=0)


In [72]:
# Retrieve top 5 rated questions for each user
predictions_uw_df = pd.DataFrame(x_uw,columns = ['User1','User2','User3','User4'],index=list(df1.index))

top5_pred_user_uw1 = predictions_uw_df['User1'].sort_values(ascending = False)[0:5]
top5_pred_user_uw2 = predictions_uw_df['User2'].sort_values(ascending = False)[0:5]
top5_pred_user_uw3 = predictions_uw_df['User3'].sort_values(ascending = False)[0:5]
top5_pred_user_uw4 = predictions_uw_df['User4'].sort_values(ascending = False)[0:5]

print(str(top5_pred_user_uw1) + "\n",str(top5_pred_user_uw2) + "\n",str(top5_pred_user_uw3) + "\n",str(top5_pred_user_uw4))

question16    0.797222
question6     0.659494
question12    0.573441
question9     0.445373
question1     0.427934
Name: User1, dtype: float64
 question17    0.834683
question2     0.834683
question4     0.643743
question13    0.605555
question14    0.589188
Name: User2, dtype: float64
 question14    0.199431
question5     0.164488
question19    0.101929
question11    0.098693
question10    0.000000
Name: User3, dtype: float64
 question1   NaN
question2   NaN
question3   NaN
question4   NaN
question5   NaN
Name: User4, dtype: float64


### IDF
With the unit weight applied, now evaluate the topics relevance using IDF. The higher the number of questions a topic has, the lower its relevance is. Rare topics would have more weight applying IDF now, thus being more relevant for the final prediction. With the new predictions, provide the top-5 questions recommended per each user.

In [73]:
# Compute IDF vector for the 10 categories - count the total number of questions and divide it by the number of 
# mentions of the category and then finally taking the logarithm of the resulting value
idf = np.count_nonzero(np_feedback_uw.T, axis=1)
idf = np.log(20/idf)
idf

array([1.60943791, 1.2039728 , 0.69314718, 0.597837  , 1.2039728 ,
       1.2039728 , 1.04982212, 1.2039728 , 1.04982212, 1.38629436])

In [74]:
# Compute the IDF user profile with the dot-product of the feedback table and the unit weight user profile, multiplying by the IDF factor
user_profile_idf = np.dot(np_feedback_uw.transpose(),np_user_wa_dv).T * idf
user_profile_idf = user_profile_idf.T
user_profile_idf

array([[ 1.66308584, -0.85836689, -1.07295861,  0.        ],
       [-0.54178776,  0.6019864 ,  0.40132427,  0.        ],
       [-0.1732868 ,  0.38123095,  0.1732868 ,  0.        ],
       [ 0.14945925,  0.44837775,  0.        ,  0.        ],
       [ 0.        , -0.24079456,  0.        ,  0.        ],
       [ 0.64211883, -0.64211883, -1.10364174,  0.        ],
       [-0.20996442, -0.08748518, -0.34994071,  0.        ],
       [-0.3009932 ,  0.9029796 , -0.9029796 ,  0.        ],
       [ 0.34994071,  0.        ,  0.        ,  0.        ],
       [ 0.        , -0.27725887,  0.11552453,  0.        ]])

In [75]:
# Computing all predictions for each of the users by applying the cosine prediction method
# Retrieve number of likes, dislikes and neutral ratings per user by counting over the array columns
predictions_idf = []

for k in range(0,4):
    for i in range(0,20):
        num = np.dot(user_profile_idf[:,k],np_feedback_uw[i])
        denom = sqrt(np.dot(np_feedback_uw[i],np_feedback_uw[i])) * sqrt(np.dot(user_profile_idf[:,k],user_profile_idf[:,k]))
        predictions_idf.append(num/denom)

x_idf = np.reshape(predictions_idf, (4, 20)).T
likes_idf = np.sum(np.array(x_idf) > 0, axis=0)
dislikes_idf = np.sum(np.array(x_idf) < 0, axis=0)
neutral_idf = np.sum(np.array(x_idf) == 0, axis=0)

print(x_idf)
print("Number of Likes per user: " + str(likes_idf))
print("Number of Dislikes per user: " + str(dislikes_idf))
print("Number of Neutrals per user: " + str(neutral_idf))

[[ 0.49030911 -0.43636286 -0.45052633         nan]
 [-0.22283194  0.69563292 -0.08761597         nan]
 [ 0.23502693 -0.14950893 -0.34003155         nan]
 [-0.13750986  0.49019116 -0.28807004         nan]
 [-0.05696118  0.11172769  0.1592409          nan]
 [ 0.65911062 -0.17276656 -0.40487382         nan]
 [-0.10945262  0.26367435 -0.29714095         nan]
 [-0.06011517  0.13851566 -0.01631067         nan]
 [ 0.36075074 -0.2705844  -0.4164519          nan]
 [-0.22320225  0.09417315  0.01583125         nan]
 [ 0.0524502   0.04831938  0.05338959         nan]
 [ 0.6220964  -0.54636674 -0.77842623         nan]
 [-0.08352149  0.44450958 -0.19469814         nan]
 [-0.05545663  0.4265722   0.15331891         nan]
 [ 0.07215623  0.18526376 -0.62878274         nan]
 [ 0.78833747 -0.51626607 -0.67060964         nan]
 [-0.22283194  0.69563292 -0.08761597         nan]
 [ 0.1816009   0.18894326  0.                 nan]
 [-0.21274508  0.10065605  0.0811885          nan]
 [ 0.02986545  0.22113045 -0.04

  predictions_idf.append(num/denom)
  likes_idf = np.sum(np.array(x_idf) > 0, axis=0)
  dislikes_idf = np.sum(np.array(x_idf) < 0, axis=0)


In [76]:
# Computing the IDF top 5 predictions for each of the 4 users
predictions_idf_df = pd.DataFrame(x_idf,columns = ['User1','User2','User3','User4'],index=list(df1.index))

top5_pred_user_idf1 = predictions_idf_df['User1'].sort_values(ascending = False)[0:5]
top5_pred_user_idf2 = predictions_idf_df['User2'].sort_values(ascending = False)[0:5]
top5_pred_user_idf3 = predictions_idf_df['User3'].sort_values(ascending = False)[0:5]
top5_pred_user_idf4 = predictions_idf_df['User4'].sort_values(ascending = False)[0:5]

print(str(top5_pred_user_idf1) + "\n",str(top5_pred_user_idf2) + "\n",str(top5_pred_user_idf3) + "\n",str(top5_pred_user_idf4))

question16    0.788337
question6     0.659111
question12    0.622096
question1     0.490309
question9     0.360751
Name: User1, dtype: float64
 question17    0.695633
question2     0.695633
question4     0.490191
question13    0.444510
question14    0.426572
Name: User2, dtype: float64
 question5     0.159241
question14    0.153319
question19    0.081188
question11    0.053390
question10    0.015831
Name: User3, dtype: float64
 question1   NaN
question2   NaN
question3   NaN
question4   NaN
question5   NaN
Name: User4, dtype: float64


## 3. Building a Hybrid Recommendation Engine
### Switched Hybrid
Consider the case of User4. User4 is new in the webpage and is not having a defined profile. Solve the User4 cold-start problem switching the content-based to non-personalise for users without actions collected. Provide the top-5 questions recommended per each user.

In [77]:
# Taking the mean of the simple unary predictions for users 1, 2 and 3
user4_unary_pre = user_wa_dv
user4_unary_pre['User4'] = user_wa_dv.iloc[:,[0,1,2]].mean(axis=1)
user4_unary_pre = user4_unary_pre.drop(columns = ['User 4 - WA / DV'])

user4_unary_pre['User4'].sort_values(ascending = False)[0:5]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  user4_unary_pre['User4'] = user_wa_dv.iloc[:,[0,1,2]].mean(axis=1)


question17    0.333333
question4     0.333333
question5     0.333333
question6     0.333333
question8     0.333333
Name: User4, dtype: float64

In [78]:
# Taking the mean of the IDF predictions of users 1, 2 and 3
user4_idf_pred = x_idf[:,[0,1,2]].mean(axis=1)
user4_idf_pred_df = pd.DataFrame(user4_idf_pred,columns = ['User4'],index=list(df1.index))
user4_idf_pred_df['User4'].sort_values(ascending = False)[0:5]

question14    0.174811
question17    0.128395
question2     0.128395
question18    0.123515
question5     0.071336
Name: User4, dtype: float64

### Hybrid Challenge
Define your own hybrid solution. Choose a feature-weighted linear stacking, a trust-aware CF, content-based similarity or build your own. Provide the top-5 questions recommended per each user with your solution. It is key in this exercise to explain in detail your solution with good argumentations. The “best explained” solution will have the best note.

In [147]:
# Import required libraries
from sklearn import preprocessing

In [148]:
feedback = df1[list(df1.columns[:10])]
user_wa_dv = df1[list(df1.columns[10:14])]
user_up_down = df1[list(df1.columns[14:18])]

In [149]:
user_wa_dv.columns = ['User 1','User 2','User 3','User 4']
user_up_down.columns = ['User 1','User 2','User 3','User 4']

In [174]:
# Here, I tried to create a system that incorporates the IDF method, the cold-start approach
# and the votes given to the answers of the users, which were not considered in the previous approaches.

def hybrid(questions, userdata, relevance): 
    
    # Data table preparation for later use
    new_feedback = feedback.copy()
    rating = user_wa_dv.copy()
    relevance.columns = user_up_down.columns

    # Scale answer relevance data on the same range as user feedback. 
    # Create a logical matrix (1-0) to add back 0's that were dropped as part of the preprocessing method
    # Multiply scaled matrix with logical matrix to remove nan values

    scaled = preprocessing.scale(relevance)
    logical_matrix = relevance.copy()
    
    for na in logical_matrix.columns:
        for i in logical_matrix.index:
            if (logical_matrix.loc[i,na] != 0):
                logical_matrix.loc[i,na] = 1
    
    scaled = pd.DataFrame(scaled)
    int_rel = np.multiply(scaled, logical_matrix)
    int_rel.columns = relevance.columns
    int_rel.index = relevance.index
    relevance = int_rel.copy()
    
    # Add another rating component: quality of user answers
    # If a user gives higher quality answers, the user should be recommended similar questions in the future
    # If a user gives a lower quality answer, the user should not be recommended similar questions in the future
    # This way, the quality of answers in the forum will stay higher
    
    for n in rating.columns:
        for p in rating.index:
            if (rating.loc[p,n] == 1):
                if (relevance.loc[p,n] != 0):
                    rating.loc[p,n] = rating.loc[p,n] * relevance.loc[p,n]
            elif(rating.loc[p,n] == -1):
                if (relevance.loc[p,n] < 0):
                    rating.loc[p,n] = (-1) * rating.loc[p,n] * relevance.loc[p,n]
                if(relevance.loc[p,n] > 0):
                    rating.loc[p,n] = (-1) * rating.loc[p,n] * relevance.loc[p,n]
            elif(rating.loc[p,n] == 0):
                if (relevance.loc[p,n] != 0):
                    rating.loc[p,n] = relevance.loc[p,n]  
    
    # Intermediate supporting list initiation for later use and determine question weight
    
    helper1 = []
    helper2 = []
    
    for i in questions.index:
        helper3 = 1 / sum(questions.loc[i,:])
        helper2.append(helper3)
    
    # Compute IDF for the topics
    
    for i in questions.columns:
        helper3 = np.log(20 / sum(questions.loc[:,i]))
        helper1.append(helper3)
    helper1 = pd.DataFrame(helper1).transpose()
    helper1.columns = questions.columns
    helper1
    j = 1
    store = pd.DataFrame()
    
    # Compute predictions for each user and question
    
    for i in rating.columns:
        
        # Handle Cold-start problem: assume mean rating of other users 
        if ((rating.loc[:,i].mean()) == 0 and (rating.loc[:,i].std()) == 0):
            store.loc[:,j] = store.loc[:, store.columns != j].mean(axis = 1)
        else:
            check_2 = questions.copy()
            
            for col in questions.columns: # Multiply topics with user feedback considering the previously computed weights and IDF factor. 
                
                check_2[col] = np.where(questions.loc[:,col] == 0,questions.loc[:,col],rating.loc[:,i] * questions.loc[:,col] * helper2 * helper1.loc[0,col])
                user_profile = pd.DataFrame(check_2.sum(axis = 0)).transpose()
                mult = cosine_similarity(new_feedback, user_profile).sum(axis = 1) # Compute cosine predictions
            store[j] = mult
            user_profile = []
            mult = []
            j = j + 1
    
    store.index = rating.index 
    
    # Summarize sorted question recommendations and retrieve the top-5 recommendations
    
    top = pd.DataFrame()
    for p in store.columns:
        rank = store.loc[:,p].sort_values(ascending = False)
        top[p] = rank.index
    top5 = top.loc[:4,:]
    top5.columns = rating.columns
    store.columns = rating.columns
    
    return top5

In [175]:
hybrid(feedback, user_wa_dv, user_up_down)

Unnamed: 0,User 1,User 2,User 3,User 4
0,question16,question18,question19,question14
1,question6,question9,question14,question5
2,question12,question5,question2,question18
3,question1,question14,question17,question20
4,question9,question20,question10,question11
