# Guide Recommendation Pipeline
The guide recommendation system aims to provide the tourists with suggestions about some potentially interesting guides for them, according to the users' preferences and requests.

Recommendation systems are decision-support systems which can identify and learn patterns among sources of information. They are able to achieve great performance when a good amount of high-quality data is available.

The designed system relies upon datasets of:
- Ratings given by tourists to specific guides;
- Set of attributes about the guides;
- Set of attributes about the tourists.

## Define working environment

In [30]:
# Import libraries

import numpy as np
import pandas as pd
import scipy.sparse as sps
import ast

from tqdm import tqdm

## Load and preprocess data

In this part we are loading the previously generated dataframes:
- tourists.csv with the id, languages spoken, and keywords of the tourists.
- guides.csv with various attributes of the guides.
- guides_ratings.csv which contains the tourist-guide interactions.


In [31]:
# Load dataframe for tourist attributes

tourist_file = open('Data/tourists_200.csv')

tourist_df = pd.read_csv(
    filepath_or_buffer = tourist_file,
    sep = ';',
    header = 0
)

tourist_df.rename(columns={tourist_df.columns[0]: 'id'}, inplace=True)

tourist_df

Unnamed: 0,id,languages,keywords
0,0,['bulgarian'],"['beer', 'sport', 'cinema', 'tracking', 'rafti..."
1,1,['bulgarian'],"['history', 'food', 'tracking']"
2,2,['spanish'],"['tracking', 'cinema', 'wine', 'literature', '..."
3,3,['deutsche'],"['sport', 'tracking']"
4,4,['spanish'],"['beer', 'tracking', 'food', 'literature', 'sp..."
...,...,...,...
195,195,['italian'],"['museums', 'cinema', 'literature', 'history',..."
196,196,['dutch'],"['wine', 'countryside', 'museums', 'art', 'raf..."
197,197,['italian'],"['rafting', 'archeology', 'art', 'museums', 'b..."
198,198,['italian'],"['music', 'art', 'museums', 'history', 'food',..."


In [32]:
# Load dataframe for guide attributes

guide_file = open('Data/guides_40.csv')

guide_df = pd.read_csv(
    filepath_or_buffer = guide_file,
    sep = ';',
    header = 0,

    # Process the columns containing lists: load the content of those columns as list
    converters = {'languages_spoken':ast.literal_eval, 'keywords':ast.literal_eval}
)

guide_df.rename(columns={guide_df.columns[0]: 'id'}, inplace=True)

guide_df

Unnamed: 0,id,gender,name,birth_date,now_available,languages_spoken,price,education,biography,keywords,current_location,experience
0,0,male,Gary Baker,1984-07-08,True,[english],25,high-school,Astronomer,[museums],"{'lat': 40.342693584880706, 'lon': 18.16438078...",13
1,1,female,Lisa Swanson,1982-09-27,True,"[italian, dutch]",36,middle-school,"Designer, jewellery","[cinema, rafting, history, wine]","{'lat': 40.3367551413547, 'lon': 18.1569120995...",14
2,2,female,Carla Dillon,1995-01-04,True,"[chinese, french, english]",34,phd,Travel agency manager,"[food, archeology, art]","{'lat': 40.362447049660204, 'lon': 18.14225129...",5
3,3,male,Jesse Schaefer,1990-11-12,True,[bulgarian],46,bachelor,Financial planner,"[countryside, rafting, art]","{'lat': 40.36584086550253, 'lon': 18.183002910...",1
4,4,male,David Davis,1998-03-06,True,"[deutsche, french]",30,master,"Pilot, airline","[countryside, tracking, beer]","{'lat': 40.354724889812935, 'lon': 18.20308322...",6
5,5,male,Joseph James,1974-05-30,True,"[deutsche, dutch, bulgarian]",27,middle-school,"Designer, exhibition/display",[],"{'lat': 40.35854623058413, 'lon': 18.183459879...",8
6,6,male,Steven Schneider,1980-08-19,True,"[english, chinese, french]",34,middle-school,"Engineer, land","[rafting, art, sport]","{'lat': 40.35774331848761, 'lon': 18.163273185...",2
7,7,male,Reginald Berg,1971-01-06,True,"[spanish, dutch, french, italian]",33,bachelor,Financial risk analyst,"[museums, art, countryside, wine]","{'lat': 40.365193043282794, 'lon': 18.18548169...",10
8,8,male,Gregory Lee,1973-12-11,True,"[bulgarian, english, chinese]",31,middle-school,Audiological scientist,"[literature, cinema, music]","{'lat': 40.35265552781134, 'lon': 18.151723396...",25
9,9,male,Alex Perkins,1963-07-30,True,"[chinese, bulgarian, dutch, english]",37,phd,"Engineer, materials",[museums],"{'lat': 40.33774657696426, 'lon': 18.182064664...",41


In [33]:
# Load dataframe for ratings

rating_file = open('Data/guides_ratings_200_40.csv')

rating_df = pd.read_csv(
    filepath_or_buffer = rating_file,
    sep = ';',
    header = 0
)

rating_df

Unnamed: 0,0,1,2
0,0,10,4.0
1,0,5,4.0
2,1,9,3.0
3,1,5,4.0
4,2,28,4.0
...,...,...,...
293,197,27,5.0
294,198,34,5.0
295,198,27,3.0
296,199,7,5.0


In [34]:
# Rename the columns of the rating dataframe
rating_df.rename(columns={rating_df.columns[0]: 'tourist_id',
                          rating_df.columns[1]: 'guide_id',
                          rating_df.columns[2]: 'rating',
                         }, inplace=True)

# Remove the duplicated interactions, if any
rating_df.drop_duplicates(subset=['tourist_id','guide_id'],inplace=True)

rating_df

Unnamed: 0,tourist_id,guide_id,rating
0,0,10,4.0
1,0,5,4.0
2,1,9,3.0
3,1,5,4.0
4,2,28,4.0
...,...,...,...
293,197,27,5.0
294,198,34,5.0
295,198,27,3.0
296,199,7,5.0


## Print statistics

Some statistics about the acquired data are shown.

In [35]:
# Statistics about data
arr_tourists = tourist_df["id"].unique()
arr_guides = guide_df["id"].unique()

n_tourists = len(arr_tourists)
n_guides = len(arr_guides)
n_interactions = len(rating_df)

print("Number of tourists: {:d}".format(n_tourists))
print("Number of guides: {:d}".format(n_guides))
print("Number of interactions: {:d}".format(n_interactions))

print("Average interaction per tourist: {:.2f}".format(n_interactions/n_tourists))
print("Average interaction per guide: {:.2f}".format(n_interactions/n_guides))
print("Sparsity: {:.2f} %".format((1-float(n_interactions)/(n_guides*n_tourists))*100))

Number of tourists: 200
Number of guides: 40
Number of interactions: 298
Average interaction per tourist: 1.49
Average interaction per guide: 7.45
Sparsity: 96.28 %


In [36]:
# Statistics about the distribution of ratings
print("Average rating: {:.6f}".format(rating_df.loc[:, 'rating'].mean()))
print("Maximum rating: {:.6f}".format(rating_df.loc[:, 'rating'].max()))
print("Minimum rating: {:.6f}".format(rating_df.loc[:, 'rating'].min()))

Average rating: 3.929530
Maximum rating: 5.000000
Minimum rating: 2.000000


## Create the URM

The **User Rating Matrix** describes the past interactions between tourists and guides, where rows represent tourists and columns represent guides. The values in the cells can be defined by an implicit or explicit approach:
- Explicit ratings are given directly by the tourists to the guides, according to a rating scale.
- Implicit ratings are obtained according to specific criteria based on tourists' behaviour, without asking for an opinion explicitly. The corresponding value of the interaction is set to 1 if we think that the guide could be interested in the guide, otherwise 0.

In our problem, we decide to use **explicit ratings** with a rating scale from 1 to 5.

In [37]:
# Create the User Rating Matrix in CSR format to facilitate the computation
# Row: tourist_id, Column: guide_id, Value: rating
URM_all = sps.csr_matrix(
    (rating_df["rating"].values,
    (rating_df["tourist_id"].values, rating_df["guide_id"].values))
)

URM_all

<200x40 sparse matrix of type '<class 'numpy.float64'>'
	with 298 stored elements in Compressed Sparse Row format>

In [38]:
# Define the portion of data used for training the model
# Since we are using fake data, we train the model on all the available data
URM_train = URM_all

## Create the ICM

The **Item Content Matrix** describes the guides with their own attributes. The rows indicates the guides, and the columns represent attributes. Each number in the ICM indicates how much important an attribute is in characterizing a guide.

In this case we started from the simplest form: the cell value is equal to 1 if the guide has that specific attribute, and 0 otherwise.

In [39]:
# Make a copy of the guide dataframe to build the ICM
icm_df = guide_df.copy(deep=True)

In [40]:
# Convert birthday information into categorical labels
# Classification criterion: if the guide is more than 40 years old or not.
def replace_birth_year(x):
    if x > 1984:
        return '20-40'
    else:
        return '40+'

# Apply the conversion to the dataframe
icm_df['birth_date'] = icm_df['birth_date'].apply(
    lambda x: replace_birth_year(pd.to_datetime(x, format="%Y-%m-%d").year)
)

In [41]:
# Convert number of years of experience into categorical labels
# Classification criteria:
# - a junior guide has less than 5 years of experience;
# - an experience guide has between 5 and 10 years of experience;
# - a senior guide has at least 10 years of experience.
def replace_experience(x):
    if x < 5:
        return 'junior'
    elif x < 10:
        return 'experienced'
    else:
        return 'senior'

# Apply conversion to the dataframe
icm_df['experience'] = icm_df['experience'].apply(replace_experience)

In [42]:
# Compute the mean value of prices as a reference for proper ranges
icm_df[['price']].mean(axis=0)

price    29.175
dtype: float64

In [43]:
# Split the price information into three ranges
# - cost < 25: low cost
# - 25 <= cost < 35: medium
# - cost >= 35: high

def replace_price(x):
    if x < 25:
        return 'low_cost'
    elif x < 35:
        return 'medium_cost'
    else:
        return 'high_cost'

# Apply conversion to the dataframe
icm_df['price'] = icm_df['price'].apply(replace_price)

In [44]:
# Show the dataframe
icm_df

Unnamed: 0,id,gender,name,birth_date,now_available,languages_spoken,price,education,biography,keywords,current_location,experience
0,0,male,Gary Baker,40+,True,[english],medium_cost,high-school,Astronomer,[museums],"{'lat': 40.342693584880706, 'lon': 18.16438078...",senior
1,1,female,Lisa Swanson,40+,True,"[italian, dutch]",high_cost,middle-school,"Designer, jewellery","[cinema, rafting, history, wine]","{'lat': 40.3367551413547, 'lon': 18.1569120995...",senior
2,2,female,Carla Dillon,20-40,True,"[chinese, french, english]",medium_cost,phd,Travel agency manager,"[food, archeology, art]","{'lat': 40.362447049660204, 'lon': 18.14225129...",experienced
3,3,male,Jesse Schaefer,20-40,True,[bulgarian],high_cost,bachelor,Financial planner,"[countryside, rafting, art]","{'lat': 40.36584086550253, 'lon': 18.183002910...",junior
4,4,male,David Davis,20-40,True,"[deutsche, french]",medium_cost,master,"Pilot, airline","[countryside, tracking, beer]","{'lat': 40.354724889812935, 'lon': 18.20308322...",experienced
5,5,male,Joseph James,40+,True,"[deutsche, dutch, bulgarian]",medium_cost,middle-school,"Designer, exhibition/display",[],"{'lat': 40.35854623058413, 'lon': 18.183459879...",experienced
6,6,male,Steven Schneider,40+,True,"[english, chinese, french]",medium_cost,middle-school,"Engineer, land","[rafting, art, sport]","{'lat': 40.35774331848761, 'lon': 18.163273185...",junior
7,7,male,Reginald Berg,40+,True,"[spanish, dutch, french, italian]",medium_cost,bachelor,Financial risk analyst,"[museums, art, countryside, wine]","{'lat': 40.365193043282794, 'lon': 18.18548169...",senior
8,8,male,Gregory Lee,40+,True,"[bulgarian, english, chinese]",medium_cost,middle-school,Audiological scientist,"[literature, cinema, music]","{'lat': 40.35265552781134, 'lon': 18.151723396...",senior
9,9,male,Alex Perkins,40+,True,"[chinese, bulgarian, dutch, english]",high_cost,phd,"Engineer, materials",[museums],"{'lat': 40.33774657696426, 'lon': 18.182064664...",senior


In [45]:
# Remove the columns that we would not consider as relevant static attributes for guides
icm_df.drop(labels=['name', 'now_available', 'current_location'], axis=1, inplace=True)

In [46]:
# Split the categorical attributes into separate columns
multiclass_attributes = ['gender', 'price', 'experience', 'birth_date', 'education', 'biography', 'languages_spoken', 'keywords']

# The result would have value 1 for each cell corresponding to an attribute owned by a guide
# and 0 otherwise
for n in multiclass_attributes:

    # Explode lists of attributes to a subset of columns
    s = icm_df[n].explode()

    # Augment the dataframe with columns corresponding to each category, which belongs to the set of attributes. The cell values are determined by the counts of occurrences for each element.
    # (if the intersection is not null, the cell has value 1, otherwise it is filled with 0)
    icm_df = icm_df.join(pd.crosstab(s.index, s).astype(object)).fillna(0)

    # Drop the original column
    icm_df.drop(labels=n,axis=1,inplace=True)

In [47]:
# Show the dataframe
icm_df

Unnamed: 0,id,female,male,high_cost,low_cost,medium_cost,experienced,junior,senior,20-40,...,countryside,food,history,literature,museums,music,rafting,sport,tracking,wine
0,0,0,1,0,0,1,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
1,1,1,0,1,0,0,0,0,1,0,...,0,0,1,0,0,0,1,0,0,1
2,2,1,0,0,0,1,1,0,0,1,...,0,1,0,0,0,0,0,0,0,0
3,3,0,1,1,0,0,0,1,0,1,...,1,0,0,0,0,0,1,0,0,0
4,4,0,1,0,0,1,1,0,0,1,...,1,0,0,0,0,0,0,0,1,0
5,5,0,1,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,6,0,1,0,0,1,0,1,0,0,...,0,0,0,0,0,0,1,1,0,0
7,7,0,1,0,0,1,0,0,1,0,...,1,0,0,0,1,0,0,0,0,1
8,8,0,1,0,0,1,0,0,1,0,...,0,0,0,1,0,1,0,0,0,0
9,9,0,1,1,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0


In [48]:
# Print the list of attributes
attribute_list = icm_df.columns.tolist()
attribute_list

['id',
 'female',
 'male',
 'high_cost',
 'low_cost',
 'medium_cost',
 'experienced',
 'junior',
 'senior',
 '20-40',
 '40+',
 'bachelor',
 'high-school',
 'master',
 'middle-school',
 'phd',
 'Administrator, sports',
 'Air broker',
 'Animal technologist',
 'Astronomer',
 'Audiological scientist',
 'Best boy',
 'Broadcast journalist',
 'Building services engineer',
 'Counsellor',
 'Designer, exhibition/display',
 'Designer, graphic',
 'Designer, industrial/product',
 'Designer, jewellery',
 'Editor, magazine features',
 'Emergency planning/management officer',
 'Engineer, aeronautical',
 'Engineer, land',
 'Engineer, materials',
 'Environmental consultant',
 'Exhibitions officer, museum/gallery',
 'Financial planner',
 'Financial risk analyst',
 'Geochemist',
 'Holiday representative',
 'Legal secretary',
 'Local government officer',
 'Multimedia programmer',
 'Newspaper journalist',
 'Office manager',
 'Pilot, airline',
 'Secretary/administrator',
 'Systems developer',
 'Television fl

In [49]:
# Convert the names of columns (attributes) into sequential numbers
def convert_index(x):
    # The first column contains the ids of the guide, keep its name
    if x == 'id':
        return x
    else:
        return attribute_list.index(x)

# Apply the transformation to the dataframe
icm_df.rename(mapper=convert_index, axis=1, inplace=True)
icm_df

Unnamed: 0,id,1,2,3,4,5,6,7,8,9,...,65,66,67,68,69,70,71,72,73,74
0,0,0,1,0,0,1,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
1,1,1,0,1,0,0,0,0,1,0,...,0,0,1,0,0,0,1,0,0,1
2,2,1,0,0,0,1,1,0,0,1,...,0,1,0,0,0,0,0,0,0,0
3,3,0,1,1,0,0,0,1,0,1,...,1,0,0,0,0,0,1,0,0,0
4,4,0,1,0,0,1,1,0,0,1,...,1,0,0,0,0,0,0,0,1,0
5,5,0,1,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,6,0,1,0,0,1,0,1,0,0,...,0,0,0,0,0,0,1,1,0,0
7,7,0,1,0,0,1,0,0,1,0,...,1,0,0,0,1,0,0,0,0,1
8,8,0,1,0,0,1,0,0,1,0,...,0,0,0,1,0,1,0,0,0,0
9,9,0,1,1,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0


In [50]:
# Re-organize data structure for building the ICM

# The dataframe is modeled such that the column 'id' is identifier variable,
# and all other columns are "unpivoted" to the row axis.
# The result has only two non-identifier columns: 'label' (attribute) and 'value'.
icm_df = pd.melt(icm_df, id_vars='id', var_name='label')

# Filter the values: only the cells with value 1 are kept in the new dataframe
icm_df = icm_df[icm_df["value"]==1]
icm_df

Unnamed: 0,id,label,value
1,1,1,1
2,2,1,1
10,10,1,1
11,11,1,1
12,12,1,1
...,...,...,...
2908,28,73,1
2909,29,73,1
2921,1,74,1
2927,7,74,1


In [51]:
# Create the Item Content Matrix in CSR format to facilitate the computation
# Row: guide_id, Column: attribute_id, Value: 1 if the attribute is present for the guide
ICM_all = sps.csr_matrix(
    (icm_df["value"].values,
    (icm_df["id"].values, icm_df["label"].values))
)

ICM_all

<40x75 sparse matrix of type '<class 'numpy.int64'>'
	with 412 stored elements in Compressed Sparse Row format>

In [52]:
# Print the matrix in dense format
print(ICM_all.todense())

[[0 0 1 ... 0 0 0]
 [0 1 0 ... 0 0 1]
 [0 1 0 ... 0 0 0]
 ...
 [0 0 1 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]]


### Feature Engineering
It is possible to model the importance of the features by weighting them differently in the ICM: so we can attribute a higher value to the features that we consider more relevant for the recommendation problem, such as the languages spoken by the guides.

In the considered scenario, we assume that the **language** is a hard constraint with high priority in the matching process: for communication purpose, it is reasonable that the recommended guides and the tourist speak the same language. So we aim to first guarantee this constraint.

Other attributes, such as the experience or the educational background of the guides, can be further highlighted whenever necessary.

In [53]:
# Print the full list of attributes with their indices
for l in attribute_list:
    print(attribute_list.index(l), l)

0 id
1 female
2 male
3 high_cost
4 low_cost
5 medium_cost
6 experienced
7 junior
8 senior
9 20-40
10 40+
11 bachelor
12 high-school
13 master
14 middle-school
15 phd
16 Administrator, sports
17 Air broker
18 Animal technologist
19 Astronomer
20 Audiological scientist
21 Best boy
22 Broadcast journalist
23 Building services engineer
24 Counsellor
25 Designer, exhibition/display
26 Designer, graphic
27 Designer, industrial/product
28 Designer, jewellery
29 Editor, magazine features
30 Emergency planning/management officer
31 Engineer, aeronautical
32 Engineer, land
33 Engineer, materials
34 Environmental consultant
35 Exhibitions officer, museum/gallery
36 Financial planner
37 Financial risk analyst
38 Geochemist
39 Holiday representative
40 Legal secretary
41 Local government officer
42 Multimedia programmer
43 Newspaper journalist
44 Office manager
45 Pilot, airline
46 Secretary/administrator
47 Systems developer
48 Television floor manager
49 Therapist, drama
50 Travel agency manager


In [54]:
# Define which features we would like to model
# Some examples could be 'low_cost' to consider the cost of the guide,
# 'experienced' for the experience, '20-40' for the age, etc.
features_to_model = ['low_cost', 'experienced', '20-40']

# Define the importance of those features that we want to attribute in a positional way
# Default weight: 1
importance_weights = [1, 1, 1]

# Extract the columns representing languages, and attribute a higher weight
languages_columns = [54, 61]
languages_weight = 10

In [55]:
# Initialize the dictionaries for the chosen attributes and weights
feature_columns_dict = dict()
importance_weights_dict = dict()

# Add the language feature to the two dictionaries
feature_columns_dict.update({'language': languages_columns})
importance_weights_dict.update({'language': languages_weight})

In [56]:
# Build function to extract other features' indices and the corresponding weights to attribute
def extract_feature(name, i):
    # Update the dictionaries with indices and weights
    feature_columns_dict.update({name: attribute_list.index(name)})
    importance_weights_dict.update({name: importance_weights[i]})

# Execute the operation
for i in range(len(features_to_model)):
    extract_feature(features_to_model[i], i)

print(feature_columns_dict)
print(importance_weights_dict)

{'language': [54, 61], 'low_cost': 4, 'experienced': 6, '20-40': 9}
{'language': 10, 'low_cost': 1, 'experienced': 1, '20-40': 1}


In [57]:
# Create a copy of the original ICM to modify
new_icm_df = icm_df.copy(deep=True)

In [60]:
# Modify the cell values with respect to the weights we want to give
for feature in importance_weights_dict:

    # If the weight is not default, modify the values
    if importance_weights_dict[feature] > 1:

        # Trace all columns assigned to the languages
        if feature=='language':
            condition = (new_icm_df.label >= feature_columns_dict[feature][0]) & (new_icm_df.label <= feature_columns_dict[feature][1])

        # Trace the specific feature
        else:
            condition = (new_icm_df.label == feature_columns_dict[feature])

        # Find the rows with labels corresponding to the selected features, and update the values
        new_icm_df.loc[condition, 'value'] = importance_weights_dict[feature]

In [61]:
# Print the modified dataframe
new_icm_df

Unnamed: 0,id,label,value
1,1,1,1
2,2,1,1
10,10,1,1
11,11,1,1
12,12,1,1
...,...,...,...
2908,28,73,1
2909,29,73,1
2921,1,74,1
2927,7,74,1


In [107]:
# Build the modified ICM from the dataframe
# Like the URM, also the ICM is built in CSR format for computational purpose
# Row: guide_id, Column: attribute_id, Value: 1 if the attribute is present for the guide

ICM_modified = sps.csr_matrix(
    (new_icm_df["value"].values,
    (new_icm_df["id"].values, new_icm_df["label"].values))
)

ICM_modified

<40x78 sparse matrix of type '<class 'numpy.int64'>'
	with 412 stored elements in Compressed Sparse Row format>

In [109]:
# Define the portion of ICM used for training the model: here we use the entire ICM
ICM_train = ICM_modified

## Build the model

At this point all the necessary input data are prepared, and we proceed to build the recommendation algorithm.


### Similarity Function
A recommendation system learns from the past ratings and/or the attributes of the users. By evaluating the similarity between the guides, it is able to suggest guides that are similar to the ones that the tourist liked in the past, and also to recommend a guide liked by a tourist to similar tourists.

To determine the degree of similarity between the users, we choose to implement the **Cosine Similarity** function for the models. In our case, it consists in determining the number of common elements between two vectors $\vec{i}$ and $\vec{j}$ representing guides, and this can be computed with the **normalized dot product**:
$$    s_{ij}=\frac{\vec{i}\cdot \vec{j}}{|\vec{i}|_2\cdot |\vec{j}|_2}=cos(\theta)   .$$

For faster computation, we adopt a version of the cosine similarity based on vector products, with $M$ the reference matrix (URM or ICM):
$$ W_{i,I}
= cos(v_i, M_{I})
= \frac{v_i \cdot M_{I}}{|| v_i || IW_{I} + shrink}  ,$$
where
$$ IW_{i} = \sqrt{{\Sigma_{u \in U}{M_{u,i}^2}}}  .$$
The shrink term is introduced to take into account the support of the vectors: vectors with larger number of non-zero elements in them are statistically more significant.

Finally, the similarity values are stored in a **similarity matrix**, which establishes the pairwise correspondence between guides. This is fed to the models as weights.

In [110]:
# Define the cosine similarity function based on vector products
def vector_similarity(m: sps.csc_matrix, shrink: int):

    # Compute the Euclidean norm of each column of the matrix m
    item_weights = np.sqrt(
        np.sum(m.power(2), axis=0)
    ).A.flatten()

    # Find the number of items
    num_items = m.shape[1]

    # Get the transposed matrix
    matrix_t = m.T

    # Initialize the empty weight matrix
    weights = np.empty(shape=(num_items, num_items))

    # Compute the similarity values as mentioned in the previous formula
    # The results will be weights to give to the model
    for item_id in range(num_items):
        numerator = matrix_t.dot(m[:, item_id]).A.flatten()
        denominator = item_weights[item_id] * item_weights + shrink + 1e-6

        weights[item_id] = numerator / denominator

    # The elements on the diagonal of the similarity matrix are set to 0,
    # to force the fact that each element is not considered similar to itself!
    np.fill_diagonal(weights, 0.0)

    return weights

To solve our problem, we worked on two different approaches: **Collaborative Filtering** (CF) and **Content-Based Filtering** (CBF).

### Collaborative Filtering Model

**Collaborative Filtering** models rely on the opinions of a community of tourists, without the need of guides' attributes. It essentially focuses on identifying patterns and similarities in interactions between tourists and guides based on the tourists' feedback.

In this case, **item-based** collaborative filtering technique has been used. It calculates the similarity between each pair of guides, according to the number of tourists that have the same opinion on them.

In [111]:
# Define the item-based collaborative filtering recommender
class ItemCFRecommender(object):

    # Initialize the input matrix
    def __init__(self, URM):
        self.URM = URM
        
    # Fit the recommender with a chosen shrink value
    def fit(self, shrink=1):
        # Compute the pairwise similarity between guides using the URM
        self.W_matrix = vector_similarity(self.URM.tocsc(), shrink=shrink)
        # Print the weight matrix
        # with np.printoptions(threshold=np.inf):
        #     print(self.W_matrix)

    # Generate recommendations for each tourist
    # at: number of guides that we want to recommend to each tourist
    # exclude_seen: if we want to avoid recommending the guides that the tourist has rated before
    def recommend(self, user_id, at=None, exclude_seen=True):
        user_profile = self.URM[user_id]
        # Compute the scores using the dot product
        scores = user_profile.dot(self.W_matrix).ravel()

        # Filter the guides
        if exclude_seen:
            scores = self.filter_seen(user_id, scores)

        # Rank the guides
        ranking = scores.argsort()[::-1]
            
        return ranking[:at]
    
    # Guides that have been already rated by the tourist will be excluded in the recommendation
    def filter_seen(self, user_id, scores):
        # Retrieve the cells regarding the interactions of that specific tourist
        start_pos = self.URM.indptr[user_id]
        end_pos = self.URM.indptr[user_id+1]

        # Extract the user profile
        user_profile = self.URM.indices[start_pos:end_pos]

        # Set the score of already rated guides to the minimum value
        scores[user_profile] = -np.inf

        return scores

### Content-Based Filtering Model

A **Content-Based Filtering** model recommends guides to tourists based on the attributes of the guides themselves. It stands upon the assumption of recommending items similar to the ones a user liked in the past, and it requires a list of good quality attributes for the guides to work properly.

In this case **item-based** CBF technique has been used. It suggests guides to tourists according to the similarity weights computed on the guides: two guides are similar if they have a great part of attributes in common.

In [112]:
# Define the item-based content-based filtering recommender
class ItemCBFRecommender(object):

    # Initialize the input matrices
    def __init__(self, URM, ICM):
        self.URM = URM
        self.ICM = ICM

    # Fit the recommender with a chosen shrink value
    def fit(self, shrink=1):
        # Compute the pairwise similarity between guides using the transposed ICM
        self.W_matrix = vector_similarity(self.ICM.T.tocsc(), shrink=shrink)
        # with np.printoptions(threshold=np.inf):
        #     print(self.W_matrix)

    # Generate recommendations for each tourist
    # at: number of guides that we want to recommend to each tourist
    # exclude_seen: if we want to avoid recommending the guides that the tourist has rated before
    def recommend(self, user_id, at=None, exclude_seen=True):
        user_profile = self.URM[user_id]
        # Compute the scores using the dot product
        scores = user_profile.dot(self.W_matrix).ravel()

        # Filter the guides
        if exclude_seen:
            scores = self.filter_seen(user_id, scores)

        # Rank the guides
        ranking = scores.argsort()[::-1]

        return ranking[:at]

    # Guides that have been already rated by the tourist will be excluded in the recommendation
    def filter_seen(self, user_id, scores):

        # Retrieve the cells regarding the interactions of that specific tourist
        start_pos = self.URM.indptr[user_id]
        end_pos = self.URM.indptr[user_id+1]

        # Extract the user profile
        user_profile = self.URM.indices[start_pos:end_pos]

        # Set the score of already rated guides to the minimum value
        scores[user_profile] = -np.inf

        return scores

## Fit the model
The recommenders are built and fit on the available input data.

In [114]:
# Select if building a collaborative filtering model or a content-based filtering model
# Available options: 'cf', 'cbf'
model_type = 'cbf'

In [115]:
# Collaborative filtering model
# Input data: URM
# The shrink value is a non-negative number that can be tuned
if model_type == 'cf':
    recommender = ItemCFRecommender(URM_train)
    recommender.fit(shrink=0.5)

[[0.         0.         0.23278381 0.         0.         0.
  0.1480595  0.         0.200763   0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.16996397 0.         0.         0.         0.         0.
  0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.15917231 0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.15410449
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.14959189
  0.         0.         0.         0.        ]
 [0.23278381 0.         0.         0.         0.         0.
  0.17766303 0.         0.         0.         0.         0.
  0.         0.         0.16463436 0.         0.  

In [116]:
# Content-based filtering model
# Input data: URM and ICM
# The shrink value is a non-negative number that can be tuned
if model_type == 'cbf':
    recommender = ItemCBFRecommender(URM_train, ICM_train)
    recommender.fit(shrink=0.5)

## Generate outputs

Here the generation of recommendations of guides to each tourist is reported.

In [117]:
# Set the number of guides to recommend to each tourist
n_recommendations_per_tourist = 5

In [118]:
# Generate recommendations for each tourist
recommendations = []

for i,id in tqdm(enumerate(arr_tourists)):
    # at: number of guides to recommend to each tourist
    # exclude_seen: if we want to exclude the guides already rated by this tourist
    rec = recommender.recommend(id, at=n_recommendations_per_tourist, exclude_seen=True)
    rec_list = rec

    # Produce a string with the ids of the recommended guides
    rec_row = ' '.join(str(s) for s in rec_list)
    recommendations.append(rec_row)

200it [00:00, 16667.87it/s]


In [119]:
# Print recommendations for the first 10 users
for i in range(10):
    print("User " + str(arr_tourists[i]) + " - recommended guides: " + recommendations[i])

For user 0 recommended guides: 5 29 4 20 3
For user 1 recommended guides: 30 6 32 33 31
For user 2 recommended guides: 28 29 39 8 11
For user 3 recommended guides: 33 15 4 2 30
For user 4 recommended guides: 28 2 34 8 39
For user 5 recommended guides: 27 7 35 1 37
For user 6 recommended guides: 39 38 17 16 15
For user 7 recommended guides: 32 22 6 34 10
For user 8 recommended guides: 8 23 31 34 39
For user 9 recommended guides: 39 12 37 23 22


In [120]:
# Show the output dataframe
result_df = pd.DataFrame(
    data = {'tourist_id': arr_tourists,
            'guides': recommendations}
)

result_df

Unnamed: 0,tourist_id,guides
0,0,5 29 4 20 3
1,1,30 6 32 33 31
2,2,28 29 39 8 11
3,3,33 15 4 2 30
4,4,28 2 34 8 39
...,...,...
195,195,2 22 6 39 10
196,196,37 27 35 12 24
197,197,10 26 17 4 9
198,198,37 27 35 12 24


Let's randomly choose a tourist and check for information of recommended guides.

In [121]:
# Show some examples: select a tourist by id to visualize the received recommendations
sample_tourist = 50

# Show the selected tourist
pd.DataFrame(tourist_df.loc[sample_tourist,:])

Unnamed: 0,50
id,50
languages,['chinese']
keywords,['cinema']


In [122]:
# Extract the indices of the recommended guides
sample_guide_list = list(map(int, recommendations[sample_tourist].split(" ")))

# Show the list of recommended guides
guide_df.loc[sample_guide_list,:]

Unnamed: 0,id,gender,name,birth_date,now_available,languages_spoken,price,education,biography,keywords,current_location,experience
22,22,female,Shannon Oneal,1976-06-09,True,"[chinese, spanish]",25,high-school,Information officer,"[history, countryside, cinema, literature, music]","{'lat': 40.36846436096489, 'lon': 18.171263566...",6
29,29,female,Dana Cox,1988-11-28,True,"[bulgarian, chinese]",28,master,Market researcher,"[archeology, literature, tracking]","{'lat': 40.35465462623879, 'lon': 18.172666005...",10
11,11,female,Rhonda Hull,1960-12-06,True,[bulgarian],23,phd,Garment/textile technologist,"[cinema, beer, rafting]","{'lat': 40.367394960036485, 'lon': 18.17553586...",25
16,16,male,Steven Sims,1962-12-17,True,"[spanish, bulgarian, french]",32,middle-school,"Psychologist, counselling","[countryside, music, beer]","{'lat': 40.32347428820915, 'lon': 18.164571467...",2
39,39,female,Alice Lester,1973-03-02,True,"[spanish, dutch]",26,middle-school,Minerals surveyor,[history],"{'lat': 40.35901486939172, 'lon': 18.166553513...",1
