# Tour Recommendation Pipeline
Since the scenario includes the fact that the guides can publish already organized tours on the application, we have also studied this problem. Considering the guide recommendation pipeline, we have adapted it to the case of published tours.

Hence, the tour recommendation system proposes to recommend some tours published by guides to the tourists.

The tour recommendation system uses datasets of:
- Ratings given by tourists to specific tours;
- Set of attributes about the tours.

## Define working environment

In [1]:
# Import libraries

import numpy as np
import pandas as pd
import scipy.sparse as sps
import ast

from tqdm import tqdm

## Load and preprocess data

In this part we are loading the previously generated dataframes:
- tourists.csv with the id, languages spoken, and keywords of the tourists.
- tours.csv with various attributes of the tours.
- ratings.csv which contains the tourist-tour interactions.


In [2]:
# Load dataframe for tourist attributes

tourist_file = open('Data/tourists_500.csv')

tourist_df = pd.read_csv(
    filepath_or_buffer = tourist_file,
    sep = ';',
    header = 0
)

tourist_df.rename(columns={tourist_df.columns[0]: 'id'}, inplace=True)

tourist_df

Unnamed: 0,id,languages,keywords
0,0,['spanish'],"['rafting', 'countryside', 'literature']"
1,1,['chinese'],"['art', 'wine', 'literature', 'sport']"
2,2,['italian'],"['sport', 'rafting', 'cinema', 'museums', 'mus..."
3,3,['spanish'],['beer']
4,4,['french'],"['tracking', 'food', 'art', 'countryside', 'ra..."
...,...,...,...
495,495,['french'],"['music', 'museums', 'literature', 'cinema', '..."
496,496,['bulgarian'],"['countryside', 'food', 'wine', 'music', 'hist..."
497,497,['french'],"['music', 'beer', 'rafting', 'museums', 'track..."
498,498,['italian'],['sport']


In [3]:
# Load dataframe for tour attributes

tour_file = open('Data/tours_50.csv')

tour_df = pd.read_csv(
    filepath_or_buffer = tour_file,
    sep = ';',
    header = 0,

    # Process the columns containing lists: load the content of those columns as list
    converters = {'languages':ast.literal_eval, 'keywords':ast.literal_eval, 'attractions':ast.literal_eval}
)

tour_df.rename(columns={tour_df.columns[0]: 'id'}, inplace=True)

tour_df

Unnamed: 0,id,guide,languages,city,attractions,keywords,price,date,duration
0,0,0,[english],Lecce,"[Celestine Convent, Roman Theatre]","[archeology, literature]",22,2024-06-30,12
1,1,1,"[italian, dutch]",Lecce,[Museo Faggiano],[museums],28,2024-06-26,9
2,2,2,"[chinese, french, english]",Lecce,"[Palazzo Carafa, Church of Santa Chiara]","[history, literature]",27,2024-06-03,11
3,3,3,[bulgarian],Lecce,"[Villa Comunale di Lecce, Church of San Matteo...","[countryside, literature, museums]",31,2024-06-29,11
4,4,4,"[deutsche, french]",Lecce,"[Torre del Parco, Palazzo dei Celestini]",[history],31,2024-06-17,12
5,5,5,"[deutsche, dutch, bulgarian]",Lecce,"[Museo Faggiano, Palazzo dei Celestini]","[history, museums]",26,2024-06-29,7
6,6,6,"[english, chinese, french]",Lecce,"[Piazza Sant'Oronzo, Villa Comunale di Lecce]","[art, countryside]",38,2024-06-28,10
7,7,7,"[spanish, dutch, french, italian]",Lecce,[Celestine Convent],[literature],34,2024-07-01,7
8,8,8,"[bulgarian, english, chinese]",Lecce,"[San Giovanni Battista Church, Roman Theatre, ...","[archeology, history, literature]",25,2024-06-10,11
9,9,9,"[chinese, bulgarian, dutch, english]",Lecce,[Palazzo dei Celestini],[history],29,2024-06-16,8


In [4]:
# Load dataframe for ratings

rating_file = open('Data/tours_ratings_500_50.csv')

rating_df = pd.read_csv(
    filepath_or_buffer = rating_file,
    sep = ';',
    header = 0
)

rating_df

Unnamed: 0,0,1,2
0,0,17,4.0
1,0,33,4.0
2,1,2,5.0
3,1,43,3.0
4,2,44,4.0
...,...,...,...
666,495,28,4.0
667,496,29,4.0
668,497,19,4.0
669,497,26,4.0


In [5]:
# Rename the columns of the rating dataframe
rating_df.rename(columns={rating_df.columns[0]: 'tourist_id',
                          rating_df.columns[1]: 'tour_id',
                          rating_df.columns[2]: 'rating',
                         }, inplace=True)

# Remove the duplicated interactions, if any
rating_df.drop_duplicates(subset=['tourist_id','tour_id'],inplace=True)

rating_df

Unnamed: 0,tourist_id,tour_id,rating
0,0,17,4.0
1,0,33,4.0
2,1,2,5.0
3,1,43,3.0
4,2,44,4.0
...,...,...,...
666,495,28,4.0
667,496,29,4.0
668,497,19,4.0
669,497,26,4.0


## Print statistics

Some statistics of the acquired data are shown.

In [6]:
# Statistics about data
arr_tourists = tourist_df["id"].unique()
arr_tours = tour_df["id"].unique()

n_tourists = len(arr_tourists)
n_tours = len(arr_tours)
n_interactions = len(rating_df)

print("Number of tourists: {:d}".format(n_tourists))
print("Number of tours: {:d}".format(n_tours))
print("Number of interactions: {:d}".format(n_interactions))

print("Average interaction per tourist: {:.2f}".format(n_interactions/n_tourists))
print("Average interaction per tour: {:.2f}".format(n_interactions/n_tours))
print("Sparsity: {:.2f} %".format((1-float(n_interactions)/(n_tours*n_tourists))*100))

Number of tourists: 500
Number of tours: 50
Number of interactions: 671
Average interaction per tourist: 1.34
Average interaction per tour: 13.42
Sparsity: 97.32 %


In [7]:
# Statistics about ratings

print("Average rating: {:.6f}".format(rating_df.loc[:, 'rating'].mean()))
print("Maximum rating: {:.6f}".format(rating_df.loc[:, 'rating'].max()))
print("Minimum rating: {:.6f}".format(rating_df.loc[:, 'rating'].min()))

Average rating: 3.985097
Maximum rating: 5.000000
Minimum rating: 2.000000


## Create the URM

The **User Rating Matrix** describes the interactions between tourists and organized tours, where rows represent tourists and columns represent tours. The values in the cells can be defined by an implicit or explicit approach:
- Explicit ratings are given directly by the tourists to the tours, according to a rating scale.
- Implicit ratings are obtained according to specific criteria based on tourists' behaviour, without asking for an opinion explicitly. The corresponding value of the interaction is set to 1 if we think that the tourist could be interested in the tour, otherwise 0.

In our problem, we decide to use **explicit ratings** with a rating scale from 1 to 5.

In [8]:
# Create the User Rating Matrix in CSR format to facilitate the computation
# Row: tourist_id, Column: tour_id, Value: rating
URM_all = sps.csr_matrix(
    (rating_df["rating"].values,
    (rating_df["tourist_id"].values, rating_df["tour_id"].values))
)

URM_all

<500x50 sparse matrix of type '<class 'numpy.float64'>'
	with 671 stored elements in Compressed Sparse Row format>

In [9]:
# Define the portion of data used for training the model
# Since we are using fake data, we train the model on all the available data
URM_train = URM_all

## Create the ICM

The **Item Content Matrix** describes the tours with their attributes, with rows representing tours and columns representing attributes. Each number in the ICM indicates how much important an attribute is in characterizing a tour.

In this case we started from the simplest form: the cell value is equal to 1 if the tour has that specific attribute, and 0 otherwise.

In [10]:
# Make a copy of the tour dataframe to build the ICM
icm_df = tour_df.copy(deep=True)

In [11]:
# Split the price information into three ranges
# - cost < 25: low cost
# - 25 <= cost < 35: medium
# - cost >= 35: high

def replace_price(x):
    if x < 25:
        return 'low_cost'
    elif x < 35:
        return 'medium_cost'
    else:
        return 'high_cost'

# Apply conversion to the dataframe
icm_df['price'] = icm_df['price'].apply(replace_price)

In [12]:
# Show the dataframe
icm_df

Unnamed: 0,id,guide,languages,city,attractions,keywords,price,date,duration
0,0,0,[english],Lecce,"[Celestine Convent, Roman Theatre]","[archeology, literature]",low_cost,2024-06-30,12
1,1,1,"[italian, dutch]",Lecce,[Museo Faggiano],[museums],medium_cost,2024-06-26,9
2,2,2,"[chinese, french, english]",Lecce,"[Palazzo Carafa, Church of Santa Chiara]","[history, literature]",medium_cost,2024-06-03,11
3,3,3,[bulgarian],Lecce,"[Villa Comunale di Lecce, Church of San Matteo...","[countryside, literature, museums]",medium_cost,2024-06-29,11
4,4,4,"[deutsche, french]",Lecce,"[Torre del Parco, Palazzo dei Celestini]",[history],medium_cost,2024-06-17,12
5,5,5,"[deutsche, dutch, bulgarian]",Lecce,"[Museo Faggiano, Palazzo dei Celestini]","[history, museums]",medium_cost,2024-06-29,7
6,6,6,"[english, chinese, french]",Lecce,"[Piazza Sant'Oronzo, Villa Comunale di Lecce]","[art, countryside]",high_cost,2024-06-28,10
7,7,7,"[spanish, dutch, french, italian]",Lecce,[Celestine Convent],[literature],medium_cost,2024-07-01,7
8,8,8,"[bulgarian, english, chinese]",Lecce,"[San Giovanni Battista Church, Roman Theatre, ...","[archeology, history, literature]",medium_cost,2024-06-10,11
9,9,9,"[chinese, bulgarian, dutch, english]",Lecce,[Palazzo dei Celestini],[history],medium_cost,2024-06-16,8


In [13]:
# Remove the column(s) that we would not consider as relevant static attributes
icm_df.drop(labels=['date', 'city', 'duration'], axis=1, inplace=True)

In [14]:
# Split the categorical attributes into separate columns
multiclass_attributes = ['languages', 'attractions', 'keywords', 'price', 'guide']

# The result would have value 1 for each cell corresponding to an attribute of the tour
# and 0 otherwise
for n in multiclass_attributes:

    # Explode lists of attributes to a subset of columns
    s = icm_df[n].explode()

    # Augment the dataframe with columns corresponding to each category, which belongs to the set of attributes. The cell values are determined by the counts of occurrences for each element.
    # (if the intersection is not null, the cell has value 1, otherwise it is filled with 0)
    icm_df = icm_df.join(pd.crosstab(s.index, s).astype(object)).fillna(0)

    # Drop the original column
    icm_df.drop(labels=n,axis=1,inplace=True)

In [15]:
# Show the dataframe
icm_df

Unnamed: 0,id,bulgarian,chinese,deutsche,dutch,english,french,italian,spanish,Basilica di Santa Croce,...,40,41,42,43,44,45,46,47,48,49
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,0,1,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,0,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,5,1,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,6,0,1,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,7,0,0,0,1,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
8,8,1,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,9,1,1,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
# Print the list of attributes
# The columns with a number as name indicate the ids of the guides
attribute_list = icm_df.columns.tolist()
attribute_list

['id',
 'bulgarian',
 'chinese',
 'deutsche',
 'dutch',
 'english',
 'french',
 'italian',
 'spanish',
 'Basilica di Santa Croce',
 'Castello di Carlo V',
 'Celestine Convent',
 'Church of San Francesco della Scarpa',
 'Church of San Matteo',
 "Church of San Niccolo' e Cataldo",
 'Church of Santa Chiara',
 "Colonna di Sant'Oronzo",
 'Lecce Cathedral',
 'Museo Faggiano',
 'Palazzo Carafa',
 'Palazzo dei Celestini',
 "Piazza Sant'Oronzo",
 'Piazza del Duomo',
 'Porta Napoli',
 'Roman Amphitheatre',
 'Roman Theatre',
 'San Giovanni Battista Church',
 'Torre del Parco',
 'Villa Comunale di Lecce',
 'archeology',
 'art',
 'countryside',
 'history',
 'literature',
 'museums',
 'high_cost',
 'low_cost',
 'medium_cost',
 0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49]

In [17]:
# Convert the names of columns (attributes) into sequential numbers
def convert_index(x):
    # The first column contains the ids of the tours, keep its name
    if x == 'id':
        return x
    else:
        return attribute_list.index(x)

# Apply the transformation to the dataframe
icm_df.rename(mapper=convert_index, axis=1, inplace=True)
icm_df

Unnamed: 0,id,1,2,3,4,5,6,7,8,9,...,78,79,80,81,82,83,84,85,86,87
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,0,1,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,0,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,5,1,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,6,0,1,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,7,0,0,0,1,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
8,8,1,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,9,1,1,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# Re-organize data structure for building the ICM

# The dataframe is modeled such that the column 'id' is identifier variable,
# and all other columns are "unpivoted" to the row axis.
# The result has only two non-identifier columns: 'label' (attribute) and 'value'.
icm_df = pd.melt(icm_df, id_vars='id', var_name='label')

# Filter the values: only the cells with value 1 are kept in the new dataframe
icm_df = icm_df[icm_df["value"]==1]
icm_df

Unnamed: 0,id,label,value
3,3,1,1
5,5,1,1
8,8,1,1
9,9,1,1
10,10,1,1
...,...,...,...
4145,45,83,1
4196,46,84,1
4247,47,85,1
4298,48,86,1


In [19]:
# Create the Item Content Matrix in CSR format to facilitate the computation
# Row: tour_id, Column: attribute_id, Value: 1 if the attribute characterizes the tour
ICM_all = sps.csr_matrix(
    (icm_df["value"].values,
    (icm_df["id"].values, icm_df["label"].values))
)

ICM_all

<50x88 sparse matrix of type '<class 'numpy.int64'>'
	with 390 stored elements in Compressed Sparse Row format>

In [20]:
# Print the matrix in dense format
print(ICM_all.todense())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 ...
 [0 0 0 ... 1 0 0]
 [0 1 0 ... 0 1 0]
 [0 1 0 ... 0 0 1]]


### Feature Engineering
It is possible to model the importance of the features by weighting them differently in the ICM: so we can attribute a higher value to the features that we consider more relevant for the recommendation problem, such as the languages spoken by the guides offering the tours.

In the considered scenario, we assume that the **language** is a hard constraint with high priority in the matching process: for communication purpose, it is reasonable that the recommended tours are offered by guides who speak the same language(s) of the tourist. So we aim to first guarantee this constraint.

Other attributes can be further explored whenever necessary.

In [21]:
# Print the full list of attributes with their indices
for l in attribute_list:
    print(attribute_list.index(l), l)

0 id
1 bulgarian
2 chinese
3 deutsche
4 dutch
5 english
6 french
7 italian
8 spanish
9 Basilica di Santa Croce
10 Castello di Carlo V
11 Celestine Convent
12 Church of San Francesco della Scarpa
13 Church of San Matteo
14 Church of San Niccolo' e Cataldo
15 Church of Santa Chiara
16 Colonna di Sant'Oronzo
17 Lecce Cathedral
18 Museo Faggiano
19 Palazzo Carafa
20 Palazzo dei Celestini
21 Piazza Sant'Oronzo
22 Piazza del Duomo
23 Porta Napoli
24 Roman Amphitheatre
25 Roman Theatre
26 San Giovanni Battista Church
27 Torre del Parco
28 Villa Comunale di Lecce
29 archeology
30 art
31 countryside
32 history
33 literature
34 museums
35 high_cost
36 low_cost
37 medium_cost
38 0
39 1
40 2
41 3
42 4
43 5
44 6
45 7
46 8
47 9
48 10
49 11
50 12
51 13
52 14
53 15
54 16
55 17
56 18
57 19
58 20
59 21
60 22
61 23
62 24
63 25
64 26
65 27
66 28
67 29
68 30
69 31
70 32
71 33
72 34
73 35
74 36
75 37
76 38
77 39
78 40
79 41
80 42
81 43
82 44
83 45
84 46
85 47
86 48
87 49


In [22]:
# Define which features we would like to model
# Some examples could be 'low_cost' to consider the cost of the tour,
# or keywords as 'art', etc.
features_to_model = ['low_cost', 'art']

# Define the importance of those features that we want to attribute in a positional way
# Default weight: 1
importance_weights = [1, 1]

# Extract the columns representing languages, and attribute a higher weight
languages_columns = [1, 8]
languages_weight = 10

In [23]:
# Initialize the dictionaries for the chosen attributes and weights
feature_columns_dict = dict()
importance_weights_dict = dict()

# Add the language feature to the two dictionaries
feature_columns_dict.update({'language': languages_columns})
importance_weights_dict.update({'language': languages_weight})

In [24]:
# Build function to extract other features' indices and the corresponding weights to attribute
def extract_feature(name, i):
    # Update the dictionaries with indices and weights
    feature_columns_dict.update({name: attribute_list.index(name)})
    importance_weights_dict.update({name: importance_weights[i]})

# Execute the operation
for i in range(len(features_to_model)):
    extract_feature(features_to_model[i], i)

print(feature_columns_dict)
print(importance_weights_dict)

{'language': [1, 8], 'low_cost': 36, 'art': 30}
{'language': 10, 'low_cost': 1, 'art': 1}


In [25]:
# Create a copy of the original ICM to modify
new_icm_df = icm_df.copy(deep=True)

In [26]:
# Modify the cell values with respect to the weights we want to give
for feature in importance_weights_dict:

    # If the weight is not default, modify the values
    if importance_weights_dict[feature] > 1:

        # Trace all columns assigned to the languages
        if feature == 'language':
            condition = (new_icm_df.label >= feature_columns_dict[feature][0]) & (
                        new_icm_df.label <= feature_columns_dict[feature][1])

        # Trace the specific feature
        else:
            condition = (new_icm_df.label == feature_columns_dict[feature])

        # Find the rows with labels corresponding to the selected features, and update the values
        new_icm_df.loc[condition, 'value'] = importance_weights_dict[feature]

In [27]:
# Print the modified dataframe
new_icm_df

Unnamed: 0,id,label,value
3,3,1,10
5,5,1,10
8,8,1,10
9,9,1,10
10,10,1,10
...,...,...,...
4145,45,83,1
4196,46,84,1
4247,47,85,1
4298,48,86,1


In [28]:
# Build the modified ICM from the dataframe
# Like the URM, also the ICM is built in CSR format for computational purpose
# Row: tour_id, Column: attribute_id, Value: 1 if the attribute is present for the tour

ICM_modified = sps.csr_matrix(
    (new_icm_df["value"].values,
    (new_icm_df["id"].values, new_icm_df["label"].values))
)

ICM_modified

<50x88 sparse matrix of type '<class 'numpy.int64'>'
	with 390 stored elements in Compressed Sparse Row format>

In [29]:
# Define the portion of ICM used for training the model: here we use the entire ICM
ICM_train = ICM_modified

## Build the model

At this point all the necessary input data are prepared, and we proceed to build the recommendation algorithm.

### Similarity Function
A recommendation system learns from the past ratings and/or the attributes of the users. By evaluating the similarity between the tours, it is able to suggest tours that are similar to the ones that the tourist liked in the past, and also to recommend a tour liked by a tourist to similar tourists.

To determine the degree of similarity between the tours, we choose to implement the **Cosine Similarity** function for the models. In our case, it consists in determining the number of common elements between two vectors $\vec{i}$ and $\vec{j}$ representing tours, and this can be computed with the **normalized dot product**:
$$    s_{ij}=\frac{\vec{i}\cdot \vec{j}}{|\vec{i}|_2\cdot |\vec{j}|_2}=cos(\theta)   .$$

For faster computation, we adopt a version of the cosine similarity based on vector products, with $M$ the reference matrix (URM or ICM):
$$ W_{i,I}
= cos(v_i, M_{I})
= \frac{v_i \cdot M_{I}}{|| v_i || IW_{I} + shrink}  ,$$
where
$$ IW_{i} = \sqrt{{\Sigma_{u \in U}{M_{u,i}^2}}}  .$$
The shrink term is introduced to take into account the support of the vectors: vectors with larger number of non-zero elements in them are statistically more significant.

Finally, the similarity values are stored in a **similarity matrix**, which establishes the pairwise correspondence between tours. This is fed to the models as weights.

In [30]:
# Define the cosine similarity function based on vector products
def vector_similarity(m: sps.csc_matrix, shrink: int):

    # Compute the Euclidean norm of each column of the matrix m
    item_weights = np.sqrt(
        np.sum(m.power(2), axis=0)
    ).A.flatten()

    # Find the number of items
    num_items = m.shape[1]

    # Get the transposed matrix
    matrix_t = m.T

    # Initialize the empty weight matrix
    weights = np.empty(shape=(num_items, num_items))

    # Compute the similarity values as mentioned in the previous formula
    # The results will be weights to give to the model
    for item_id in range(num_items):
        numerator = matrix_t.dot(m[:, item_id]).A.flatten()
        denominator = item_weights[item_id] * item_weights + shrink + 1e-6

        weights[item_id] = numerator / denominator

    # The elements on the diagonal of the similarity matrix are set to 0,
    # to force the fact that each element is not considered similar to itself!
    np.fill_diagonal(weights, 0.0)

    return weights

Also in this case we report two different approaches: **Collaborative Filtering** (CF) and **Content-Based Filtering** (CBF).

### Collaborative Filtering Model

**Collaborative Filtering** models rely on the opinions of a community of tourists, without the need of tours' attributes. It essentially focuses on identifying patterns and similarities in interactions between tourists and tours based on the tourists' feedback.

In this case, **item-based** collaborative filtering technique has been used. It calculates the similarity between each pair of tours, according to the number of tourists that have the same opinion on them.

In [31]:
# Define the item-based collaborative filtering recommender
class ItemCFRecommender(object):

    # Initialize the input matrix
    def __init__(self, URM):
        self.URM = URM

    # Fit the recommender with a chosen shrink value
    def fit(self, shrink=1):
        # Compute the pairwise similarity between guides using the URM
        self.W_matrix = vector_similarity(self.URM.tocsc(), shrink=shrink)
        # Print the weight matrix
        # with np.printoptions(threshold=np.inf):
        #     print(self.W_matrix)

    # Generate recommendations for each tourist
    # at: number of tours that we want to recommend to each tourist
    # exclude_seen: if we want to avoid recommending the tours that the tourist has rated before
    def recommend(self, user_id, at=None, exclude_seen=True):
        user_profile = self.URM[user_id]
        # Compute the scores using the dot product
        scores = user_profile.dot(self.W_matrix).ravel()

        # Filter the tours
        if exclude_seen:
            scores = self.filter_seen(user_id, scores)

        # Rank the tours
        ranking = scores.argsort()[::-1]

        return ranking[:at]

    # Tours that have been already rated by the tourist will be excluded in the recommendation
    def filter_seen(self, user_id, scores):
        # Retrieve the cells regarding the interactions of that specific tourist
        start_pos = self.URM.indptr[user_id]
        end_pos = self.URM.indptr[user_id+1]

        # Extract the user profile
        user_profile = self.URM.indices[start_pos:end_pos]

        # Set the score of already rated tours to the minimum value
        scores[user_profile] = -np.inf

        return scores

### Content-Based Filtering Model

A **Content-Based Filtering** model recommends tours to tourists based on the attributes of the tours themselves. It stands upon the assumption of recommending items similar to the ones a user liked in the past, and it requires a list of good quality attributes for the guides to work properly.

In this case **item-based** CBF technique has been used. It suggests tours to tourists according to the similarity weights computed on the tours: two tours are similar if they have a great part of attributes in common.

In [32]:
# Define the item-based content-based filtering recommender
class ItemCBFRecommender(object):

    # Initialize the input matrices
    def __init__(self, URM, ICM):
        self.URM = URM
        self.ICM = ICM

    # Fit the recommender with a chosen shrink value
    def fit(self, shrink=1):
        # Compute the pairwise similarity between tours using the transposed ICM
        self.W_matrix = vector_similarity(self.ICM.T.tocsc(), shrink=shrink)
        # with np.printoptions(threshold=np.inf):
        #     print(self.W_matrix)

    # Generate recommendations for each tourist
    # at: number of tours that we want to recommend to each tourist
    # exclude_seen: if we want to avoid recommending the tours that the tourist has rated before
    def recommend(self, user_id, at=None, exclude_seen=True):
        user_profile = self.URM[user_id]
        # Compute the scores using the dot product
        scores = user_profile.dot(self.W_matrix).ravel()

        # Filter the tours
        if exclude_seen:
            scores = self.filter_seen(user_id, scores)

        # Rank the tours
        ranking = scores.argsort()[::-1]

        return ranking[:at]

    # Tours that have been already rated by the tourist will be excluded in the recommendation
    def filter_seen(self, user_id, scores):

        # Retrieve the cells regarding the interactions of that specific tourist
        start_pos = self.URM.indptr[user_id]
        end_pos = self.URM.indptr[user_id+1]

        # Extract the user profile
        user_profile = self.URM.indices[start_pos:end_pos]

        # Set the score of already rated tours to the minimum value
        scores[user_profile] = -np.inf

        return scores

## Fit the model

The recommenders are built and fit on the available input data.

In [33]:
# Select if building a collaborative filtering model or a content-based filtering model
# Available options: 'cf', 'cbf'
model_type = 'cbf'

In [34]:
# Collaborative filtering model
# Input data: URM
# The shrink value is a non-negative number that can be tuned
if model_type == 'cf':
    recommender = ItemCFRecommender(URM_train)
    recommender.fit(shrink=0.5)

In [35]:
# Content-based filtering model
# Input data: URM and ICM
# The shrink value is a non-negative number that can be tuned
if model_type == 'cbf':
    recommender = ItemCBFRecommender(URM_train, ICM_train)
    recommender.fit(shrink=0.5)

## Generate outputs

Here the generation of recommendations of guides to tourists is reported.

In [36]:
# Set the number of tours to recommend to each tourist
n_recommendations_per_tourist = 3

In [37]:
# Generate recommendations for each tourist

recommendations = []

for i,id in tqdm(enumerate(arr_tourists)):
    # at: number of guides to recommend to each tourist
    # exclude_seen: if we want to exclude the guides already rated by this tourist
    rec = recommender.recommend(id, at=n_recommendations_per_tourist, exclude_seen=True)
    rec_list = rec

    # Produce a string with the ids of the recommended guides
    rec_row = ' '.join(str(s) for s in rec_list)
    recommendations.append(rec_row)

500it [00:00, 18518.39it/s]


In [38]:
# print recommendations for the first 10 users
for i in range(10):
    print("User " + str(arr_tourists[i]) + " - recommended tours: " + recommendations[i])

User 0 - recommended tours: 12 45 37
User 1 - recommended tours: 31 30 6
User 2 - recommended tours: 34 48 29
User 3 - recommended tours: 49 12 22
User 4 - recommended tours: 37 26 39
User 5 - recommended tours: 8 32 49
User 6 - recommended tours: 11 42 3
User 7 - recommended tours: 49 12 22
User 8 - recommended tours: 13 8 23
User 9 - recommended tours: 38 27 35


In [39]:
# Show the output dataframe
result_df = pd.DataFrame(
    data = {'tourist_id': arr_tourists,
            'tours': recommendations}
)

result_df

Unnamed: 0,tourist_id,tours
0,0,12 45 37
1,1,31 30 6
2,2,34 48 29
3,3,49 12 22
4,4,37 26 39
...,...,...
495,495,41 36 39
496,496,8 44 9
497,497,5 9 17
498,498,49 12 22


Let's randomly choose a tourist and check the recommendations.

In [72]:
# Show some examples: select a tourist by id to visualize the received recommendations
sample_tourist = 100

In [73]:
# Show the selected tourist
pd.DataFrame(tourist_df.loc[sample_tourist,:])

Unnamed: 0,100
id,100
languages,['spanish']
keywords,"['art', 'cinema', 'literature', 'rafting', 'mu..."


In [74]:
# Extract the indices of the recommended tours
sample_tour_list = list(map(int, recommendations[sample_tourist].split(" ")))

# Show the list of recommended tours
tour_df.loc[sample_tour_list,:]

Unnamed: 0,id,guide,languages,city,attractions,keywords,price,date,duration
22,22,22,"[chinese, spanish]",Lecce,"[Basilica di Santa Croce, Roman Theatre]","[archeology, history]",18,2024-06-29,12
6,6,6,"[english, chinese, french]",Lecce,"[Piazza Sant'Oronzo, Villa Comunale di Lecce]","[art, countryside]",38,2024-06-28,10
36,36,36,[spanish],Lecce,"[San Giovanni Battista Church, Piazza Sant'Oro...","[art, literature]",31,2024-06-09,11
