**Recommender systems**

Recommender systems are one of the most commonly used practical systems in data science. In this section, we will focus on collaborative filtering, where the focus is on similarities between users. Depending on the past preference of users, this type of recommender system recommends items that users have liked or rated highly in the past. For this task, we will use Surprise, a Python scikit-learn library for building and analyzing recommender systems.

We first need to read the merged df into Surprise, set the rating scale of the dataset, and load data from df into Surprise data:

In [None]:
# Set rating scale of the dataset
reader = Reader(rating_scale=(0, 2))

# Load the dataframe with ratings.
data = Dataset.load_from_df(df[['userID', 'placeID', 'rating']], reader)

Now, we are set and can use the Surprise library functionalities. First, we will get a benchmark on this dataset from different available algorithms in Surprise. We will do cross-validation on the whole dataset and append the results in bencmark_scores. We also set random seed to 114 to get reproducible results:

In [None]:
benchmark_scores = []

random.seed(114)
np.random.seed(114)

# Iterate selected algorithms
for algorithm in [BaselineOnly(), SVD(), SVDpp(), KNNWithMeans(),KNNWithZScore(), CoClustering(), NMF()]:
 # Cross-validation
 cv = cross_validate(algorithm, data, cv=5, verbose=False)

 # create df with cv results
 df_cv = pd.DataFrame.from_dict(cv).mean(axis=0)
 df_cv = df_cv.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
 benchmark_scores.append(df_cv)

Now, let's create a pandas DataFrame from the benchmark_scores list and see the results of each algorithm:

In [None]:
# Create results DataFrame from the benckmark scores
results = pd.DataFrame(benchmark_scores).set_index('Algorithm').sort_values('test_rmse')
results

results are shown as follows. Each algorithm and its result is displayed in DataFrame. In this particular dataset, the SVDpp algorithm, which is part of the matrix factorization recommender within the collaborative filtering algorithms, performs well. It has a Root Mean Squared Error (RMSE) of 0.65. The next best algorithm, in this case, happens to be the KNNWithMeans algorithm, directly derived from a basic nearest neighbors approach, with an RMSE of 0.66:

We will apply these two algorithms, SVDpp and KNNWithMeans, into the dataset, and compare how they perform in different scenarios. First, let's set up cross-validation with 5 splits:

In [None]:
# define a cross-validation iterator
kf = KFold(n_splits=5)

Then, we will also set the two chosen algorithms, SVDpp and KNNwithMeans:

In [None]:
# Define algorithms 
algo_knnwithMeans = KNNWithMeans()
algo_svdpp = SVDpp()

Let's start with the KNNWithMeans algorithm and apply it to our dataset.

**KNNWithMeans**

KNNWithMeans is a basic collaborative filtering algorithm, taking into account the mean ratings of each user. It is inspired directly by k-nearest neighborhood and the main tuning parameter is the maximum number of k. We will use the default, which is 40, but you can try and see how this changes the results.

First, we set a seed to make the results reproducible. Then, we loop through the data and split into training and test datasets according to the kf cross-validation we have created previously. Inside the loop, we first fit on the training dataset, then predict with the test dataset. We will calculate accuracy using RMSE metrics. Finally, we dump the data into a pandas DataFrame for later use:

In [None]:
random.seed(114)
np.random.seed(114)

for trainset, testset in kf.split(data):

   # train and test algorithm with KNNWithMeans.
    algo_knnwithMeans.fit(trainset)
    predictions_knnwithmeans = algo_knnwithMeans.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions_knnwithmeans, verbose=True)

    dump.dump('./dump_KNNWithMeans', predictions_knnwithmeans, algo_knnwithMeans)


The preceding fits on the training dataset, predicts on testset, and calculates accuracy based on RMSE metrics for each iteration. We can get the mean of RMSE for all iterations by averaging all RMSEs:

In [None]:
print("Average RMSE of the CV is: ", np.mean(accuracy.rmse(predictions_knnwithmeans, verbose=False)))

Now, let's load the dumped file of this algorithm and look closely at the predictions: 

In [None]:
# Load the dump file
predictions_knnwithmeans, algo_knnwithMeans = dump.load('./dump_KNNWithMeans')

After loading the dumped file, we can easily create a DataFrame from it like this and look at the first five rows:

In [None]:
df_knnithmeans = pd.DataFrame(predictions_knnwithmeans, columns=['uid', 'iid', 'rui', 'est', 'details']) 
df_knnithmeans.head()

This df_knnithmeans consists of five columns. The first one, uid, as you might recognize, is userID from our dataset; the second column, iid, is placeID of the restaurants. rui represents the rating of users with items. est is the estimated or predicted result from the algorithm. To calculate the error of each row, we can simply subtract the rui and est columns. Let's create a column called err to store the error results:



In [None]:
# Calculate the error
df_knnithmeans['err'] = abs(df_knnithmeans.est - df_knnithmeans.rui)
df_knnithmeans.head()

We will move to an SVDpp algorithm and later come back to compare the results of these two different algorithms. 

**SVDpp**

The SVDpp algorithm is an extension of the SVD algorithm popularized by the fact that it won third place in the Netflix recommendation competition. SVDpp takes into account implicit rating, which is an improvement on the original SVD algorithm. We will carry out the same procedure as before but only change the algorithm from KNNwithMeans to SVDpp:

In [None]:
random.seed(114)
np.random.seed(114)

for trainset, testset in kf.split(data):
    # train and test algorithm with SVDpp.
    algo_svdpp.fit(trainset)
    predictions_svdpp = algo_svdpp.test(testset)
    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions_svdpp, verbose=True)
    # Dump the prediction into dataframe
    dump.dump('./dump_SVDpp', predictions_svdpp, algo_svdpp)

Let's print out the average RMSE of this algorithm:

In [None]:
print("Average RMSE of the CV is: ", np.mean(accuracy.rmse(predictions_svdpp, verbose=False)))

This is an improvement from the KNNWithMeans result.

Let's also load the dumped file and create DataFrame with the results. We will also calculate the error of the predictions:

In [None]:
# Load the dump file
predictions_svdpp, algo_svdpp = dump.load('./dump_SVDpp')

df_svdpp = pd.DataFrame(predictions_svdpp, columns=['uid', 'iid', 'rui', 'est', 'details'])  


df_svdpp['err'] = abs(df_svdpp.est - df_svdpp.rui)
df_svdpp.head()

Although we can see that SVDpp performs better than KNNwithMeans in this case, we can compare these two algorithms to find out where each performs better than the other. 

***Comparison and interpretations***

We can simply get the worst predictions of the algorithms by sorting them. Let's first get the worst predictions of df_knnithmeans :

In [None]:
df_knnithmeans.sort_values(by='err')[-10:]

We will do the same for df_svdpp to get the worst 10 predictions:

In [None]:
df_svdpp.sort_values(by='err')[-10:]

And here is the output of the worst 10 predictions for df_svdpp. Compared to the preceding table, you can see that this table has lower error rates. The worst prediction error is 1.79, compared to 2.00 from the preceding table:

We can show the overall distribution of the prediction errors in distplot using seaborn. We will construct a 2 x 2 plot, where we will plot the prediction errors of df_svdpp and df_knnithmeans:

In [None]:
fig, ax = plt.subplots(1,2, figsize=(10,8))
sns.distplot(df_svdpp.err, ax=ax[0])
sns.distplot(df_knnithmeans.err, ax=ax[1])
ax[0].set_title('SVDpp')
ax[1].set_title('KNNwithMeans')
plt.show()

It seems both algorithms are right-skewed and have higher predictions around zero. So, what happens when a user has a smaller number of ratings? Remember that we had a mean rating of 1.19. Let's choose users with fewer than 5 ratings. This function is copied from the documentation of the Surprise library and calculates the number of items rated by a given user:

In [None]:
def get_Iu(uid):
    try:
        return len(trainset.ur[trainset.to_inner_uid(uid)])
    except ValueError: # user was not part of the trainset
        return 0

We will use this function to calculate the user rating for both df_knnithmeans and df_svdpp

In [None]:
df_knnithmeans['Iu'] = df_knnithmeans.uid.apply(get_Iu)
df_svdpp['Iu'] = df_svdpp.uid.apply(get_Iu)

We can now compare the error rates of the two DataFrame when users only have fewer than 5 ratings and see which one of the algorithms performs better. We will use the mean here:

In [None]:
df_knnithmeans[df_knnithmeans.Iu < 5].err.mean(), 
df_svdpp[df_svdpp.Iu < 5].err.mean()

This is for df_knnithmeans and df_svdpp respectively. It seems that KNNwithMeans is performing much better than SVDpp when users have fewer ratings.

In the next section, we will cover location-based (LB) recommenders. 

***Location-based recommenders***

LB recommenders include explication location components to provide more relevant recommendations based on the location of users or items. We will carry out a simple LB recommendation using the k-means clustering techniques we covered in Chapter 4, Making Sense of Humongous Location Datasets. 

We will first fit on a small sample of df and get the labels. Let's also print out the number of k and silhouette_score, which is the metric we will use and the number of clusters:

In [None]:
kmeans = KMeans(n_clusters=24, init='k-means++')
X_sample = df[['longitude','latitude']].sample(frac=0.1)
kmeans.fit(X_sample)
y = kmeans.labels_

print("k = 24", " silhouette_score ", silhouette_score(X_sample, y, metric='euclidean'))

Our k value is equal to k = 24 and silhouette_score, which is 0.5461956922155007.

Now, we will predict on the whole dataset and populate in a new column in our df['cluster'], let's also get a sample of 10 rows from df:



In [None]:
df['cluster'] = kmeans.predict(df[['longitude','latitude']])
df[['userID','latitude','longitude','placeID','cluster']].sample(10)

We will use top-rated venues for our recommendation. This can be changed to, for example, the number of rating per restaurant, or more tailored recommenders such as different cuisines of restaurants. Let's get the highest rated restaurants. Here, we will not only include rating but also food_rating and service_rating since we have many restaurants with a rating of 2. We call this topvenues_df:

In [None]:
topvenues_df = df.sort_values(by=['rating', 'food_rating', 'service_rating'], ascending=False)

We create a simple function that receives df, as well as latitude and longitude. The function first predicts the cluster of coordinates provided. Once it gets the cluster number, it passes through df provided in this topvenues_df and gets the first top-rated name of the restaurant in this cluster. Finally, the function prints out recommendations:

In [None]:
def recommend_restaurants(df, longitude, latitude):
    # Predict the cluster for longitude and latitude provided
    cluster = kmeans.predict(np.array([longitude,latitude]).reshape(1,-1))[0]

    # Get the best restaurant in this cluster
    name = df[df['cluster']==cluster].iloc[0]['name']

    print('"{}" is recommended'.format(name))

Let's also create a function that plots a Folium map for both user location and restaurant locations. This function takes df, a user coordinates, and the restaurant name produced by the preceding function:

In [None]:
def create_folium_map(df, user_coords, restaurant_name):
    m = folium.Map(
    location=user_coords,
    zoom_start=10,
    tiles='Stamen Terrain'
    )

    folium.Marker(
    location=user_coords,
    popup='User Location',
    icon=folium.Icon(icon='cloud')
    ).add_to(m)


    folium.Marker(
    location=list(df[df['name'] == restaurant_name][['latitude', 'longitude']].iloc[0]),
    popup='Restaurant Location',
    icon=folium.Icon(color='red',icon='info-sign')
    ).add_to(m)

    return m

Let's use the recommend_restaurants function to recommend restaurants

In [None]:
recommend_restaurants(topvenues_df,-99.145185, 23.730925)

Here is another example with different locations:

In [None]:
recommend_restaurants(topvenues_df, -100.939752, 22.150849)

We utilize the create_folium_map function to display user location and restaurant location in a map for the last example, which has recommended "Rincon Huasteco". Let's first create variables that hold user_coords and restaurant_name:

In [None]:
user_coords = [22.120849, -100.839752]
restaurant_name = "Rincon Huasteco"

Now, let's pass these variables to the create_folium_map function:

In [None]:
create_folium_map(df, user_coords, restaurant_name)

Congratulations! This is just the beginning of your journey in geospatial data science. While reading this book, you have been introduced to a broad and essential range of geospatial Python libraries, as well as real-world applications. I hope that this will be your inspiration to continue learning and working on the vast array of geospatial data science projects out there.