**Getting Location Recommender Systems**

Recommendation systems are primarily used to predict the preference or rating of a user for an item. They are widely used in many commercial applications, including product and service recommendations, as well as content and friendship recommendations in social media. However, recommendation systems are not only used for products on Amazon or movies on Netflix but also locations. Location-based recommenders incorporate the location of users to provide relevant and precise recommendations. These can be a point of interest recommendations, such as restaurants, events in nearby locations, or posts and local trends in social media. In this chapter, we will cover different recommender systems, including collaborative filtering methods and location-based recommendation methods. We will take an example of a restaurant recommender system application in this chapter, using a restaurant and consumer dataset from UCI, Machine Learning Repository.

The topics covered in this chapter include the following:

- Exploratory data analysis
- Collaborative filtering recommenders
- Location-based recommendation systems

**Exploratory data analysis**

Let's start reading the data. We will be using two files: one CSV with ratings and another GeoJSON file with restaurants and their locations. Let's first read the ratings of the CSV file.

**Rating data**

This file contains the final rating of restaurants. It has userID and placeID, which we can merge with the GeoJSON datasets of restaurants and rating columns. Let's read the data in pandas and look at the first five rows:

In [None]:
ratings = pd.read_csv('RCdata/rating_final.csv')
ratings.head()

User ratings

We have 1,161 rating rows and if we look at the first five rows of the rating column, the first three rows under the rating column have a 2 point rating, while the last two rows have a 1 point rating. Let's get the mean of the rating column by using the pandas .mean() function:

In [None]:
print(ratings['rating'].mean())

The output the preceding code shows that the average rating of the entire rating dataset is 1.20. Let's print out how many unique userID we have, as well as unique placeID

In [None]:
print("There are {} unique userID in the dataset".format(ratings['userID'].nunique()))
print("There are {} unique placeID in the dataset".format(ratings['placeID'].nunique()))

The output of the preceding code prints out the following and we can see that we have 138 unique userID and 130 unique placeID

We can also go further and look at countplot to get the distribution of the ratings. We will use seaborn for our data visualization in this chapter



In [None]:
fig, ax = plt.subplots(figsize=(12,10))
sns.countplot(ratings['rating'], ax=ax)
plt.show()

The output of the preceding code is countplot, where the total of each rating number is calculated. This plot shows that a rating range of 0 to 2 is available for this dataset and most restaurants have a rating of 2, while around more than 250 restaurants have 0 ratings

**Restaurants data**

The restaurant dataset is in GeoJSON format and therefore we do not need to convert it into GeoDataFrame but rather can read it directly with GeoPandas. Let's do that. The data comes with 22 columns and, therefore, it will be difficult to read the first five rows in the way that we normally do in this book. We will only read the first two rows and transpose the output to fit it into the screen

In [None]:
# Read the data as GeoDataFrame
geoplaces = gpd.read_file('RCdata/geoplaces.geojson')
geoplaces.head(2).T

Here is the output of the first two rows transposed. The columns are now displayed as rows and vice versa. This data comes in 130 rows, matching placeID in the ratings datasets

Since we have a geometry column and read the data with GeoPandas, we can plot this data as a map. Let's do a clustered map with folium

In [None]:
lons = geoplaces['longitude'] 
lats = geoplaces['latitude']
m = folium.Map(
 location = [np.mean(geoplaces.latitude), np.mean(geoplaces.longitude)],
 tiles= 'CartoDB dark_matter',
 zoom_start=6
 )
FastMarkerCluster(data=list(zip(lats, lons))).add_to(m)
folium.LayerControl().add_to(m)
m

Let's explore the data further and see how many unique cities are in the dataset

In [None]:
print("Unique Cities in the dataset is {}".format(geoplaces.city.nunique()))

So, let's summarize the unique cities within the dataset in countplot

In [None]:
# Display cities in countplot
fig, ax = plt.subplots(figsize=(18,12))
sns.countplot(x="city",data=geoplaces, color="grey", order=geoplaces['city'].value_counts().index, ax=ax)
plt.show()

As the following plot indicates, there are a lot of duplicated and mistyped city names in the dataset that need to be cleaned out. For example, Cuernavaca is misspelled as cuernavaca

Let's fix this and replace the mistakes with the correctly spelled city names. First, we group all of the cities with misspelled names like this:

In [None]:
cuer = ['Cuernavaca', 'cuernavaca', ]
slp = ['s.l.p.', 'San Luis Potosi', 'san luis potosi', 'slp', 'san luis potos', 'san luis potosi ', 's.l.p'] 
ciudad = ['victoria ', 'victoria', 'Cd Victoria', 'Ciudad Victoria', 'Cd. Victoria']

Then, replace the mistyped name with the correct one like this

In [None]:
geoplaces['city']=geoplaces['city'].replace(slp,'San Luis Potosi' )
geoplaces['city']=geoplaces['city'].replace(ciudad,'Ciudad Victoria')
geoplaces['city']=geoplaces['city'].replace(cuer,'Cuernavaca')

This looks much cleaner. Let's explore some other features of this dataset. We will display a 2 x 2 plot of countplots for restaurant alcohol service, price, accessibility, and ambience:

In [None]:
# Subplots for four countplots
fig, ax = plt.subplots(2,2, figsize=(15,12))
sns.countplot(x="alcohol",data=geoplaces, color="grey", ax=ax[0][0])
ax[0][0].set_title('Alcohol Service')
sns.countplot(x="price",data=geoplaces, color="grey", ax=ax[0][1])
ax[0][1].set_title('Price Categories')
sns.countplot(x="accessibility",data=geoplaces, color="grey", ax=ax[1][0])
ax[1][0].set_title('Accessibility Categories')
sns.countplot(x="Rambience",data=geoplaces, color="grey", ax=ax[1][1])
ax[1][1].set_title('Rambience')
#fig.subplots_adjust(hspace=0.5)
plt.tight_layout()
plt.show()

The preceding code output is a 2 x 2 countplot. In the upper left-hand corner, we have an alcohol service countplot of restaurants. Most restaurants do not offer alcohol services, as you can see from the plot. Prices (the upper right-hand graph) also show restaurants with medium prices to have the highest counts. In the lower left-hand part of the plot, accessibility counts indicate that most restaurants do not have accessibility options. Finally, the ambience of restaurants (the lower right-hand graph) has two categories: familiar ambience as well as quiet ambience:


Countplots for alcohol service (upper-left), price (upper-right), accessibility (lower-left), and ambience (lower-right)

We are ready now to carry out the recommendation algorithm, but we first need to merge the dataset. Since both the ratings and geoplaces datasets have placeID, we can use this column to merge them both. We call our merged dataset simply df



In [None]:
df = pd.merge(ratings, geoplaces, on='placeID')