# **IBM Capstone Project **
Travel Agency Tour Recommendation

Levan Gvalia



**Introduction**

After enabling visa free entrance to EU from Georgia, in addition to introduction of cheap and popular
airlines, tourism abroad has become much more available to masses than It ever was. As an analyst at
Travel Agency, I clearly see result of visa free travel and cheap airlines – more people tend to favor
cheap and frequent travels.

The Travel Agency was focused on more expensive tours, with client tailored tour recommendations –
the information of which was gathered manually by employees, through online searches and word of
mouth. The problem is that, with recent changes, employees can’t keep up with the requests of cheaper
and more frequent travels, thus causing client churn rate to skyrocket. As company is not willing to give
up on its main advantage over competition – client tailored tour recommendations – as well as miss an
opportunity of cheap and frequent flights, some solution has to be offered.

**Business Problem**

So, this is where I come in – I plan to use Machine Learning and Location Data to cluster neighborhoods
depending on its venues on my own – the same process was previously done by several employees over several
days. The scope of the project is, that I have to prove eligibility of my offered tool on one popular travel
destination – Barcelona – if I am able to cluster neighborhoods appropriately, then management will
approve the tool which then will be used on other travel destinations.

**Data**

Combination of several sources will be the input data for the project:
1. Neighborhoods and PostCodes of Barcelona - will be collected manually and imported as a data source into the project
2. Latitude and Longitude of PostCodes – will be collected through arcgis of geocoder package
3. Venue Data of neighborhoods – Foursqueare API will be used to collect Points of Interest in proximity of Neighborhoods’ location

1. Neighborhoods and PostCodes of Barcelona 

I will use pandas read_excel function to import local file of Barcelona Neighborhood postal codes

In [1]:
import pandas as pd

In [2]:
file = "../input/capstoneproject/Barcelona Neighbourhoods v2.xlsx"
df = pd.read_excel(file)
df.head()

Unnamed: 0,PostCode,Neighborhood
0,8001,el Raval
1,8002,el Gòtic
2,8003,La Barceloneta
3,8004,el Poble-sec
4,8005,el Poblenou


As I am interested in more central parts of Barcelona, I will remove outskirts

In [3]:
df = df[~df['PostCode'].isin([8042, 8040,8039,8035,8033,8017])]


In [4]:
df.head()

Unnamed: 0,PostCode,Neighborhood
0,8001,el Raval
1,8002,el Gòtic
2,8003,La Barceloneta
3,8004,el Poble-sec
4,8005,el Poblenou


2. Latitude and Longitude of PostCodes

geocoder package will be used to determine latitudes and longitudes by postal codes of barcelona

In [5]:
import numpy as np

In [6]:
# for latitude and longitude of neighborhoods
!pip install geocoder
import geocoder 

Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 1.7 MB/s eta 0:00:011
Collecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m


In order to get lat-long of postal codes I'll need - geocoder.arcgis('{}, Barcelona'.format(PostCode))- function, thus I define get_latlon(PostCode) function

In [7]:
#define function for lat long fetching
def get_latlon(PostCode):
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Barcelona'.format(PostCode))
        lat_lng_coords = g.latlng
        return lat_lng_coords 

Only thing remainig to get lat-longs, I have to iterate defined function over postal codes

In [8]:
#fetching lat long data
latlog = []
    
for i in df['PostCode']:
    a = get_latlon(i)
    latlog.append(a)
    


I'll merge fetched location data to my main dataframe

In [9]:
latlog = np.asarray(latlog)
df['Latitude'] = latlog[:,0]
df['Longitude'] = latlog[:,1]
#df_group.drop(['latitude','longitude'],axis=1,inplace=True)

In [10]:
df.head()

Unnamed: 0,PostCode,Neighborhood,Latitude,Longitude
0,8001,el Raval,41.380145,2.168721
1,8002,el Gòtic,41.38218,2.176718
2,8003,La Barceloneta,41.383205,2.18788
3,8004,el Poble-sec,41.370415,2.159972
4,8005,el Poblenou,41.396235,2.201622


Let's visualize for more visibility

In [11]:
#library for map visualization
import folium 

In [12]:
Barcelona_latitude = 41.3851
Barcelona_longitude = 2.1734

In [13]:
# create map
Barca_map = folium.Map(location=[Barcelona_latitude, Barcelona_longitude], zoom_start=12.5)


# add markers to the map
markers_colors = []
for lat, lon, zipcode in zip(df['Latitude'], df['Longitude'], df['PostCode']):
    label = folium.Popup(str(zipcode) , parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label
      ).add_to(Barca_map)
       
Barca_map

3. Venue Data of neighborhoods

In order to get venue data by lat-long, I'm using Foursuare API

In [14]:
import requests #foursqaure API to get data into json file
from pandas.io.json import json_normalize #get data from json file

Foursquare credentials will be hidden

In [15]:
CLIENT_ID = 'CQ2OUSPP4YDZJSGMF3XC0IEKB01C4KTN3OMJUFMPKCA0LPJ0' # your Foursquare ID
CLIENT_SECRET = 'H5YASFCDHXIQGYQ44W4XO3GJIPOL1R1PU52FPUAXJRCAFVQS' # your Foursquare Secret
VERSION = '20200420' # Foursquare API version

limit of 100 venues in radius of 500 meters seems appropriate

In [16]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

I'll define function to get data of venues with requests.get call.

Venue names, location data but most importantly venue categories will be fetched 

In [17]:
#define function to get POIs
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        global url
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

I just have to run defined function for all postal codes

In [18]:
#get venue categories for nighborhoods
Barcelona_venues = getNearbyVenues(names=df['Neighborhood'],#[:14], #temp slicer                                   
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

el Raval


KeyError: 'groups'

In [None]:
Barcelona_venues.head()

**Methodology**

Simple exploratory analysis of venue counts

In [None]:
Barcelona_venues.groupby('Neighborhood').Venue.count().reset_index().sort_values(by='Venue',ascending=False)

As seen above more than half of neighborhoods have more than 50 venues listed. Thus I plan to sort venues as most common venue categories in its respective neighborhood and only consider top 10 venue categories per neighborhood, to make data more managable.

This kind of data seems perfect fit for k_means unsupervised clustering, in order to combine most related neighborhoods by top 10 venue categories. In order to find appropriate number of clusters I'll use elbow method. For these machine learning calculations I'll use scikit-learn package, which is well fitted for machine learning algorithms

Prepare for Analysis

In order to be able to cluster neighborhoods, I'll have to prepare data for analysis first. For that I'll use one hot encoding ensuring to get venue categories in columns and neighborhoods in rows

In [None]:
# one hot encoding
Barcelona_onehot = pd.get_dummies(Barcelona_venues[['Venue Category']], prefix="", prefix_sep="")
Barcelona_onehot['Neighborhood'] = Barcelona_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [Barcelona_onehot.columns[-1]] + list(Barcelona_onehot.columns[:-1])
Barcelona_onehot = Barcelona_onehot[fixed_columns]

Barcelona_onehot.head()

Then I'll have to group neighborhoods and use mean aggregate function

In [None]:
Barcelona_groupby = Barcelona_onehot.groupby('Neighborhood').mean().reset_index()
Barcelona_groupby.head()

Then I'll define function to get top 10 venue categories by neighborhoods

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = Barcelona_groupby['Neighborhood']

for ind in np.arange(Barcelona_groupby.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Barcelona_groupby.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Analysis

In order to find appropriate number of K-means, I'll use elbow method

In [None]:
#libraries for clustering
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist

#library for map
import matplotlib.pyplot as plt

In [None]:
Barcelona_clustering = Barcelona_groupby.drop('Neighborhood', 1)

In [None]:
distortions = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k).fit(Barcelona_clustering)
    kmeanModel.fit(Barcelona_clustering)
    distortions.append(sum(np.min(cdist(Barcelona_clustering, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / Barcelona_clustering.shape[0])

# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

according to elbow method, optimal k can be set at 6

kmeans 

In [None]:
# set number of clusters
kclusters = 6

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Barcelona_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

Cluster labels will be added back to dataframe, as neighborhoods were removed before not to interfere with k-means algorithms

In [None]:
#neighborhoods_venues_sorted = neighborhoods_venues_sorted.drop('Cluster Labels',axis=1)

In [None]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
Barcelona_merged = df#[:14] #temp slicer
Barcelona_merged = Barcelona_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')


Barcelona_merged.head() 

**Results** & **Discussion**

now we have results and we'll have to dive into it to make conclusions.
first I'll start with visualization and then check check results by cluster labels.

Visualization

for visualization I'll use folium map, which will help us have clear view of the results

In [None]:
#libraries for visualization
import matplotlib.cm as cm
import matplotlib.colors as colors

I'll start by zooming in to barcelona and highlighting clusters with its respective colors

In [None]:
# create map
map_clusters = folium.Map(location=[Barcelona_latitude, Barcelona_longitude], zoom_start=12.5)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Barcelona_merged['Latitude'], Barcelona_merged['Longitude'], Barcelona_merged['Neighborhood'], Barcelona_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

as it seems cluster 1 is more concentrated in uptown of barcelona, with few exceptions. As for cluster 2 it is more spread out in the city, but mostly located in lower part of the city center and acroos the seaside. These two clusters occupy most of the city map.

Cluster 4 is third most common of the clusters and is mostly located around the uptown of the city

Cluster 5 is located across the city center surroundings.

Cluster 0 and Cluster 3 seem to be outliers in terms of map occupation, but it might be interesting to check those in more details

Clusters

Next I'll take a look at clusters by its most common venue categories.
It would be interesting to check each by each.

In [None]:
Barcelona_merged.loc[Barcelona_merged['Cluster Labels'] == 0, Barcelona_merged.columns[[0] + list(range(4, Barcelona_merged.shape[1]))]]

In [None]:
Barcelona_merged.loc[Barcelona_merged['Cluster Labels'] == 1, Barcelona_merged.columns[[0] + list(range(4, Barcelona_merged.shape[1]))]]

In [None]:
Barcelona_merged.loc[Barcelona_merged['Cluster Labels'] == 2, Barcelona_merged.columns[[0] + list(range(4, Barcelona_merged.shape[1]))]]

In [None]:
Barcelona_merged.loc[Barcelona_merged['Cluster Labels'] == 3, Barcelona_merged.columns[[0] + list(range(4, Barcelona_merged.shape[1]))]]

In [None]:
Barcelona_merged.loc[Barcelona_merged['Cluster Labels'] == 4, Barcelona_merged.columns[[0] + list(range(4, Barcelona_merged.shape[1]))]]

In [None]:
Barcelona_merged.loc[Barcelona_merged['Cluster Labels'] == 5, Barcelona_merged.columns[[0] + list(range(4, Barcelona_merged.shape[1]))]]

It is already possible to make assumptions from this view, but I prefer to add sorted counts of venue categories by neighborhoods, which should give clearer view

In order to add sorted view of venue categories, I merge cluster data with venue list and check counts of venues per neighborhood per venue category

In [None]:
Cluster_data = neighborhoods_venues_sorted[['Neighborhood','Cluster Labels']].groupby(['Neighborhood','Cluster Labels']).count()
Cluster_data.head()

In [None]:
Clusters_Merged = Cluster_data.join(Barcelona_venues.set_index('Neighborhood'), on='Neighborhood')
Clusters_Merged.head()

In [None]:
Clusters_sorted = Clusters_Merged.groupby(['Cluster Labels', 'Venue Category']).Venue.count().reset_index().sort_values(by=['Cluster Labels','Venue'],ascending=False)
Clusters_sorted.head()

In [None]:
Clusters_sorted[Clusters_sorted['Cluster Labels']==1].head(5)

In [None]:
Clusters_sorted[Clusters_sorted['Cluster Labels']==2].head(5)

In [None]:
Clusters_sorted[Clusters_sorted['Cluster Labels']==3].head(5)

In [None]:
Clusters_sorted[Clusters_sorted['Cluster Labels']==4].head(5)

In [None]:
Clusters_sorted[Clusters_sorted['Cluster Labels']==5].head(5)

Now we are ready to discuss results and make assumptions

Sorting of the venue categories strengthened my opinions about clusters
1. Cluster 1 has higher concentration of hotels/hostels, which should be interesting info for tourist who is in search of hotels, or trying to avoid places with high concentration of hotels
2. As it seems Cluster 2 is more concentrated around Tapas restaurant, which is specialty of Barcelona and Spain itself. It should be interesting location for tourist to check in, while traveling to Barcelona.
3. Cluster 4 seems to be focused more on internation food, then local cuisine. This might seem interesting for tourists, as they might be interested in tasting other foods too, after trying out local cuisine.
4. Cluster 5 does not seem to be of particular interest as it does now show any trend of particular places. This cluster might be avoided at all
5. Cluster 0 and Cluster 3 seem to be concentrated on sport activities, for those who want to take a break from local or international cuisine tasting and lose several calories on the way, or just relax.

**Conclusion**

As I mentioned in the introduction, whole point of this project is to prove machine learning capabilities for location/neighborhood auto recommendation to customers. All the steps executed required little human intervention and this method can be used for other locations too. 

There is space for improvement, of course. Other machine learning algorithms can be added for getting better results. Or algorithms can be tailored for individual or specific groups of customers, who are more interested in some particular areas of tourism. 

Also some human interactions can be reduced too. As I used local file for Barcelona districts, web scrapping can be incorporated in this part. Also paid features of Foursquare API can be used to enhance venue data to better fit needs.

Decision should be made bt stakeholders, but to me this method is out of competition. Still it won't be easy to prove to stakeholders, but with some other locations and imprivements mentioned above succes should be easily achieved