# A good tourist city

DELETE  
A full report consisting of all of the following components (15 marks):

- Introduction where you discuss the business problem and who would be interested in this project.
- Data where you describe the data that will be used to solve the problem and the source of the data.
- Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.
- Results section where you discuss the results.
- Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
- Conclusion section where you conclude the report.

## Introduction/Business Problem

Tourism is an important economic income for cities. Some cities are attractive because of their beaches, their building, their history and much more. But there are also some cities which wouldn't be nice to go for a vacation. In this project I will train in a machine learning model that can determine how attractive a given city would be for tourists based on other attractive cities.

The problem this project will solve is knowing which city is a good place to visit. It can also help local/city governments make their cities more attractive based on the results of the algorithm.

## Data

The data used for this project will be obtained from Foursquare using the [Foursquare API](https://developer.foursquare.com/docs/places-api/). Only relevant venue categories will be requested, such us:

- Outdoors & Recreation
- Shop & Service
- Travel & Transport
    - Hotel
    - Taxi
    - Tourist Information Center
- Etc.

The various Foursquare venue categories are listed in [this link](https://developer.foursquare.com/docs/build-with-foursquare/categories/).

In addition, the cities categorized as "good tourist cities" will be chosen based on articles on the internet like [this one](https://en.wikipedia.org/wiki/List_of_cities_by_international_visitors). Venues data obtained from these cities will be fed into the machine learning algorithm to create the model.

## Methodology

There are many types of tourist destinations, as shown in [this article](https://tourismteacher.com/types-of-tourist-destinations/), and some of them could be treated as a single venue by the Foursquare API which would not be good for our model. That's the reason this project is aimed at predicting good **towns and cities**, considering these tend to have a variety of venues we can work with.

So now, the first step will be choosing the cities we'll getting the data from. Let's start!

### Choosing good tourist cities

A good start for choosing our cities is looking for the most popular destinations in the world. According to [this Wikipedia article](https://en.wikipedia.org/wiki/List_of_cities_by_international_visitors) (on 12/14/2020), the 10 most visited cities are:

- Hong Kong, China
- Bangkok, Thailand
- London, United Kingdom
- Macau, China
- Singapore, Singapore
- Paris, France
- Dubai, United Arab Emirates
- New York City, United States
- Kuala Lumpur, Malaysia
- Istanbul, Turkey

It would be useful to visualize these cities in a world map. But first, we need the coordinates of each one. For this purpose I will parse data from Wikipedia using BeatifulSoup.

In [1]:
import requests
import pandas as pd
import folium
import re
import json

In [2]:
def to_url_format(phrase):
    phrase = phrase.strip()
    phrase_words = phrase.split(" ")

    for word in phrase_words:
        word = word.capitalize()
    return "_".join(phrase_words)

def find_city_coords(city, country):
    city_formatted = to_url_format(city)
    
    base_url = "https://nominatim.openstreetmap.org/search?format=json&"
    city_query = f"city={city}"
    country_query = f"country={country}"
    
    url = base_url + city_query + "&" + country_query
    
    request = requests.get(url)
    
    if request.status_code == 200:
        response_json = json.loads(request.text)
        try:
            response_json = response_json[0]
        except:
            return (None, None)
        
        latitude = response_json["lat"]
        longitude = response_json["lon"]
        
        return (float(latitude), float(longitude))
    
    return (None, None)

In [3]:
# This list will later become a Dataframe containing data of the cities.
cities = [
    ["Hong Kong", "China"],
    ["Bangkok", "Thailand"],
    ["London", "United Kingdom"],
    ["Macau", "China"],
    ["Singapore", "Singapore"],
    ["Paris", "France"],
    ["Dubai", "United Arab Emirates"],
    ["New York City", "United States"],
    ["Kuala Lumpur", "Malaysia"],
    ["Istanbul", "Turkey"]
]

In [4]:
for row in cities:
    row.append(find_city_coords(row[0], row[1])[0]) # Append latitude.
    row.append(find_city_coords(row[0], row[1])[1]) # Append longitude.

In [5]:
cities = pd.DataFrame(cities, columns=['City', 'Country', 'Latitude', 'Longitude'])
cities.head()

Unnamed: 0,City,Country,Latitude,Longitude
0,Hong Kong,China,22.279328,114.162813
1,Bangkok,Thailand,13.754424,100.49304
2,London,United Kingdom,51.507322,-0.127647
3,Macau,China,22.189945,113.538045
4,Singapore,Singapore,1.290475,103.852036


In [6]:
world_map = folium.Map(location=[0, 0], zoom_start=1.5)

for row in cities.iterrows():
    row = row[1]
    marker = folium.Marker([row.Latitude, row.Longitude], popup=row.City, icon=folium.Icon(color='blue'))
    marker.add_to(world_map)

world_map

As we can see, most of the cities in our list are in Asia and Europe. But there are also nice tourist cities in North and South America, Africa, Russia and Australia. Let's fix that adding some more cities to our list:

- **[Latin America](https://destinationlesstravel.com/best-latin-american-cities/)**
    - Havana, Cuba
    - Medellin, Colombia
    - Rio de Janeiro, Brazil
- **[North America](https://bigseventravel.com/2019/08/the-10-most-visited-cities-in-north-america/)**
    - Miami, USA
    - Los Angeles, USA
    - Toronto, Canada
    - Vancouver, Canada
- **[Africa](https://www.africatouroperators.org/africa/25-most-beautiful-cities-and-towns-in-africa)**
    - Cape Town, South Africa
    - Zanzibar City, Tanzania
    - Lamu, Kenya
    - Essaouira, Morocco
- **[Russia](https://www.touropia.com/best-cities-to-visit-in-russia/)**
    - Moscow, Russia
    - St Petersburg, Russia
    - Kazan, Russia
    - Yekaterinburg, Russia
- **[Australia](https://www.thrillist.com/travel/nation/australias-10-best-cities-ranked-by-an-impartial-american)**
    - Perth, Australia
    - Margaret River, Australia
    - Melbourne, Australia
    - Port Douglas, Australia

Perfect, let's add these cities to our database.

In [7]:
new_cities = [
    ["Havana", "Cuba"],
    ["Medellin", "Colombia"],
    ["Rio de Janeiro", "Brazil"],
    ["Miami", "United States"],
    ["Los Angeles", "United States"],
    ["Toronto", "Canada"],
    ["Vancouver", "Canada"],
    ["Cape Town", "South Africa"],
    ["Zanzibar City", "Tanzania"],
    ["Lamu", "Kenya"],
    ["Essaouira", "Morocco"],
    ["Moscow", "Russia"],
    ["St Petersburg", "Russia"],
    ["Kazan", "Russia"],
    ["Yekaterinburg", "Russia"],
    ["Perth", "Australia"],
    ["Margaret River", "Australia"],
    ["Melbourne", "Australia"],
    ["Port Douglas", "Australia"]
]

In [8]:
for row in new_cities:
    row.append(find_city_coords(row[0], row[1])[0]) # Append latitude.
    row.append(find_city_coords(row[0], row[1])[1]) # Append longitude.

In [9]:
cities = cities.append(pd.DataFrame(new_cities, columns=cities.columns), ignore_index=True)

In [10]:
world_map = folium.Map(location=[0, 0], zoom_start=1.5)

for row in cities.iterrows():
    row = row[1]
    marker = folium.Marker([row.Latitude, row.Longitude], popup=row.City, icon=folium.Icon(color='blue'))
    marker.add_to(world_map)

world_map

Now our map is filled with nice tourist cities. I will export the data obtained to a .csv file so you can use it without running the code.

In [11]:
cities.to_csv("../data/capstone/city_coords.csv", index=False)

### Choosing bad tourist cities

Of course, our algorithm not only needs good tourist cities, but it also needs *bad* ones. Let's choose some from [this](https://journalistontherun.com/2016/01/04/15-worst-travel-destinations/) and [this article](https://www.mapquest.com/travel/the-worst-cities-to-visit-in-the-united-states/) [another one](https://www.smartertravel.com/9-boring-cities-world/) [another](https://leaveyourdailyhell.com/2019/10/14/most-boring-cities-in-the-world/).

- Cunnamulla, Australia
- Malabo, Equatorial Guinea
- Naples, Italy
- Potosí, Bolivia
- Flores, Indonesia
- Bratislava, Slovakia
- Mandalay, Myanmar
- Saigon, Vietnam
- Pyongyang, North Korea
- St Louis, United States
- Detroit, United States
- Oakland, United States
- Atlatna, United States
- Nagoya, Japan
- Casablanca, Morocco
- Ottawa, Canada
- Frankfurt, Germany
- Nassau, Bahamas
- Zurich, Switzerland
- Canberra, Australia
- Guayaquil, Ecuador
- Agra, India
- Brisbane, Australia
- Bucharest, Romania
- Haifa, Israel
- Mexico City, Mexico
- Oslo, Norway
- Vientiane, Laos

Now, let's do the same process again to obtain the coords for these cities.

In [12]:
boring_cities = [
    ["Cunnamulla", "Australia"],
    ["Malabo", "Equatorial Guinea"],
    ["Naples", "Italy"],
    ["Potosí", "Bolivia"],
    ["Flores", "Indonesia"],
    ["Bratislava", "Slovakia"],
    ["Mandalay", "Myanmar"],
    ["Saigon", "Vietnam"],
    ["Pyongyang", "North Korea"],
    ["St Louis", "United States"],
    ["Detroit", "United States"],
    ["Oakland", "United States"],
    ["Atlanta", "United States"],
    ["Nagoya", "Japan"],
    ["Casablanca", "Morocco"],
    ["Ottawa", "Canada"],
    ["Frankfurt", "Germany"],
    ["Nassau", "Bahamas"],
    ["Zurich", "Switzerland"],
    ["Canberra", "Australia"],
    ["Guayaquil", "Ecuador"],
    ["Agra", "India"],
    ["Brisbane", "Australia"],
    ["Bucharest", "Romania"],
    ["Haifa", "Israel"],
    ["Mexico City", "Mexico"],
    ["Oslo", "Norway"],
    ["Vientiane", "Laos"],
]

In [13]:
for row in boring_cities:
    row.append(find_city_coords(row[0], row[1])[0]) # Append latitude.
    row.append(find_city_coords(row[0], row[1])[1]) # Append longitude.

In [14]:
boring_cities = pd.DataFrame(boring_cities, columns=cities.columns)

In [15]:
boring_map = folium.Map(location=[0, 0], zoom_start=1.5)

for row in boring_cities.iterrows():
    row = row[1]
    marker = folium.Marker([row.Latitude, row.Longitude], popup=row.City, icon=folium.Icon(color='red'))
    marker.add_to(boring_map)

boring_map

### Merging in a single DataFrame

Now, it would be nice to have all the data we've collected so far merged in a single DataFrame. A new column will be added, indicating if the city is nice or not for tourism.

In [16]:
cities["Tourist"] = 1
boring_cities["Tourist"] = 0

cities = cities.append(boring_cities, ignore_index=True)

In [17]:
cities.head()

Unnamed: 0,City,Country,Latitude,Longitude,Tourist
0,Hong Kong,China,22.279328,114.162813,1
1,Bangkok,Thailand,13.754424,100.49304,1
2,London,United Kingdom,51.507322,-0.127647,1
3,Macau,China,22.189945,113.538045,1
4,Singapore,Singapore,1.290475,103.852036,1


In [18]:
cities.tail()

Unnamed: 0,City,Country,Latitude,Longitude,Tourist
52,Bucharest,Romania,44.436141,26.10272,0
53,Haifa,Israel,32.819122,34.998386,0
54,Mexico City,Mexico,19.43263,-99.133178,0
55,Oslo,Norway,59.91333,10.73897,0
56,Vientiane,Laos,17.964099,102.613371,0


### Obtaining venue data

The next step is obtaining venue data using the Foursquare API. I will be using the `dotenv` package to save my API credentials safely. If you want to run this notebook you will have to provide your API credentials saving them in the `.env` file in the root directory.

In [19]:
# !pip install python-dotenv
from dotenv import load_dotenv
import os
load_dotenv(dotenv_path="../.env")

True

In [20]:
FOURSQUARE_VERSION = "20201612"
FOURSQUARE_ID = os.getenv("FOURSQUARE_ID")
FOURSQUARE_SECRET = os.getenv("FOURSQUARE_SECRET")

In [21]:
def get_category_tree(cat_list):
    categories = {}
    
    for cat in cat_list:
        if cat["categories"] != []:
            subcat = get_category_tree(cat["categories"])
            categories[cat["name"]] = subcat
        else:
            categories[cat["name"]] = {}# cat["name"]
    
    return categories

request = requests.get(url="https://api.foursquare.com/v2/venues/categories", params={
    "client_id": FOURSQUARE_ID,
    "client_secret": FOURSQUARE_SECRET,
    "v": FOURSQUARE_VERSION
})

category_tree = get_category_tree(json.loads(request.text)["response"]["categories"])

In [22]:
def get_up_branch(category, cat_tree, name="root"):
    for key in cat_tree.keys():
        if key == category:
            return name
        else:
            main_branch = get_up_branch(category, cat_tree[key], key)
            if main_branch : return main_branch
            
def get_main_branch(category, cat_tree, name="root"):
    main_keys = cat_tree.keys()
    
    if category in main_keys:
        return category
    else:
        up_branch = get_up_branch(category, cat_tree)
        main_branch = get_main_branch(up_branch, cat_tree, up_branch)
        
        if main_branch:
            return main_branch

In [23]:
def get_trending_venues(lat, lon):
    base_url = "https://api.foursquare.com/v2/venues/explore"
    params = dict(
        client_id=FOURSQUARE_ID,
        client_secret=FOURSQUARE_SECRET,
        v=FOURSQUARE_VERSION,
        ll=f'{lat},{lon}',
        limit=30
    )
    
    resp = requests.get(url=base_url, params=params)
    data = json.loads(resp.text)
    
    recommended = data["response"]["groups"][0]
    venues = []
    
    for item in recommended["items"]:
        category = item["venue"]["categories"][0]["name"]
        category = get_main_branch(category, category_tree)
        venues.append(category)
    
    return venues

def get_venue_dataframe(city_series):
    city = city_series["City"]
    latitude = city_series["Latitude"]
    longitude = city_series["Longitude"]
    
    venues = pd.Series(get_trending_venues(latitude,longitude)).value_counts(normalize=True)
    venues['City'] = city
    return pd.DataFrame(venues).T

In [24]:
venues = pd.DataFrame()

for row in cities.iterrows():
    row = row[1]
    venue_dataframe = get_venue_dataframe(row)
    venues = venues.append(venue_dataframe)
    
venues = venues.fillna(0)

In [25]:
cities = pd.merge(cities, venues)
cities.head()

Unnamed: 0,City,Country,Latitude,Longitude,Tourist,Food,Travel & Transport,Outdoors & Recreation,Shop & Service,Nightlife Spot,Professional & Other Places,Arts & Entertainment,Residence,College & University
0,Hong Kong,China,22.279328,114.162813,1,0.366667,0.2,0.166667,0.1,0.066667,0.066667,0.033333,0.0,0.0
1,Bangkok,Thailand,13.754424,100.49304,1,0.4,0.0,0.166667,0.033333,0.033333,0.133333,0.233333,0.0,0.0
2,London,United Kingdom,51.507322,-0.127647,1,0.433333,0.066667,0.2,0.066667,0.066667,0.033333,0.133333,0.0,0.0
3,Macau,China,22.189945,113.538045,1,0.666667,0.0,0.066667,0.0,0.066667,0.133333,0.066667,0.0,0.0
4,Singapore,Singapore,1.290475,103.852036,1,0.3,0.066667,0.1,0.066667,0.1,0.066667,0.3,0.0,0.0


In [26]:
cities.shape

(57, 14)

Now our dataset is complete and we can start working on our model!

## Machine Learning Implementation

It's important to remember the objective of this project is finding out whether a city would be attractive for tourism or not. In other words, we are going to **classify** cities. There are a lot of classification algorithms, such as Support Vector Machines, Logistic Regression, Naive Bayes, and others.

As our dataset is limited to only 57 cities, I will choose an algorithm that doesn't require much data to work, such as a Support Vector Machine. If the evaluation of this model results in a low score, I'll move on to another algorithm. Also, I will use GridSearchCV to find the best hyperparameters for the model.

### SVM Classification

In [66]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

In [67]:
x = cities.drop(columns=["City", "Country", "Latitude", "Longitude", "Tourist"])
y = cities["Tourist"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=19)

In [214]:
hyperparams = {
    'kernel': ['linear'],#, 'poly', 'rbf', 'sigmoid'],
    'gamma': ['scale', 'auto']
}

svm = GridSearchCV(estimator=SVC(C=0.3, random_state=102), param_grid=hyperparams, cv=10)
svm.fit(x_train, y_train)
svm.best_params_

{'gamma': 'scale', 'kernel': 'linear'}

In [215]:
svm_predicted = svm.predict(x_test)
f1_score(y_test, svm_predicted)

0.64

In [216]:
f1_score(y_train, svm.predict(x_train))

0.7037037037037037

### Stochastic Gradient Descent

In [203]:
from sklearn.naive_bayes import MultinomialNB

In [204]:
sgdc = MultinomialNB()
sgdc.fit(x_train, y_train)

MultinomialNB()

In [205]:
sgdc_predicted = sgdc.predict(x_test)
f1_score(y_test, sgdc_predicted)

0.4210526315789474