# Applied Data Science Capstone Project
## Final Project of [IBM's Data Science Professional Certificate Course] (https://www.coursera.org/professional-certificates/ibm-data-science)
## Part 2:
## A Restaurant recommendation system to the city of Curitiba, Brazil

First, let's import all the libraries needed

In [1]:
import pandas as pd
import numpy as np 
import geocoder
import requests 
import folium

from pandas.io.html import read_html
from sklearn import preprocessing


## Data aquisition

Now, we define a function to parse the data we are going to collect

In [2]:
def getBoroughDataframe(wikitable):
    tableTitle = wikitable.iloc[0,0]
    boroughName = tableTitle[(tableTitle.index("- ") + 2):tableTitle.index("(IBGE-")]

    if 'Regional' in boroughName:
        boroughName = boroughName.replace('Regional ','')

    df = wikitable.drop([0,1,2]).reset_index(drop = True)
    
    df['Borough'] = boroughName
    df['Neighborhood'] = df[0]
    df['Area'] = pd.to_numeric(df[1], downcast="float")/100 

    return df[['Borough', 'Neighborhood', 'Area']]

Then we can create a dataset of all the Curitiba's Neighborhoods, also with their respective borough (region) and Area in Km² 

In [3]:
# Get a list of wiki tables from the following link 
page = 'https://pt.wikipedia.org/wiki/Lista_de_bairros_de_Curitiba'
wikitables = read_html(page,  attrs = {"class":"wikitable"})

# Create empty dataframe to hold all sub dataframes
df_curitiba = pd.DataFrame(columns = ['Borough', 'Neighborhood', 'Area'])

# For each borough, get a sub dataframe of its neighborhoods and concat it to the main dataframe
for table in wikitables:
    df_curitiba = pd.concat([df_curitiba, getBoroughDataframe(table)])

# Remove any duplicate value 
df_curitiba.drop_duplicates(subset = 'Neighborhood', keep = 'last', inplace = True)

Now, lets use a geocoder to get the lagitude and longitude to each neighborhood and append this information to the dataset

In [4]:
latitude = []
longitude = []

# For each neighborhood, find its coordinates and append it to the latitude and longitude lists
for neighborhood in df_curitiba['Neighborhood']:
    lat_lng_coords = None

    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Curitiba, Brasil'.format(neighborhood))
        lat_lng_coords = g.latlng

    latitude.append(lat_lng_coords[0])
    longitude.append(lat_lng_coords[1])

# Create new columns with the latitude and longitude lists
df_curitiba['Latitude'] = latitude
df_curitiba['Longitude'] = longitude

df_curitiba.reset_index(drop = True)

# Print the dataframe 
df_curitiba.head(10)

Unnamed: 0,Borough,Neighborhood,Area,Latitude,Longitude
0,Bairro Novo,Ganchinho,11.2,-25.57523,-49.25502
1,Bairro Novo,Sitio Cercado,11.12,-25.54155,-49.26651
2,Bairro Novo,Umbará,22.469999,-25.58153,-49.28313
0,Boa Vista,Abranches,4.32,-25.37028,-49.27007
1,Boa Vista,Atuba,4.27,-25.43333,-49.23333
2,Boa Vista,Bacacheri,6.98,-25.39847,-49.23038
3,Boa Vista,Bairro Alto,7.02,-25.41102,-49.20442
4,Boa Vista,Barreirinha,3.73,-25.37337,-49.25943
5,Boa Vista,Boa Vista,5.14,-25.38704,-49.24761
6,Boa Vista,Cachoeira,3.07,-25.35376,-49.26428


Now, it is time to create a function to collect the venues list for each neighborhood, using the foursquare API. Update the client's ID and secret, since those presented here are not valid anymore :) 

In [5]:
CLIENT_ID = 'IPUNF3UYRYA0XSGB4GLQP4AXLBZOFRFR2SMHDJMOLG25AV2L'
CLIENT_SECRET = 'S23LIPCBN2PZWFVZ3H5S5W1FO3OMOG2X2OVEF2QFHAG4XN1S' 
VERSION = '20180605' # Foursquare API version

def getNearbyVenues(names, latitudes, longitudes, radius):
    
    venues_list=[]

    for name, lat, lng, rad in zip(names, latitudes, longitudes, radius):

        for price in range(1,5):  
            # Create the API request URL
            url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&price={}&section=food'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION, 
                lat, 
                lng, 
                rad, 
                price)
            
            try:
                # Make the GET request
                results = requests.get(url).json()["response"]['groups'][0]['items']
                
                # Return only relevant information for each nearby venue
                venues_list.append([(
                    name, 
                    v['venue']['name'], 
                    v['venue']['categories'][0]['name'],
                    price) for v in results])  
            except: 
                print('Error fecthing foursquare data for the neighborhood', name)    

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Venue', 
                  'Venue Category',
                  'Price']
    
    return(nearby_venues)

Let's apply our funtion to each neighborhood and get the complete list of venues in the city

In [6]:
curitiba_venues = getNearbyVenues(names = df_curitiba['Neighborhood'],
                                   latitudes = df_curitiba['Latitude'],
                                   longitudes = df_curitiba['Longitude'],
                                   radius = np.sqrt(df_curitiba['Area']/(np.pi))*1000)

curitiba_venues.head(10)

Unnamed: 0,Neighborhood,Venue,Venue Category,Price
0,Ganchinho,panificadora e confeitaria tortas da vovó,Bakery,1
1,Ganchinho,Burgueira Grill Food Truck,Burger Joint,1
2,Ganchinho,Big Pao Panificadora,Bakery,1
3,Ganchinho,Burgueria Grill,Burger Joint,1
4,Ganchinho,Bistrô Lago Azul,Brazilian Restaurant,2
5,Ganchinho,Frigorífico Família Costa,Steakhouse,4
6,Ganchinho,Defumados Ganchinho,Steakhouse,4
7,Sitio Cercado,Hamburgueria Mothafocka Gourmet,Burger Joint,1
8,Sitio Cercado,Dina Pizza Expressa (Bairro Novo),Pizza Place,1
9,Sitio Cercado,Niltinho Espetinhos,Fast Food Restaurant,1


## Pre-processing

To use our recommendation system, we need to apply a one hot encoding to get a single value for each possible restaurant category. Then, we group the categories to each neoghborhood. Is it also helpful to normalize the price mean column so we can better understand the values at the end

In [7]:
# Apply one hot encoding
curitiba_onehot = pd.get_dummies(curitiba_venues[['Venue Category']], prefix="", prefix_sep="")

# Add neighborhood and price columns back to dataframe
curitiba_onehot['Neighborhood'] = curitiba_venues['Neighborhood'] 
curitiba_onehot['Price'] = curitiba_venues['Price'] 

# Move neighborhood column to the first column
fixed_columns = [curitiba_onehot.columns[-1]] + list(curitiba_onehot.columns[:-1])
curitiba_onehot = curitiba_onehot[fixed_columns]

# Group rows by neighborhood and take the mean of the frequency of occurrency of each category
curitiba_grouped = curitiba_onehot.groupby('Neighborhood').mean().reset_index()

# Normalize the price column
prices = curitiba_grouped[['Price']].values
min_max_scaler = preprocessing.MinMaxScaler()
prices_norm = min_max_scaler.fit_transform(prices)

curitiba_grouped['Price'] = prices_norm

# Print the dataframe
curitiba_grouped.head(10)

Unnamed: 0,Neighborhood,Price,Afghan Restaurant,American Restaurant,Argentinian Restaurant,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Brazilian Restaurant,...,Spanish Restaurant,Steakhouse,Sushi Restaurant,Swiss Restaurant,Taco Place,Tapas Restaurant,Tapiocaria,Thai Restaurant,Vegetarian / Vegan Restaurant,Wings Joint
0,Abranches,0.453333,0.0,0.0,0.0,0.0,0.04,0.0,0.2,0.08,...,0.0,0.12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Ahú,0.592593,0.0,0.0,0.0,0.0,0.022222,0.0,0.088889,0.111111,...,0.0,0.088889,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Alto Boqueirão,0.375758,0.0,0.036364,0.0,0.018182,0.0,0.0,0.2,0.090909,...,0.0,0.036364,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Alto da Glória,0.266667,0.0,0.0,0.0,0.028571,0.028571,0.0,0.085714,0.142857,...,0.0,0.0,0.057143,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Alto da XV,0.448485,0.0,0.0,0.018182,0.0,0.054545,0.0,0.072727,0.127273,...,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.0
5,Atuba,0.480392,0.0,0.0,0.0,0.0,0.058824,0.0,0.117647,0.117647,...,0.0,0.044118,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Augusta,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Bacacheri,0.430769,0.0,0.0,0.0,0.0,0.092308,0.0,0.138462,0.153846,...,0.0,0.030769,0.0,0.0,0.0,0.0,0.0,0.0,0.030769,0.0
8,Bairro Alto,0.36,0.0,0.0,0.0,0.0,0.02,0.0,0.22,0.06,...,0.0,0.06,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.02
9,Barreirinha,0.275862,0.0,0.0,0.0,0.0,0.034483,0.0,0.275862,0.034483,...,0.0,0.068966,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Getting the results

Finally, we create a function that applies a recommendation algorithm to our data, based on an input list of the desired restaurant categories, and returns an ordered list of the neighborhoods that get the higher grades 

In [8]:
def createRecommendationList(restaurant_list):

    # Drop all data that is not related to restaurant categories
    curitiba_categories = curitiba_grouped.drop(['Neighborhood', 'Price'],1)
    columns = list(curitiba_categories.columns)

    matrix = []

    # Create a binary array where each restaurant category is represented by an 1 if it is contained at the input list or 0 otherwise 
    for restaurant_category in columns:
        isRestaurantContained = False
        for restaurant in restaurant_list:
            if (restaurant_category == restaurant):
                isRestaurantContained = True
                break

        if(isRestaurantContained):
            matrix.append(1)
        else:
            matrix.append(0)

    # Multiply the curitiba categories by the binary matrix, appling the recommendation system algorithm
    result = curitiba_categories.to_numpy().dot(matrix)

    # Create a new dataset with the results of the recommendation an join it with the neighborhood data
    df_recommendation = pd.DataFrame(data=result, columns = ['Recomendation grade'])
    df_recommendation = df_recommendation.join(curitiba_grouped[['Neighborhood','Price']]).set_index('Neighborhood').sort_values(by = 'Recomendation grade', ascending = False)

    return df_recommendation

Let's test our function by finding the neighborhoods in which we are most likely to find vegetarian/vegan buffet restaurants. Notice that the function also returns a "Price" column, where the value 0.0 correspond to the cheapest restaurants, and 1.0 to the most expensive ones.

In [9]:
restaurant_list = ['Buffet', 'Vegetarian / Vegan Restaurant']

recommendation = createRecommendationList(restaurant_list)

recommendation.head(20)

Unnamed: 0_level_0,Recomendation grade,Price
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1
Vista Alegre,0.071429,0.380952
Seminário,0.065217,0.449275
Jardim Botânico,0.04,0.386667
Bigorrilho,0.037037,0.617284
Hugo Lange,0.034483,0.597701
Bacacheri,0.030769,0.430769
Cabral,0.030303,0.424242
Mercês,0.029412,0.460784
Centro,0.022222,0.711111
Ahú,0.022222,0.592593


Finally, we create a map to visualize the top 10 neighborhoods for vegetarian/vegan buffets in Curitiba

In [10]:
# Append latitude and longitude, reset index, and get only the top 10 higher ranked neighboors
recommendation_lat_lon = recommendation.join(df_curitiba[['Neighborhood','Latitude','Longitude']].set_index('Neighborhood'), on = 'Neighborhood')
recommendation_lat_lon = recommendation_lat_lon.reset_index().head(10)

# Get Curitiba city coordinates
g = geocoder.arcgis('Curitiba, Brazil')
lat_lng_coords = g.latlng
curitiba_latitude = lat_lng_coords[0]
curitiba_longitude = lat_lng_coords[1]

# Create map centered in Curitiba city
map = folium.Map(location=[curitiba_latitude, curitiba_longitude], zoom_start=12)

# Add markers to the map
for lat, lon, poi, grade, price in zip(recommendation_lat_lon['Latitude'], recommendation_lat_lon['Longitude'], recommendation_lat_lon['Neighborhood'], recommendation_lat_lon['Recomendation grade'], recommendation_lat_lon['Price']):
    label = folium.Popup(str(poi) + '\nGrade: ' + str(grade) + '\nPrice: ' + str(price), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        fill=True,
        fill_opacity=0.7).add_to(map)
       
map