<a href="https://colab.research.google.com/github/Bromus001/notebooks/blob/master/Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Capstone Project : Little Pizza Store

## **Part 1 : The Idea**

### Problem **Description**

***Business Problem***

In Buenos Aires, the capital city of Argentina, there are people from many different cultures living and working.

 It is a big city with a huge collection of different companies ranging from small startups to big multinational ones.


Each day thousands of workers needs to get lunch and every night thousands of families needs to dinner.
 

Also Pizza is a very popular meal because it can be delivered easily and is easy to share among several persons. Also is not so expensive as other options.


Little Pizza group wants to open a new pizza store in Buenos Aires City and they wants to improve the decision using Analytics to define the most promissing neighborhood.

<br>
<br>

***Analytic Approach***

As a first start a clustering model will be constructed using neighborhood data to determine similarities and groups.

The selection critera will be those neighborhoods where pizza stores are not the mos popular venue and have the less distance against those venues where pizza stores are popular.
<br>
<br>

***Data description***

Basically two datasets are required in order to get the features and insights the model will requiere. 
<br>
<br>
1) City geographical data: This dataset identifies the nieghborhoods and its coordinates (latitude and longitude). The Neighborhood data is downloaded from a [public government site](https://data.buenosaires.gob.ar/api/files/barrios.csv/download/csv). 


<img src="https://drive.google.com/file/d/14w2iroQQGpYvGywWi8XV97lYc-w2gVHK/view?usp=sharing"  alt="Neighborhood dataset">

Dataset fields:
-  WKT: The coordinates of each corner of the geographical polygon.
-  barrio	: Name of the neighborhood
-  comuna: id of the borough
-  perimetro: perimeter of the neighborhood
-  area: area of the neighborhood

This dataset does not contains the coordinates of each neighborhood but insteaed it has all the coordinates that defines the geographic polygon.

The process to get the approximate center of each neighborhood was parse the polygon coordinates to identify de maximun amd minimun latitude and longitude and the applying


```
center_lat = min_lat + (max_lat - min_lat)/2
center_lng = min_lng + (max_lng - min_lng)/2
```


Using the coordinates of the center I would be able to call de foursquare API to retrieve the main venues in each neighborhood and then analyze them using a clustering model.
<br>
<br>


2) Venues by Neighborhood: This dataset enumerates the different venues each neighborhood has. Will be constructed using foursqare API for each neighborhood.

<img src="https://drive.google.com/file/d/1PPZKHDqqsS0gyxNszSBO56BLAYfCQXbu/view?usp=sharing"  alt="Venues dataset">

Dataset fields:
-  Venue: Name of the venue
-  Venue Latitude: Latitude of the venue
-  Venue Longitude: Longitude of the venue
-  Venue Category: Category of the venue

Using the Venues Category feature the neighborhoods can be compared between them.






## **Part 2 : The Work**

### ***Imports***

In [0]:
# Required imports
import pandas as pd
import numpy as np

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

from bs4 import BeautifulSoup
import requests

import folium # map rendering library

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans


### Uploading and formatting the data

The dataset is downloaded from the site of the "Ciudad Autonoma de Buenos Aires".

In [2]:
# Read the source file and apply basic cleaning
url = "https://data.buenosaires.gob.ar/api/files/barrios.csv/download/csv"
df_bsas = pd.read_csv(url, sep=',', low_memory=False, encoding="latin1")


df_bsas["polygon"] = df_bsas["WKT"].str.replace("POLYGON", "")
df_bsas["polygon"] = df_bsas["polygon"].apply(lambda x: str(x).replace("(", "").replace(")", ""))

df_bsas.head()

Unnamed: 0,WKT,barrio,comuna,perimetro,area,polygon
0,"POLYGON ((-58.4528200492791 -34.5959886570639,...",CHACARITA,15,7725.695228,3118101.0,"-58.4528200492791 -34.5959886570639,-58.45365..."
1,"POLYGON ((-58.4655768128541 -34.5965577078058,...",PATERNAL,15,7087.513295,2229829.0,"-58.4655768128541 -34.5965577078058,-58.46562..."
2,"POLYGON ((-58.4237529813037 -34.5978273383243,...",VILLA CRESPO,15,8132.699348,3613584.0,"-58.4237529813037 -34.5978273383243,-58.42495..."
3,"POLYGON ((-58.4946097568899 -34.6148652395239,...",VILLA DEL PARQUE,11,7705.389797,3399596.0,"-58.4946097568899 -34.6148652395239,-58.49478..."
4,"POLYGON ((-58.4128700313089 -34.6141162515854,...",ALMAGRO,5,8537.901368,4050752.0,"-58.4128700313089 -34.6141162515854,-58.41281..."


In [3]:
"""
Parses the polygon field to determine de center. This is a simple formula that does not consider
the earth curvature. Given the small areas it has not signifivative impact.
"""
def get_data(row):
  min_lat = 0
  min_lng = 0
  max_lat = -999
  max_lng = -999
  
  coords = row["polygon"].split(",")
  
  for coord in coords:
    c = coord.split(" ")
    if len(c)==2:
      if float(c[0])>max_lng: max_lng = float(c[0])
      if float(c[0])<min_lng: min_lng = float(c[0])
      if float(c[1])>max_lat: max_lat = float(c[1])
      if float(c[1])<min_lat: min_lat = float(c[1])
      
  row["min_lat"] = min_lat
  row["max_lat"] = max_lat
  row["min_lng"] = min_lng
  row["max_lng"] = max_lng
  
  row["center_lat"] = min_lat + (max_lat - min_lat)/2
  row["center_lng"] = min_lng + (max_lng - min_lng)/2
  
  return row
    
  
# Get the center of each neghborhood polygon
df_bsas = df_bsas.apply(get_data, axis=1)
df_bsas.head()

Unnamed: 0,WKT,barrio,comuna,perimetro,area,polygon,min_lat,max_lat,min_lng,max_lng,center_lat,center_lng
0,"POLYGON ((-58.4528200492791 -34.5959886570639,...",CHACARITA,15,7725.695228,3118101.0,"-58.4528200492791 -34.5959886570639,-58.45365...",-34.597835,-34.578295,-58.466828,-58.438536,-34.588065,-58.452682
1,"POLYGON ((-58.4655768128541 -34.5965577078058,...",PATERNAL,15,7087.513295,2229829.0,"-58.4655768128541 -34.5965577078058,-58.46562...",-34.605311,-34.587445,-58.478831,-58.456236,-34.596378,-58.467534
2,"POLYGON ((-58.4237529813037 -34.5978273383243,...",VILLA CRESPO,15,8132.699348,3613584.0,"-58.4237529813037 -34.5978273383243,-58.42495...",-34.607616,-34.588668,-58.458935,-58.423367,-34.598142,-58.441151
3,"POLYGON ((-58.4946097568899 -34.6148652395239,...",VILLA DEL PARQUE,11,7705.389797,3399596.0,"-58.4946097568899 -34.6148652395239,-58.49478...",-34.615016,-34.596789,-58.506168,-58.474017,-34.605902,-58.490092
4,"POLYGON ((-58.4128700313089 -34.6141162515854,...",ALMAGRO,5,8537.901368,4050752.0,"-58.4128700313089 -34.6141162515854,-58.41281...",-34.622075,-34.597713,-58.433334,-58.411919,-34.609894,-58.422626


### **Drawing the map with the Neighborhoods**

In [4]:
# Get Lat & Lng of Buenos Aires City
address = 'Buenos Aires, AR'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Buenos Aires City are {}, {}.'.format(latitude, longitude))



The geograpical coordinate of Buenos Aires City are -34.6075616, -58.437076.


In [5]:
# create map of Buenos Aires using latitude and longitude values
map_bsas = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_bsas['center_lat'], df_bsas['center_lng'], df_bsas['comuna'], df_bsas['barrio']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill_color='#3186cc').add_to(map_bsas)  
    
map_bsas

### **Using Foursqare API to get venues by neighborhodd**

In [0]:
CLIENT_ID = "XXX"
CLIENT_SECRET = "XXX"
VERSION = '20180605' # Foursquare API version
LIMIT = 100
RADIUS = 500

Get venue data from foursquare using lat and lng

In [7]:
def get_venues(name, lat, lng):
  
  venues_list = []

  # create the API request URL
  url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
      CLIENT_ID, 
      CLIENT_SECRET, 
      VERSION, 
      lat, 
      lng, 
      RADIUS, 
      LIMIT)

  #try:
  # make the GET request
  results = requests.get(url).json()["response"]['groups'][0]['items']

  # return only relevant information for each nearby venue
  venues_list.append([(
      name, 
      lat, 
      lng, 
      v['venue']['name'], 
      v['venue']['location']['lat'], 
      v['venue']['location']['lng'],  
      v['venue']['categories'][0]['name']) for v in results])
  
  return venues_list



# Get all Venues from FourSquare
all_venues = []

for index, row in df_bsas.iterrows():
  print("Processing {}".format(row["barrio"]))
  venues = get_venues(row["barrio"], row["center_lat"], row["center_lng"])
  all_venues.extend(venues)
  

# Create a DataFrame
df_venues = pd.DataFrame([item for venue_list in all_venues for item in venue_list])

df_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
  
  

Processing CHACARITA
Processing PATERNAL
Processing VILLA CRESPO
Processing VILLA DEL PARQUE
Processing ALMAGRO
Processing CABALLITO
Processing VILLA SANTA RITA
Processing MONTE CASTRO
Processing VILLA REAL
Processing FLORES
Processing FLORESTA
Processing CONSTITUCION
Processing SAN CRISTOBAL
Processing BOEDO
Processing VELEZ SARSFIELD
Processing VILLA LURO
Processing PARQUE PATRICIOS
Processing MATADEROS
Processing VILLA LUGANO
Processing SAN TELMO
Processing SAAVEDRA
Processing COGHLAN
Processing VILLA URQUIZA
Processing COLEGIALES
Processing BALVANERA
Processing VILLA GRAL. MITRE
Processing PARQUE CHAS
Processing AGRONOMIA
Processing VILLA ORTUZAR
Processing BARRACAS
Processing PARQUE AVELLANEDA
Processing PARQUE CHACABUCO
Processing NUEVA POMPEYA
Processing PALERMO
Processing VILLA RIACHUELO
Processing VILLA SOLDATI
Processing VILLA PUEYRREDON
Processing VILLA DEVOTO
Processing LINIERS
Processing VERSALLES
Processing PUERTO MADERO
Processing MONSERRAT
Processing SAN NICOLAS
Process

In [8]:
df_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,CHACARITA,-34.588065,-58.452682,El Imperio de la Pizza,-34.58689,-58.454967,Pizza Place
1,CHACARITA,-34.588065,-58.452682,Santos 4040,-34.588822,-58.449863,Theater
2,CHACARITA,-34.588065,-58.452682,Albamonte Ristorante,-34.587803,-58.453075,Argentinian Restaurant
3,CHACARITA,-34.588065,-58.452682,Fábrica de Churros Olleros,-34.586983,-58.45364,Bakery
4,CHACARITA,-34.588065,-58.452682,Pizzería Santa María,-34.587238,-58.454005,Pizza Place


In [9]:
df_venues.shape

(1035, 7)

### **Formatting Venue data**

In [10]:
# one hot encoding
df_bsas_onehot = pd.get_dummies(df_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
df_bsas_onehot['Neighborhood'] = df_venues['Neighborhood'] 

	

# move neighborhood column to the first column
fixed_columns = ["Neighborhood"] + [col for col in df_bsas_onehot.columns.tolist() if col not in ["Neighborhood"]]
df_bsas_onehot = df_bsas_onehot[fixed_columns]

# Grouping by Neighborhood and calculate the mean of the frecuency of each venue
df_bsas_grouped = df_bsas_onehot.groupby(["Neighborhood"]).mean().reset_index()
df_bsas_grouped.head()

Unnamed: 0,Neighborhood,American Restaurant,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Arts & Entertainment,Asian Restaurant,Athletics & Sports,...,Toy / Game Store,Track,Train Station,Tunnel,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Veterinarian,Vietnamese Restaurant,Wine Shop,Women's Store
0,AGRONOMIA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.166667,0.166667,0.0,0.0,0.0,0.0,0.0,0.0
1,ALMAGRO,0.0,0.0,0.15625,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,BALVANERA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,BARRACAS,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,BELGRANO,0.0,0.0,0.054348,0.0,0.0,0.0,0.0,0.043478,0.0,...,0.0,0.0,0.0,0.0,0.032609,0.0,0.0,0.0,0.0,0.0


### **Clustering Neighborhoods**

In [11]:
# set number of clusters
kclusters = 9

df_bsas_grouped_clustering = df_bsas_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_bsas_grouped_clustering)

# add clustering labels
df_bsas_grouped['Cluster Labels'] = kmeans.labels_

# merge df_bsas_grouped with df_bsas to add latitude/longitude for each neighborhood
df_bsas_merged = df_bsas.merge(df_bsas_grouped, how="left", left_on='barrio', right_on="Neighborhood")

df_bsas_merged['Cluster Labels'].value_counts()


0    26
6     7
7     5
1     5
8     1
5     1
4     1
3     1
2     1
Name: Cluster Labels, dtype: int64

In [12]:
df_bsas_merged.shape

(48, 192)

### **Drawing a map identifying the neighborhood cluster**

In [13]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_bsas_merged['center_lat'], df_bsas_merged['center_lng'], df_bsas_merged['barrio'], df_bsas_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster))
    folium.CircleMarker(
        [lat, lon],
        radius=10,
        popup=label,
        color=rainbow[cluster-1],
        fill_color=rainbow[cluster-1],
        fill_opacity=0.0).add_to(map_clusters)
       
map_clusters

In [14]:
df_bsas_merged[df_bsas_merged["Cluster Labels"]==3]

Unnamed: 0,WKT,barrio,comuna,perimetro,area,polygon,min_lat,max_lat,min_lng,max_lng,...,Track,Train Station,Tunnel,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Veterinarian,Vietnamese Restaurant,Wine Shop,Women's Store,Cluster Labels
29,"POLYGON ((-58.3703353711449 -34.6329258371189,...",BARRACAS,4,13018.210271,7961000.0,"-58.3703353711449 -34.6329258371189,-58.37027...",-34.662504,-34.626647,-58.405045,-58.367649,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3


In [72]:
df_bsas_grouped.head()

Unnamed: 0,Neighborhood,Airport,Airport Food Court,Airport Lounge,Airport Terminal,American Restaurant,Arcade,Argentinian Restaurant,Art Museum,Arts & Entertainment,...,Thrift / Vintage Store,Toll Booth,Toy / Game Store,Train Station,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Veterinarian,Vietnamese Restaurant,Women's Store,Cluster Labels
0,AGRONOMIA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0
1,ALMAGRO,0.0,0.0,0.0,0.0,0.0,0.0,0.151515,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
2,BALVANERA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
3,BARRACAS,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
4,BELGRANO,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [15]:
num_top_venues = 5

for hood in df_bsas_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = df_bsas_grouped[df_bsas_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp[temp["venue"]!="Cluster Labels"]
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----AGRONOMIA----
           venue  freq
0   Soccer Field  0.33
1  Garden Center  0.17
2     Restaurant  0.17
3         Tunnel  0.17
4  Train Station  0.17


----ALMAGRO----
                    venue  freq
0  Argentinian Restaurant  0.16
1                    Café  0.09
2                     Bar  0.09
3          Ice Cream Shop  0.09
4             Pizza Place  0.06


----BALVANERA----
                  venue  freq
0                  Café  0.25
1           Pizza Place  0.19
2  Fast Food Restaurant  0.12
3              Bus Stop  0.06
4     Electronics Store  0.06


----BARRACAS----
                 venue  freq
0        Auto Workshop   0.5
1       Farmers Market   0.5
2  American Restaurant   0.0
3      Paintball Field   0.0
4      Nature Preserve   0.0


----BELGRANO----
                    venue  freq
0      Chinese Restaurant  0.09
1  Argentinian Restaurant  0.05
2             Pizza Place  0.05
3               BBQ Joint  0.04
4        Asian Restaurant  0.04


----BOCA----
               