# Helping small business to define where to start in a city

## Business Problem 
Small business such as restaurants, coffee shops, among others might want to know how desired can they be in a new neigbourhood if they decide to open a new place.

So the questions are: 
- If someone is looking to open a restaurant, where would you recommend that they open it based on similar venues around the neighbourhood? 
- Can we use Foursquare to map a list of top venues within a city's different neighbourhoods?

If that is possible, then, for sure a restaurant will take a decision to open a new spot at a specific list of neighbourhoods based on the number of competition

## Data to be used
- We'll use Foursquare's API to get a list of top venues of each borough (localidad) in Bogota, Colombia
- A list of Bogota's boroughs https://es.wikipedia.org/wiki/Anexo:Localidades_de_Bogot%C3%A1 that also includes population and population density that can help to improve a decision

### Installing and importing necessary libraries

In [1]:
import pandas as pd
import numpy as np
import wikipedia as wp
from geopy.geocoders import Nominatim
import requests
from sklearn.cluster import KMeans

#!pip install wikipedia
#!pip install geopy

# Foursquare part
CLIENT_ID = '3MEJANW2GDELQDPTFPAFWKOE1RRD4A0CFJYA3MLAFNWK4UQ2' # your Foursquare ID
CLIENT_SECRET = 'BGE3UKSJGUQ1BAM2KUB41QALX5O2RSEVPEVTTOX30TM5MZTQ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100
radius = 1000

### Downloading the boroughs lists from Wikipedia
We need to use the spanish version of wikipedia in order to get the list of localities for Bogota:

In [2]:
wp.set_lang("es")
html = wp.page("Bogotá").html().encode("UTF-8")
html_content = pd.read_html(html)
df = html_content[3]
df.head()

Unnamed: 0,â,Localidad,CÃ³digos Postales,Superficie kmÂ²[62]​,PoblaciÃ³n[63]​,Densidad hab./kmÂ²
0,1,UsaquÃ©n,110111-110151,65.31,501 999,7 686.4
1,2,Chapinero,110211-110231,38.15,139 701,3 661.88
2,3,Santa Fe,110311-110321,45.17,110 048,2 436.3
3,4,San CristÃ³bal,110411-110441,49.09,404 697,8 243.98
4,5,Usme,110511-110571,215.06,457 302,2 126.39


### Cleaning the dataframe and renaming some column names and rows due to spanish accents

In [3]:
df.columns = ['useless', 'Borough', 'PostalCode', 'SurfaceInKm2','Population','PopulationDensityPerKm2']
df = df.drop(['useless'], axis=1)
df

Unnamed: 0,Borough,PostalCode,SurfaceInKm2,Population,PopulationDensityPerKm2
0,UsaquÃ©n,110111-110151,65.31,501 999,7 686.4
1,Chapinero,110211-110231,38.15,139 701,3 661.88
2,Santa Fe,110311-110321,45.17,110 048,2 436.3
3,San CristÃ³bal,110411-110441,49.09,404 697,8 243.98
4,Usme,110511-110571,215.06,457 302,2 126.39
5,Tunjuelito,110611-110621,9.91,199 430,20 124.11
6,Bosa,110711-110741,23.93,673 077,28 126.91
7,Kennedy,110811-110881,38.59,1 088 443,28 205.31
8,FontibÃ³n,110911-110931,33.28,394 648,11 858.41
9,EngativÃ¡,111011-111071,35.88,887 080,24 723.52


In [4]:
df.at[0,'Borough'] = "Usaquen"
df.at[3,'Borough'] = "San Cristobal"
df.at[8,'Borough'] = "Fontibon"
df.at[9,'Borough'] = "Engativa"
df.at[13,'Borough'] = "Los Martires"
df.at[14,'Borough'] = "Antonio Narino"
df.at[18,'Borough'] = "Ciudad Bolivar"
df

Unnamed: 0,Borough,PostalCode,SurfaceInKm2,Population,PopulationDensityPerKm2
0,Usaquen,110111-110151,65.31,501 999,7 686.4
1,Chapinero,110211-110231,38.15,139 701,3 661.88
2,Santa Fe,110311-110321,45.17,110 048,2 436.3
3,San Cristobal,110411-110441,49.09,404 697,8 243.98
4,Usme,110511-110571,215.06,457 302,2 126.39
5,Tunjuelito,110611-110621,9.91,199 430,20 124.11
6,Bosa,110711-110741,23.93,673 077,28 126.91
7,Kennedy,110811-110881,38.59,1 088 443,28 205.31
8,Fontibon,110911-110931,33.28,394 648,11 858.41
9,Engativa,111011-111071,35.88,887 080,24 723.52


### Changing column types
If we take a closer look to our dataframe, we can notice the values in Population and PopulationDensityPerKm2 have spaces, so they're probably strings that needs to be converted to integers

In [5]:
df.dtypes

Borough                     object
PostalCode                  object
SurfaceInKm2               float64
Population                  object
PopulationDensityPerKm2     object
dtype: object

I needed to take an extra step in order to convert the strings to integers.
It appears the info downloaded needed to be unidecodedd first, I noticed that since I couldn't replace the spaces of 'Population' and 'PopulationDensityPerKm2' as the white spaces were not really white spaces but unicode '\xa0' values

In [6]:
!pip install unidecode
import unidecode

df['Population'] = df['Population'].apply(unidecode.unidecode)
df['PopulationDensityPerKm2'] = df['PopulationDensityPerKm2'].apply(unidecode.unidecode)
df = df.replace(' ', '', regex=True)
df.head()



Unnamed: 0,Borough,PostalCode,SurfaceInKm2,Population,PopulationDensityPerKm2
0,Usaquen,110111-110151,65.31,501999,7686.4
1,Chapinero,110211-110231,38.15,139701,3661.88
2,SantaFe,110311-110321,45.17,110048,2436.3
3,SanCristobal,110411-110441,49.09,404697,8243.98
4,Usme,110511-110571,215.06,457302,2126.39


In [7]:
df['Population'] = pd.to_numeric(df["Population"])
df['PopulationDensityPerKm2'] = pd.to_numeric(df["PopulationDensityPerKm2"])
df.dtypes

Borough                     object
PostalCode                  object
SurfaceInKm2               float64
Population                   int64
PopulationDensityPerKm2    float64
dtype: object

### Splitting the postal codes?

In [8]:
df

Unnamed: 0,Borough,PostalCode,SurfaceInKm2,Population,PopulationDensityPerKm2
0,Usaquen,110111-110151,65.31,501999,7686.4
1,Chapinero,110211-110231,38.15,139701,3661.88
2,SantaFe,110311-110321,45.17,110048,2436.3
3,SanCristobal,110411-110441,49.09,404697,8243.98
4,Usme,110511-110571,215.06,457302,2126.39
5,Tunjuelito,110611-110621,9.91,199430,20124.11
6,Bosa,110711-110741,23.93,673077,28126.91
7,Kennedy,110811-110881,38.59,1088443,28205.31
8,Fontibon,110911-110931,33.28,394648,11858.41
9,Engativa,111011-111071,35.88,887080,24723.52


In [9]:
df.sum()

Borough                    UsaquenChapineroSantaFeSanCristobalUsmeTunjuel...
PostalCode                 110111-110151110211-110231110311-110321110411-...
SurfaceInKm2                                                         1636.57
Population                                                           8050444
PopulationDensityPerKm2                                               277278
dtype: object

If we take a look at the population of each Borough, there are over 8 million people living in Bogota. That's a big opportunity for our restaurant owners.
But they might need to tackle each borough separately, that's why it might be a good idea to have a FirstPostalCode and LastPostalCode for each Borough

In [10]:
postalCodes = df["PostalCode"].str.split("-", n = 1, expand = True) 
df["FirstPostalCode"]= postalCodes[0] 
df["LastPostalCode"]= postalCodes[1]
df['FirstPostalCode'] = pd.to_numeric(df["FirstPostalCode"])
df['LastPostalCode'] = pd.to_numeric(df["LastPostalCode"])
df

Unnamed: 0,Borough,PostalCode,SurfaceInKm2,Population,PopulationDensityPerKm2,FirstPostalCode,LastPostalCode
0,Usaquen,110111-110151,65.31,501999,7686.4,110111,110151.0
1,Chapinero,110211-110231,38.15,139701,3661.88,110211,110231.0
2,SantaFe,110311-110321,45.17,110048,2436.3,110311,110321.0
3,SanCristobal,110411-110441,49.09,404697,8243.98,110411,110441.0
4,Usme,110511-110571,215.06,457302,2126.39,110511,110571.0
5,Tunjuelito,110611-110621,9.91,199430,20124.11,110611,110621.0
6,Bosa,110711-110741,23.93,673077,28126.91,110711,110741.0
7,Kennedy,110811-110881,38.59,1088443,28205.31,110811,110881.0
8,Fontibon,110911-110931,33.28,394648,11858.41,110911,110931.0
9,Engativa,111011-111071,35.88,887080,24723.52,111011,111071.0


Time to remove the initial PostalCode column

In [11]:
df = df.drop(['PostalCode'], axis=1)
df

Unnamed: 0,Borough,SurfaceInKm2,Population,PopulationDensityPerKm2,FirstPostalCode,LastPostalCode
0,Usaquen,65.31,501999,7686.4,110111,110151.0
1,Chapinero,38.15,139701,3661.88,110211,110231.0
2,SantaFe,45.17,110048,2436.3,110311,110321.0
3,SanCristobal,49.09,404697,8243.98,110411,110441.0
4,Usme,215.06,457302,2126.39,110511,110571.0
5,Tunjuelito,9.91,199430,20124.11,110611,110621.0
6,Bosa,23.93,673077,28126.91,110711,110741.0
7,Kennedy,38.59,1088443,28205.31,110811,110881.0
8,Fontibon,33.28,394648,11858.41,110911,110931.0
9,Engativa,35.88,887080,24723.52,111011,111071.0


## Time to make important decisions
We now have a clean dataset, and if we want bigger impact, just based on population density, we should focus on Kennedy borough as it has a over 1 million population. And we can have a closer look at the neighbourhoods that this borough has in order to see where can our restaurants have a bigger revenue return

It's important to note that we are taking this approach solely based on population. On a real world scenario, a data scientist would also talk with the restaurant owners to also take a look at other data per borough such as:
- Average income level
- Housing prices
- Education level
- Population overall
- Dimension of borough in Km2

among other numbers that might be relevant to each specific restaurant. We might not want to open a high end priced 5 starred restaurant at a Borough that won't have as much potential customers we want.

> It's worth nothing that for Bogota postal codes overall, they jump from 10 to 10, or in specific cases like Suba the jump from 10 to 10 and then from 5 to 5

In [14]:
kennedy_postalCodes = list(range(int(df.at[7,"FirstPostalCode"]), int(df.at[7,"LastPostalCode"])+1,10))
print("There are {} postal codes in Kennedy".format(len(kennedy_postalCodes)))

There are 8 postal codes in Kennedy


Lets create a Kennedy specific dataframe with Postal Code, Latitude and Longitude

In [15]:
kennedy_df = pd.DataFrame(kennedy_postalCodes, columns = ['PostalCode'])
kennedy_df

Unnamed: 0,PostalCode
0,110811
1,110821
2,110831
3,110841
4,110851
5,110861
6,110871
7,110881


In [16]:
latitudes = []
longitudes = []
import pgeocode
import time
nomi = pgeocode.Nominatim('CO')

for x in kennedy_df['PostalCode']:
    info = nomi.query_postal_code(x)
    lat = info.latitude
    lon = info.longitude
    print("The '{}' postal code have coordinates: {},{}".format(x, lat, lon))
    latitudes.append(lat)
    longitudes.append(lon)

The '110811' postal code have coordinates: 4.6544,-74.154
The '110821' postal code have coordinates: 4.6397,-74.1426
The '110831' postal code have coordinates: 4.6292,-74.1301
The '110841' postal code have coordinates: 4.6075,-74.1458
The '110851' postal code have coordinates: 4.626,-74.157
The '110861' postal code have coordinates: 4.6144,-74.1718
The '110871' postal code have coordinates: 4.6455,-74.1675
The '110881' postal code have coordinates: 4.6388,-74.1757


In [17]:
kennedy_df['Latitude'] = latitudes
kennedy_df['Longitude'] = longitudes
kennedy_df

Unnamed: 0,PostalCode,Latitude,Longitude
0,110811,4.6544,-74.154
1,110821,4.6397,-74.1426
2,110831,4.6292,-74.1301
3,110841,4.6075,-74.1458
4,110851,4.626,-74.157
5,110861,4.6144,-74.1718
6,110871,4.6455,-74.1675
7,110881,4.6388,-74.1757


Now let's get the coordinates for Kennedy borough

In [47]:
from geopy.geocoders import Nominatim

address = 'Kennedy, Bogota'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
kennedy_latitude = location.latitude
kennedy_longitude = location.longitude
print('The geograpical coordinates of Kennedy are {}, {}.'.format(kennedy_latitude, kennedy_longitude))

The geograpical coordinates of Kennedy are 4.6315823, -74.1513187.


We'll reuse the GetVenues method from previous labs to get a list of venues nearby Kennedy's boroughs

In [29]:
def getNearbyVenues(postalcodes, latitudes, longitudes, radius=1000):
    venues_list=[]
    for postalcode, lat, lng in zip(postalcodes, latitudes, longitudes):
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(
            postalcode, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostalCode', 
                  'PostalCode Latitude', 
                  'PostalCode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now, we have a single method to get a list of all the venues in Kennedy, that can be grouped by their PostalCodes

In [31]:
kennedy_venues = getNearbyVenues(postalcodes=kennedy_df['PostalCode'],
                                   latitudes=kennedy_df['Latitude'],
                                   longitudes=kennedy_df['Longitude'])
kennedy_venues.head()

Unnamed: 0,PostalCode,PostalCode Latitude,PostalCode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,110811,4.6544,-74.154,Homecenter y Constructor Tintal,4.651588,-74.146031,Hardware Store
1,110811,4.6544,-74.154,Pista De Patinaje El Tintal,4.652716,-74.160928,Skating Rink
2,110811,4.6544,-74.154,Fitness Sports GYM,4.648465,-74.153374,Gym
3,110811,4.6544,-74.154,Centro Comercial Ciudad Tintal,4.652983,-74.159557,Shopping Mall
4,110811,4.6544,-74.154,Merkacol,4.64841,-74.147728,Fruit & Vegetable Store


## How many venue types do we have in Kennedy?

In [32]:
print('There are {} uniques categories.'.format(len(kennedy_venues['Venue Category'].unique())))

There are 45 uniques categories.


## Analazing each postal code in Kennedy borough

In [33]:
# one hot encoding
kennedy_onehot = pd.get_dummies(kennedy_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
kennedy_onehot['PostalCode'] = kennedy_venues['PostalCode'] 

# move neighborhood column to the first column
fixed_columns = [kennedy_onehot.columns[-1]] + list(kennedy_onehot.columns[:-1])
kennedy_onehot = kennedy_onehot[fixed_columns]

kennedy_onehot.head()

Unnamed: 0,PostalCode,Airport,Arcade,Arepa Restaurant,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Basketball Court,Breakfast Spot,...,Skating Rink,Soccer Field,Soccer Stadium,South American Restaurant,Spa,Sporting Goods Shop,Sports Club,Supermarket,Wings Joint,Women's Store
0,110811,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,110811,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,110811,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,110811,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,110811,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Grouping by PostalCode
It's time to group this venues by PostalCode so we can have a frequency of venues:

In [34]:
kennedy_grouped = kennedy_onehot.groupby('PostalCode').mean().reset_index()
kennedy_grouped

Unnamed: 0,PostalCode,Airport,Arcade,Arepa Restaurant,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Basketball Court,Breakfast Spot,...,Skating Rink,Soccer Field,Soccer Stadium,South American Restaurant,Spa,Sporting Goods Shop,Sports Club,Supermarket,Wings Joint,Women's Store
0,110811,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,110821,0.0,0.0,0.0,0.066667,0.0,0.133333,0.066667,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,110831,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.052632,0.052632,...,0.0,0.052632,0.052632,0.052632,0.052632,0.0,0.052632,0.052632,0.052632,0.0
3,110841,0.0,0.0,0.0,0.0,0.166667,0.166667,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0
4,110851,0.0,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,110861,0.125,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0
6,110871,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0
7,110881,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25


Let's see what are the top 5 type of venues per each postal code:

In [36]:
num_top_venues = 5

for postalCode in kennedy_grouped['PostalCode']:
    print("PostalCode: {}".format(postalCode))
    temp = kennedy_grouped[kennedy_grouped['PostalCode'] == postalCode].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

PostalCode: 110811
                     venue  freq
0           Hardware Store  0.17
1             Skating Rink  0.17
2                   Market  0.17
3                      Gym  0.17
4  Fruit & Vegetable Store  0.17


PostalCode: 110821
                  venue  freq
0           Pizza Place  0.20
1             BBQ Joint  0.13
2  Fast Food Restaurant  0.13
3      Asian Restaurant  0.07
4                Bakery  0.07


PostalCode: 110831
                  venue  freq
0  Fast Food Restaurant  0.11
1        Soccer Stadium  0.05
2          Burger Joint  0.05
3                  Park  0.05
4           Pizza Place  0.05


PostalCode: 110841
                        venue  freq
0  Construction & Landscaping  0.17
1          Athletics & Sports  0.17
2                   BBQ Joint  0.17
3         Sporting Goods Shop  0.17
4                        Park  0.17


PostalCode: 110851
            venue  freq
0            Park  0.11
1     Pizza Place  0.11
2      Restaurant  0.11
3  Farmers Market  0.11
4  

### Converting the results to a dataframe

In [41]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['PostalCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
postalcodes_venues_sorted = pd.DataFrame(columns=columns)
postalcodes_venues_sorted['PostalCode'] = kennedy_grouped['PostalCode']

for ind in np.arange(kennedy_grouped.shape[0]):
    postalcodes_venues_sorted.iloc[ind, 1:] = return_most_common_venues(kennedy_grouped.iloc[ind, :], num_top_venues)

postalcodes_venues_sorted

Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,110811,Hardware Store,Skating Rink,Market,Gym,Shopping Mall,Fruit & Vegetable Store,BBQ Joint,Bakery,Fried Chicken Joint,Fast Food Restaurant
1,110821,Pizza Place,Fast Food Restaurant,BBQ Joint,Gym,Gymnastics Gym,Asian Restaurant,Bakery,Brewery,Seafood Restaurant,Pub
2,110831,Fast Food Restaurant,Pizza Place,Burger Joint,Breakfast Spot,Basketball Court,Wings Joint,Park,Coffee Shop,BBQ Joint,Restaurant
3,110841,Restaurant,Sporting Goods Shop,Athletics & Sports,BBQ Joint,Construction & Landscaping,Park,Coffee Shop,Fried Chicken Joint,Fast Food Restaurant,Farmers Market
4,110851,Restaurant,Farmers Market,Park,Pizza Place,Burger Joint,Gym,Discount Store,Department Store,Historic Site,Ice Cream Shop
5,110861,Park,Airport,Supermarket,BBQ Joint,Convenience Store,Mobile Phone Shop,Shopping Mall,Coffee Shop,Fast Food Restaurant,Farmers Market
6,110871,Supermarket,Shopping Mall,Park,Women's Store,Burger Joint,Fried Chicken Joint,Fast Food Restaurant,Farmers Market,Discount Store,Department Store
7,110881,Women's Store,Arcade,Movie Theater,Park,Coffee Shop,Fried Chicken Joint,Fast Food Restaurant,Farmers Market,Discount Store,Department Store


## First conclusions
We got some results for each postal code!
- 110811 does look like has a mall, a market,a gym but no restaurants, whatsoever!
- 110821 has pizza places, BBQ joints and overall fast food restaurants
- 110831 has a stadium and a park,  and while it has some fast food restaurants. It does look like it could use some variety, specially considering the ammount of people nearby parks or stadiums on any weekend
- 110841 Looks a place in process of rebuildings and some sporting stores could be a hint that some healthy food restaurants could atract some customers
- 110851 with a parks, farmers markets and a soccer field is also giving us hints that some healthy food restaurants are missing in the neighbourhood. While there are general restaurants and some pizza places. There's always room for some healthy competition
- 110861 with an airport at the place and mobile shops(?) being in the top 5 venues gives the appearance to be a more industrial neighbourhood. When there are workers around, some fast food restaurants are always welcomed!
- 110871 has no restaurants! While only 3 top venues are identified in the form of supermarkets, shopping malls and parks. This is THE hint a restaurant owner is looking for.
- 110881 looks like an interesting place to open a restaurant or a fast food restaurant, since the top venues are women's stores, movie theaters, arcade zones and parks. 

## Clustering time
Now that we have some idea of what each PostalCode / Neighbourhood in Kennedy is, can we try to cluster them and find simmilar patterns?

In [57]:
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 5

kennedy_grouped_clustering = kennedy_grouped.drop('PostalCode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(kennedy_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

# add clustering labels
postalcodes_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

kennedy_merged = kennedy_df
kennedy_merged = kennedy_merged.join(postalcodes_venues_sorted.set_index('PostalCode'), on='PostalCode')
kennedy_merged.head()

Unnamed: 0,PostalCode,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,110811,4.6544,-74.154,1,Hardware Store,Skating Rink,Market,Gym,Shopping Mall,Fruit & Vegetable Store,BBQ Joint,Bakery,Fried Chicken Joint,Fast Food Restaurant
1,110821,4.6397,-74.1426,1,Pizza Place,Fast Food Restaurant,BBQ Joint,Gym,Gymnastics Gym,Asian Restaurant,Bakery,Brewery,Seafood Restaurant,Pub
2,110831,4.6292,-74.1301,1,Fast Food Restaurant,Pizza Place,Burger Joint,Breakfast Spot,Basketball Court,Wings Joint,Park,Coffee Shop,BBQ Joint,Restaurant
3,110841,4.6075,-74.1458,1,Restaurant,Sporting Goods Shop,Athletics & Sports,BBQ Joint,Construction & Landscaping,Park,Coffee Shop,Fried Chicken Joint,Fast Food Restaurant,Farmers Market
4,110851,4.626,-74.157,1,Restaurant,Farmers Market,Park,Pizza Place,Burger Joint,Gym,Discount Store,Department Store,Historic Site,Ice Cream Shop


## A picture tells more than a thousand words
While we can identify some clustering happening around our postal codes, it's better to have a graphic approach

In [72]:
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[kennedy_latitude, kennedy_longitude], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 2, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(kennedy_merged['Latitude'], kennedy_merged['Longitude'], kennedy_merged['PostalCode'], kennedy_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=6,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.9).add_to(map_clusters)
       
map_clusters.save("Kennedy_Clusters.png")
map_clusters

## Second conclusions:
While we tried to get 5 clusters, it seems reasonable that, for 8 neighborhoods, a total of 3 clusters were formed:
- Cluster One: 110861 and 110871 PostalCodes
- Cluster Two: 110881 PostalCode
- Cluster Three: 110811, 110821, 110831, 110841 and 110851 PostalCodes

If we recall our Top 5 venues per PostalCodes:
- Cluster One looks like an all industrial and go to work cluster, not a lot of restaurants but plenty of supermarkets, mobile shops, parks and mobile shops. And with an airport at your neighbourhood it might be an option to open some fancy places that, at the same time deliver food in fast time.

- Cluster Two looks like it could be merge into Cluster One since it has more stores, movie theathers and arcade zones.
-  From a restaurant owner perspective, it makes sense to be a different cluster because this looks more like a residential zone rather than an industrial one. So, family-owned type of restaurants with places for the kids to play look like a potential idea. Also, since there seems to be little working activities, chances are the customers of the restaurants will have more time to spare between the arcades, parks and stadiums around the area.
 
- Cluster Three had stadiums, sports, parks, gyms but no good enough amount restaurants apart from PostalCode 110821. Interesting enough, 110821 is the 'heart' of this cluster. So this can be a message to our restaurant owners:
 - Bring competition to 110821 neighbourhood or start creating smaller restaurants around the cluster
 - Recall this looks like a 'healthy' cluster? Consider that when you start building your menus.
 - Also, note that there are some places like stadiums and parks, food-to-go options are a must around those places. So, convenience and fast service will be key to win customers around this cluster
 
## Final conclusion
- From our findings, we were able to help local restaurant owners at Bogota!
- We only took a look at Kennedy borough, but we can easily analize the rest of the boroughs in Bogota to help our customers to expand their restaurants.
- It's worth nothing again that we made the decision to analize Kennedy solely thinking on density population, there are a lot of extra factors that a restaurant owner should take a look at.