# Applied Data Science Capstone
This notebook will be mainly used for the capstone project. The project is the
following
## Moving to a similar neighborhood in Mexico City
Many people in Mexico City are moving to new houses every day, which is great  but a lot of people have problems 
getting familiarized with the new neighborhood because they miss their old neighborhood and the venues 
they used to go to . For example: they miss their favorite tacos or their favorite coffee shop.  
So, we can prevent people from spending many hours looking for a new neighborhood with the same kind of venues that 
they have nearby their old house. Instead, we can recommend neighborhoods with almost the same kind of venues as the 
old neighborhood.

### Import the libraries

In [105]:
import numpy as np
import pandas as pd
import folium 
import requests 
from pandas import json_normalize
from sklearn.cluster import KMeans

print('Libraries imported')

Libraries imported


Load the dataset

In [59]:
df = pd.read_csv('https://datos.cdmx.gob.mx/explore/dataset/coloniascdmx/download/?format=csv&timezone=America/Mexico_City&lang=es&use_labels_for_header=true&csv_separator=%2C')
df.head()

Unnamed: 0,COLONIA,ENTIDAD,Geo Point,Geo Shape,CVE_ALC,ALCALDIA,CVE_COL,SECC_COM,SECC_PAR
0,IRRIGACION,9.0,"19.4429549298,-99.2099357048","{""type"": ""Polygon"", ""coordinates"": [[[-99.2115...",16,MIGUEL HIDALGO,16-035,"5079, 5080, 5083, 5102","5068, 5082"
1,MARINA NACIONAL (U HAB),9.0,"19.4466319056,-99.1795110575","{""type"": ""Polygon"", ""coordinates"": [[[-99.1797...",16,MIGUEL HIDALGO,16-049,"5137, 5182",
2,PEDREGAL DE STO DOMINGO VI,9.0,"19.3234027183,-99.1654676133","{""type"": ""Polygon"", ""coordinates"": [[[-99.1622...",3,COYOACAN,03-144,"381, 394, 494, 416, 417, 439",
3,VILLA PANAMERICANA 7MA. SECCIN (U HAB),9.0,"19.304604269,-99.1677617231","{""type"": ""Polygon"", ""coordinates"": [[[-99.1676...",3,COYOACAN,03-121,,"474, 475"
4,VILLA PANAMERICANA 6TA. SECCIN (U HAB),9.0,"19.3112238873,-99.1696478642","{""type"": ""Polygon"", ""coordinates"": [[[-99.1702...",3,COYOACAN,03-120,,458


Drop the features that we don't need

In [60]:
df.drop(['ENTIDAD', 'Geo Shape', 'CVE_ALC', 'CVE_COL', 'SECC_COM', 'SECC_PAR'], axis=1, inplace=True)

In [61]:
df.head()

Unnamed: 0,COLONIA,Geo Point,ALCALDIA
0,IRRIGACION,"19.4429549298,-99.2099357048",MIGUEL HIDALGO
1,MARINA NACIONAL (U HAB),"19.4466319056,-99.1795110575",MIGUEL HIDALGO
2,PEDREGAL DE STO DOMINGO VI,"19.3234027183,-99.1654676133",COYOACAN
3,VILLA PANAMERICANA 7MA. SECCIN (U HAB),"19.304604269,-99.1677617231",COYOACAN
4,VILLA PANAMERICANA 6TA. SECCIN (U HAB),"19.3112238873,-99.1696478642",COYOACAN


Remove all rows that contain NaN in Geo Point column

In [62]:
df = df.dropna()
df = df.reset_index(drop=True)

Convert the 'Go Point' feature in two new features: lat and lng

In [63]:
lat = []
lng = []
for lat_lng_str in df['Geo Point']:
    lat = np.append(lat, float(lat_lng_str.split(',')[0]))
    lng = np.append(lng, float(lat_lng_str.split(',')[1]))

df['lat'] = lat
df['lng'] = lng
df.head()

Unnamed: 0,COLONIA,Geo Point,ALCALDIA,lat,lng
0,IRRIGACION,"19.4429549298,-99.2099357048",MIGUEL HIDALGO,19.442955,-99.209936
1,MARINA NACIONAL (U HAB),"19.4466319056,-99.1795110575",MIGUEL HIDALGO,19.446632,-99.179511
2,PEDREGAL DE STO DOMINGO VI,"19.3234027183,-99.1654676133",COYOACAN,19.323403,-99.165468
3,VILLA PANAMERICANA 7MA. SECCIN (U HAB),"19.304604269,-99.1677617231",COYOACAN,19.304604,-99.167762
4,VILLA PANAMERICANA 6TA. SECCIN (U HAB),"19.3112238873,-99.1696478642",COYOACAN,19.311224,-99.169648


Drop Geo Point column

In [64]:
df.drop(['Geo Point'], axis=1, inplace=True)

In [65]:
df.head()

Unnamed: 0,COLONIA,ALCALDIA,lat,lng
0,IRRIGACION,MIGUEL HIDALGO,19.442955,-99.209936
1,MARINA NACIONAL (U HAB),MIGUEL HIDALGO,19.446632,-99.179511
2,PEDREGAL DE STO DOMINGO VI,COYOACAN,19.323403,-99.165468
3,VILLA PANAMERICANA 7MA. SECCIN (U HAB),COYOACAN,19.304604,-99.167762
4,VILLA PANAMERICANA 6TA. SECCIN (U HAB),COYOACAN,19.311224,-99.169648


Change the columns name so they are more descriptive

In [66]:
df_cdmx = df.rename(columns={'COLONIA': 'neighborhood', 'ALCALDIA': 'borough'})
df_cdmx.head()

Unnamed: 0,neighborhood,borough,lat,lng
0,IRRIGACION,MIGUEL HIDALGO,19.442955,-99.209936
1,MARINA NACIONAL (U HAB),MIGUEL HIDALGO,19.446632,-99.179511
2,PEDREGAL DE STO DOMINGO VI,COYOACAN,19.323403,-99.165468
3,VILLA PANAMERICANA 7MA. SECCIN (U HAB),COYOACAN,19.304604,-99.167762
4,VILLA PANAMERICANA 6TA. SECCIN (U HAB),COYOACAN,19.311224,-99.169648



**Explore and cluster the neighborhoods in Mexico City.**

Set the lat and lng of Mexico City

In [72]:
lat_cdmx = 19.4284706
lng_cdmx = -99.1276627


Let's plot the Neighborhoods to explore them

In [69]:
map_cdmx = folium.Map(location=[lat_cdmx, lng_cdmx], zoom_start=10)  
 
 # add markers to map  
for lat, lng, borough, neighborhood in zip(df_cdmx['lat'], df_cdmx['lng'], df_cdmx['borough'], df_cdmx['neighborhood']):  
    label = '{}, {}'.format(neighborhood, borough)  
    label = folium.Popup(label, parse_html=True)  
    folium.CircleMarker(
        [lat, lng],  
        radius=5,  
        popup=label,  
        color='blue',  
        fill=True,  
        fill_color='#3186cc',  
        fill_opacity=0.7,  
        parse_html=False).add_to(map_cdmx)
    
map_cdmx

<b>In case the map didn't appear</b>
<img src='https://github.com/Ivan-hdz/Coursera_Capstone/blob/master/imgs/cdmx_1.png?raw=true' />


Let's assign a color for each borough

In [73]:
# How many borough are there
df_cdmx['borough'].value_counts()

IZTAPALAPA                293
ALVARO OBREGON            249
GUSTAVO A. MADERO         232
TLALPAN                   177
COYOACAN                  153
AZCAPOTZALCO              111
MIGUEL HIDALGO             88
VENUSTIANO CARRANZA        80
XOCHIMILCO                 79
BENITO JUAREZ              64
CUAUHTEMOC                 63
TLAHUAC                    58
IZTACALCO                  55
LA MAGDALENA CONTRERAS     52
CUAJIMALPA DE MORELOS      43
MILPA ALTA                 11
Name: borough, dtype: int64

In [87]:
# Method to get a random color
import random

def random_color():
    random_number = random.randint(0,16777215)
    hex_number =format(random_number,'x')
    hex_number = '#'+hex_number
    return hex_number

colors = {}
for borough in df_cdmx['borough'].value_counts().index:
    colors[borough] = random_color()
colors

{'IZTAPALAPA': '#10db4d',
 'ALVARO OBREGON': '#6abc82',
 'GUSTAVO A. MADERO': '#ccc80b',
 'TLALPAN': '#ed7e1c',
 'COYOACAN': '#9f68f2',
 'AZCAPOTZALCO': '#572209',
 'MIGUEL HIDALGO': '#5066f6',
 'VENUSTIANO CARRANZA': '#ca85c0',
 'XOCHIMILCO': '#e748b0',
 'BENITO JUAREZ': '#de5b1c',
 'CUAUHTEMOC': '#891613',
 'TLAHUAC': '#71cf45',
 'IZTACALCO': '#f97402',
 'LA MAGDALENA CONTRERAS': '#e49e6f',
 'CUAJIMALPA DE MORELOS': '#a53f08',
 'MILPA ALTA': '#76f877'}

In [88]:
# Plot
map_cdmx = folium.Map(location=[lat_cdmx, lng_cdmx], zoom_start=10)  
 
 # add markers to map  
for lat, lng, borough, neighborhood in zip(df_cdmx['lat'], df_cdmx['lng'], df_cdmx['borough'], df_cdmx['neighborhood']):  
    label = '{}, {}'.format(neighborhood, borough)  
    label = folium.Popup(label, parse_html=True)  
    folium.CircleMarker(
        [lat, lng],  
        radius=5,  
        popup=label,  
        color=colors[borough],  
        fill=True,  
        fill_color=colors[borough],  
        fill_opacity=0.7,  
        parse_html=False).add_to(map_cdmx)
    
map_cdmx


<b>In case the map didn't appear</b>
<img src='https://github.com/Ivan-hdz/Coursera_Capstone/blob/master/imgs/cdmx_2.png?raw=true' />


Now let's get venues for each Neighborhood


In [89]:
# Foursquare config parameters
CLIENT_ID = 'WSBDS3PHA2ZA2QRF1K2PFSPE1G2DOMXDFX5LTEJ2NCC5OUG1' # your Foursquare ID
CLIENT_SECRET = 'LOHFOAR0DHZK5WYJOU1N0FRMLVYOUKNYK3KBRCTT33YSEQBH' # your Foursquare Secret
VERSION = '20200404'
LIMIT = 100

define URL

In [90]:
# define URL with a sample latitude and longitude
latitude = df_cdmx['lat'][0]
longitude = df_cdmx['lng'][0]

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}'.format(CLIENT_ID,
                                                                                                                 CLIENT_SECRET,
                                                                                                                 latitude,
                                                                                                                 longitude,
                                                                                                                 VERSION)


### For each neighborhood we are doing the following steps

Get venues nearby Neighborhood location

In [91]:

# send GET request and get trending venues
venues_json_dirty = requests.get(url).json()
print('Request sent')


Request sent


Process each venue and find out its category

In [92]:
if len(venues_json_dirty['response']['venues']) == 0:
    print('No trending venues are available at the moment!')

else:
    # assign relevant part of JSON to venues
    venues_json = venues_json_dirty['response']['venues']
    # Getting the name of the primary category
    for v in venues_json:
        if isinstance(v['categories'], list):
            if len( v['categories'] ) > 0:
                v['categories'] = v['categories'][0]['name']
            else:
                v['categories'] = 'Not assigned'
    # tranform venues into a dataframe
    venues_df_dirty = json_normalize(venues_json)
    ## Preprocessing 
    venues_df = pd.DataFrame({
        'category': venues_df_dirty['categories'],
        'distance': venues_df_dirty['location.distance']
    })
    


  from ipykernel import kernelapp as app


Venue categories for the Neighborhood location

In [93]:
venues_df['category'].value_counts()

Mexican Restaurant                          5
Student Center                              2
Salon / Barbershop                          2
Café                                        2
Park                                        2
Not assigned                                1
Chiropractor                                1
Department Store                            1
Building                                    1
Church                                      1
Bus Stop                                    1
Ice Cream Shop                              1
Residential Building (Apartment / Condo)    1
Convenience Store                           1
Office                                      1
Veterinarian                                1
Gas Station                                 1
Coffee Shop                                 1
Vineyard                                    1
Paper / Office Supplies Store               1
Nursery School                              1
Community College                 

One hot encoding for each category

In [94]:
one_hot_categories_df = pd.get_dummies(venues_df['category'])
one_hot_categories_df.head()

Unnamed: 0,Building,Bus Stop,Café,Chiropractor,Church,Coffee Shop,Community College,Convenience Store,Department Store,Gas Station,...,Not assigned,Nursery School,Office,Paper / Office Supplies Store,Park,Residential Building (Apartment / Condo),Salon / Barbershop,Student Center,Veterinarian,Vineyard
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Get the mean value for each category

In [95]:
one_hot_categories_sers = one_hot_categories_df.mean()
one_hot_categories_sers

Building                                    0.033333
Bus Stop                                    0.033333
Café                                        0.066667
Chiropractor                                0.033333
Church                                      0.033333
Coffee Shop                                 0.033333
Community College                           0.033333
Convenience Store                           0.033333
Department Store                            0.033333
Gas Station                                 0.033333
Ice Cream Shop                              0.033333
Mexican Restaurant                          0.166667
Not assigned                                0.033333
Nursery School                              0.033333
Office                                      0.033333
Paper / Office Supplies Store               0.033333
Park                                        0.066667
Residential Building (Apartment / Condo)    0.033333
Salon / Barbershop                          0.

Create the dataframe, each row for each neighborhood

In [96]:
one_hot_categories_mean_df = pd.DataFrame( [one_hot_categories_sers.values], columns = one_hot_categories_sers.index)

one_hot_categories_mean_df.head()

Unnamed: 0,Building,Bus Stop,Café,Chiropractor,Church,Coffee Shop,Community College,Convenience Store,Department Store,Gas Station,...,Not assigned,Nursery School,Office,Paper / Office Supplies Store,Park,Residential Building (Apartment / Condo),Salon / Barbershop,Student Center,Veterinarian,Vineyard
0,0.033333,0.033333,0.066667,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,...,0.033333,0.033333,0.033333,0.033333,0.066667,0.033333,0.066667,0.066667,0.033333,0.033333


Add mean distance

In [97]:
one_hot_categories_mean_df['distance'] = venues_df['distance'].mean()
one_hot_categories_mean_df.head()

Unnamed: 0,Building,Bus Stop,Café,Chiropractor,Church,Coffee Shop,Community College,Convenience Store,Department Store,Gas Station,...,Nursery School,Office,Paper / Office Supplies Store,Park,Residential Building (Apartment / Condo),Salon / Barbershop,Student Center,Veterinarian,Vineyard,distance
0,0.033333,0.033333,0.066667,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,...,0.033333,0.033333,0.033333,0.066667,0.033333,0.066667,0.066667,0.033333,0.033333,144.133333


### Let's do the above process for all of the neighborhoods 

In [98]:
# create a dataframe to store the result
cdmx_categories_mean_df = pd.DataFrame()

# for each neighborhood
print('Processing data ...')
for lat, lng, neighborhood in zip(df_cdmx['lat'], df_cdmx['lng'], df_cdmx['neighborhood']):
    # Define URL 
    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}'.format(
                                                                                                    CLIENT_ID,
                                                                                                    CLIENT_SECRET,
                                                                                                    lat,
                                                                                                    lng,
                                                                                                    VERSION)
    # Make the request
    venues_json_dirty = requests.get(url).json()
    venues_json_dirty
    # Preprocess the categories for each venue in this neighborhood
    if len(venues_json_dirty['response']['venues']) == 0:
        print('No venues are available at the moment!')

    else:
        # assign relevant part of JSON to venues
        venues_json = venues_json_dirty['response']['venues']
        # Getting the name of the primary category
        for v in venues_json:
            if isinstance(v['categories'], list):
                if len( v['categories'] ) > 0:
                    v['categories'] = v['categories'][0]['name']
                else:
                    v['categories'] = 'Not assigned'
        # tranform venues into a dataframe
        venues_df_dirty = json_normalize(venues_json)
        ## Preprocessing 
        venues_df = pd.DataFrame({
            'category': venues_df_dirty['categories'],
            'distance': venues_df_dirty['location.distance']
        })
        # One-hot encoding
        one_hot_categories_df = pd.get_dummies(venues_df['category'])
        # Mean value for each category
        one_hot_categories_sers = one_hot_categories_df.mean()
        one_hot_categories_mean_df = pd.DataFrame( [one_hot_categories_sers.values], columns = one_hot_categories_sers.index)
        # Append mean distance
        one_hot_categories_mean_df['distance'] = venues_df['distance'].mean()
        # one_hot_categories_mean_df['neighborhood'] = neighborhood
        cdmx_categories_mean_df = pd.concat([cdmx_categories_mean_df, one_hot_categories_mean_df], axis=0, ignore_index=True)

    
print('... Data processed')


Processing data ...




... Data processed


In [99]:
cdmx_categories_mean_df.head()

Unnamed: 0,Building,Bus Stop,Café,Chiropractor,Church,Community College,Convenience Store,Department Store,Gas Station,Ice Cream Shop,...,Child Care Service,Platform,Mongolian Restaurant,Airport Lounge,Halal Restaurant,Pakistani Restaurant,Badminton Court,Tiki Bar,Hockey Arena,Racecourse
0,0.033333,0.033333,0.066667,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,...,,,,,,,,,,
1,0.1,,,,,,,,0.033333,,...,,,,,,,,,,
2,0.066667,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


Verify that all neighborhoods are in the dataframe

In [100]:
cdmx_categories_mean_df.shape

(1808, 614)

In [101]:
df_cdmx.shape

(1808, 4)

Deal with NaN values

In [102]:
cdmx_categories_mean_df.replace(np.nan, 0, inplace=True)
cdmx_categories_mean_df.head()

Unnamed: 0,Building,Bus Stop,Café,Chiropractor,Church,Community College,Convenience Store,Department Store,Gas Station,Ice Cream Shop,...,Child Care Service,Platform,Mongolian Restaurant,Airport Lounge,Halal Restaurant,Pakistani Restaurant,Badminton Court,Tiki Bar,Hockey Arena,Racecourse
0,0.033333,0.033333,0.066667,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Scale because of the distance column

In [103]:
# Scaling the features didn't give me a better result
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()
# X = scaler.fit_transform(canada_categories_mean_df.values)
# 
X = cdmx_categories_mean_df.values

Let's cluster neighborhoods

In [157]:
# set number of clusters
num_clusters = 20

k_means = KMeans(init="k-means++", n_clusters=num_clusters, n_init=1000)
k_means.fit(X)
labels = k_means.predict(X)
print('Labels')
print(labels)

Labels
(1808,)


Assign the predicted label to each neighborhood

In [158]:
df_cdmx = df_cdmx.drop(columns = ['label'], axis=1)
cdmx_labeled_df = pd.DataFrame(df_cdmx)
cdmx_labeled_df['label'] = labels
cdmx_labeled_df.head()

Unnamed: 0,neighborhood,borough,lat,lng,label
0,IRRIGACION,MIGUEL HIDALGO,19.442955,-99.209936,5
1,MARINA NACIONAL (U HAB),MIGUEL HIDALGO,19.446632,-99.179511,12
2,PEDREGAL DE STO DOMINGO VI,COYOACAN,19.323403,-99.165468,5
3,VILLA PANAMERICANA 7MA. SECCIN (U HAB),COYOACAN,19.304604,-99.167762,5
4,VILLA PANAMERICANA 6TA. SECCIN (U HAB),COYOACAN,19.311224,-99.169648,5


Visualize the result

In [176]:
# Assign a color to labels
colors = {}
random.seed(3)
for label in np.unique(labels):
    colors[label] = random_color()
colors

{0: '#79d67f',
 1: '#42c6c6',
 2: '#bd6ac3',
 3: '#f2b725',
 4: '#218cff',
 5: '#6bdf4',
 6: '#f03f38',
 7: '#84ca0c',
 8: '#77fa3a',
 9: '#622c48',
 10: '#f0c660',
 11: '#f3e491',
 12: '#cb5539',
 13: '#4d1d98',
 14: '#76be7b',
 15: '#4da172',
 16: '#c7a5c9',
 17: '#7c150',
 18: '#20c8ba',
 19: '#519cde'}

In [177]:
# Plot
map_cdmx = folium.Map(location=[lat_cdmx, lng_cdmx], zoom_start=10)  
 
 # add markers to map  
for lat, lng, borough, neighborhood, cluster_label in zip(cdmx_labeled_df['lat'], cdmx_labeled_df['lng'], cdmx_labeled_df['borough'], cdmx_labeled_df['neighborhood'], cdmx_labeled_df['label']):  
    label = '{}, {}'.format(neighborhood, borough)  
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],  
        radius=5,  
        popup=label,  
        color=colors[cluster_label],  
        fill=True,  
        fill_color=colors[cluster_label],  
        fill_opacity=0.7,  
        parse_html=False).add_to(map_cdmx)
    
map_cdmx


<b>In case the map didn't appear</b>
<img src='https://github.com/Ivan-hdz/Coursera_Capstone/blob/master/imgs/cdmx_3.png?raw=true' />
