# Coursera Capstone Project

Loading the essential libraries, or install if not already available

In [None]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

# !conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20181108054459-0000
Solving environment: \ 

## 1. Introduction to the Objective of this Capstone Project

This Notebook will be used to analyze the areas around a central, focal point, as they would appear in a rectangular grid of roughly 5 X 5 points with almost 1 km distances between them. In longitude and latitude terms, 1 km is roughly 1 minute of a degree both to the North - South and to the East - West direction.

The center can be determined by the user, by setting the "Central_Address" variable to an address which should be initially recognized by the Google geocode API. The API call should return the original coordinates. Following this, a dataframe with the 25 neighborhoods, all around this central point are fetched.

The **objective** of this exercise is to cluster these 25 neighborhoods and provide suggestions about how much the patterns and the profile of the city changes when someone moves away from their original point of interest. This POI might be the hotel they are going to stay in, a new home, their business address or anything else that they need to analyze from a "what is near" viewpoint.

As a first step, we determine the address and retrieve its coordinates with an API call to Google maps.

In [None]:
API_key = 'AIzaSyA57R4UMqyiVNiVnEOTF9rjYDcHgzLaSto'

Central_Address = 'Diovounioti 1, Chalandri 15231, Greece'
# construct URL to make API call
url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}'.format(API_key, Central_Address)
    
response = requests.get(url).json() # get response
# print(response)
try:
    geographical_data = response['results'][0]['geometry']['location'] # get geographical coordinates
    central_latitude = geographical_data['lat']
    central_longitude = geographical_data['lng']
    print('The central point has latitude {} and longitude {}.'.format(central_latitude, central_longitude))
except:
    print('Oops for {}!'.format(Central_Address))
    print(response)


Following a successful completion of the first call, we setup the virtual grid of 5X5 points around this conceptual pin on the map, moving each time roughly 1km from the central point.

In [None]:
# Create the grid around the central point
list_of_points = []
fm_deg_to_km = 0.01
for north_south in range(2,-3,-1):
    for east_west in range(2,-3,-1):
        list_of_points.append((central_latitude+north_south*fm_deg_to_km, central_longitude+east_west*fm_deg_to_km, north_south, east_west))
len(list_of_points)

Convert the list into a Python dataframe, which is easier to handle and enrich.

In [None]:
df = pd.DataFrame(list_of_points)
df.columns = ['Latitude', 'Longitude', 'North_South', 'East_West']

For each point of the 5x5 grid, we get the actual address (e.g. street, neighborhood, country etc.).

In [None]:
from geopy.geocoders import GoogleV3
geolocator = GoogleV3(API_key)
for latitude, longitude in zip(df['Latitude'], df['Longitude']):
    location = geolocator.reverse('{}, {}'.format(latitude, longitude))
    address_fm_GPS = location[0][0].split(',')
    Street = address_fm_GPS[0]
    Neighborhood = address_fm_GPS[1]
    Country = address_fm_GPS[2]
    cond = ((df['Latitude'] == latitude) & (df['Longitude'] == longitude))
    col01 = 'Street'
    col02 = 'Neighborhood'
    col03 = 'Country'
    df.loc[cond, col01] = Street
    df.loc[cond, col02] = Neighborhood
    df.loc[cond, col03] = Country
df.head()

#### Explore the Dataframe

We can run a few basic statistics on the derived dataset.

In [None]:
df.groupby('Neighborhood')['Neighborhood'].count()

In [None]:
df[df['Neighborhood'].str.contains('152 31')]

# 2. Explore and cluster the neighborhoods around the central point.

Methodology: At this point, we have the grid of the 25 points around the central address we had selected in the very beginning. For each one of them, we will make a call to Foursquare and retrieve the 100 top venues and associate them with their category. We will then generate summary statistics for the categories mostly located in each neighborhood and finally, cluster the neighborhoods based on their similarity of the type of venues included. 

Before we start, we can have a quick look of how our grid looks like on the map.

In [None]:
# create map of Canada using latitude and longitude values
map_of_POI = folium.Map(location=[central_latitude, central_longitude], zoom_start=12)

# add markers to map
for lat, lng, neighborhood, street in zip(df['Latitude'], df['Longitude'], df['Neighborhood'], df['Street']):
    label = '{}, {}'.format(neighborhood, street)
    label = folium.Popup(label, parse_html=True)
    if lat == central_latitude and lng == central_longitude:
        color = 'red'
    else:
        color = 'blue'
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color=color,
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_of_POI)  
    
map_of_POI

#### Define Foursquare credentials and Version

The following paragraphs fetch all the venues information from Foursquare.

In [None]:
# The code was removed by Watson Studio for sharing.
CLIENT_ID = 'YILUDYB0P1W1X4HNOK5C3AVVWFY3LK1NZAX4AJGVUESNOFPG'
CLIENT_SECRET = 'I2LT5WEAFU5PGXYCYALNWAWAXGGFSC5L2C1AZGWWW1YQDUSM'
LIMIT = 100
VERSION = 20180901

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        # results = requests.get(url).json()
        # print(results)
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
All_around_venues = getNearbyVenues(df['Neighborhood'], df['Latitude'], df['Longitude'], radius=500)

In [None]:
print(All_around_venues.shape)
All_around_venues.head()

In [None]:
All_around_venues.loc[All_around_venues['Venue Category'] == 'ATM']

In [None]:
All_around_venues.groupby('Neighborhood')['Neighborhood'].count()

In [None]:
print('There are {} unique categories.'.format(All_around_venues['Venue Category'].nunique()))

In [None]:
print('There are {} unique neighborhoods.'.format(All_around_venues['Neighborhood'].nunique()))

#### 2a. Analyze each neighborhood

In [None]:
# one hot encoding
All_around_onehot = pd.get_dummies(All_around_venues[['Venue Category']], prefix="", prefix_sep="")

# add grid coordination columns back to dataframe
All_around_onehot['Neighborhood'] = All_around_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [All_around_onehot.columns[-1]] + list(All_around_onehot.columns[:-1])
All_around_onehot = All_around_onehot[fixed_columns]

All_around_onehot.head()

In [None]:
All_around_onehot.shape

Given that some points of the grid fall into the same neighborhood, we recalculate the grid by averaging the coordinates of all the points within the same neighborhood.

In [None]:
All_around_by_Neighborhood = df[['Neighborhood','Latitude','Longitude']].groupby('Neighborhood').mean().reset_index()
All_around_by_Neighborhood

In [None]:
All_around_grouped = All_around_onehot.groupby('Neighborhood').mean().reset_index()
All_around_grouped

In [None]:
All_around_grouped.shape

We take a quick look at the most common venues for each neighborhood.

In [None]:
num_top_venues = 5

for hood in All_around_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = All_around_grouped[All_around_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    '''    
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
    '''
    if ind <= 2:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    else:
        columns.append('{}th Most Common Venue'.format(ind+1))
    

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = All_around_grouped['Neighborhood']

for ind in np.arange(All_around_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(All_around_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

#### 2b. Clustering Step

Given the number of distinct neighborhoods cannot exceed 25, we believe a total number of 5 clusters is both sufficient and simple enough.

In [None]:
# set number of clusters
kclusters = 5

All_around_grouped_clustering = All_around_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(All_around_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

In [None]:
All_around_grouped.shape

In [None]:
All_around_merged = All_around_by_Neighborhood

# add clustering labels
All_around_merged['Cluster Labels'] = kmeans.labels_

# merge back to original dataframe to add latitude/longitude for each neighborhood
All_around_merged = All_around_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')



All_around_merged.head() # check the last columns!

In [None]:
# create map
map_clusters = folium.Map(location=[central_latitude, central_longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(All_around_merged['Latitude'], All_around_merged['Longitude'], All_around_merged['Neighborhood'], All_around_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters


#### 2c. Examine the generated Clusters

In [None]:
print(All_around_merged.groupby('Cluster Labels')['Cluster Labels'].count())

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
ID_columns = ['Neighborhood', 'Cluster Labels']
show_columns = [x for x in All_around_merged.columns if x.find('Venue') > 0]
All_around_merged.loc[All_around_merged['Cluster Labels'] == 0, ID_columns + show_columns]


Understand each one of the clusters in more detail.

In [None]:
All_around_merged.loc[All_around_merged['Cluster Labels'] == 0, ID_columns + show_columns]

In [None]:
All_around_merged.loc[All_around_merged['Cluster Labels'] == 1, ID_columns + show_columns]

In [None]:
All_around_merged.loc[All_around_merged['Cluster Labels'] == 2, ID_columns + show_columns]

In [None]:
All_around_merged.loc[All_around_merged['Cluster Labels'] == 3, ID_columns + show_columns]

In [None]:
All_around_merged.loc[All_around_merged['Cluster Labels'] == 4, ID_columns + show_columns]

### Conclusion

The clustering exercise managed to identify 4 unique areas (differentiated mostly by their Souvlaki Shops, Scenic Lookout, Bus Stop and Movie Theater respectively), where the type of venues is clearly disimilar to that of the remaining neighborhoods. Cluster 0 includes mostly mixed neighborhoods, but conherently with loads of coffee shops and restaurants.