# Capstone Project - The Battle of Neighborhoods 2


## Applied Data Science Capstone - IBM/Coursera

## Table of Contents

1. <a href="#item1">Introduction: Business Problem</a>
2. <a href="#item2">Data</a>  
3. <a href="#item3">Methodology</a> 
4. <a href="#item3">Analysis</a>
5. <a href="#item4">Results and Discussion</a>  
6. <a href="#item5">Conclusion</a>  

## Introduction: Business Problem

A real estate development and investment company is trying to identify and shortlist retail opportunities in the Greater Toronto area based on trends and popularity.

The company realizes the importance and relevance of social media in understanding the pulse of the market and seeks to use data as a key driver in decision making.

How can the company use social trends to select popular venues, understand and identify characteristics of the venues, and select new locations with similar characteristics which would have high growth potential?
In this study, as a Data Scientist, I provide a point of view of how data can be acquired, cleansed, curated and analyzed through machine learning technique to better drive the decision-making process.


 # Data


To drive the understanding and analysis in this data science project, I have used the following data sets:

1-Toronto neighborhoods data from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M , which was also used in the week 3 assignment. This data set includes the Postal Codes, Boroughs and Neighborhood in the Toronto area starting with the letter M.

2-The above data set was augmented with geo codes for each postal code from the data set provided by Cognitive Class at http://cocl.us/Geospatial_data. Upon merging the data sets, the resulting data set included geo coordinates, i.e. latitude and longitude, for each postal code.

3-Foursquare Places API for Venues – Foursquare provides various Regular and Premium API endpoints. Regular endpoints include basic venue firmographic data, category, and ID. Premium endpoints include rich content such as ratings, URLs, photos, tips, menus, etc. For the analysis, I have used the “explore” Regular API endpoint to get venue recommendations via https://developer.foursquare.com/docs/venues/explore.

The data sets used were already curated and did not require any additionally preparation such as reformatting. The only preparation steps were merging and reshaping of the data frames during the analysis.


## Methodology

## Neighborhood Candidate Selection

In [None]:
# Import required libraries
!conda install -c conda-forge folium=0.5.0 --yes
import folium
print('Folium installed and imported!')


Fetching package metadata .............
Solving package specifications: 

In [None]:
import numpy as np  # useful for many scientific computing in Python
import pandas as pd # primary data structure library
import requests
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
import folium
from sklearn.cluster import KMeans



Fetch the neighborhood data from Wikipedia and read it into a DataFrame. Next filter and transform records per specifications.

In [None]:
# Read HTML content
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0][1:]

# Rename columns
df.rename(columns={0:'PostalCode',1:'Borough',2:'Neighborhood'},inplace=True)

# Filter dataframe: drop rows with Borough as 'Not assigned'
df.drop(df[df.Borough == 'Not assigned'].index, inplace=True)

# Combine neigborhoods that have the same PostalCode and Borough
gdf = df.groupby(['PostalCode','Borough']).agg(lambda col: ', '.join(col)).reset_index()

# Assign Borough value to Neighborhood that are 'Not assigned'
gdf.Neighborhood = gdf.Borough.where(gdf.Neighborhood == 'Not assigned',gdf.Neighborhood)

print('Total number of Neighborhoods: {}'.format(gdf.shape[0]))
gdf.head()

Fetch geocode file and read it into a DataFrame.

In [None]:
geocodes = pd.read_csv('http://cocl.us/Geospatial_data')
geocodes.rename(columns={'Postal Code': 'PostalCode'},inplace=True)
print('Total Geo Code entries: {}'.format(geocodes.shape[0]))
geocodes.head()

Merge the neighborhood and geocode DataFrames.

In [None]:
neighborhoods = gdf.merge(geocodes, how='left', on=['PostalCode'])
print('The dataframe has {} Boroughs and {} Neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)
neighborhoods.head()

Visualize the neighborhoods as markers overlaid in a map of Toronto created using Folium.

In [None]:
# create map of city using latitude and longitude values
map_city = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], 
                                           neighborhoods['Longitude'], 
                                           neighborhoods['Borough'], 
                                           neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_city)  
map_city

Fetch venue data from Foursquare for each neighborhood using the “explore” API endpoint. Aggregate the data in

In [None]:
# Set Foursquare credentials and version
CLIENT_ID = 'OIJJKMXL2BGR44AP1EFFIGVDGUAL1FDUCNTYHNSI0CSS22NC' # your Foursquare ID
CLIENT_SECRET = 'OIJJKMXL2BGR44AP1EFFIGVDGUAL1FDUCNTYHNSI0CSS22NC' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)


In [None]:
# Create function to pull venues for a neighborhood using the "explore" API endpoint
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=50):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['id'], 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue ID',               
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
# Execute the function to get the list of venues for each neighborhood.
# Aggregate the data into the city_venues DataFrame
city_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                 latitudes=neighborhoods['Latitude'],
                                 longitudes=neighborhoods['Longitude']
                                )
city_venues.sort_values(by=['Neighborhood'],inplace=True)

In [None]:
# OPTIONAL
# Export to CSV file
city_venues.to_csv('city_top_venues.csv')


In [None]:
# OPTIONAL
# Read from CSV file
city_venues = pd.read_csv('city_top_venues.csv',index_col=0)

In [None]:
# Print venue and neighborhood record counts
print('Pulled {} venues in {} neighborhoods.'.format(
    city_venues.shape[0],
    len(city_venues['Neighborhood'].unique())
))

city_venues.head()

# Exploratory Data Analysis

Generate statistics from the venue data such as: a) Venue counts by neighborhood b) Top 20 neighborhoods by venue count


In [None]:
# Group venues by neighborhood, aggregate and sort by venue count
grouped = city_venues.groupby('Neighborhood').size().reset_index(name='Venue Count by Neighborhood').sort_values(by='Venue Count by Neighborhood',ascending=False)
print('Neighborhood Count: {}'.format(neighborhoods.shape[0]))
print('Venue Count: {}'.format(city_venues.shape[0]))
print('Neighborhoods with venues: {}'.format(len(city_venues['Neighborhood'].unique())))
grouped.describe()

In [None]:
top_neighborhoods = pd.DataFrame(grouped['Neighborhood'][0:20])
top_neighborhoods.set_index('Neighborhood',inplace=True)
top_neighborhoods.head(20)

Visualize the venue data by plotting the number of neighborhoods for each venue count and range of venue counts


In [None]:
# Plot number of Neighborhoods for each Venue Count
vc='Venue Count by Neighborhood'
nc='# of Neighborhoods'
counted = grouped.groupby(vc).size().reset_index(name=nc).sort_values(by=vc,ascending=True)
ax = counted.plot(kind='bar', x=vc, y=nc, legend=False)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.title('# of Neighborhoods for each Venue Count')
plt.xlabel('Venue Count')
plt.ylabel(nc)
plt.yticks([])
for p in ax.patches:
    ax.annotate("%i" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')
plt.show()

# Plot number of Neighborhoods for Venue Count range
bins=np.arange(0,60,10)
vc_ranged = counted.rename(columns={vc:'VCR'}).groupby(pd.cut(counted[vc],bins)).sum().drop(columns='VCR').reset_index()
ax = vc_ranged.plot(kind='bar', x=vc, y=nc, legend=False)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.title('# of Neighborhoods for Venue Count Range')
plt.xlabel('Venue Count Range')
plt.ylabel(nc)
plt.yticks([])
for p in ax.patches:
    ax.annotate("%i" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')
plt.show()

Visualize the top 10 categories across the entire set of venues (all neighborhoods) and as well as the top 20 neighborhoods.


In [None]:

print('There are {} uniques categories.'.format(len(city_venues['Venue Category'].unique())))

top_categories_all = city_venues.groupby('Venue Category').size().reset_index(name='count').sort_values(by='count', ascending=False).reset_index(drop=True)[0:10]
vc='Venue Category'
nv='count'
ax = top_categories_all.plot(kind='bar', x=vc, y=nv, legend=False)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.title('# of Venues by Category across all Neighborhoods')
plt.xlabel('Category')
plt.ylabel('# of Venues')
plt.yticks([])
for p in ax.patches:
    ax.annotate("%i" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')
plt.show()

top_categories_top_neighborhoods = city_venues.join(top_neighborhoods,on='Neighborhood',how='inner').groupby('Venue Category').size().reset_index(name='count').sort_values(by='count', ascending=False).reset_index(drop=True)[0:10]
ax = top_categories_top_neighborhoods.plot(kind='bar', x=vc, y=nv, legend=False)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.title('# of Venues by Category across top 20 Neighborhoods')
plt.xlabel('Category')
plt.ylabel('# of Venues')
plt.yticks([])
for p in ax.patches:
    ax.annotate("%i" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')
plt.show()

One hot encode the data.


In [None]:
# one hot encoding
city_onehot = pd.get_dummies(city_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
city_onehot['Neighborhood'] = city_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [city_onehot.columns[-1]] + list(city_onehot.columns[:-1])
city_onehot = city_onehot[fixed_columns]

print('Dimensions of one hot encoded dataframe: {}'.format(city_onehot.shape))

city_grouped = city_onehot.groupby('Neighborhood').mean().reset_index()

print('Dimensions of one hot encoded dataframe grouped by Neighborhood: {}'.format(city_grouped.shape))
Dimensions of one hot encoded dataframe: (1692, 255)
Dimensions of one hot encoded dataframe grouped by Neighborhood: (99, 255)

In [None]:
# Print each neighborhood along with its respective top 5 most common venues (category) by frequency.
num_top_venues = 5

for hood in city_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = city_grouped[city_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [None]:
Create a DataFrame containing the top 10 venues by neighborhood based on the one hot encoded data and visualize the data.


In [None]:
# Create a function to sort the venues in descending order.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
# Call the function to populate the DataFrame with top 10 venues for each neighborhood
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = city_grouped['Neighborhood']

for ind in np.arange(city_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(city_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

In [None]:
print("1st Most Common Venue: {}".format(neighborhoods_venues_sorted['1st Most Common Venue'].describe().top))
print("2nd Most Common Venue: {}".format(neighborhoods_venues_sorted['2nd Most Common Venue'].describe().top))
print("3rd Most Common Venue: {}".format(neighborhoods_venues_sorted['3rd Most Common Venue'].describe().top))
print("4th Most Common Venue: {}".format(neighborhoods_venues_sorted['4th Most Common Venue'].describe().top))
print("5th Most Common Venue: {}".format(neighborhoods_venues_sorted['5th Most Common Venue'].describe().top))
print("6th Most Common Venue: {}".format(neighborhoods_venues_sorted['6th Most Common Venue'].describe().top))
print("7th Most Common Venue: {}".format(neighborhoods_venues_sorted['7th Most Common Venue'].describe().top))
print("8th Most Common Venue: {}".format(neighborhoods_venues_sorted['8th Most Common Venue'].describe().top))
print("9th Most Common Venue: {}".format(neighborhoods_venues_sorted['9th Most Common Venue'].describe().top))
print("10th Most Common Venue: {}".format(neighborhoods_venues_sorted['10th Most Common Venue'].describe().top))

# Analysis (Machine Learning)

Based on the data available for neighborhoods and venues, we can define venue categories as features for machine learning. Given there are approximately 166 categories across a data set of 99 neighborhoods, use of k-means clustering to cluster the neighborhoods sounds like a reasonable approach. The 166 categories naturally map to features used in the k-means model. An initial value of ‘k’ was set to 7 [square root of 49.5 (99 divided by 2)]. This generates the cluster labels for each of the neighborhoods.

In [None]:
# set number of clusters
kclusters = 7

city_grouped_clustering = city_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(city_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

city_merged = neighborhoods

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
city_merged = city_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

# drop rows with no assigned clusters
city_merged.dropna(inplace=True)

city_merged.head() # check the last columns!

In [None]:
Visualize the resulting clusters on a map of Toronto using Folium.
In [24]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(city_merged['Latitude'], city_merged['Longitude'], city_merged['Neighborhood'], city_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    cluster = int(cluster)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters



# Results


In this section we examine the clusters and view cluster specific details.


In [None]:
# Plot the distribution of neighborhoods across clusters
cn = city_merged.groupby('Cluster Labels').size().reset_index(name='# of Neighborhoods')
ax = cn.plot(kind='bar', x='Cluster Labels', y='# of Neighborhoods', legend=False)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.title('Distribution of Neighborhoods by cluster')
plt.xlabel('Cluster')
plt.ylabel('# of Neighborhoods')
plt.yticks([])
for p in ax.patches:
    ax.annotate("%i" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')
plt.show()

# Plot the distribution of venues across clusters
cv = city_venues.merge(city_merged, on='Neighborhood')[['Venue','Cluster Labels']].groupby('Cluster Labels').size().reset_index(name='# of Venues')
ax = cv.plot(kind='bar', x='Cluster Labels', y='# of Venues', legend=False)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.title('Distribution of Venues by cluster')
plt.xlabel('Cluster')
plt.ylabel('# of Venues')
plt.yticks([])
for p in ax.patches:
    ax.annotate("%i" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')
plt.show()

# Display most common venue category in each cluster
city_venues.merge(city_merged, on='Neighborhood')[['Cluster Labels','Venue Category']].groupby(['Cluster Labels','Venue Category']).size().reset_index(name='Count').sort_values(by=['Cluster Labels','Count'],ascending=[True,False]).groupby(['Cluster Labels']).head(1)

The plots above present the distribution of entire set of neighborhoods (99) and venues (1692) across 7 clusters. Each cluster includes neighborhoods that have commonality based on the feature set which equates to the categories of the venues in the neighborhood. A key point to notice is the uneven distribution of the neighborhoods and venues across the clusters indicating the similarity or cohesiveness in the clusters 5 and 6.

Next, we look at the most common venue in each cluster. Coffee Shops and Fast Food Restaurants are the most common venues in clusters 5 and 6 respectively.


In [None]:
city_merged.loc[city_merged['Cluster Labels'] == 0, city_merged.columns[[1] + [2] + list(range(5, city_merged.shape[1]))]]

Cluster 6 is the second largest cluster with 34 neighborhoods and includes 439 venues. Like cluster 5, it covers most of the boroughs. The most common venues is Fast Food Restaurant. Apart from restaurants, this cluster includes a wide range of retails outlets.

# Discussion

I observed the following during the analysis of the results:

Predominance of 2 clusters across the neighborhoods and venue categories which indicates similarity or commonality of features. The remaining 4 clusters had more distinguishing or unique features.

The k-means clustering approach relied of frequency of a category across the 255 unique categories. The feature set may be large compared to the number of samples, i.e. number of neighborhoods (99).

I tried multiple values of k in the k-means clustering. For lower values of k, the larger clusters coalesced into a single cluster. For higher values of k, the number of smaller clusters increased but the larger clusters did not break up noticeably any further.

The Foursquare data is primarily social and is crowdsourced. I noticed the API calls returned slightly different data sets when executed at various times of the day or day of the week.

Based on the results, I have the following recommendations:

I had planned initially to use the Premium Endpoint to fetch ratings but was unable to because of the daily limits of API calls. This extended data could have provided a social dimension, but the data would change frequently.

Running the analysis and comparing results over a period as opposed to a snapshot would stabilize the findings.
Consider other unsupervised learning methods for comparative analysis.
Augment demographic data for neighborhoods to get additional insights.


# Conclusion

In conclusion, this study was a positive step for the stakeholders to understand how data from various sources can be used via powerful tools and visualization techniques to derive insights.

From a personal perspective, it provided me with exposure to the data science methodology from a business problem, analysis, data acquisition, preparation, feature selection, model creation, train/fit and test/analyze results. The libraries for data acquisition, preparation, and visualization demonstrated the value of data science.