# Capstone Project - The Battle of Neighborhoods

## Introduction/ Business Problem section

#### Circumstance: Accommodation problem in London, Ontario.

- In this capstone, we reach out to stakeholders who are looking for the most convenient and well-located areas with good public amenities and service in London, Ontario. Specifically, We are going to explore the neighborhoods in London and specify the number of residential areas with a prosperous economy. 

- Our stakeholders don't know what kind of effective business patterns they should startup, so they also want us to suggest some of the potential economies in the recommended areas. More crucially, the number of influential factors will be **the stakeholder's affordability** for **the cost of living** and **the price of the real estate in the desired areas**. Since we primarily focus on using Foursquare to discover the most ideal living area in this capstone, we will assume these factors are in the available budget of our stakeholders. 

- Working as a data scientist, we will manipulate the power of data to generate the most feasible and promising neighborhoods based on the listed above criteria. It will be expected that the upsides and downsides will be also comprehensively listed out so that the best deliverables can be used to help our stakeholders make their final decision.

- This project will significantly target the potential stakeholders who have a desire to settle down and run their own business in a residential area with good living conditions.

**Problem**
1. Which area is the best to settle down in London, Ontario?
2. How close should the living area be to the surrounded public services?
3. What kind of potential business pattern should be recommended?

## Data section

Based on our defined circumstance, there will be a number of factors that will have impacts on our decision:
- The number of neighborhoods that need to be taken a look at in London, Ontario. 
- The distance from the living areas to the other venues within the neighborhood.
- The number of available business patterns within the neighborhood.

As listed above, the following data sources will be needed to generate the required information:
- The location and coordinates of each neighborhood in London, Ontario will be scraped from **[webage](http://www.geonames.org/postalcode-search.html?q=london&country=CA&adminCode1=ON&fbclid=IwAR2XipWkuSm3F9YSjjVvFqp7SfYCPl9_XaxiehoPnn-7XmsjtnJBrbKh31g)** by using **Pandas function/ BeautifulSoup**.
- The number of venues and their categories within each neighborhood are extracted by using **FourSquare API**.
- **The extracted venue categories** can be used as a **separated dataset** to build the **recommender system** for the suggestion of business patterns.   

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
import requests
import seaborn as sb
import folium
from geopy.geocoders import Nominatim
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans

ModuleNotFoundError: No module named 'seaborn'

In [None]:
url = 'http://www.geonames.org/postalcode-search.html?q=london&country=CA&adminCode1=ON&fbclid=IwAR2XipWkuSm3F9YSjjVvFqp7SfYCPl9_XaxiehoPnn-7XmsjtnJBrbKh31g'
html_file = requests.get(url).text
soup = BeautifulSoup(html_file,'html.parser')

In [None]:
pc = []
borough = []
neigh = []
lat = []
lng = []

iterrows = soup.find('table', class_ = 'restable').find_all('tr')
for rows in iterrows[1::2]:
    if len(rows) > 1:
        temp = rows.find_all('td')[1:9]
        lst_rows = temp[0:2] + [temp[-1]]
        pc.append(lst_rows[1].text)
        borough.append(lst_rows[0].text.split('(')[0].strip())
        try:
            neigh.append(lst_rows[0].text.split('(')[1].strip(')'))
        except:
            neigh.append(lst_rows[0].text.split('(')[0].strip())
        lat.append(eval(lst_rows[-1].small.text.split('/')[0]))
        lng.append(eval(lst_rows[-1].small.text.split('/')[1]))        
#         for i in lst_rows:
#             print(i.text)
#         print(lst_rows)
            
    
# This Code is used to test for the data getting from the webpage
# rows = soup.find('table', class_ = 'restable').find_all('tr')
# temp = rows[1].find_all('td')[1:9]
# lst = temp[0:2] + [temp[-1]]
# lst[0].text.split('(')
# eval(rows[2].small.text.split('/')[0])

# Check the total length in each rows
# for i in rows[1:]:
#     temp = i.find_all('td')
#     print(len(i.find_all('td')))

london_df = pd.DataFrame({'Postal Code': pc,'Borough': borough,'Neighborhood': neigh,'Latitude':lat,'Longitude':lng})
# london_df.to_csv('London_ON_Canada.csv')
london_df

In [None]:
address = 'London, ON, Canada'
geolocator = Nominatim(user_agent="lon_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of London, Ontario are {}, {}.'.format(latitude, longitude))

In [None]:
london_map = folium.Map([latitude,longitude], zoom_start = 11)
for lat,lng,pc,neigh in zip(london_df.Latitude, london_df.Longitude,london_df['Postal Code'],london_df.Neighborhood):
    label = f'{neigh}\n({pc})'
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat,lng],
        radius = 5,
        popup = label,
        color = 'Blue',
        fill = True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(london_map)
    
london_map

In [None]:
CLIENT_ID = 'yAHPPBLZ4QX43IMOBWWZPQPW2GO2PC403TLIIJPTXEUDV1PGJ' # your Foursquare ID
CLIENT_SECRET = 'OWL05PQU02RVOHOS45IIGJ3SCHJOYOGUFNOOL21SPROSTZ1I' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
ACCESS_TOKEN = 'OZ0KJZ4ZYJPGUBRGQPDA2QXNO5MTBC4IG2SVMXP2K0LBGBQM'
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
print('ACCESS_TOKEN:' + ACCESS_TOKEN)

In [None]:
def getNearbyVenues(pc, names, latitude, longitude, radius = 1000):
    venues_lst = []
    for pc, names, lat, lng in zip(pc, names, latitude, longitude):
        print(names)
        url = f'https://api.foursquare.com/v2/venues/explore?client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}&ll={lat},{lng}&oauth_token={ACCESS_TOKEN}&radius={radius}&limit={LIMIT}'
        results = requests.get(url).json()["response"]['groups'][0]['items']
        venues_lst.append([(
            pc,
            names,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_lst in venues_lst for item in venue_lst])
    nearby_venues.columns = ['Postal Code',
                            'Neighborhood',
                            'Neighborhood Latitude',
                            'Neighborhood Longitude',
                            'Venue',
                            'Venue Latitude',
                            'Venue Longitude',
                            'Venue Category']
    return(nearby_venues)

In [None]:
venues_df = getNearbyVenues(london_df['Postal Code'], london_df.Neighborhood, london_df.Latitude, london_df.Longitude)
venues_df.head()

In [None]:
col=list(venues_df.columns.values[0:2])
del_col = list(venues_df.columns.values[2:4])
cate_df = venues_df.groupby(col).count().sort_values(by='Venue',ascending=False).drop(del_col,axis = 1).reset_index()
cate_df.to_csv('Venue_Categories_in_London.csv')
cate_df

In [None]:
# Outliers in neighborhood
london_df[~london_df.Neighborhood.isin((list(cate_df.Neighborhood.values)))]

In [None]:
print(f'Shape of venues_df: {venues_df[venues_df["Neighborhood"] == "London Central"].shape}')

In [None]:
print(f'Number of unique categories: {len(venues_df["Venue Category"].unique())}')

In [None]:
onehot = pd.get_dummies(venues_df[['Venue Category']], prefix = "", prefix_sep = "")
onehot[['Postal Code','Neighborhood']] = venues_df[['Postal Code','Neighborhood']]
fixed_col = list(onehot.columns[-2:]) + list(onehot.columns[0:-2])
onehot = onehot[fixed_col]
print(f'Shape: {onehot.shape[0]}x{onehot.shape[1]}')
onehot.head()

In [None]:
print(f'Shape of onehot: {onehot.shape[0]}x{onehot.shape[1]}')

In [None]:
onehot_group = onehot.groupby(['Postal Code','Neighborhood']).mean().reset_index()
onehot_group

In [None]:
num_top_venues = 5
for pc, hood in zip(onehot_group['Postal Code'],onehot_group['Neighborhood']):
    print("----"+f'{hood} (PC:{pc})'+"----")
    temp = onehot_group[onehot_group['Postal Code'] == pc].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[2:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq':2})
    print(temp.sort_values('freq',ascending=False).reset_index(drop = True).head(num_top_venues))
    print('n')

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[2:]
    row_categories_sorted = row_categories.sort_values(ascending = False)[0:num_top_venues]
    for i in np.arange(len(row_categories_sorted.values)):
        if row_categories_sorted.values[i] == 0:
            row_categories_sorted.index.values[i] = 'NaN'      
    return row_categories_sorted.index.values

In [None]:
num_top_venues = 10
indicators = ['st','nd','rd']
columns = ['Postal Code', 'Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append(f'{ind+1}{indicators[ind]} Most Common Venue')
    except:
        columns.append(f'{ind+1}th Most Common Venue')
neigh_venues_sorted = pd.DataFrame(columns = columns)
neigh_venues_sorted[['Postal Code', 'Neighborhood']] = onehot_group[['Postal Code','Neighborhood']]
for ind in np.arange(onehot_group.shape[0]):
    neigh_venues_sorted.iloc[ind,2:] = return_most_common_venues(onehot_group.iloc[ind,:],num_top_venues)
neigh_venues_sorted

In [None]:
feature = onehot_group.drop(['Postal Code','Neighborhood'],1)
ssq = []
kclusters = np.arange(1,12)
for k in kclusters:
    kmeans = KMeans(n_clusters = k, random_state = 0).fit(feature)
    ssq.append(kmeans.inertia_)
# ssq
plt.plot(kclusters, ssq)
plt.xlabel('K')
plt.ylabel('SSQ')
plt.show

In [None]:
# The best k will be 4
kcluster = 4
kmeans = KMeans(n_clusters = kcluster, random_state = 0, n_init =  12).fit(feature)
print(f'Number of Cluster Labels: {len(kmeans.labels_)}')
print(f'Labels: {kmeans.labels_}')

In [None]:
print(f'Total number of cluster\'s coordinates: {kmeans.cluster_centers_[0].size}')
print(f'Total size of features: {onehot.columns.values[2:].size}')

In [None]:
neigh_venues_sorted.insert(0,'Cluster Labels',kmeans.labels_)
df_merged = london_df.drop([6,14]).reset_index(drop=True)
df_merged = df_merged.merge(neigh_venues_sorted)
columns = [df_merged.columns[5]] + [df_merged.columns[1]] + [df_merged.columns[0]] + list(df_merged.columns[2:5]) + list(df_merged.columns[6:])
df_merged = df_merged[columns]
df_merged

In [None]:
# temp1 = london_df.drop([6,14]).reset_index(drop = True)
# temp2 = neigh_venues_sorted
# # temp1['Postal Code'].isin(temp2['Postal Code'])
# join_df = temp1
# join_df = join_df.join(temp2.set_index('Postal Code'), on = 'Postal Code', lsuffix = '_left',rsuffix='_right')
# join_df

In [None]:
# Set color codes
x = np.arange(kcluster)
ys = [i+x+(i*x)**2 for i in range(kcluster)]
colors_array = cm.rainbow(np.linspace(0,1,len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
rainbow

In [None]:
map_cluster = folium.Map([latitude, longitude], zoom_start = 11)
for lat, lng, neigh, c in zip(df_merged.Latitude, df_merged.Longitude, df_merged.Neighborhood, df_merged['Cluster Labels']):
    label = f'Neighbor: {neigh}\nCluster: {c}'
    label = folium.Popup(label,parse_html = True)
    folium.CircleMarker(
        [lat,lng],
        radius = 8,
        popup = label,
        color = rainbow[c],
        fill = True,
        fill_color = rainbow[c],
        fill_opacity = 0.7,).add_to(map_cluster)
map_cluster

In [None]:
cols = df_merged.columns.values[6:]
df_merged.loc[df_merged['Cluster Labels'] == 0, cols]

In [None]:
df_merged.loc[df_merged['Cluster Labels'] == 1, cols]

In [None]:
df_merged.loc[df_merged['Cluster Labels'] == 2, cols]

In [None]:
df_merged.loc[df_merged['Cluster Labels'] == 3, cols]

In [None]:
# https://www.postalpinzipcodes.com/Postcode-CAN-Canada-Postal-code-N6A-ZIP-Code
# http://zip-code.en.mapawi.com/canada/4/ontario/1/9/on/london-north-uwo-/n6a/1051/