# Capstone Project - Carol Sutton
#### 311 provides residents, businesses and visitors with easy access to non-emergency City services, programs and information 24 hours a day, seven days a week. 311 can offer assistance in more than 180 languages. The City of Toronto has been made aware that some of its residential areas (namely those near Downtown Toronto) may have hazardous materials buried.  As a service to residents 311 is offering to identify areas for current residents that are similar to the ones that they live in currently (obviously wihout the buried hazardous material).  Subsequent relocation would be free for those residents whose housing is paid for by Toronto .  Other residents needing to be relocated will have thier expenses subsidised.

#### This analysis is for Etobicoke

### Set up section
#### Import libraries required for the activities

In [98]:
import numpy as np # library for vectorized computation
import pandas as pd # library to process data as dataframes

import matplotlib.pyplot as plt # plotting library
# backend for rendering plots within the browser
%matplotlib inline 

from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes 
!conda install -c conda-forge folium=0.5.0 --yes 

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

GeoLocator = Nominatim(user_agent='My-IBMNotebook')

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors


!conda install -c conda-forge folium=0.5.0 --yes 
import folium 
from urllib import request
import requests
import urllib.request
import time
from bs4 import BeautifulSoup

import bs4 as bs
        

from sklearn.cluster import KMeans




print('Libraries imported.')

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Libraries imported.


### Scraping the web
#### This is the code to scrape the varios web pages

#### 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.html
#### I used Python BeautifulSoup and Python lxml

In [99]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
rawpage = request.urlopen(url)

### This is to parse the data using xpath

In [100]:
def scrape_table_bs4(cname,cols):
    page = urllib.request.urlopen(url).read()
    soup = bs.BeautifulSoup(page,'lxml')
    table = soup.find('table',class_=cname)
    header = [head.findAll(text=True)[0].strip() for head in table.find_all("th")]
    data = [[td.findAll(text=True)[0].strip() for td in tr.find_all("td")] for tr in table.find_all("tr")]
    data = [row for row in data if len(row) == cols]
# This is to store the data temporarily
    temp_df = pd.DataFrame(data, columns=header)
    return temp_df

### This is to test the work in Beautiful Soup

In [101]:
raw_Postcodes = scrape_table_bs4("wikitable",3)

### This is to test the work in LXML

In [102]:
print ("Postcodes")
print(raw_Postcodes.info(verbose = True))

Postcodes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287 entries, 0 to 286
Data columns (total 3 columns):
Postcode         287 non-null object
Borough          287 non-null object
Neighbourhood    287 non-null object
dtypes: object(3)
memory usage: 6.8+ KB
None


### Assumptions

#### The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

#### I will only process the cells that have an assigned borough. I will ignore cells with a borough that is Not assigned.

#### Where more than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

#### If a cell has a borough but a Not assigned neighborhood, then I will make the neighborhood will be the same as the borough. For example the 9th cell in¶

In [103]:
Postcodes = raw_Postcodes[~raw_Postcodes['Borough'].isin(['Not assigned'])]
                          
Postcodes=Postcodes.sort_values(by=['Postcode', 'Borough', 'Neighbourhood'], ascending =[1,1,1]).reset_index(drop=True)

In [104]:
Postcodes.loc[Postcodes['Neighbourhood'] == 'Not assigned', ['Neighbourhood']]=Postcodes['Borough']

check_unassigned_post_state_sample = Postcodes.loc[Postcodes['Borough'] =='Queen\'s Park']

In [105]:
Postcodes = Postcodes.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()

#### List of Postal Codes in Toronto Canada (starting M...)

In [106]:
Postcodes

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [107]:
Postcodes.shape

(103, 3)

## Neighbourhood Coordinates

#### In order to utilize the Foursquare location data, I will get the latitude and the longitude coordinates of each neighborhood. Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

#### I choose to use the provided csv file  - http://cocl.us/Geospatial_data

In [108]:
lat_longcsv = 'http://cocl.us/Geospatial_data'
!wget -q -o 'Geospatial_coordinates.csv' lat_longcsv
geopostcode_data=pd.read_csv(lat_longcsv).set_index('Postal Code')
geopostcode_data.head()

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


In [109]:
Postcodes.to_csv('postcode1_df.csv',index=False)

postcode_csv = 'postcode1_df.csv'

postcodes1 = pd.read_csv(postcode_csv).set_index('Postcode')
postcodes1.rename_axis('Postal Code', axis = 'index', inplace = True)
postcodes1.head()

Unnamed: 0_level_0,Borough,Neighbourhood
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Malvern, Rouge"
M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill"
M1E,Scarborough,"Guildwood, Morningside, West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae


### Combine the two sets of data

In [110]:
Combined_data = postcodes1.join( geopostcode_data)
Combined_data.head()

Unnamed: 0_level_0,Borough,Neighbourhood,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill",43.784535,-79.160497
M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
M1G,Scarborough,Woburn,43.770992,-79.216917
M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [111]:
Combined_data.shape

(103, 4)

## Exploring and clustering the neighbourhoods in Etobicoke

#### To explore the neighbourhoos of selected cities I will use the Foursquare API.



### Use geophy to get the lat/long values of Etobicoke Canada

In [112]:
address = 'Toronto, Ontario Canada'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto Canada are {}, {}.'.format(latitude, longitude))

  app.launch_new_instance()


The geograpical coordinate of Toronto Canada are 43.653963, -79.387207.


### Create a map of Toronto with the cities superimposed

In [113]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)


for lat, lng, borough, neighborhood in zip(Combined_data['Latitude'], Combined_data['Longitude'], Combined_data['Borough'], Combined_data['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=4,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#87cefa',
        fill_opacity=0.5,
        parse_html=False).add_to(map_toronto)


In [114]:
map_toronto

## Now I will apply the same analysis to the Etobicoke area (as I did to Downtown Tornonto) to start the assessment

### Assumption
#### For the purpose of the exercise I will work with only boroughs that contain the word Etobicoke and then replicate the same analysis that I did with the New York City data.

In [115]:
Etob_data = Combined_data[Combined_data['Borough'].str.contains("Queen's")].reset_index(drop=True)
print(Etob_data.shape)
Etob_data.head()

(1, 4)


Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
0,Queen's Park,Queen's Park,43.662301,-79.389494


#### I will now recreate the map with the markers on it for the neighourhoods

In [116]:

map_Et = folium.Map(location=[latitude, longitude], zoom_start=11)


for lat, lng, label in zip(Etob_data['Latitude'], Etob_data['Longitude'], Etob_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_SC)  
    
map_Et

### Now using the Foursquare API to explore and segment neighborhoods

In [117]:
CLIENT_ID = 'DWE403I3DYSRFXV4VDIAQOSUD1IMFKWNV4LMVNQWSR5CZMDV' # your Foursquare ID
CLIENT_SECRET = 'UND1K2GR13ZF5ZUYY45MAQINZRAGY4IJ2EXBINYW0FAOPGGI' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: DWE403I3DYSRFXV4VDIAQOSUD1IMFKWNV4LMVNQWSR5CZMDV
CLIENT_SECRET:UND1K2GR13ZF5ZUYY45MAQINZRAGY4IJ2EXBINYW0FAOPGGI


### To explore the neighbourhoods in Etobicoke
#### I will use the same query as for the NY exercise
#### https://api.foursquare.com/v2/venues/search? client_id=CLIENT_ID&client_secret=CLIENT_SECRET&ll=LATITUDE,LONGITUDE&v=VERSION&query=QUERY&radius=RADIUS&limit=LIMIT

In [118]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
      
    
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
     
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [119]:
Combined_data = Etob_data
Etob_venues = getNearbyVenues(names=Combined_data['Neighbourhood'],
                                   latitudes=Combined_data['Latitude'],
                                   longitudes=Combined_data['Longitude'])

Queen's Park


In [120]:
Combined_data.shape

(1, 4)

In [121]:
Etob_venues.head()

Unnamed: 0,Neighbourhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Queen's Park,43.662301,-79.389494,Queen's Park,43.663946,-79.39218,Park
1,Queen's Park,43.662301,-79.389494,Mercatto,43.660391,-79.387664,Italian Restaurant
2,Queen's Park,43.662301,-79.389494,Nando's,43.661617,-79.386095,Portuguese Restaurant
3,Queen's Park,43.662301,-79.389494,Coffee Public,43.660763,-79.386184,Coffee Shop
4,Queen's Park,43.662301,-79.389494,YMCA,43.662753,-79.384849,Gym


In [122]:
Etob_venues.shape

(30, 7)

### Noting that 72 venues have been returned, lets check to see how many venues are in each neighbourhood

In [123]:
Etob_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Queen's Park,30,30,30,30,30,30


### Checking on unique categories in each area

In [124]:
print('{} unique venue categories have been found.'.format(len(Etob_venues['Venue Category'].unique())))

23 unique venue categories have been found.


## Analysing each neighbourhood
### Using One Hot encoding
#### One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction
#### then sort out the presentation of the data

In [125]:
venues_oh = pd.get_dummies(Etob_venues['Venue Category'])


venues_oh['Neighbourhood'] = Etob_venues['Neighbourhood'] 


fixed_columns = [venues_oh.columns[-1]] + list(venues_oh.columns[:-1])
venues_oh =venues_oh[fixed_columns]

venues_oh.head()

Unnamed: 0,Neighbourhood,Arts & Crafts Store,Beer Bar,Burger Joint,Burrito Place,Coffee Shop,Creperie,Diner,Fried Chicken Joint,Gym,Hobby Shop,Italian Restaurant,Mexican Restaurant,Nightclub,Park,Persian Restaurant,Portuguese Restaurant,Sandwich Place,Seafood Restaurant,Smoothie Shop,Sushi Restaurant,Theater,Wings Joint,Yoga Studio
0,Queen's Park,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
1,Queen's Park,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
2,Queen's Park,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
3,Queen's Park,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Queen's Park,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Just confirm that no venues have been dropped (check figure is 72)

In [126]:
venues_oh.shape

(30, 24)

### Lets work out the average number of the different types of venues for each neighbourhood to see which ones suit me the best

In [127]:
neighbourhoodsE_grouped = venues_oh.groupby('Neighbourhood').mean().reset_index()

neighbourhoodsE_grouped

Unnamed: 0,Neighbourhood,Arts & Crafts Store,Beer Bar,Burger Joint,Burrito Place,Coffee Shop,Creperie,Diner,Fried Chicken Joint,Gym,Hobby Shop,Italian Restaurant,Mexican Restaurant,Nightclub,Park,Persian Restaurant,Portuguese Restaurant,Sandwich Place,Seafood Restaurant,Smoothie Shop,Sushi Restaurant,Theater,Wings Joint,Yoga Studio
0,Queen's Park,0.033333,0.033333,0.033333,0.033333,0.133333,0.033333,0.066667,0.033333,0.066667,0.033333,0.033333,0.033333,0.033333,0.066667,0.033333,0.033333,0.033333,0.033333,0.033333,0.066667,0.033333,0.033333,0.033333


### Check for new size in case items get dropped in the future

In [128]:
neighbourhoodsE_grouped.shape

(1, 24)

### Convert to a panda's dataframe for easier use later on

In [129]:
num_top = 5
for neigh in neighbourhoodsE_grouped['Neighbourhood']:
    print(""+neigh+"")
    temp = neighbourhoodsE_grouped[neighbourhoodsE_grouped['Neighbourhood'] == neigh].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top))
    print('\n')

Queen's Park
              venue  freq
0       Coffee Shop  0.13
1              Park  0.07
2  Sushi Restaurant  0.07
3             Diner  0.07
4               Gym  0.07




### Intersting but not very easy to understand
### Lets put them in descending order - so I can see the top 20

In [130]:
def return_most_common_venues(row, num_top):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top]

In [131]:
num_top = 20

indicators = ['st', 'nd', 'rd']

columns = ['Neighbourhood']
for ind in np.arange(num_top):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))


nvs = pd.DataFrame(columns=columns)
nvs['Neighbourhood'] = neighbourhoodsE_grouped['Neighbourhood']

for ind in np.arange(neighbourhoodsE_grouped.shape[0]):
    nvs.iloc[ind, 1:] = return_most_common_venues(neighbourhoodsE_grouped.iloc[ind, :], num_top)

nvs.shape

(1, 21)

In [132]:
nvs.head(20)

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
0,Queen's Park,Coffee Shop,Sushi Restaurant,Diner,Park,Gym,Yoga Studio,Hobby Shop,Beer Bar,Burger Joint,Burrito Place,Creperie,Fried Chicken Joint,Mexican Restaurant,Italian Restaurant,Wings Joint,Nightclub,Persian Restaurant,Portuguese Restaurant,Sandwich Place,Seafood Restaurant


#### Looks like Albion, alderwood and Bloordale are good areas for me.
### Lets see what other neighbourhoods are in that area.
## Clustering the Neighbourhoods
### Using the k means technique to see what other areas are inthe same vicinity
### Given that there are over 100 neighbourhoods - I'll make 20 clusters

In [133]:
kclusters = 11
neighbourhoodclusteringE = neighbourhoodsE_grouped.drop('Neighbourhood',1)
kmeans = KMeans(n_clusters=kclusters, random_state = 1).fit(neighbourhoodclusteringE)
print(kmeans.labels_[0:12])
print(len(kmeans.labels_))

ValueError: n_samples=1 should be >= n_clusters=11

#### Confirm data set to be used going forward

In [None]:
Combined_data.shape

In [None]:
finalised_data = Combined_data

finalised_data['Cluster Labels'] = kmeans.labels_

finalised_data = finalised_data.join(nvs.set_index('Neighbourhood'), on='Neighbourhood')

finalised_data.head()

### Lets now see what this looks like on a map so I can see where things are

In [None]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)


x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]


markers_colors = []
for lat, lon, poi, cluster in zip(finalised_data['Latitude'], finalised_data['Longitude'], finalised_data['Neighbourhood'], finalised_data ['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters