***
# Capstone Project  
#### This notebook will be used for the Coursera IBM data science capstone project  
**Julia Kettle**  
2019-01-24  
***

In [34]:
import numpy as np
import pandas as pd
import requests
!pip install lxml
from lxml import html



In [15]:
print("Hello Capstone Project Course")

Hello Capstone Project Course


# Getting the data
Let's get the data using requests.  
I use Beautiful soup from bs4 module to get the page contents in lxml form.
Then use its find and findAll to get first the table then all occurences of '<td> </td>' which contain the data 

In [35]:
#get data of postcodes 
wiki_postcodes = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text

from bs4 import BeautifulSoup
content = BeautifulSoup(wiki_postcodes,'lxml')

table = content.find('table',{'class':'wikitable sortable'})
data  = table.findAll('td')

# Further processing

Convert to string and strip td markers  
Separate each column  


In [36]:
#convert list to list of strings
#strip <td> and </td> from data
for i,entry in enumerate(data):
    data[i] = str(data[i])[4:-5]

#separate 
postcodes = data[::3]
borough   = data[1::3]
neighbourhood = data[2::3]

    

Here I wrote a function to process the neighbourhood and borough data. We don't need the links, only the titles which are formatted as - 'title="The title"'  

In [37]:
def find_titles(text_list):
    for i, entry in enumerate(text_list):
        
        #find index of the markers for title
        index1 = str.find(entry,'title="')
        index2 = str.find(entry,">")
        
        #find returns -1 if nothing found
        #so only slice data where "title=" found
        if(index1 != -1 and index2 != -1):
            text_list[i] = entry[index1+7:index2-1]
            
        #drop trailing newline character
        if(text_list[i].endswith('\n')):
            text_list[i]=text_list[i][:-1]
        
        #drop trailing ",Toronto" and "(Toronto)"
        text_list[i] = str.replace(text_list[i],", Toronto","")
        text_list[i] = str.replace(text_list[i],"(Toronto)","")
    return text_list

1. Get the borough/neighbourhood names with function above 

2. recombine the columns and create data frame   

3. Drop rows where borough or neighbourhood aren't assigned (assuming always written as 'Not assigned'

In [38]:
#process borough and neighbourhood to strip
borough=find_titles(borough)
neighbourhood=find_titles(neighbourhood)

#zip the lists then create datafram
data_tuples = list(zip(postcodes,borough,neighbourhood))
labels = ['PostalCode','Borough','Neighbourhood']
df = pd.DataFrame.from_records(data_tuples,columns=labels)

#drop any rows where not borough assigned.
df=df[(df.Borough != 'Not assigned')]
#assign borough name if neighbourhood not assigned
df.loc[df['Neighbourhood'] == 'Not assigned', ['Neighbourhood']]=df['Borough']

Group by postal codes then combine the other data by:  
Neighbourhood: Concatenate, separated by commas  
Borough: keep first name (I assume same postal code never describes different boroughs )


In [39]:
#find duplicates of postal code
df_group = df.groupby('PostalCode')
#aggregate them combining nbhd as strings and keep 1 brgh
#use lambda functions
df = df_group.agg({'Borough': lambda x: x.iloc[0] , \
                         'Neighbourhood': lambda x: "%s" % ', '.join(x)} \
                       ).reset_index()

df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek , Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Woburn
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [40]:
print("number of rows =",np.shape(df)[0])

number of rows = 103


In [42]:
import geocoder

In [51]:
#Constantly returned none.
#going to use the csv file instead.
lat_lng_coords = None
count = 0
while((lat_lng_coords is None) and count <100):
    g = geocoder.google('{}, Toronto, Ontario'.format("M5G"))
    lat_lng_coords = g.latlng
    count=count+1

**read geo data from csv and display 5 rows**

In [43]:
df_geo = pd.read_csv("Geospatial_Coordinates.csv")
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


**Left join the two dataframes on postal codes**  
Make sure PostalCode column has the same title, then left join.
We left join so we keep all neighbourhood info even if geo data not available, but only keep lat/long data if for one of the neighbourhoods.

In [49]:
df_geo.rename(columns={'Postal Code':'PostalCode'},inplace=True)
df_geo.head()
df_all = df.merge(df_geo,on=['PostalCode'],how='left')
df_all.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek , Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Woburn,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


**Start by showing the neighbourhoods on a map with different colours for different boroughs**

In [194]:
#visualise the neighbourhoods on the maps
import folium
import matplotlib.colors as colors

#set up list of colours to use.
cols = ['red', 'blue', 'green', 'purple', 'orange', 'darkred',
             'lightred', 'beige', 'darkblue', 'darkgreen', 'cadetblue',
             'darkpurple', 'white', 'pink', 'lightblue', 'lightgreen',
             'gray', 'black', 'lightgray']

#set up list of uniqur Boroughs
Boroughs = df_all['Borough'].unique()

#create map
toronto_map = folium.Map(location=[43.6532,-79.3832],zoom_start=10)

#plot marker for each row
for index, row in df_all.iterrows():
    
    #set the colour based on the borough.
    icol = np.argwhere(Boroughs==row['Borough'])[0,0]
    
    folium.Marker([row['Latitude'],row['Longitude']] \
                  ,popup=row['Neighbourhood'],icon=folium.Icon(color=cols[icol])).add_to(toronto_map)

#show map
toronto_map

['Scarborough' 'North York' 'East York' 'East Toronto' 'Central Toronto'
 'Downtown Toronto' 'York' 'West Toronto' "Queen's Park " 'Mississauga'
 'Etobicoke']
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
3
2
2
2
3
3
3
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
1
4
4
4
5
5
5
5
5
1
1
6
6
5
7
7
7
1
6
6
7
7
7
8
9
3
10
10
10
10
10
10
10
10
1
1
6
10
10
10
10


In [None]:
#get the location data in array for training
X = df_all[['Latitude','Longitude']]
X.info()

Let's set up and train our model

In [169]:
import sklearn

#let's try 8 clusters. 11  
k=8

#set up Kmeans which will re-initialise 12 times
#protection against only local minimum being found
clstr = sklearn.cluster.KMeans(n_clusters=k,n_init=12)
clstr.fit(X)

#display the labels
clstr.labels_

array([5, 5, 5, 5, 5, 5, 1, 1, 1, 1, 3, 3, 5, 3, 3, 3, 5, 3, 3, 7, 7, 7, 7,
       7, 7, 3, 7, 1, 7, 2, 2, 2, 2, 2, 1, 1, 1, 1, 6, 1, 1, 6, 1, 6, 7, 7,
       7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 0, 6, 6, 6, 6,
       6, 6, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 6, 4, 1, 4, 4, 4, 4,
       4, 4, 4, 4, 2, 2, 2, 2, 4, 2, 2], dtype=int32)

replot map - this time colours assigned based on cluster

In [195]:


cols = ['red', 'blue', 'green', 'purple', 'orange', 'darkred',
             'lightred', 'beige', 'darkblue', 'darkgreen', 'cadetblue',
             'darkpurple', 'white', 'pink', 'lightblue', 'lightgreen',
             'gray', 'black', 'lightgray']

toronto_map_clstr = folium.Map(location=[43.6532,-79.3832],zoom_start=10)

for lat, long, nbrhd, clstr_lab in zip(df_all['Latitude'],df_all['Longitude'], df_all['Neighbourhood'],list(clstr.labels_)):
    folium.Marker(location=[lat,long] \
                  ,popup=nbrhd, \
                    icon=folium.Icon(color=cols[clstr_lab])).add_to(toronto_map_clstr)

toronto_map_clstr

From the maps we can see that the boroughs do not represent the clusters, althog