<h1>Segmenting and Clustering Neighborhoods in Toronto - Assignment</h1>

<h2>Step 1: Loading required Table data from Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M into Dataframe and display dataframe</h2>

In [35]:
import requests
import lxml.html as lh
import pandas as pd
import numpy as np # library to handle data in a vectorized manner
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df = pd.read_html(url)[0]
df.head(12) # displaying sample of data loaded


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


In [3]:
df.shape # innitial records loaded

(288, 3)

<h2>Step 2: Ignore rows with Borough as 'Not assigned'</h2>

In [4]:
df = df.loc[df['Borough'] != 'Not assigned']

In [5]:
df.shape # checking reduced size

(211, 3)

<h3>Step 3: Valid borough but Not assigned neighborhood case, replace neighborhood with borough</h3>

In [5]:
df1 = df.loc[df['Neighbourhood'] == 'Not assigned'] # DF with correct Neighbourhood
df2 = df.loc[df['Neighbourhood'] != 'Not assigned'] # DF with Not assigned Neighbourhood
df1['Neighbourhood'] = df1.loc[:,'Borough'] # REplace Neighbourhood with Borough
df1 = pd.concat((df1,df2),axis =0) # conact both DFs
df1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0,Postcode,Borough,Neighbourhood
8,M7A,Queen's Park,Queen's Park
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [6]:
df1.shape # Checking shape for correctness

(211, 3)

<h2>Postal code grouping and Neighbourhood listed as comma separated values</h2>

In [7]:
df_grp = df1['Postcode'].value_counts()
df_grp.columns = ['Postcode','Count','Borough','Neighbourhood']
df_grp.head()
df_grp1 = df1.groupby(["Postcode", "Borough"])["Neighbourhood"].apply(', '.join).reset_index()
df_grp1.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


<h2>shape method to print the number of rows of in final dataframe</h2>

In [9]:
df_grp1.shape

(103, 3)

<h2>Load lat ,lang from CSV</h2>

In [8]:
data = pd.read_csv("https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv") 
data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


<h2>Create result Dataframe 'neighborhoods'</h2>

In [21]:
# define the dataframe columns
column_names = ['Postcode','Borough', 'Neighborhood', 'Latitude', 'Longitude','Neighborhood_index'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)
index_ref = df_grp1['Borough'].unique().tolist() # Borough unique values fetched to create index ued in cluster map

<h2> load data into 'neighborhoods'</h2>

In [23]:
for i in df_grp1.index:
        postcode = df_grp1.iloc[i]['Postcode']
        borough = df_grp1.iloc[i]['Borough'] 
        neighborhood_name = df_grp1.iloc[i]['Neighbourhood']
        
        neighborhood_latlon = data.loc[data['Postal Code'] == postcode]
        neighborhood_lat = neighborhood_latlon.iloc[0]['Latitude']
        neighborhood_lon = neighborhood_latlon.iloc[0]['Longitude']
        neighborhoods = neighborhoods.append({'Postcode': postcode,
                                              'Borough': borough,
                                              'Neighborhood': neighborhood_name,
                                              'Latitude': neighborhood_lat,
                                              'Longitude': neighborhood_lon,
                                              'Neighborhood_index':index_ref.index(borough)}, ignore_index=True)  

<h2> Expected result<h2>

In [24]:
neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Neighborhood_index
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,0
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0


In [13]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


array(['Scarborough', 'North York', 'East York', 'East Toronto',
       'Central Toronto', 'Downtown Toronto', 'York', 'West Toronto',
       "Queen's Park", 'Mississauga', 'Etobicoke'], dtype=object)

In [30]:
#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

In [31]:
address = 'Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


#### Create a map of Toronto with neighborhoods superimposed on top.

In [27]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.9.11  |       hecc5488_0         144 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    certifi-2019.9.11          |           py36_0         147 KB  conda-forge
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    altair-3.2.0               |           py36_0         770 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.3 MB

The following NEW packages will be 

In [38]:
# create map of New York using latitude and longitude values
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    
map_Toronto

<h1>Custer Map display</h1>

In [36]:
kclusters =11
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Neighborhood'], neighborhoods['Neighborhood_index']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters