<h1 align=center><font size = 5>--------------------------------------------------------------------------------------------------------------</font></h1>

<h1 align=center><font size = 5>Coursera Capstone in <br>IBM Data Science with Python</font></h1>

<h3 align=center><font size = 3>Lutz Wimmer</font></h3>

## Assignment 1 - Capstone Project Notebook

<br><br>This Notebook is part of the Capstone Project for the "IBM Data Science with Python" course on Coursera.

In [1]:
# import Python libraries
import pandas as pd
import numpy as np
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


## Assignment 2 - Segmenting and Clustering Neighborhoods in Toronto

### Part 1 - Web scraping list of postal codes of Toronto, Canada

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [3]:
# saving webpage to memory as text
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
webtext = requests.get(url).text
webpage = BeautifulSoup(webtext,"html5lib")

webtab = webpage.find("table",class_='wikitable sortable')

In [4]:
# creating pandas df
toronto = pd.DataFrame(columns=["PostalCode","Borough","Neighborhood"])

for row in webtab.tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        pc = col[0].text.strip()
        bor = col[1].text.strip()
        nei = col[2].text.strip()
        toronto = toronto.append({"PostalCode": pc, "Borough": bor, "Neighborhood": nei}, ignore_index=True)

# filter unassigned values
fil = toronto.index[toronto['Borough'] == "Not assigned"].tolist()
toronto = toronto.drop(fil, axis = 0).reset_index(drop = True)

# checking for duplicate postal codes and "Not assigned" Neighborhoods
# toronto.iloc[:,0].duplicated().describe()
# toronto.iloc[:,2] == "Not assigned"
# none found
toronto

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [5]:
toronto.shape

(103, 3)

### Part 2 - finding lat/long coordinates

Using the link to the .csv file, the geocoder doesn't work.

In [6]:
coords = pd.read_csv('https://cocl.us/Geospatial_data')
coords.shape

(103, 3)

In [7]:
# merging the two dataframes by PostalCode
coords.rename(columns={'Postal Code':'PostalCode'}, inplace = True)
toronto = toronto.merge(coords, on = 'PostalCode')
toronto

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


### Part 3 - clustering neighborhoods

using ".*Toronto*." to filter out any boroughs outside the city.

In [8]:
city = toronto[toronto['Borough'].str.match(".*Toronto*.")]
city['Borough'].unique()

array(['Downtown Toronto', 'East Toronto', 'West Toronto',
       'Central Toronto'], dtype=object)

The number of neighborhoods in the city is 39 and thus, the number of clusters is 39.<br>


In [9]:
import numpy as np
from sklearn.cluster import KMeans

# set number of clusters = number of neighborhoods
kclusters = len(city['Neighborhood'].unique())

toronto_clustering = city.drop(columns={'PostalCode','Borough','Neighborhood'})

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([22, 35, 32, 34, 12, 36, 11, 10,  4, 15, 33,  6, 16, 28, 23,  3, 37,
        8, 14,  9, 30, 20, 21, 25, 26,  1,  5, 17,  7, 19, 31,  2, 13, 24,
       18,  0, 38, 27, 29])


<br>Constructing a new dataframe with Toronto neighborhood centers coordinates for mapping.

In [10]:
neighborhoods = pd.DataFrame(columns=['Neighborhood','Cluster Labels','Latitude','Longitude']) 
names = city['Neighborhood'].unique()

for i in range(len(names)):
    one = names[i]
    two = kmeans.cluster_centers_[i][0]
    three = kmeans.cluster_centers_[i][1]
    four = kmeans.labels_[i]
    neighborhoods = neighborhoods.append({'Neighborhood':one,'Cluster Labels':four,'Latitude':two,'Longitude':three}, ignore_index=True)

neighborhoods.head()

Unnamed: 0,Neighborhood,Cluster Labels,Latitude,Longitude
0,"Regent Park, Harbourfront",22,43.667967,-79.367675
1,"Queen's Park, Ontario Provincial Government",35,43.64896,-79.456325
2,"Garden District, Ryerson",32,43.686412,-79.400049
3,St. James Town,34,43.668999,-79.315572
4,The Beaches,12,43.650571,-79.384568


Creating a map with folium.

In [11]:
# !conda install -c conda-forge folium=0.5.0 --yes
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

In [12]:
# create map
latitude = neighborhoods['Latitude'][3]
longitude = neighborhoods['Longitude'][3]
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Neighborhood'], neighborhoods['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### A map with the Neighborhood centers for all 39 Neighborhoods in Toronto, CA