# In this lab :
#### In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

#### For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

#### Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

## 1.Scrape the table from Wikipedia 

#### Import the libraries

In [1]:
#!pip install lxml
import requests # use to get the URL
import lxml.html as lh # for parsing the relevant fields
import pandas as pd 

#### Scrape the table cells

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#to anti anti spider
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
          'cookie': 'WMF-Last-Access=25-Jun-2020; WMF-Last-Access-Global=25-Jun-2020; GeoIP=GB:SCT:Glasgow:55.87:-4.26:v4; enwikimwuser-sessionId=55e1fd10fbf0f75c3c6f'}
#Create a handle,page,to handle the contents of the website
page = requests.get(url,headers = headers)

#Store the contents of the website under doc
doc = lh.fromstring(page.content)

#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

#### For sanity check, ensure that all the rows have the same width.If not, we probably got something more than just the table.

In [3]:
#Check the length of the first 12 rows
[len(T) for T in tr_elements[:12]]

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

###### Looks like all our rows have exactly 3 columns.This means all the data collected on tr_elements are from the table.

#### Parse Table Header
###### Parse the first row as our header

In [4]:
#tr_elements = doc.xpath('//tr')

#Create empty list
col = []
i = 0

#For each row, store each first element(header) and an empty list
for t in tr_elements[0]:
    i += 1
    name = t.text_content()
    print('%d:"%s"'%(i,name))
    col.append((name,[]))


1:"Postal Code
"
2:"Borough
"
3:"Neighborhood
"


#### Creating Pandas DataFrame
###### Each header is appended to a tuple along with an empty list

In [5]:

#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T = tr_elements[j]
   
    #If row is not of size 3, the //tr data is not from our table
    if len(T) != 3:
        break
    
    #i is the index of our column
    i = 0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data = t.text_content()
        
        #Chech if row is empty
        if i >= 0:
            #Convert any numerical value to str
           # try:
              #  data = str(data)
            #except:
              #  pass
            #Append the data to the empty list of the i'th column
            col[i][1].append(data)
            
            #Increment i for the next column
            i += 1

###### Just to be sure, let's check the length of each column. Ideally, they should all be the same.

In [6]:
[len(C) for (title,C) in col]

[181, 181, 181]

#### Creating the DataFrame

In [7]:
#.strip is the methond to delete '\n'
Dict = {title.strip():column for (title,column) in col}
df = pd.DataFrame(Dict)

In [8]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


#### Delete "\n" in the dataframe 

In [9]:
columns1 = df.columns[0]
type(columns1)

str

In [10]:
import numpy as np
df = df.replace('\n','',regex = True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


 <span style="color:red">  Question:why this methond can not be right.</span>
###### # df = df.rename(columns={df.columns[0]:'Postal Code'，df.columns[1]:'Borough',df.columns[2]:'Neighborhood'})

## 2.Data cleanup

#### Requirements:
##### 1.Remove Boroughs that are 'Not assigned'
##### 2.More than one neighborhood can exist in one postal code area, combined these into one row with the neighborhoods separated with a comma
##### 3.If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as  the borough

In [11]:
#Remove Boroughs that are 'Not assigned'
df = df[df['Borough'] != 'Not assigned']
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [12]:
#if Neighbourhood is not assigned, replace by borough name
for index, row in df.iterrows():
    if row['Neighborhood'] == 'Not assigned':
        row['Neighborhood'] = row['Borough']
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [13]:
df.shape

(104, 3)

## 3.Add Location

#### We will use a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data to get the latitude and the longitude coordinates of each neighborhood.

In [14]:
df_location = pd.read_csv('https://cocl.us/Geospatial_data')
df_location.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### To combine these two data frames according to the Postal Code.

###### First change the type to str.(因为在匹配时候object类型没法使用，会报错)

In [15]:
df.dtypes

Postal Code     object
Borough         object
Neighborhood    object
dtype: object

In [16]:
#df['Postal Code'] = df['Postal Code'].astype('str')

In [17]:
df_location.dtypes

Postal Code     object
Latitude       float64
Longitude      float64
dtype: object

In [18]:
#df_location['Postal Code'] = df_location['Postal Code'].astype('str')

#### Use merge methond to combine these two data frames into one according to the column which named 'Postal Code'

In [19]:
df_all = pd.merge(df, df_location, on=df.columns[0])
df_all

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


#### Delete the NaN rows.

In [20]:
#df_all.dropna(subset = ["Borough"], inplace=True)
#df_all
df_all['Borough'] = df_all['Borough'].astype('str')

In [21]:
df_all.dtypes

Postal Code      object
Borough          object
Neighborhood     object
Latitude        float64
Longitude       float64
dtype: object

## 3.Explore and cluster the neighborhoods in Toronto.

#### You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

###### Just make sure:

###### 1.to add enough Markdown cells to explain what you decided to do and to report any observations you make.
###### 2.to generate maps to visualize your neighborhoods and how they cluster together.

#### Let's get the boroughs which have Toronto and make a new data frame

In [22]:
df_toronto = df.loc[df['Borough'].str.contains('Toronto')]
df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
13,M5B,Downtown Toronto,"Garden District, Ryerson"
22,M5C,Downtown Toronto,St. James Town
30,M4E,East Toronto,The Beaches


In [23]:
df_toronto = pd.merge(df_toronto, df_location, on=df_toronto.columns[0])
df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


## 4.Visualizing all the Neighbourhoods of the above data frame using Folium

#### Install the package 'Folium'

In [24]:
#!conda install -c conda-forge folium=0.5.0 --yes
import folium
print('Folium installed and imported!')

Folium installed and imported!


#### Create the map of Toronto

In [25]:
toronto_latitude = 43.651070
toronto_longitude = -79.347015
map_toronto = folium.Map(location=[toronto_latitude,toronto_longitude],zoom_start=8)
map_toronto

In [26]:
map_toronto = folium.Map(location=[toronto_latitude,toronto_longitude],zoom_start=10)

for lat,lng,borough,neighborhood in zip(df_toronto['Latitude'],df_toronto['Longitude'],df_toronto['Borough'],df_toronto['Neighborhood']):
   
    folium.CircleMarker(
    location = [lat,lng],
    radius=4,
    popup=folium.Popup(borough,neighborhood),
    color='pink',
    fill=True,
    fill_color='purple',
    fill_opacity=0.6,
    ).add_to(map_toronto)
    
map_toronto

## 5.Clustering 

#### Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

###### Just make sure:

###### 1.to add enough Markdown cells to explain what you decided to do and to report any observations you make.
###### 2.to generate maps to visualize your neighborhoods and how they cluster together.

#### Import KMeans package

In [27]:
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

#### Run k-means to cluster the neighborhood into 4 clusters.

In [28]:
# set number of clusters
k=4 
toronto_clustering = df_toronto.drop(['Postal Code','Borough','Neighborhood'],1)

# run k-means clustering
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

# add clustering labels
df_toronto.insert(0, 'Cluster Labels', kmeans.labels_)

df_toronto.head()

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,1,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,1,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,3,M4E,East Toronto,The Beaches,43.676357,-79.293031


#### According to the Lab:Segmenting and Clustering Neighborhoods in New York City. It shows the steps of how to Cluster Neighborhoods.

In [29]:
# create map
map_clusters = folium.Map(location=[toronto_latitude,toronto_longitude],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighborhood, cluster in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Neighborhood'], df_toronto['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster))
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# THANKS