<a href="https://colab.research.google.com/github/BrandaoEid/IBM/blob/master/Segmenting_and_Clustering_Neighborhoods_in_Toronto.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Segmenting and Clustering Neighborhoods in Toronto

In [193]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

#!pip install geocoder
import geocoder
import folium


print('Bibliotecas importadas!')

Bibliotecas importadas!


## Web Scraping

In [0]:
url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [0]:
soup = BeautifulSoup(url,'lxml')

In [0]:
tables = soup.find('table',{'class':'wikitable sortable'})

In [0]:
#Create array to hold the data we extract
postalCodes = []
borough = []
neighborhood = []

rows = tables.findAll('tr')

for row in rows:
    cells = row.findAll('td')

    if (len(cells) > 1):
        postalCodes.append(cells[0].text.replace("\n",""))
        borough.append(cells[1].text.replace("\n",""))
        neighborhood.append(cells[2].text.replace("\n","").replace(" / ",", "))

# Data manipulation

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [0]:
loc_df = pd.DataFrame(
    {'PostalCode': postalCodes,
     'Borough': borough,
     'Neighborhood': neighborhood
    })

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [0]:
loc_df = loc_df[loc_df['Borough'] != 'Not assigned']

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.


In [0]:
loc_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [0]:
loc_df[(loc_df['Borough'] != 'Not assigned')&(loc_df['Neighborhood'] == '')]

Unnamed: 0,PostalCode,Borough,Neighborhood


In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [0]:
print('Rows: {}'.format(loc_df.shape[0]))

Rows: 103


# Data latitude & longitude

In [0]:
!wget -q -O 'lat_lon_data.csv' http://cocl.us/Geospatial_data
print('Data downloaded')

Data downloaded


In [0]:
geo_df = pd.read_csv('lat_lon_data.csv')

In [0]:
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [0]:
geo_df[geo_df['Postal Code']== 'M5G']

Unnamed: 0,Postal Code,Latitude,Longitude
57,M5G,43.657952,-79.387383


# Merge data

In [0]:
final_df = pd.merge(left=loc_df, right=geo_df, left_on='PostalCode', right_on='Postal Code')

In [0]:
final_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M3A,North York,Parkwoods,M3A,43.753259,-79.329656
1,M4A,North York,Victoria Village,M4A,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",M5A,43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",M6A,43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",M7A,43.662301,-79.389494


In [0]:
final_df.drop(columns='Postal Code', inplace= True)

In [0]:
final_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


# Borough processing

In [0]:
def toronto(name):
    response = ''

    for n in name.split():
        if n == 'Toronto':
            response = name

    return response

In [0]:
final_df['Toronto?'] = final_df['Borough'].apply(lambda x: toronto(x))

In [0]:
toronto_df = final_df[final_df['Toronto?'] != ''][['PostalCode','Borough','Neighborhood','Latitude','Longitude']]

In [202]:
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [199]:
map_toronto = folium.Map(location=[43.651070, -79.347015], zoom_start=11)

for lat, lng, label in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

# Cluster Neighborhoods

In [0]:
clusters_num = 4

toronto_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 