


# Segmenting and Clustering Neighborhoods in Toronto

##### For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

*Assumptions*
1. If Neighbourhood is not assigned, use Borough
2. If Borough is not assigned, ignore data
3. Mutliple neighborhoods may be assigned to a single postal code

## PART 1: Web Scrape 

In [1]:
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
dfs = pd.read_html(url)

print(len(dfs))

3


In [2]:
df = dfs[0]

### Requirement 1: The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [3]:
df.rename(index=str, columns={"Postal Code": "PostalCode", "Borough": "Borough", "Neighbourhood": "Neighborhood"}, inplace= True)
print(df.columns)
print(df.shape)
df.head()



Index(['PostalCode', 'Borough', 'Neighborhood'], dtype='object')
(180, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Requirement 2: Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [4]:
df = df[df.Borough != 'Not assigned']
print(df.shape)
df

(103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


### Requirement 3: Combine duplicate postal code rows into one row with multiple neighborhoods
#### After checking for duplicate rows by postal code and finding none (as shown), no further action is required on this step

In [5]:
#Check count of duplicate rows by Postal Code
dups =df.duplicated(subset ='PostalCode').sum()
print("There are",dups, 'duplicated rows')

#If Duplicate postal code entries existed, t
#df1 = df.groupby(['Postal Code','Borough'])['Neighbourhood'].apply(','.join).reset_index()


There are 0 duplicated rows


### Requirement 3: If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

#### After checking for "Not Assigned" values in Neighborhood and finding none, no further action is required on this step.

In [6]:
#check if any Not Assigned values exist in Neighbourhood
exists = 'Not Assigned' in df.Neighborhood
print("There are values in NEIGHBORHOOD which are not assigned:")
print(exists)

There are values in NEIGHBORHOOD which are not assigned:
False


### Requirement 4: Summarize shape of updated dataframe

In [7]:
df.reset_index(drop=True, inplace=True)
print(df.head())
df.shape

  PostalCode           Borough                                 Neighborhood
0        M3A        North York                                    Parkwoods
1        M4A        North York                             Victoria Village
2        M5A  Downtown Toronto                    Regent Park, Harbourfront
3        M6A        North York             Lawrence Manor, Lawrence Heights
4        M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government


(103, 3)

## PART 2: Geographic coordinates by Postal Code

### Option 1: Geocoder  [NOT SELECTED]

#### Example code:

import geocoder # import geocoder
postal_code = "M5A'

#### initialize your variable to None
lat_lng_coords = None

#### Loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

### Option 2: Import CSV [SELECTED]

In [8]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [9]:
df_ll.rename(index=str, columns={"Postal Code": "PostalCode"}, inplace= True)
print(df_ll.columns)


Index(['PostalCode', 'Latitude', 'Longitude'], dtype='object')


In [10]:
df = pd.merge(df,df_ll, how='outer',on='PostalCode')
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## PART 3: Explore and Cluster Neighborhoods

*Assumptions*

1. Explore "Toronto" based on Borough contains "Toronto", all other Boroughs are ignored

In [11]:
# Create a new column to test for contains Toronto
df['logic'] = df['Borough'].str.contains("Toronto") 

# Create new dataframe dfT of only Toronto neighborhoods and remove logic column
dfT = df[df['logic']]
dfT.reset_index(drop=True, inplace=True)
dfT = dfT.drop('logic', axis=1)

print(dfT.shape)
dfT.head()



(39, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [19]:

neighborhoods = dfT.drop('PostalCode', axis=1)
neighborhoods.head()


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,Downtown Toronto,St. James Town,43.651494,-79.375418
4,East Toronto,The Beaches,43.676357,-79.293031


## Export neighborhood dataframe to csv in Watson Studio

In [25]:
from project_lib import Project


#Initializing PySpark
!pip install pyspark
from pyspark import SparkContext, SparkConf

# #Spark Config
conf = SparkConf().setAppName("sample_app")
sc = SparkContext(conf=conf)

project = Project(sc,"PROJEC ID","Token")

Collecting pyspark
  Downloading pyspark-3.0.1.tar.gz (204.2 MB)
[K     |████████████████████████████████| 204.2 MB 57 kB/s s eta 0:00:01
[?25hCollecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 39.8 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612244 sha256=58b07081e1324dc12b75b38da017b07aa82e234b9c045e70bab3b383db8ae9c9
  Stored in directory: /tmp/wsuser/.cache/pip/wheels/5e/34/fa/b37b5cef503fc5148b478b2495043ba61b079120b7ff379f9b
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.1


2021-01-12 18:22:01,659 - __PROJECT_LIB__ - ERROR - failed to initialize ibmos2spark integration
Traceback (most recent call last):
  File "/opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages/project_lib/storage/bcos.py", line 138, in _initialize_bcos2spark
    import ibmos2spark
ModuleNotFoundError: No module named 'ibmos2spark'


In [27]:
project.save_data("neighborhood.csv", neighborhoods.to_csv(index=False))

{'file_name': 'neighborhood.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'courseracapstonedatascience-donotdelete-pr-rdkbz4b0qough4',
 'asset_id': '56e6b62c-0998-4b4c-82b6-85903a0cc817'}