# Part 2 of Segmentation and Clustering of Neighborhoods in Toronto


Now that we have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In [1]:
#importing libraries for scraping, cleaning and preparation of data
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
L_url = ' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(L_url).text

In [3]:
soup = BeautifulSoup(source,'xml')

In [4]:
table = soup.find('table')

In [5]:
#Assigning Column Names
column_names = ['PostalCode', 'Borough','Neighborhood']
df = pd.DataFrame(columns = column_names)

In [6]:
#Our dataframe headings
df

Unnamed: 0,PostalCode,Borough,Neighborhood


In [7]:
#searching for all the postal codes, Boroughs and Neighorhoods and adding it to our dataframe
for tr_cell in table.find_all('tr'):
    row_data=[]
    for td_cell in tr_cell.find_all('td'):
        row_data.append(td_cell.text.strip())
    if len(row_data)==3:
        df.loc[len(df)] = row_data

In [8]:
#This is how our Dataframe looks now
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## Cleaning data
for documentation please click this link : https://github.com/CocaCola24/Coursera_Capstone/blob/master/Segmentation%20and%20Clustering.ipynb

In [9]:
df = df[df.Borough != 'Not assigned']
df = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(list).apply(lambda x:', '.join(x)).to_frame().reset_index()
for index, row in df.iterrows():
    if row['Neighborhood'] == 'Not assigned':
        row['Neighborhood'] = row['Borough']
#final df after cleaning and preparation        
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [10]:
df.shape

(103, 3)

#### The following method gets the geocode for an input postal code

In [11]:
def get_geocode(postal_code):
    # initialize the variable to None
    
    ll_coords = None
    #ll_Cords means latitude and longitude coordinates
    
    while(ll_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        ll_coords = g.latlng
    latitude = ll_coords[0]
    longitude = ll_coords[1]
    return latitude,longitude

In [12]:
#Reading the file which contains the information of Latitudes and Longitudes w.r.t to postal codes into a dataframe which we can merge later to obtain the final dataframe
geo_df=pd.read_csv('http://cocl.us/Geospatial_data')

In [13]:
#This is how our geospatial dataframe looks
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [14]:
#This is how our Neighborhood dataframe looks
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Now we need to merge these two dataframes into a single dataframe containing a Postal code, a Borough, its neighborhoods, latitude and longitude

In [18]:
# Since both the dataframes have different "Postal Code" titles we will change them into the same title
geo_df.rename(columns={'Postal Code':'PostalCode'},inplace=True)
geo_df.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [19]:
#Merging both the dataframes into one called geo_merged
geo_merged = pd.merge(geo_df, df, on='PostalCode')

In [20]:
geo_data=geo_merged[['PostalCode','Borough','Neighborhood','Latitude','Longitude']]

### Finally this is how our dataframe looks after merging coordinates and neighborhoods

In [21]:
geo_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
