# Data Science Final Project

This Jupiter notebook will be used for the development of the Capstone project of the data science course given by IBM through the Coursera platform.

In [1]:
import pandas as pd
import numpy as np

In [2]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


***

## Data preparation

The Toronto postal code table is obtained from Wikipedia and is stored in the "df" variable.

In [3]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

All rows where the values in the column "Borough" have not been assigned are deleted.

In [4]:
df = df[df.Borough != "Not assigned"]

If in the "neighborhoods" column any value has not been assigned, then it is assigned the name of the Borough.

In [5]:
df['Neighbourhood'][df.Neighbourhood.str.contains('Not assigned')] = df["Borough"]

The column "Borough" is extracted from the data table. To do this, a variable called "boroughColumn" is assigned, where the data of the variable "df" is copied, and then some operations are performed on the data, such as
* Sort all the values according to the "Postcode" column.
* Delete all the rows that have duplicated data from the "Postcode" column.
* Reset the index to avoid compatibility problems.
* Finally select the "Borough" column being assigned as the value of the boroughColumn variable

In [6]:
boroughColumn = df.copy()

boroughColumn.sort_values(by=['Postcode'], inplace = True)
boroughColumn.drop_duplicates(subset ="Postcode", inplace = True)
boroughColumn.reset_index(inplace=True)
boroughColumn = boroughColumn["Borough"]

In the main table contained in the variable "df", different operations are executed such as
* The grouping of neighborhoods that share the same zip code to be displayed in a row.
* The index which was altered in the operation mentioned above is reset.
* All values are sorted according to the 'Postcode' column.
* Finally the column "Borough" is inserted in position 1 with the data of the variable "boroughColumn".

In [7]:
df = df.groupby('Postcode')['Neighbourhood'].agg(', '.join).reset_index().sort_values(by=['Postcode'])
df.insert(loc = 1, column="Borough", value=boroughColumn, allow_duplicates=False)
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [8]:
print("The dimensionality of the DataFrame is", df.shape)

The dimensionality of the DataFrame is (103, 3)


The geospatial data of the postal codes are imported from a URL.

In [9]:
geoData = pd.read_csv("http://cocl.us/Geospatial_data")

The columns "Latitude" and "Longitude" with the data collected in the variable "geoData" are inserted directly into the main table contained in the variable "df", thanks to the fact that the data in the CSV file were already sorted according to the column "Postcode" just like our original table and also that they share the same number of rows.

In [10]:
df["Latitude"], df["Longitude"] = geoData["Latitude"], geoData["Longitude"]
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
