# Capstone Project

This notebook is for Capstone project and we will be using Pandas in the Python Programming Language and Various machine learning Techniques to deliver the end outcome.

<b> Install BeautifulSoup4 tool for data Scrapping 

In [3]:
# Installing Dependencies

!pip install requests bs4 pandas

print("beautifulsoup4 is SUCCESSFULLY installed !")

Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/dsxuser/.cache/pip/wheels/a0/b0/b2/4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1
beautifulsoup4 is SUCCESSFULLY installed !


In [4]:
from bs4 import BeautifulSoup # magical tool for parsing html data
from urllib.request import urlopen # for making standard html requests

import requests
import json # for parsing data
import pandas as pd # premier library for data organization

<h2>1. Web Scrapping and Data Preparation

<b> Extract Table data from the WIKIPEDIA html page

In [6]:
# Request html page from our target URL

url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

table_data = soup.find('table')
#print(table_data.prettify())

<b>Capture the table data cells from the HTML page into the Dataframe

In [7]:
# Get all the table rows from 2nd Row onwards and place it under Headers.

data = []
for tr in table_data.find_all('tr')[1:]:
    row_data = tr.find_all('td')
    data.append([cell.text for cell in row_data])
df_data = pd.DataFrame(data, columns = ['PostalCode', 'Borough', 'Neighborhood'])
df_data.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,\n
1,M2A\n,Not assigned\n,\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


<b> Data Cleanup: Remove the new line charecter (\n) from the dataset. 

In [8]:
df_data['PostalCode'] = df_data['PostalCode'].str.split('\n', expand = True)[0]
df_data['Borough'] = df_data['Borough'].str.split('\n', expand = True)[0]
df_data['Neighborhood'] = df_data['Neighborhood'].str.split('\n', expand = True)[0]

df_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


<b>Only process the cells that have an assigned borough (Drop, Borough = "Not assigned")

In [9]:
# Clean datasete to remove records with Borough as "Not Assigned"

df_cleaned_data = df_data[df_data.Borough != 'Not assigned']
df_cleaned_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


<b>Check if we have any multiple recods for the same Postalcode, if exists then merge the records

In [10]:
# Check if we have any multiple recods for the same Postalcode.

duplicateRowsDF = df_cleaned_data[df_cleaned_data.duplicated(['PostalCode'])]

Multi_records_postalCode = duplicateRowsDF.shape[0]

# For Postalcode with multiple recods, conctenate 'Neighborhood' values and keep only 1 record in the DataFrame

if Multi_records_postalCode == 0:
    print("Multirecord Postalcode does not exists for merger")
else:
    df_cleaned_data = df_cleaned_data.groupby(['PostalCode','Borough'])['Neighborhood'].apply(', '.join).reset_index()

Multirecord Postalcode does not exists for merger


<b>Check If a cell has a borough but a "Not assigned" neighborhood, if found, assign Borough to neighborhood.

In [11]:
# If Neighborhood == 'Not assigned' then Neighborhood = Borough

df_cleaned_data.loc[df_cleaned_data['Neighborhood'] == 'Not assigned', 'Neighborhood'] = df_cleaned_data['Borough']

df_cleaned_data.head(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


<b>Save the Cleansed Data in a CSV file and Publish the size of your DataFrame.

In [12]:
df_cleaned_data.to_csv('Final_Cleaned_dataset.csv', index = False)

df_cleaned_data.shape

(103, 3)

# 2. Get the Geo coordinates for each neighborhood.

<b> Let's get all the Geo Liabraries installed

In [13]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    ca-certificates-2020.4.5.1 |       hecc5488_0         146 KB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    certifi-2020.4.5.1         |   py36h9f0ad1d_0         151 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0           conda-forge
    geopy:          

<b> Given below is Geocoder code with Foursqaure Agent but it's not working for all the Neighborhood

In [14]:
geolocator = Nominatim(user_agent="foursquare_agent")

df_NeighData = pd.read_csv('Final_Cleaned_dataset.csv')

Rec_Count = df_NeighData.shape[0]

for i in range(Rec_Count):
    address = df_NeighData.Neighborhood[i]
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    print(i,address,latitude,longitude)

0 Parkwoods 37.8567738 -122.22068778004532
1 Victoria Village 43.732658 -79.3111892
2 Regent Park, Harbourfront 43.64076885 -79.37989177980148
3 Lawrence Manor, Lawrence Heights 43.7227784 -79.4509332


AttributeError: 'NoneType' object has no attribute 'latitude'

<b> Used GeoSpatial Data for the Latitude and Longitude

In [15]:
!wget -O GeoCord.csv http://cocl.us/Geospatial_data/

--2020-05-30 06:09:31--  http://cocl.us/Geospatial_data/
Resolving cocl.us (cocl.us)... 158.85.108.83, 169.48.113.194, 158.85.108.86
Connecting to cocl.us (cocl.us)|158.85.108.83|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cocl.us/Geospatial_data/ [following]
--2020-05-30 06:09:31--  https://cocl.us/Geospatial_data/
Connecting to cocl.us (cocl.us)|158.85.108.83|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-05-30 06:09:32--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 185.235.236.197
Connecting to ibm.box.com (ibm.box.com)|185.235.236.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-05-30 06:09:33--  https://ibm.box.com/p

In [16]:
df_geospatial = pd.read_csv('GeoCord.csv')

df_geospatial.head(10)

Final_Dataset = pd.merge(df_NeighData, df_geospatial, left_on='PostalCode', right_on='Postal Code').drop(['Postal Code'], axis = 1)
Final_Dataset.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


<b> Save the Final Dataset with Geo coordinates

In [17]:
Final_Dataset.to_csv("Final_Dataset_with_Geo_coordinates.csv", index = False)
Final_Dataset.shape

(103, 5)