<h1>Applied Data Science - Capstone Project - Assignment 2 </h1>
<p>This assignment will scraped data from Wikipedia and load into a pandas dataframe</p>

<h2>First Step: Scraping Data into a Pandas Dataframe</h2>
<ol>
    <li>There are 3 tables on the page. However, we only need the first table for the dataframe.</li>
    <li> The first column 'Postal Code' is the index of the dataframe </li>
</ol>

In [37]:
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
dfs = pd.read_html(url)

df = dfs[0]
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


<h2>Second Step: Cleaning the data in the dataframe</h2>
<ol>
    <li>Dropping rows with Borough values Not assigned.</li>
    <li>Replacing Neighbourhood values with Borough values where Neighbourhood value is Not assigned</li>
</ol>

In [38]:
#removing rows with Borough 'Not assigned'

df.drop(df[df['Borough'] =='Not assigned'].index, inplace = True) 
df.reset_index(drop=True, inplace = True)
df


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [39]:
#replacing neighborhood values
df.loc[df.Neighbourhood == 'Not assigned', 'Neighbourhood'] = df.Borough
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [40]:
#printing shape of the dataframe
df.shape

(103, 3)

<h2>Third Step: Adding Longitude and Latitude Data to the Dataframe</h2>
<ol>
    <li>Reading longitude and latitude data from csv file</li>
    <li>Adding longitude and latitude data to dataframe</li>
</ol>

In [41]:
#read csv file
df_coord = pd.read_csv("http://cocl.us/Geospatial_data")
df_coord

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [43]:
#merging dataframes
neighborhoods= pd.merge(df, df_coord, left_on='Postal Code', right_on='Postal Code')
neighborhoods

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


<h2>Explore and Cluster Neighbourhood Data</h2>

In [45]:
#Get the latitude and longitude data for Toronto
from geopy.geocoders import Nominatim 
address = 'Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [49]:
pip install folium

Collecting folium
  Downloading folium-0.11.0-py2.py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 1.7 MB/s eta 0:00:011
[?25hCollecting branca>=0.3.0
  Downloading branca-0.4.1-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0
Note: you may need to restart the kernel to use updated packages.


In [52]:
import folium

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for Latitude, Longitude, Borough, Neighbourhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighbourhood']):
    label = '{}, {}'.format(Neighbourhood, Borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [Latitude, Longitude],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

<h2>Using Four Square to explore Neighbourhoods</h2>

1. Setting the credentials for Four Square

In [59]:
#Setting the credentials to use Four Square
CLIENT_ID = 'RNJ0DDXZB2L4SWVE4SQB4GDO1CYQKQTIXDK1ACQ2F4W3MQJP' # your Foursquare ID
CLIENT_SECRET = 'XZ0KMUQHZN3YIRY4UL1OU3JFXIP5F1LH2NYG0XKKAHOY0BVA'
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: RNJ0DDXZB2L4SWVE4SQB4GDO1CYQKQTIXDK1ACQ2F4W3MQJP
CLIENT_SECRET:XZ0KMUQHZN3YIRY4UL1OU3JFXIP5F1LH2NYG0XKKAHOY0BVA


2. Getting the 10th neighbourhood from the dataframe

In [70]:
neighborhoods.loc[10, 'Neighbourhood']

'Glencairn'

In [71]:
neighborhood_latitude = neighborhoods.loc[10, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = neighborhoods.loc[10, 'Longitude'] # neighborhood longitude value

neighborhood_name = neighborhoods.loc[10, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Glencairn are 43.709577, -79.44507259999999.


3. Getting top 100 venues within a 500 m radius of Glencairn

In [72]:
import requests
#create the GET request for the URL
radius=500
LIMIT=100

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

#Getting results from the URL
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5fa6db3e841fee376f4cbe9d'},
 'response': {'headerLocation': 'Briar Hill - Belgravia',
  'headerFullLocation': 'Briar Hill - Belgravia, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 5,
  'suggestedBounds': {'ne': {'lat': 43.714077004500005,
    'lng': -79.43885887350777},
   'sw': {'lat': 43.7050769955, 'lng': -79.45128632649221}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4cc75fe53c40a35d09976e2e',
       'name': 'R Bakery - Delicious Cakes, Breads',
       'location': {'address': '326 Marlee Ave',
        'lat': 43.70741964331273,
        'lng': -79.44312584563261,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.70741964331273,
          'lng': -79.44312584563261}],
        'dist

In [73]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [74]:
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()



Unnamed: 0,name,categories,lat,lng
0,"R Bakery - Delicious Cakes, Breads",Bakery,43.70742,-79.443126
1,Miyako Sushi Restaurant,Japanese Restaurant,43.709111,-79.44393
2,"Chalker's Pub, Billiards and Bistro",Pub,43.705747,-79.442378
3,Domino's Pizza,Pizza Place,43.70717,-79.442658
4,Fraserwood Park,Park,43.71355,-79.442482


In [75]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

5 venues were returned by Foursquare.
