![logo](IBM-logo.jpg)

![logo](coursera.png)

# IBM Capstone Data Science Project

**Table of contents**
1. [Data acquisition](#Chapter1) 
2. [Data cleaning](#Chapter2)
3. [Analysis](Chapter3)

This notebook will be used to demonstrate the learned skills within the IBM data science course on Coursera.

In [129]:
import pandas as pd # Library for dataframe manipulations
import numpy as np # library for mathematical functions
import requests # library to handle requests
import random # library for random number generation
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
import folium # plotting library
import geocoder # import geocoder

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 

# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize
import json
import csv
from json import load

print('Libraries imported.')

Libraries imported.


In [130]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## 1. Data acquisition <a name="Chapter1"></a>

To aquire the data for the clustering of Toronto webscraping is used. To be able to webscrape the data from [Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M), I use the package [pandas](https://pandas.pydata.org/pandas-docs/stable/).

In [2]:
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', header=0) # to acces the wikipedia page

# now to construct the dataframe using pandas
headings = ['Postcode', 'Borough', 'Neighbourhood']
for table in tables:
    current_headings = table.columns.values[:3]
    if list(current_headings) == headings:
        break


In [3]:
table['Postcode'].replace({r'.*!(.*)': r'\1'}, regex=True, inplace=True)
table['Borough'].replace({r'.*!(.*)': r'\1'}, regex=True, inplace=True)
table['Neighbourhood'].replace({r'.*!(.*)': r'\1'}, regex=True, inplace=True)

table[headings].to_csv('test.txt', sep=',', header=False, index=False)

In [4]:
df = pd.DataFrame(table)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## 2. Data cleaning <a name="Chapter2"></a>

At this stage the dataset will be assesed. If there are any missing values or other values that could impact the analysis they are removed or dealt with accordingly.

In [5]:
# Adressing the not assigned values
df['Borough'].replace('Not assigned', np.NaN, inplace=True)
df.dropna(inplace=True)
# to verify that it worked
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [6]:
# grouping values based on zipcode
df = df.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(', '.join).reset_index()

In [7]:
# shape of the current dataframe
print(df.shape)

(103, 3)


###  Geospatial Coordinates 

In [8]:
df2 = pd.read_csv('Geospatial_Coordinates.csv')
df2 = pd.DataFrame(df2)

In [9]:
df2.rename(columns = {'Postal Code': 'Postcode'}, inplace=True)

In [10]:
df_new = pd.merge(df, df2, on='Postcode')

In [11]:
df_new

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv...",43.688905,-79.554724
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437


## 3. Analysis <a name="Chapter3"></a>

At this stage the different neighbourhoods will be clustered and visualized.

In [139]:
address = 'Groningen'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(f'The geograpical coordinate of Groningen are {latitude}, {longitude}.')

The geograpical coordinate of Groningen are 53.2190652, 6.5680077.


### Creating a map from Toronto

The first step is to generate a map of Toronto and attach markers using the created dataframe from before.

In [140]:
# to generate the latitude and longitude of Toronto.
locations = df_new[['Latitude', 'Longitude']]
locations_list = locations.values.tolist()

In [141]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude],zoom_start=11)

# Ger a nice layer over the folium map
folium.TileLayer('cartodbdark_matter').add_to(map_toronto)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_new['Latitude'], df_new['Longitude'], df_new['Borough'], df_new['Neighbourhood']):
    label = f'{neighborhood}, {borough}'
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)  
    
# to get the output    
map_toronto

## Using [Foursquare](https://foursquare.com/) to generate meaningfull data points

In [39]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 100
search_query = 'Italian'
radius = 500

In [122]:
venues = []

for lat, long, post, borough, neighbourhood in zip(df_new['Latitude'], df_new['Longitude'], df_new['Postcode'], df_new['Borough'], df_new['Neighbourhood']):
    url = f"https://api.foursquare.com/v2/venues/explore?client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}&ll={lat},{long}&radius={radius}&limit={LIMIT}"

    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    for venue in results:
        venues.append((
            post, 
            borough,
            neighbourhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))



In [123]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

In [133]:
venues_df.head()

# define the column names
venues_df.columns = ['PostalCode', 'Borough', 'Neighbourhood', 'BoroughLatitude', 'BoroughLongitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(1341, 9)


Unnamed: 0,PostalCode,Borough,Neighbourhood,BoroughLatitude,BoroughLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
3,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
4,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Marina Spa,43.766,-79.191,Spa


### first lets see which venues are most prevalent within Toronto

In [138]:
venues_df.VenueCategory.value_counts()

Coffee Shop                      91
Café                             69
Park                             43
Pizza Place                      40
Restaurant                       33
Sandwich Place                   32
Italian Restaurant               28
Bakery                           27
Grocery Store                    22
Fast Food Restaurant             22
Japanese Restaurant              20
Bar                              20
Gym                              20
Sushi Restaurant                 18
Pub                              18
Pharmacy                         16
Clothing Store                   16
American Restaurant              14
Thai Restaurant                  14
Greek Restaurant                 14
Breakfast Spot                   14
Ice Cream Shop                   14
Liquor Store                     13
Chinese Restaurant               13
Hotel                            13
Gastropub                        13
Diner                            12
Dessert Shop                

It looks like that in Toronto there are substantially more coffee shops and cafes, when comparing them on first glance to the other venues.

It would be interesting to see which neighbourhood has the most Coffee shops and cafes in Toronto.