# Coursera Capstone Data Science Project

### Applied Data Science Capstone
#### IBM

This capstone project course will give a taste of what data scientists go through in real life when working with data. 

You will learn about location data and different location data providers, such as Foursquare. You will learn how to make RESTful API calls to the Foursquare API to retrieve data about venues in different neighborhoods around the world. You will also learn how to be creative in situations where data are not readily available by scraping web data and parsing HTML code. You will utilize Python and its pandas library to manipulate data, which will help you refine your skills for exploring and analyzing data. 

Finally, you will be required to use the Folium library to great maps of geospatial data and to communicate your results and findings.

If you choose to take this course and earn the Coursera course certificate, you will also earn an IBM digital badge upon successful completion of the course.  

That' s what Coursera and IBM claim about this course. Let' s find out what is in store for us!!!

In [61]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

print("Hello Capstone Project Course!")

Hello Capstone Project Course!


Using the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

In [62]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'lxml')

table = soup.find_all('table')[0] 
df_temp = pd.read_html(str(table))
df = df_temp[0]
df.columns = ['Postcode', 'Borough', 'Neighborhood']

# Display the first 15 rows
print(df.head(15))

   Postcode           Borough      Neighborhood
0       M1A      Not assigned      Not assigned
1       M2A      Not assigned      Not assigned
2       M3A        North York         Parkwoods
3       M4A        North York  Victoria Village
4       M5A  Downtown Toronto      Harbourfront
5       M6A        North York  Lawrence Heights
6       M6A        North York    Lawrence Manor
7       M7A  Downtown Toronto      Queen's Park
8       M8A      Not assigned      Not assigned
9       M9A      Queen's Park      Not assigned
10      M1B       Scarborough             Rouge
11      M1B       Scarborough           Malvern
12      M2B      Not assigned      Not assigned
13      M3B        North York   Don Mills North
14      M4B         East York  Woodbine Gardens


## Creating the dataframe according to requirements

#### Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [63]:
df.drop(df[df['Borough'] == 'Not assigned'].index, inplace=True)

# Display the first 15 rows
print(df.head(15))

   Postcode           Borough      Neighborhood
2       M3A        North York         Parkwoods
3       M4A        North York  Victoria Village
4       M5A  Downtown Toronto      Harbourfront
5       M6A        North York  Lawrence Heights
6       M6A        North York    Lawrence Manor
7       M7A  Downtown Toronto      Queen's Park
9       M9A      Queen's Park      Not assigned
10      M1B       Scarborough             Rouge
11      M1B       Scarborough           Malvern
13      M3B        North York   Don Mills North
14      M4B         East York  Woodbine Gardens
15      M4B         East York     Parkview Hill
16      M5B  Downtown Toronto           Ryerson
17      M5B  Downtown Toronto   Garden District
18      M6B        North York         Glencairn


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [64]:
df.loc[df.Neighborhood == 'Not assigned', "Neighborhood"] = df.Borough

# Display the first 15 rows
print(df.head(15))

   Postcode           Borough      Neighborhood
2       M3A        North York         Parkwoods
3       M4A        North York  Victoria Village
4       M5A  Downtown Toronto      Harbourfront
5       M6A        North York  Lawrence Heights
6       M6A        North York    Lawrence Manor
7       M7A  Downtown Toronto      Queen's Park
9       M9A      Queen's Park      Queen's Park
10      M1B       Scarborough             Rouge
11      M1B       Scarborough           Malvern
13      M3B        North York   Don Mills North
14      M4B         East York  Woodbine Gardens
15      M4B         East York     Parkview Hill
16      M5B  Downtown Toronto           Ryerson
17      M5B  Downtown Toronto   Garden District
18      M6B        North York         Glencairn


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [65]:
df = df.groupby(['Postcode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()

# Display the first 15 rows
print(df.head(15))

   Postcode      Borough                                       Neighborhood
0       M1B  Scarborough                                     Rouge, Malvern
1       M1C  Scarborough             Highland Creek, Rouge Hill, Port Union
2       M1E  Scarborough                  Guildwood, Morningside, West Hill
3       M1G  Scarborough                                             Woburn
4       M1H  Scarborough                                          Cedarbrae
5       M1J  Scarborough                                Scarborough Village
6       M1K  Scarborough        East Birchmount Park, Ionview, Kennedy Park
7       M1L  Scarborough                    Clairlea, Golden Mile, Oakridge
8       M1M  Scarborough    Cliffcrest, Cliffside, Scarborough Village West
9       M1N  Scarborough                        Birch Cliff, Cliffside West
10      M1P  Scarborough  Dorset Park, Scarborough Town Centre, Wexford ...
11      M1R  Scarborough                                  Maryvale, Wexford
12      M1S 

In [66]:
# In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
df.shape

(103, 3)

Now that we have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. 

In [68]:
# in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.
import io

url = "http://cocl.us/Geospatial_data"
s = requests.get(url).content
c = pd.read_csv(io.StringIO(s.decode('utf-8')))

# rename the first column to allow merging dataframes on Postcode
c.columns = ['Postcode', 'Latitude', 'Longitude']
df = pd.merge(c, df, on='Postcode')

# reorder column names and show the dataframe
df = df[['Postcode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude']]
df

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


#### Generating maps to visualize your neighborhoods and how they cluster together. 

In [70]:
from geopy.geocoders import Nominatim

address = 'Toronto, Canada'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.




In [74]:
!conda install -c conda-forge folium

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    folium-0.10.1              |             py_0          59 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.4.0               |             py_0          26 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    branca:          0.4.0-py_0        conda-forge
    folium:          

In [76]:
# create map of Toronto using latitude and longitude values
import folium

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto