# Proyect_02a

### Introduction 
___
_Objective:_ __To explore, segment and group the neighbourhoods of the city of Toronto.__   
The information is obtained with web scraping by Wikipedia and the data is transformed into a structured data format.   
The project is divided into 3 parts to facilitate the understanding and implementation of the code.

### Table of contents 
#### Notebook 1  
This notebook focuses on scraping the information, cleaning it up and setting up a data frame for later exploration.
<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#ref1">Import</a></li>
        <li><a href="#ref2">Data Wrangling</a></li>
        <li><a href="#ref3">Resources</a></li>
    </ol>
</div>
<br>

<a id="ref1"></a>
# 1. Import 
In this section installs and imports the necessary packages for the project.
***

In [1]:
!pip install beautifulsoup4 # HTML and XML data extraction library.
!pip install request # requests , timeout

# Importing Packages
import pandas as pd # DataFrame
import numpy as np # Arrays
from bs4 import BeautifulSoup # HTML and XML data extraction library.
import requests # requests , timeout
print("Ready")

Ready


<a id="ref2"></a>
# 2. Data Wrangling
* The url is transformed into a BeautifulSoup object that is processed into 'lxml' and analyzed into 'Html'.
* The table with the requested information is extracted creating a data frame consisting of three columns: Postal code, Borough and Neighborhood

In [2]:
# Loading the url data.
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page_response = requests.get(url,timeout=5) # requests , timeout
soup = BeautifulSoup(page_response.content,'lxml') # Transforming to BeautifulSoup object
table = soup.find_all('table')[0] # Filtering the html data table
df = pd.read_html(str(table))[0] # Transforming data with pandas
df.transpose() # Transposed DataFrame

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,170,171,172,173,174,175,176,177,178,179
Postal code,M1A,M2A,M3A,M4A,M5A,M6A,M7A,M8A,M9A,M1B,...,M9Y,M1Z,M2Z,M3Z,M4Z,M5Z,M6Z,M7Z,M8Z,M9Z
Borough,Not assigned,Not assigned,North York,North York,Downtown Toronto,North York,Downtown Toronto,Not assigned,Etobicoke,Scarborough,...,Not assigned,Not assigned,Not assigned,Not assigned,Not assigned,Not assigned,Not assigned,Not assigned,Etobicoke,Not assigned
Neighborhood,,,Parkwoods,Victoria Village,Regent Park / Harbourfront,Lawrence Manor / Lawrence Heights,Queen's Park / Ontario Provincial Government,,Islington Avenue,Malvern / Rouge,...,,,,,,,,,Mimico NW / The Queensway West / South of Bloo...,


In [3]:
print("size: ",df.shape) # size
df.describe() # Categorical statistics

size:  (180, 3)


Unnamed: 0,Postal code,Borough,Neighborhood
count,180,180,103
unique,180,11,98
top,M3A,Not assigned,Downsview
freq,1,77,4


_We can see that when scraping the wikipedia table there are many records with no borough assigned._
***

### Filtering data   
* Only cells that have an assigned district are processed.

In [4]:
# Value Filtering
na = df['Borough'] != 'Not assigned' 
df_na = df[na] 
df_na.count()

Postal code     103
Borough         103
Neighborhood    103
dtype: int64

In [5]:
# Size, Nulls Values, Categorical Statistics
print("Size: ",df_na.shape)
print(df_na.isnull().sum())
df_na.describe()

Size:  (103, 3)
Postal code     0
Borough         0
Neighborhood    0
dtype: int64


Unnamed: 0,Postal code,Borough,Neighborhood
count,103,103,103
unique,103,10,98
top,M3A,North York,Downsview
freq,1,24,4


_We filter out unassigned boroughs by discarding the row._
*** 

### Replacing 
We restructured the data frame by restoring its index and replacing the bar with a comma from the Neighborhoods.

In [6]:
df_na.head() # Head Before

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


In [7]:
# Replacing "/"  
new_data = df_na
new_data['Neighborhood'] = new_data['Neighborhood'].str.replace('/',',') 
new_data.index = np.arange(0, len(new_data)) # changing start index
new_data.head() # Head After

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park , Harbourfront"
3,M6A,North York,"Lawrence Manor , Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"


In [8]:
print("Size: ",new_data.shape)

Size:  (103, 3)


<div class="alert alert-block alert-warning" style="margin-top: 20px"><b>To be continued in Notebook 2</b></div><br>

<a id="ref3"></a>
# 3. Resources
<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><a href="https://www.coursera.org/learn/applied-data-science-capstone">Applied Data Science Capstone</a></li>
    <li><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">Beautiful Soup Documentation</a></li>
    <li><a href="https://requests.readthedocs.io/en/master/">Request Documentation</a></li>
    <li><a href="https://pandas.pydata.org/docs/">Pandas Documentation</a></li>
    <li><a href="https://numpy.org/doc/">NumPy Documentation</a></li>
    <li><a href="https://github.com/Azhura/Coursera_Capstone">Labs</a></li>
</ol>
</div>

<div class="alert alert-block alert-info" style="margin-top: 20px">Link´s to the notebooks</div><br>
<a href="https://github.com/Azhura/Coursera_Capstone/blob/master/Project02a.ipynb">Github - Notebook 1: Data Wrangling</a><br>
<a href="https://github.com/Azhura/Coursera_Capstone/blob/master/Project02b.ipynb">Github - Notebook 2: Geolocation</a><br>
<a href="https://github.com/Azhura/Coursera_Capstone/blob/master/Project02c.ipynb">Github - Notebook 3: GeoData Exploration</a>

<a href="https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/d017ce96-2b8f-41fa-8773-30b2e775c682/view?access_token=bb0eef0b23c151212374a5663c38a1abe998f35f6bffe18ea841cd80479c272b">Display with map - Notebook 3: GeoData Exploration</a>

This notebook was created by [Carlos Alberto Gómez Prado](https://www.linkedin.com/in/carlospradobigdata/), as an assignment for the IBM coursera course.   

This notebook is part of a course on **Coursera** called *Applied Data Science Capstone*. If you accessed this notebook outside the course, you can take this course online by clicking [here](https://www.coursera.org/professional-certificates/ibm-data-science).   
Copyright &copy; 2018 [Cognitive Class](https://cognitiveclass.ai/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).

---