# Proyect_02a

### Introduction 
___
_Objective:_ __To explore, segment and group the neighbourhoods of the city of Toronto.__   
The information is obtained with web scraping by Wikipedia and the data is transformed into a structured data format.  

### Table of contents 
#### 1 / 3

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#ref1">Import Data</a></li>
        <li><a href="#ref2">Data Wrangling</a></li>
        <li><a href="#ref3">Geodecoder</a></li>
        <li><a href="#ref4">Data Exploration</a></li>
        <li><a href="#ref5">Conclusion</a></li>
    </ol>
</div>
<br>

<a id="ref1"></a>
#  Import Data
***

In [1]:
!pip install beautifulsoup4 # HTML and XML data extraction library.
!pip install request # requests , timeout

Collecting request
  Downloading https://files.pythonhosted.org/packages/f1/27/7cbde262d854aedf217061a97020d66a63163c5c04e0ec02ff98c5d8f44e/request-2019.4.13.tar.gz
Collecting get (from request)
  Downloading https://files.pythonhosted.org/packages/3f/ef/bb46f77f7220ac1b7edba0c76d810c89fddb24ddd8c08f337b9b4a618db7/get-2019.4.13.tar.gz
Collecting post (from request)
  Downloading https://files.pythonhosted.org/packages/0f/05/bd79da5849ea6a92485ed7029ef97b1b75e55c26bc0ed3a7ec769af666f3/post-2019.4.13.tar.gz
Collecting query_string (from get->request)
  Downloading https://files.pythonhosted.org/packages/12/3c/412a45daf5bea9b1d06d7de41787ec4168001dfa418db7ec8723356b119f/query-string-2019.4.13.tar.gz
Collecting public (from query_string->get->request)
  Downloading https://files.pythonhosted.org/packages/54/4d/b40004cc6c07665e48af22cfe1e631f219bf4282e15fa76a5b6364f6885c/public-2019.4.13.tar.gz
Building wheels for collected packages: request, get, post, query-string, public
  Building wheel

In [2]:
# Importing Packages
import pandas as pd # DataFrame
import numpy as np # Arrays
from bs4 import BeautifulSoup # HTML and XML data extraction library.
import requests # requests , timeout

<a id="ref2"></a>
# Data Wrangling
***
* The url is transformed into a BeautifulSoup object that is processed into 'lxml' and analyzed into Html. 
* The table with the requested information is extracted creating a data frame consisting of three columns: Postal code, Borough and Neighborhood

In [3]:
# Loading the url data.
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page_response = requests.get(url,timeout=5) # requests , timeout
soup = BeautifulSoup(page_response.content,'lxml') # Transforming to BeautifulSoup object
table = soup.find_all('table')[0] # Filtering the html data table
df = pd.read_html(str(table))[0] # Transforming data with pandas
df.transpose() # Transposed DataFrame

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,170,171,172,173,174,175,176,177,178,179
Postal code,M1A,M2A,M3A,M4A,M5A,M6A,M7A,M8A,M9A,M1B,...,M9Y,M1Z,M2Z,M3Z,M4Z,M5Z,M6Z,M7Z,M8Z,M9Z
Borough,Not assigned,Not assigned,North York,North York,Downtown Toronto,North York,Downtown Toronto,Not assigned,Etobicoke,Scarborough,...,Not assigned,Not assigned,Not assigned,Not assigned,Not assigned,Not assigned,Not assigned,Not assigned,Etobicoke,Not assigned
Neighborhood,,,Parkwoods,Victoria Village,Regent Park / Harbourfront,Lawrence Manor / Lawrence Heights,Queen's Park / Ontario Provincial Government,,Islington Avenue,Malvern / Rouge,...,,,,,,,,,Mimico NW / The Queensway West / South of Bloo...,


In [4]:
print("size: ",df.shape) # size
df.describe() # Categorical statistics

size:  (180, 3)


Unnamed: 0,Postal code,Borough,Neighborhood
count,180,180,103
unique,180,11,98
top,M8L,Not assigned,Downsview
freq,1,77,4


### Filtering data   
***
* Only cells that have an assigned district are processed.

In [5]:
# Value Filtering
na = df['Borough'] != 'Not assigned' 
df_na = df[na] 
df_na.count()

Postal code     103
Borough         103
Neighborhood    103
dtype: int64

In [6]:
# Size, Nulls Values, Categorical Statistics
print("Size: ",df_na.shape)
print(df_na.isnull().sum())
df_na.describe()

Size:  (103, 3)
Postal code     0
Borough         0
Neighborhood    0
dtype: int64


Unnamed: 0,Postal code,Borough,Neighborhood
count,103,103,103
unique,103,10,98
top,M2J,North York,Downsview
freq,1,24,4


### Replacing character "/" with ","

In [7]:
df_na.head() # Head Before

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


In [8]:
# Replacing "/"  
new_data = df_na
new_data['Neighborhood'] = new_data['Neighborhood'].str.replace('/',',') 
new_data.index = np.arange(0, len(new_data)) # changing start index
new_data.head() # Head After

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park , Harbourfront"
3,M6A,North York,"Lawrence Manor , Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"


In [9]:
print("Size: ",new_data.shape)

Size:  (103, 3)


<div class="alert alert-block alert-warning" style="margin-top: 20px"><b>To be continued in part 2</b></div>

This notebook was created by [Carlos Alberto Gómez Prado](https://www.linkedin.com/in/carlospradobigdata/), as an assignment for the IBM coursera course.   
This notebook is part of a course on Coursera called Applied Data Science Capstone. 

---