# Canada locations - Web scraping and storing results in pandas dataframe

### The objectives of this project are to :-

1 - Scrape web for information on Canada postal codes

2 - Transform table from scraped data into a pandas dataframe (listing PostalCode,Borough and Neighbourhood)


_______________________________________________________________________________________________________________________________

## 1] Web scraping to obtain Canada postal codes

### 1a] Library installations/importing

In [6]:
# import requisite libraries 
import pandas as pd
import numpy as np

### 1b] Web scraping

In [7]:
# import table from the web as a list in pandas

import pandas as pd
import numpy as np

tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

tables # data is now a list of dataframes (spreadsheets) one dataframe for each table in the page

[    Postal code           Borough  \
 0           M1A      Not assigned   
 1           M2A      Not assigned   
 2           M3A        North York   
 3           M4A        North York   
 4           M5A  Downtown Toronto   
 ..          ...               ...   
 175         M5Z      Not assigned   
 176         M6Z      Not assigned   
 177         M7Z      Not assigned   
 178         M8Z         Etobicoke   
 179         M9Z      Not assigned   
 
                                           Neighborhood  
 0                                                  NaN  
 1                                                  NaN  
 2                                            Parkwoods  
 3                                     Victoria Village  
 4                           Regent Park / Harbourfront  
 ..                                                 ...  
 175                                                NaN  
 176                                                NaN  
 177                

## 2] Transformation of scraped data into pandas dataframe

### 2a] Conversion into pandas dataframe

In [8]:
# Convert Series to csv
tables[0].to_csv('wikifile.csv')

In [14]:
# read csv using pandas
location_df=pd.read_csv('wikifile.csv')
location_df.head()

Unnamed: 0.1,Unnamed: 0,Postal code,Borough,Neighborhood
0,0,M1A,Not assigned,
1,1,M2A,Not assigned,
2,2,M3A,North York,Parkwoods
3,3,M4A,North York,Victoria Village
4,4,M5A,Downtown Toronto,Regent Park / Harbourfront


### 2b] Cleaning dataset obtained from Web scraping

In [16]:
# unnecessary column
location_df=location_df[['Postal code','Borough','Neighborhood']]

# remove all rows with Borough unassigned
to_drop = ['Not assigned']
postalcode_df=location_df[~location_df['Borough'].isin(to_drop)]

# replace back slash character with comma separator
postalcode_df['Neighborhood'] = postalcode_df['Neighborhood'].replace('/', ',', regex=True)

# view cleaned postal code dataframe
postalcode_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"
5,M6A,North York,"Lawrence Manor , Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"


### 2c] Check dataframe dimensions

In [21]:
# check dataframe shape
postalcode_df.shape

(103, 3)