# Capstone Project: Battle of Neighborhoods in Vancouver
### Webscraping Vancouver Geo Data

The first step of the Vancouver Battle of the Neighborhoods analysis is retrieving Vancouver Geo Data. I retrieve the Vancouver Geo Data from *'Geo Names'*. This website includes postal code data from various countries worldwide. Aside from postal code data, geospatial data is available. You can dowload a zip file from the website; the zip file includes a text file. A more challenging way is to access the data by the web url, read the text file data, doing some data cleaning, and finally, save the data as a csv file in my documents folder.

In [1]:
#import packages
import pandas as pd
import csv

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

import requests # library to handle requests
import random # library for random number generation

# module to convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 

import geocoder # import geocoder

import folium # plotting library

#for webscraping
from bs4 import BeautifulSoup 

from html_table_extractor.extractor import Extractor

# Transform a json file into a pandas data frame
from pandas.io.json import json_normalize #package for flattening json in pandas df


# Import k-means for clustering
from sklearn.cluster import KMeans

# Matplotlib associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#url and zip file retrieval
import urllib.request
import zipfile
import chardet


print("Packages installed and libraries imported.")

Packages installed and libraries imported.


In [2]:
url = 'http://download.geonames.org/export/zip/CA.zip'
filename = 'CA.zip'
urllib.request.urlretrieve(url, '......../Coursera/IBM - Data Science/capestone project/capstone project week 4/data collection/postal codes/CA.zip')


('C:/Users/Lydia/Documents/Coursera/IBM - Data Science/capestone project/capstone project week 4/data collection/postal codes/CA.zip',
 <http.client.HTTPMessage at 0x1e1a5c794e0>)

In [3]:
with zipfile.ZipFile('......./Coursera/IBM - Data Science/capestone project/capstone project week 4/data collection/postal codes/CA.zip') as zf:
    zf.extractall('......../Coursera/IBM - Data Science/capestone project/capstone project week 4/data collection/postal codes/')


In [4]:
path = '......../Coursera/IBM - Data Science/capestone project/capstone project week 4/data collection/postal codes/CA.txt'
with open(path,'r') as f:
    FileContent = f.read()
    print(FileContent)

CA	T0A	Eastern Alberta (St. Paul)	Alberta	AB					54.766	-111.7174	6
CA	T0B	Wainwright Region (Tofield)	Alberta	AB					53.0727	-111.5816	6
CA	T0C	Central Alberta (Stettler)	Alberta	AB					52.1431	-111.6941	5
CA	T0E	Western Alberta (Jasper)	Alberta	AB					53.6758	-115.0948	5
CA	T0G	North Central Alberta (Slave Lake)	Alberta	AB					55.6993	-114.4529	6
CA	T0H	Northwestern Alberta (High Level)	Alberta	AB					57.5403	-116.9153	6
CA	T0J	Southeastern Alberta (Drumheller)	Alberta	AB					50.9944	-111.4632	6
CA	T0K	International Border Region (Cardston)	Alberta	AB					49.4721	-112.2408	6
CA	T0L	Kananaskis Country (Claresholm)	Alberta	AB					50.6314	-114.4089	6
CA	T0M	Central Foothills (Sundre)	Alberta	AB					51.9552	-114.8691	6
CA	T0P	Northeastern Alberta (Fort Chipewyan)	Alberta	AB					58.2626	-110.7467	5
CA	T0V	Remote Northeast (Fitzgerald)	Alberta	AB					59.9049	-111.6717	6
CA	T1A	Medicine Hat Central	Alberta	AB	Medicine Hat 				50.0816	-110.5788	1
CA	T1B	Medicine Hat South	Alberta	AB	Medicine

In [5]:
#convert the file to a csv file
txt_file = '......./Coursera/IBM - Data Science/capestone project/capstone project week 4/data collection/postal codes/CA.txt'

column_names = ['1','2','3','4','5','6','7','8','9','10','11','12'] #add the column names
dataframe = pd.read_csv(txt_file,delimiter="\t",names=column_names)  #make a data frame from the text file

In [6]:
dataframe.head(5)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12
0,CA,T0A,Eastern Alberta (St. Paul),Alberta,AB,,,,,54.766,-111.7174,6.0
1,CA,T0B,Wainwright Region (Tofield),Alberta,AB,,,,,53.0727,-111.5816,6.0
2,CA,T0C,Central Alberta (Stettler),Alberta,AB,,,,,52.1431,-111.6941,5.0
3,CA,T0E,Western Alberta (Jasper),Alberta,AB,,,,,53.6758,-115.0948,5.0
4,CA,T0G,North Central Alberta (Slave Lake),Alberta,AB,,,,,55.6993,-114.4529,6.0


In [7]:
dataframe = dataframe.rename(columns={'2':'Postal_Code','3':'Neighborhood','4':'State','6':'City','10':'Latitude','11':'Longitude'})  #rename dataframe
dataframe.drop(dataframe[['1','5','7','8','9','12']],axis=1,inplace=True) #drop the redundant columns
dataframe.to_csv (r'......../Coursera/IBM - Data Science/capestone project/capstone project week 4/data collection/postal codes/CanadaPostalCodes.csv', index=None)


**This is the end of the webscraping part.**