# Street-level analysis of public transport options in Oslo

The project will aim to :
* extract geo-coordinates of each street address
  * Street start address > street end address
* Use Foursquare API to obtain:
  * Bus routes
  * Tram
  * Light rail
  * Trains
* Cluster the transport options down to the street level and draw conclusions
  

----
> Jump to :  
* [Part 2](https://github.com/Niladri-B/Coursera_Captstone/blob/master/wk4/Capstone_part2.ipynb), *Extracting Foursquare Data*
* [Part 3](https://github.com/Niladri-B/Coursera_Captstone/blob/master/wk4/Capstone_part3.ipynb) , *Exploratory Data Analysis*
* [Part 4](https://github.com/Niladri-B/Coursera_Captstone/blob/master/wk4/Capstone_part4.ipynb), *Clustering and Visualising*
* [Part 5](https://github.com/Niladri-B/Coursera_Captstone/blob/master/wk4/Capstone_part5.ipynb), *Conclusion & Discussion*

# Part I: Web Scraping of postcodes and streets

## 1. Set up environment

In [1]:
import pandas as pd
import numpy as np
import folium
import urllib.request, urllib.parse, urllib.error
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
import re

## 2. Obtain and parse URL

#### 2.1 Ignore SSL errors

In [2]:
cntxt = ssl.create_default_context()
cntxt.check_hostname = False
cntxt.verify_mode = ssl.CERT_NONE

#### 2.2 Obtain URL

In [4]:
url = input('Please enter the website to obtain data from: ')
if len(url) < 1: url = 'https://www.erikbolstad.no/postnummer-koordinatar/kommune.php?kommunenummer=301'
print('You want data from >>>\n', url, '\n<<<')

Please enter the website to obtain data from:  


You want data from >>>
 https://www.erikbolstad.no/postnummer-koordinatar/kommune.php?kommunenummer=301 
<<<


#### 2.3 Parse URL

In [5]:
#Use a file-handle like object to open the url
html = urlopen(url, context= cntxt).read() #Read slurps everything in #Note that is additional function written at end

#Use BeautifulSoup to parse
soup = BeautifulSoup(html, 'html.parser')

## 3. Explore and obtain relevant data

For each Postkodenummer :
* Ignore postnummer til postboksar, postnummer til servicepostnummer, postnummer til gateadresser og postboksar
* Obtain Lat + Lon of postnummer of gate-/veg-adresse, save in table1
* Obtain all street names (Gate-/veg-adresse)
  * For each streetname :
      * Obtain Koordinatar of all addresses
          * Extract coodinates of start address/minimum
          * Extract coordinates of end address/last
          * ~~Estimate street length~~
          * Establish mid-point
      * Save in table2

In [7]:
#Data is contained in the <tr> tags
#tags = soup('tr')

In [8]:
#Extract tags that are in use
tagsinUse = soup.find_all('tr', class_ = 'ibruk')

count = 0 #Initiate counter
dictn = {} #With postcode as KEy, and everything else as LIST of values
lisT = []

for i in tagsinUse:
    #count = count +1
    #print('\nCount: {}'.format(count))
    #if count == 5:
     #   break
          
    #Look at the parts of the tag
    #print('TAG:', i)
    #print('Postnummer:', i.th.text)#WORKS
    #print(str(i.contents))
    #print(type(str(i.contents)))
    
    if re.search('Gate-/veg-adresse', str(i.contents)):#i.e. print only if postbox belongs to gate/veg-adresse
        #For intial testing
        #count = count +1
        #if count > 5:
               # break
            
        #print('\nCount: {}'.format(count))
        #print(str(i.contents))#Convert to str is important to be able to perform regEX
        #print(i.th.a.text)#Extract postcode number
        #print(len(i.contents))#Seems to be constant length of 9 (counting from 0)
        #print(i.contents[5].a.text)#From 5th element (counting from 0, extract the lat+ long)
        #contents[7].text #Extract bydel
        
        #print(i.th.a.text,i.contents[7].text, i.contents[5].a.text)
        pinCity = i.th.a.text.split(' ')#0018 OSLO #So there is space in between, split on it
        #print(pinCity[0])
        
        #Also split the latitude and longitude
        latlon = i.contents[5].a.text.split(',')
        #print(latlon[0], latlon[1])
        #list.append(i.contents[7].text), list.append(i.contents[5].a.text)
        lisT.append(i.contents[7].text), lisT.append(latlon[0]), lisT.append(latlon[1])
        dictn[pinCity[0]] = lisT
    else:
        continue#Ignore all other instances: Postboksar/Gateadresser og postboksar/Servicepostnummer
    lisT = []

#print('\n<<<----The dictionary follows------>>>>>')
#for k,v in dict.items():
 #   print(k,v)

In [9]:
#Convert to dataframe
data = pd.DataFrame.from_dict(dictn, orient='index')
data.head()

Unnamed: 0,0,1,2
188,Gamle Oslo,59.9126,10.7608
159,Sentrum,59.9139,10.7416
584,Bjerke,59.939,10.818
281,Ullern,59.9208,10.6622
1275,Søndre Nordstrand,59.8267,10.8379


In [10]:
#Make modifications on the table to make it prettier- assign column names + reset index
#Reset index
data.reset_index(inplace= True)
data.head()

Unnamed: 0,index,0,1,2
0,188,Gamle Oslo,59.9126,10.7608
1,159,Sentrum,59.9139,10.7416
2,584,Bjerke,59.939,10.818
3,281,Ullern,59.9208,10.6622
4,1275,Søndre Nordstrand,59.8267,10.8379


In [11]:
#Change column names
data.columns.values[0:] = ['Postcode','Bydel/District','Latitude','Longitude']
data.head()

Unnamed: 0,Postcode,Bydel/District,Latitude,Longitude
0,188,Gamle Oslo,59.9126,10.7608
1,159,Sentrum,59.9139,10.7416
2,584,Bjerke,59.939,10.818
3,281,Ullern,59.9208,10.6622
4,1275,Søndre Nordstrand,59.8267,10.8379


In [249]:
data.shape

(442, 4)

In [12]:
data.to_csv(path_or_buf= './postCode_Bydel.csv', index = False)

## 4. Obtain street-level information

In [14]:
def postCheck(pass_the_url):
    exceptions = ['0018','0045']
    #if (re.search(exceptions, pass_the_url) for post in exceptions):
    if [post for post in exceptions if re.search(post, pass_the_url)]:#Using List comprehension
        #if re.search(post, pass_the_url):
        print('Exceptional post found, skipping...')
        return 0
    else:
        print(pass_the_url)
        

In [15]:
import requests
#This library is superior to the urllib for parsing out non-ASCII characters such as those found in Norwegian
#r = requests.get('https://no.wikipedia.org/wiki/Jonas_Gahr_Støre')
#print(r.text)

In [16]:
#Small version, to test things out

#Extract tags that are in use
tagsinUse = soup.find_all('tr', class_ = 'ibruk')

count = 0 #Initiate counter
midStreetDict = dict()
midStreetLoc = list()
failedStreet = list()

for i in tagsinUse:      
    if re.search('Gate-/veg-adresse', str(i.contents)): #i.e. print only if postbox belongs to gate/veg-adresse
        
        #For intial testing
        count = count +1
        if count > 6:break
        print('\nCount: {}'.format(count))
        print('Working on postcode {}...'.format(i.th.a.text))
        posturl = i.th.a.get('href', None )#Postcode link
        postCheck(posturl)#Custom function defined above
        
        #Parse Postcode link and open it
        html2 = urllib.request.urlopen(posturl, context=cntxt)#Open, similar to filehandle
        #print('HTML2 is {}'.format(html2))
        soup2 = BeautifulSoup(html2, 'html.parser')#Parse
        #print(soup2)
        
        #Establish canonical link that will combine with street address URL to form new URL
        canonical = 'https://www.erikbolstad.no/postnummer-koordinatar/'
        
        #Extract tag in the new link
        tagsinUse2 = soup2.find_all('table', style="margin-top: 4rem;")
        for j in tagsinUse2:
            #print('\n<<<<Printing all tags selected from soup2...>>>>')
            #print(j)
            td = j.find_all('td', itemtype="https://schema.org/StreetAddress")#Find the street addresses
            #print('<<<<Printing td tags...>>>>')
            #print(td,'\n')
            #print(j.td.a.text)#Gives only 1 street address#Early version of script
            for tag in td:
                streetUrl = canonical+tag.a.get('href',None)
                streetName = tag.a.text
                print(streetUrl)
                #print('Working on:', tag.a.text,'\n')
                print('Working on:', streetName)
                
                #Parse street url link and open it
                try:
                    #html3 = urllib.request.urlopen(streetUrl, context=cntxt)#Open, similar to filehandle
                    html3 = requests.get(streetUrl).text
                    #print('HTML3 is {}'.format(html3))
                    soup3 = BeautifulSoup(html3, 'html.parser')#Parse
                
                    #Extract all anchor tags since within them one finds the street address + geocoordinates
                    streets = soup3.find_all('a')
    
                    streetList = []
                    latlonList = []
                    #latlonTup = tuple()
                    for datas in streets:
                        text = datas.text
                        #print(text) 
                        #Only get the text that is street name or lat, long
                        if re.search(streetName+'\s[0-9]+', text):# or re.findall('[0-9]+,', text):
                            #print(text)
                            streetList.append(text)
                     
                        elif re.search('[0-9]+,', text):
                            coordinates = text.split(',')
                            #print(tuple(coordinates)) #Make into latlon coordinates tuple so that it can be appended to list
                            latlonList.append(tuple(coordinates))
                        
                    #print(streetList)
                    #print(latlonList)
                    combined = dict(zip(streetList, latlonList))
                
                    #Establish mid-points of streets
                    lat2 = []
                    long2 = []
                    for l in latlonList :
                        lat2.append(float(l[0]))
                        long2.append(float(l[1]))

                    midStreetLat = sum(lat2)/len(lat2)
                    midStreetLon = sum(long2)/len(long2)
                    #midStreetLoc.append(midStreetLat,midStreetLoc)
                    print(streetName,midStreetLat, midStreetLon,'\n')
                except:
                    print('Can\'t open URL, possible UnicodeEncode error. Skipping...\n')
                    failedStreet.append(streetName)
                    



Count: 1
Working on postcode 0018 OSLO...
Exceptional post found, skipping...

Count: 2
Working on postcode 0045 OSLO...
Exceptional post found, skipping...

Count: 3
Working on postcode 0050 OSLO...
https://www.erikbolstad.no/postnummer-koordinatar/?postnummer=0050
https://www.erikbolstad.no/postnummer-koordinatar/veg.php?veg=Stenersgata&kommune=301
Working on: Stenersgata
Stenersgata 59.91311111111111 10.752633333333334 


Count: 4
Working on postcode 0139 OSLO...
https://www.erikbolstad.no/postnummer-koordinatar/?postnummer=0139
https://www.erikbolstad.no/postnummer-koordinatar/veg.php?veg=Blåskjellveien&kommune=301
Working on: Blåskjellveien
Blåskjellveien 59.870850000000004 10.771999999999998 

https://www.erikbolstad.no/postnummer-koordinatar/veg.php?veg=Fiskekroken&kommune=301
Working on: Fiskekroken
Fiskekroken 59.87151269841268 10.772650793650794 

https://www.erikbolstad.no/postnummer-koordinatar/veg.php?veg=Fjordveien&kommune=301
Working on: Fjordveien
Fjordveien 59.8690962

In [17]:
print(failedStreet)

['Dronningens gate']


#### NOTE: FOLLOWING CODE TAKES ~45 mins to RUN

In [22]:
#LONG CODE # RUN ONLY IF ABSOLUTELY REQUIRED

#Extract tags that are in use
tagsinUse = soup.find_all('tr', class_ = 'ibruk')

count = 0 #Initiate counter
midStreetDict = dict()
midStreetLoc = list()

for i in tagsinUse:      
    if re.search('Gate-/veg-adresse', str(i.contents)): #i.e. print only if postbox belongs to gate/veg-adresse
        
        #For intial testing
        count = count +1
        #if count > 4:break
        print('\nCount: {}'.format(count))
        print('Working on postcode {}...'.format(i.th.a.text))
        posturl = i.th.a.get('href', None )#Postcode link
        postCheck(posturl)#Custom function defined above
        
        #Parse Postcode link and open it
        html2 = urllib.request.urlopen(posturl, context=cntxt)#Open, similar to filehandle
        #print('HTML2 is {}'.format(html2))
        soup2 = BeautifulSoup(html2, 'html.parser')#Parse
        #print(soup2)
        
        #Establish canonical link that will combine with street address URL to form new URL
        canonical = 'https://www.erikbolstad.no/postnummer-koordinatar/'
        
        #Extract tag in the new link
        tagsinUse2 = soup2.find_all('table', style="margin-top: 4rem;")
        for j in tagsinUse2:
            #print('\n<<<<Printing all tags selected from soup2...>>>>')
            #print(j)
            td = j.find_all('td', itemtype="https://schema.org/StreetAddress")#Find the street addresses
            #print('<<<<Printing td tags...>>>>')
            #print(td,'\n')
            #print(j.td.a.text)#Gives only 1 street address#Early version of script
            for tag in td:
                streetUrl = canonical+tag.a.get('href',None)
                streetName = tag.a.text
                print(streetUrl)
                #print('Working on:', tag.a.text,'\n')
                print('Working on:', streetName)
                
                #Parse street url link and open it
                try:
                    #html3 = urllib.request.urlopen(streetUrl, context=cntxt)#Open, similar to filehandle
                    html3 = requests.get(streetUrl).text
                    #print('HTML3 is {}'.format(html3))
                    soup3 = BeautifulSoup(html3, 'html.parser')#Parse
                
                    #Extract all anchor tags since within them one finds the street address + geocoordinates
                    streets = soup3.find_all('a')
    
                    streetList = []
                    latlonList = []
                    #latlonTup = tuple()
                    for datas in streets:
                        text = datas.text
                        #print(text) 
                        #Only get the text that is street name or lat, long
                        if re.search(streetName+'\s[0-9]+', text):# or re.findall('[0-9]+,', text):
                            #print(text)
                            streetList.append(text)
                     
                        elif re.search('[0-9]+,', text):
                            coordinates = text.split(',')
                            #print(tuple(coordinates)) #Make into latlon coordinates tuple so that it can be appended to list
                            latlonList.append(tuple(coordinates))
                        
                        #print(streetList)
                        #print(latlonList)
                    combined = dict(zip(streetList, latlonList))
                
                    #Establish mid-points of streets
                    lat2 = []
                    long2 = []
                    for l in latlonList :
                        lat2.append(float(l[0]))
                        long2.append(float(l[1]))

                    midStreetLat = sum(lat2)/len(lat2)
                    midStreetLon = sum(long2)/len(long2)
                #midStreetStr= str(midStreetLat) +','+ str(midStreetLon)
                #midStreetCoord = midStreetStr.split(',')
                #midStreetLoc.append((midStreetLat,midStreetLon))
                    midStreetLoc.append(midStreetLat), midStreetLoc.append(midStreetLon)
                    print(streetName,midStreetLat, midStreetLon,'\n')
                    midStreetDict[streetName] = midStreetLoc
                    midStreetLoc = []
                except:
                    print('Can\'t open URL, possible UnicodeEncode error. Skipping...\n')



Count: 1
Working on postcode 0018 OSLO...
Exceptional post found, skipping...

Count: 2
Working on postcode 0045 OSLO...
Exceptional post found, skipping...

Count: 3
Working on postcode 0050 OSLO...
https://www.erikbolstad.no/postnummer-koordinatar/?postnummer=0050
https://www.erikbolstad.no/postnummer-koordinatar/veg.php?veg=Stenersgata&kommune=301
Working on: Stenersgata
Stenersgata 59.91311111111111 10.752633333333334 


Count: 4
Working on postcode 0139 OSLO...
https://www.erikbolstad.no/postnummer-koordinatar/?postnummer=0139
https://www.erikbolstad.no/postnummer-koordinatar/veg.php?veg=Blåskjellveien&kommune=301
Working on: Blåskjellveien
Blåskjellveien 59.870850000000004 10.771999999999998 

https://www.erikbolstad.no/postnummer-koordinatar/veg.php?veg=Fiskekroken&kommune=301
Working on: Fiskekroken
Fiskekroken 59.87151269841268 10.772650793650794 

https://www.erikbolstad.no/postnummer-koordinatar/veg.php?veg=Fjordveien&kommune=301
Working on: Fjordveien
Fjordveien 59.8690962

In [20]:
#Create backup version of the dictionary created above, just in case
midStreetDict_backUp = midStreetDict

#Check backup dictionary formed
for k,v in midStreetDict_backUp.items():
    print(k,v)

In [18]:
#Check a small sample of streets
streetDataSmall = pd.DataFrame.from_dict(combined, orient='index')

#Reset row index to numerical values
streetDataSmall.reset_index(inplace = True)

#Change column names
streetDataSmall.columns.values[0:] = ['StreetAddress','Latitude','Longitude']

#View first 5 rows
streetDataSmall.head()

Unnamed: 0,StreetAddress,Latitude,Longitude
0,Rådhusgata 11,59.9094,10.7434
1,Rådhusgata 2,59.9085,10.7462
2,Rådhusgata 7B,59.9092,10.7441
3,Rådhusgata 20,59.9095,10.7421
4,Rådhusgata 5,59.9089,10.7455


In [184]:
#Arrange above df in ascending order of address
#streetDataSmall.sort_values('StreetAddress', ascending= True, axis = 0)
from natsort import order_by_index, index_natsorted
streetDataSmall.reindex(index=order_by_index(streetDataSmall.index, index_natsorted(streetDataSmall.StreetAddress))).reset_index()

Unnamed: 0,index,StreetAddress,Latitude,Longitude
0,31,Vargveien 1A,59.8721,10.7764
1,29,Vargveien 1B,59.8721,10.7764
2,3,Vargveien 1C,59.8719,10.7765
3,46,Vargveien 1D,59.8718,10.7765
4,18,Vargveien 1E,59.8716,10.7763
5,24,Vargveien 1F,59.8714,10.7763
6,40,Vargveien 1G,59.8714,10.7762
7,42,Vargveien 1H,59.8712,10.776
8,39,Vargveien 2,59.8712,10.7767
9,38,Vargveien 3A,59.8707,10.7753


In [201]:
from natsort import natsorted
natsorted((k,v) for k,v in combined.items())#Sorts on K, and if you want v with it, bracket them together

[('Stensrudsvingen 2', ('59.8192', ' 10.8725')),
 ('Stensrudsvingen 4', ('59.8190', ' 10.8723')),
 ('Stensrudsvingen 5', ('59.8205', ' 10.8700')),
 ('Stensrudsvingen 6', ('59.8174', ' 10.8703')),
 ('Stensrudsvingen 7', ('59.8193', ' 10.8703')),
 ('Stensrudsvingen 8', ('59.8171', ' 10.8703')),
 ('Stensrudsvingen 10', ('59.8185', ' 10.8707')),
 ('Stensrudsvingen 12', ('59.8183', ' 10.8702')),
 ('Stensrudsvingen 14', ('59.8185', ' 10.8697'))]

In [227]:
#Obtain first and last address
#Find length of list
listLen = len(natsorted((k,v) for k,v in combined.items()))
#print(listLen)

count = 0
for i in natsorted((k,v) for k,v in combined.items()):
    count = count+1
    if count == 1:
        coords1 = i[1][0], i[1][1]#Can index within a list of a list in python
        print(coords1)#Geopy distance estimator takes in tuple
    elif count == listLen:
        coords2 = i[1][0], i[1][1]
        print(coords2)

('59.8192', ' 10.8725')
('59.8185', ' 10.8697')


In [229]:
import geopy.distance

coords_1 = (52.2296756, 21.0122287)
print(type(coords_1))
coords_2 = (52.406374, 16.9251681)
print(type(coords_2))

#print (geopy.distance.vincenty(coords_1, coords_2).km) #Vincenty is deprecated, the either of the 2 below
#print (geopy.distance.geodesic(coords_1, coords_2).km)
#print (geopy.distance.distance(coords_1, coords_2).km) #Default
print( geopy.distance.distance(coords1, coords2).m)

<class 'tuple'>
<class 'tuple'>
175.3858930348763


In [21]:
midStreetDict

{}

In [19]:
streetData = pd.DataFrame.from_dict(midStreetDict, orient= 'index')

#Reset index
streetData.reset_index(inplace = True)

#Rename columns
streetData.columns.values[0:] = ['Street','MidLatitude','MidLongitude']
streetData.head()

ValueError: cannot copy sequence with size 3 to array axis with dimension 1

In [13]:
streetData.shape

NameError: name 'streetData' is not defined

In [17]:
streetData.isnull().sum()

NameError: name 'streetData' is not defined