# Coursera: Battle of Neighborhoods 

This notebook is part of IBM Applied Data Science Capstone project

I am interested in exploring schools in Sweden. In this notebook I will attempt to scrape school names and addresses from edarabia, request their coordinates, and explore the area with Foursquare API.

In [None]:
# get drivers ready for selenium scraping
from zipfile import ZipFile
import wget

wget.download('https://github.com/mozilla/geckodriver/releases/download/v0.27.0/geckodriver-v0.27.0-win64.zip', 'gecko.zip')
with ZipFile('gecko.zip', 'r') as z: z.extractall('.')

# Note: the file had been downloaded from previous project
# so I did not run this code.

In [16]:
# scrape https://www.edarabia.com/schools/sweden/ for Sweden top schools.
from selenium import webdriver

driver = webdriver.Firefox(executable_path='geckodriver.exe')
driver.get('https://www.edarabia.com/schools/sweden/')

In [12]:
# Upon inspection, school names are part of a hyperlink under <h5> tags
links = driver.find_elements_by_xpath('//h5/a')
for i, link in enumerate(links):
    print(i, link.text)

0 
1 
2 
3 
4 Bladins International School
5 Deutsche Schule Stockholm
6 Engelska Skolan Norr AB
7 Fryshuset School
8 International IT College of Sweden
9 Lycee Francais Saint Louis in Stockholm
10 ProCivitas Private Gymnasium Stockholm
11 Swedish Finnish School
12 The Tanto International School
13 Thea Private Grundskola


In [14]:
# there seem to be empty links, lets remove them
schools = [link.text for link in links if link.text]
schools

['Bladins International School',
 'Deutsche Schule Stockholm',
 'Engelska Skolan Norr AB',
 'Fryshuset School',
 'International IT College of Sweden',
 'Lycee Francais Saint Louis in Stockholm',
 'ProCivitas Private Gymnasium Stockholm',
 'Swedish Finnish School',
 'The Tanto International School',
 'Thea Private Grundskola']

In [17]:
# Let's see if we can get their address.
address = driver.find_elements_by_xpath('//ul/li')
for i, addr in enumerate(address):
    print(i, addr.text)

0 
1 Universities
2 Schools
3 Nurseries
4 Courses
5 Jobs
6 Events
7 Login
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
46 
47 
48 
49 
50 
51 
52 
53 
54 
55 Home
56 Schools
57 Sweden
58 Preschool (förskola)
59 Compulsory schooling which is made up of Elementary school (lågstadiet), middle school (mellanstadiet) and junior high school (högstadiet).
60 Upper secondary school or senior high school (gymnasium)
61 Tertiary education
62 
63 
64 
65 
66 
67 
68 
69 
70 
71 
72 
73 
74 
75 Address: Sjallandstorget 1
76 Curriculum: IB
77 SEK 23,500
78 Address: Karlavagen 25
79 Address: Roslagstullsbacken 4
80 Founded: 1993
81 Address: Martensdalsgatan 2-8
82 Founded: 1984
83 Address: Halsobrunnsgatan 6
84 Founded: 2003
85 Address: Essingestraket 24
86 Founded: 1959
87 Curriculum: French
88 SEK 12,375
89 Address: Sandbacksgatan 10
90 Address: Ryttargatan 275, Planner
91 Address: Flintbacken 

In [None]:
# That should be all the data needed so let's close the driver
driver.close()

In [25]:
# We can isolate the address by looking for string "Address:"
school_addresses = []
for addr in address:
    addr = addr.text
    if "Address:" in addr:
        addr.replace('Address: ', '')
        # The ", Planner" in 1 of the addresses is causing problem with geolocation
        # so I remove it
        addr = addr.split(',')[0]
        school_addresses.append(addr.replace('Address: ', ''))
school_addresses


['Sjallandstorget 1',
 'Karlavagen 25',
 'Roslagstullsbacken 4',
 'Martensdalsgatan 2-8',
 'Halsobrunnsgatan 6',
 'Essingestraket 24',
 'Sandbacksgatan 10',
 'Ryttargatan 275',
 'Flintbacken 20',
 'Tullgardsgatan 10']

In [26]:
# check if we get same number of addresses and schools
print(len(schools))
print(len(school_addresses))

10
10


In [28]:
# see if we can get coordinates from address
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent='my_agent')
lat = []
long = []
for addr in school_addresses:
    loc = geolocator.geocode(f'{addr}, Sweden')
    lat.append(loc.latitude)
    long.append(loc.longitude)


In [31]:
# check leng of lat and long
print(len(lat))
print(len(long))

10
10


In [33]:
# All lists contain same number of elements
# We can zip them and create dataframe
import pandas as pd

df = pd.DataFrame(zip(schools, school_addresses, lat, long), columns=['School','Address','Lat','Long'])

Unnamed: 0,School,Address,Lat,Long
0,Bladins International School,Sjallandstorget 1,55.59273,12.987777
1,Deutsche Schule Stockholm,Karlavagen 25,59.345046,18.063079
2,Engelska Skolan Norr AB,Roslagstullsbacken 4,59.350678,18.059369
3,Fryshuset School,Martensdalsgatan 2-8,59.300544,18.088212
4,International IT College of Sweden,Halsobrunnsgatan 6,59.337121,18.047313
5,Lycee Francais Saint Louis in Stockholm,Essingestraket 24,59.319225,17.989403
6,ProCivitas Private Gymnasium Stockholm,Sandbacksgatan 10,59.316701,18.080628
7,Swedish Finnish School,Ryttargatan 275,59.507273,17.907965
8,The Tanto International School,Flintbacken 20,59.309693,18.049634
9,Thea Private Grundskola,Tullgardsgatan 10,59.305603,18.084816


In [10]:
# let's see if we can get the city name
import reverse_geocoder as rg

cities = [each['name'] for each in rg.search(list(zip(df['Lat'],df['Long'])))]
cities

['Malmoe',
 'Stockholm',
 'Stockholm',
 'Arsta',
 'Stockholm',
 'Solna',
 'Stockholm',
 'Upplands Vaesby',
 'Arsta',
 'OEstermalm']

In [11]:
# add cities as a column into our df
df['City'] = cities
df.head()

Unnamed: 0,School,Address,Lat,Long,City
0,Bladins International School,Sjallandstorget 1,55.59273,12.987777,Malmoe
1,Deutsche Schule Stockholm,Karlavagen 25,59.345046,18.063079,Stockholm
2,Engelska Skolan Norr AB,Roslagstullsbacken 4,59.350678,18.059369,Stockholm
3,Fryshuset School,Martensdalsgatan 2-8,59.300544,18.088212,Arsta
4,International IT College of Sweden,Halsobrunnsgatan 6,59.337121,18.047313,Stockholm


In [22]:
# Save the dataframe for future session
df.to_csv('schools_city.csv', index=False)

In [1]:
import pandas as pd
df = pd.read_csv('schools_city.csv')
df.head()

Unnamed: 0,School,Address,Lat,Long,City
0,Bladins International School,Sjallandstorget 1,55.59273,12.987777,Malmoe
1,Deutsche Schule Stockholm,Karlavagen 25,59.345046,18.063079,Stockholm
2,Engelska Skolan Norr AB,Roslagstullsbacken 4,59.350678,18.059369,Stockholm
3,Fryshuset School,Martensdalsgatan 2-8,59.300544,18.088212,Arsta
4,International IT College of Sweden,Halsobrunnsgatan 6,59.337121,18.047313,Stockholm


In [3]:
# let's visualize the location with folium
import folium

latitude = df.Lat.mean()
longitude = df.Long.mean()
map = folium.Map(location=[latitude, longitude], zoom_start=5)
for lat, lng, school, city in zip(df['Lat'], df['Long'], df['School'], df['City']):
    label = folium.Popup(f'{school}, {city}', parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map)
map

From the map, most schools cluster around Stockholm. With a few in Arsta, Solna, Upplands Vaesby, and 1 in the distance Malmoe. I will narrow my exploration down to these 5, create new dataframe and assign mean coordinate values from previous dataframe

In [2]:
cities_of_interest = ['Stockholm', 'Arsta', 'Solna', 'Upplands Vaesby', 'Malmoe']
lat = []
long = []
for city in cities_of_interest:
    lat.append(df[df['City'].str.contains(city)].Lat.mean())
    long.append(df[df['City'].str.contains(city)].Long.mean())
df_city = pd.DataFrame(zip(cities_of_interest, lat, long), columns=['City', 'Lat', 'Long'])
df_city


Unnamed: 0,City,Lat,Long
0,Stockholm,59.337386,18.062597
1,Arsta,59.305118,18.068923
2,Solna,59.319225,17.989403
3,Upplands Vaesby,59.507273,17.907965
4,Malmoe,55.59273,12.987777


In [3]:
import folium

latitude = df_city.Lat.mean()
longitude = df_city.Long.mean()
map = folium.Map(location=[latitude, longitude], zoom_start=5)
for lat, lng, city in zip(df_city['Lat'], df_city['Long'], df_city['City']):
    label = folium.Popup(f'{city}', parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map)
map

Stockholm, Arsta and Solna are quite close to each other, I will narrow the radius of their venue search to 10Km while the other 2 we can search as wide as 20Km without over lapping.

In [4]:
# fsquare credentials in text file for security.1
f = open('fsquare.txt', 'r')
ID, SECRET = f.readlines()
ID = ID.rstrip()
SECRET = SECRET.rstrip()
VERSION = '20201101'

I want to see what the data looks like. Requesting Foursquare API and saving data to json files.

In [7]:
import requests
import json

for city in cities_of_interest:
    radius = 10000 if city in ['Stockholm', 'Arsta', 'Solna'] else 20000
    query = f'https://api.foursquare.com/v2/venues/explore?&client_id={ID}&client_secret={SECRET}&near={city},Sweden&radius={radius}&limit=200&v={VERSION}'
    results = requests.get(query).json()
    with open(f'{city}.json', 'w') as c:
        json.dump(results, c, indent=4)


Counting venues of interest. For a city best suitable for raising children I want lots pf parks and if possible sport centers. Having Asian food and Thai food would be big plus.

In [19]:
gyms = []
asian_foods = []
thai_foods = []
steaks = []
parks = []
bookstores = []

for city in cities_of_interest:
    with open(f'{city}.json', 'r') as c:
        data = json.load(c)
    data = data['response']['groups'][0]['items']
    gyms.append(sum(1 for d in data if '/gym' in d['venue']['categories'][0]['icon']['prefix']))
    parks.append(sum(1 for d in data if '/park' in d['venue']['categories'][0]['icon']['prefix']))
    # trying to combine all Asian food into 1 count based on json file exploration
    asian_foods.append(sum(1 for d in data 
        if '/food/asian' in d['venue']['categories'][0]['icon']['prefix'] 
        or '/food/sushi' in d['venue']['categories'][0]['icon']['prefix']
        or '/food/korean' in d['venue']['categories'][0]['icon']['prefix']
        or '/food/vietnam' in d['venue']['categories'][0]['icon']['prefix']
        or '/food/ramen' in d['venue']['categories'][0]['icon']['prefix']
    ))
    thai_foods.append(sum(1 for d in data if '/food/thai' in d['venue']['categories'][0]['icon']['prefix']))
    steaks.append(sum(1 for d in data if '/food/steak' in d['venue']['categories'][0]['icon']['prefix']))
    bookstores.append(sum(1 for d in data if 'bookstore' in d['venue']['categories'][0]['icon']['prefix']))
print(thai_foods)

[0, 1, 0, 2, 0]
