<h1>Introduction/Business Problem</h1>
<p>Company A is a Real Estate Agency that is working for corporations.
One of the biggest client (Company B) has offices in the following cities:
<ul>
    <li>Paris</li>
    <li>London</li>
    <li>Berlin</li>
    <li>New York</li>
    <li>Tokyo</li>
</ul>
Company B's employees are rotating between offices and there is a need to do house-hunting, which is a task for company A. Usually people want to live in similar district as before. For example, if today I am working in Tokyo and my next assignment is in Berlin I would like to find the place which will be similar to the one I have in Tokyo.
In this notebook, we will build the model which clusters all districts from the above cities and help Company A in finding the best places to live for their client.</p>

In [40]:
#!conda install -c conda-forge geopy --yes 
import numpy as np
import pandas as pd
import json
from geopy.geocoders import Nominatim
import requests

<h1>Data</h1>
<p>Data for this project will come from following sources: </p>
<ul>
    <li> Paris Data: https://www.data.gouv.fr/ </li>
    <li> Tokyo Data: https://en.wikipedia.org/wiki/Special_wards_of_Tokyo </li>
    <li> London Data: https://en.wikipedia.org/wiki/List_of_London_boroughs </li>
    <li> Berlin Data: https://en.wikipedia.org/wiki/Boroughs_and_neighborhoods_of_Berlin </li>
    <li> New York Data: https://en.wikipedia.org/wiki/Boroughs_of_New_York_City </li>
    
</ul>

<p>In this part of the notebook we will first get borough data for each city together with latitude and longitude, next we will gather all data together and create final_df which will contain following information:
    <ul>
        <li>City Name</li>
        <li>District Name</li>
        <li>Latitude</li>
        <li>Longitude</li>
    </ul>
    Then we will use final_df dataframe to get data regarding nearby venues for each district from Foursquare API</p>
    <p> For Paris I downloaded ready json file, for the rest I scraped data from Wikipedia page and found Lat and Lon using geopy </p>

<h2>Paris Data</h2>

In [35]:
!wget -q -O 'paris_data.json' https://www.data.gouv.fr/en/datasets/r/871c68bf-c92b-42a9-8dcc-910f9be0b870

In [36]:
with open('paris_data.json') as json_data:
    paris_data = json.load(json_data)

In [37]:
paris_data[0]

{'datasetid': 'arrondissements',
 'recordid': 'ed26ad57eff28ccd16d66834c44723d942771958',
 'fields': {'n_sq_co': 750001537,
  'perimetre': 13678.7983149,
  'l_ar': '15ème Ardt',
  'surface': 8494994.08101075,
  'geom_x_y': [48.8400853759, 2.29282582242],
  'geom': {'type': 'Polygon',
   'coordinates': [[[2.299322310264648, 48.852174427333274],
     [2.300883913456113, 48.851176131084145],
     [2.300964661201576, 48.85123012651538],
     [2.303746047913422, 48.84943929154836],
     [2.306149310004038, 48.84789994583024],
     [2.307339709005096, 48.847139376795404],
     [2.3080419591723302, 48.84739225374763],
     [2.308219888290114, 48.847434925901986],
     [2.310378251508753, 48.84795218786006],
     [2.310525952151246, 48.84798758363206],
     [2.311155377296785, 48.84757938949668],
     [2.311260835819493, 48.847510996658556],
     [2.311486631207579, 48.847368855176626],
     [2.311625847644086, 48.847281216555366],
     [2.311767709159531, 48.84719033639172],
     [2.312091028

In [38]:
column_names = ['City', 'District', 'Latitude', 'Longitude']
paris_district=pd.DataFrame(columns=column_names)
for data in paris_data:
    district_name=data['fields']['l_aroff']
    latlon=data['geometry']['coordinates']
    lon=latlon[0]
    lat=latlon[1]
    paris_district=paris_district.append({'City': 'Paris', 'District' : district_name, 'Latitude': lat, 'Longitude': lon}, ignore_index=True)
paris_district

Unnamed: 0,City,District,Latitude,Longitude
0,Paris,Vaugirard,48.840085,2.292826
1,Paris,Opéra,48.877164,2.337458
2,Paris,Buttes-Montmartre,48.892569,2.348161
3,Paris,Luxembourg,48.84913,2.332898
4,Paris,Reuilly,48.834974,2.421325
5,Paris,Batignolles-Monceau,48.887327,2.306777
6,Paris,Ménilmontant,48.863461,2.401188
7,Paris,Louvre,48.862563,2.336443
8,Paris,Bourse,48.868279,2.342803
9,Paris,Buttes-Chaumont,48.887076,2.384821


<h2>Tokyo Data</h2>

In [3]:
tokyo_data=pd.read_html('https://en.wikipedia.org/wiki/Special_wards_of_Tokyo')
tokyo_data[3]['Name']

0        Chiyoda
1           Chūō
2         Minato
3       Shinjuku
4         Bunkyō
5          Taitō
6         Sumida
7           Kōtō
8      Shinagawa
9         Meguro
10           Ōta
11      Setagaya
12       Shibuya
13        Nakano
14      Suginami
15       Toshima
16          Kita
17       Arakawa
18      Itabashi
19        Nerima
20        Adachi
21    Katsushika
22       Edogawa
23       Overall
Name: Name, dtype: object

In [6]:
geolocator = Nominatim(user_agent="lat_lon_data")
#function to get lan and lon
def get_lon_lat(name_list, city_name, column_names):
    final_df=pd.DataFrame(columns=column_names)
    for singleName in name_list:
        location=geolocator.geocode(singleName)
        lat = location.latitude
        lon = location.longitude
        final_df=final_df.append({'City': city_name, 'District' : singleName, 'Latitude': lat, 'Longitude': lon}, ignore_index=True)
    return final_df
column_names = ['City', 'District', 'Latitude', 'Longitude']    

In [7]:
tokyo_district=get_lon_lat(tokyo_data[3]['Name'], 'Tokyo', column_names)
tokyo_district

Unnamed: 0,City,District,Latitude,Longitude
0,Tokyo,Chiyoda,35.69381,139.753216
1,Tokyo,Chūō,35.666255,139.775565
2,Tokyo,Minato,35.643227,139.740055
3,Tokyo,Shinjuku,35.693763,139.703632
4,Tokyo,Bunkyō,35.71881,139.744732
5,Tokyo,Taitō,35.71745,139.790859
6,Tokyo,Sumida,35.700429,139.805017
7,Tokyo,Kōtō,35.649154,139.81279
8,Tokyo,Shinagawa,35.599252,139.73891
9,Tokyo,Meguro,35.62125,139.688014


<h2>London Data</h2>

In [15]:
def cleanData(x):
    if str(x).find("[")>-1:
        return x[:str(x).find("[")-1]
    else:
        return x

london_data=pd.read_html('https://en.wikipedia.org/wiki/List_of_London_boroughs')
london_data[0]['Borough']=london_data[0][['Borough']].applymap(cleanData)
london_data[0]['Borough']

0       Barking and Dagenham
1                     Barnet
2                     Bexley
3                      Brent
4                    Bromley
5                     Camden
6                    Croydon
7                     Ealing
8                    Enfield
9                  Greenwich
10                   Hackney
11    Hammersmith and Fulham
12                  Haringey
13                    Harrow
14                  Havering
15                Hillingdon
16                  Hounslow
17                 Islington
18    Kensington and Chelsea
19      Kingston upon Thames
20                   Lambeth
21                  Lewisham
22                    Merton
23                    Newham
24                 Redbridge
25      Richmond upon Thames
26                 Southwark
27                    Sutton
28             Tower Hamlets
29            Waltham Forest
30                Wandsworth
31               Westminster
Name: Borough, dtype: object

In [16]:
london_district=get_lon_lat(london_data[0]['Borough'], 'London', column_names)
london_district

Unnamed: 0,City,District,Latitude,Longitude
0,London,Barking and Dagenham,51.554117,0.150504
1,London,Barnet,51.65309,-0.200226
2,London,Bexley,39.969238,-82.936864
3,London,Brent,32.937346,-87.164718
4,London,Bromley,51.402805,0.014814
5,London,Camden,39.94484,-75.119891
6,London,Croydon,51.371305,-0.101957
7,London,Ealing,51.512655,-0.305195
8,London,Enfield,51.652085,-0.081018
9,London,Greenwich,51.482084,-0.004542


<h2>Berlin Data</h2>

In [17]:
berlin_data=pd.read_html('https://en.wikipedia.org/wiki/Boroughs_and_neighborhoods_of_Berlin')
berlin_data[0]

Unnamed: 0,Borough,Population 31 March 2010,Area in km²,Density per km²,Map
0,Charlottenburg-Wilmersdorf,319628,64.72,4878,
1,Friedrichshain-Kreuzberg,268225,20.16,13187,
2,Lichtenberg,259881,52.29,4952,
3,Marzahn-Hellersdorf,248264,61.74,4046,
4,Mitte,332919,39.47,8272,
5,Neukölln,310283,44.93,6804,
6,Pankow,366441,103.01,3476,
7,Reinickendorf,240454,89.46,2712,
8,Spandau,223962,91.91,2441,
9,Steglitz-Zehlendorf,293989,102.5,2818,


In [18]:
berlin_district=get_lon_lat(berlin_data[0]['Borough'], 'Berlin', column_names)
berlin_district

Unnamed: 0,City,District,Latitude,Longitude
0,Berlin,Charlottenburg-Wilmersdorf,52.507856,13.263952
1,Berlin,Friedrichshain-Kreuzberg,52.515306,13.461612
2,Berlin,Lichtenberg,48.921296,7.481227
3,Berlin,Marzahn-Hellersdorf,52.522523,13.587663
4,Berlin,Mitte,52.51769,13.402376
5,Berlin,Neukölln,52.48115,13.43535
6,Berlin,Pankow,52.597637,13.436374
7,Berlin,Reinickendorf,52.604763,13.295287
8,Berlin,Spandau,52.535788,13.197792
9,Berlin,Steglitz-Zehlendorf,52.429205,13.229974


<h2>New York Data</h2>

In [33]:
new_york_data=pd.read_html('https://en.wikipedia.org/wiki/Boroughs_of_New_York_City')
ny_data=new_york_data[0][0:5]
ny_data.columns=['Borough', 'Val1', 'Val1', 'Val1', 'Val1', 'Val1', 'Val1', 'Val1', 'Val1']

Unnamed: 0,Borough,Val1,Val1.1,Val1.2,Val1.3,Val1.4,Val1.5,Val1.6,Val1.7
0,The Bronx,Bronx,1418207,42.695,30100,42.1,109.04,33867,13006
1,Brooklyn,Kings,2559903,91.559,35800,70.82,183.42,36147,13957
2,Manhattan,New York,1628706,600.244,368500,22.83,59.13,71341,27544
3,Queens,Queens,2253858,93.31,41400,108.53,281.09,20767,8018
4,Staten Island,Richmond,476143,14.514,30500,58.37,151.18,8157,3150


In [34]:
ny_district=get_lon_lat(ny_data['Borough'], 'NY', column_names)
ny_district

Unnamed: 0,City,District,Latitude,Longitude
0,NY,The Bronx,40.846651,-73.878594
1,NY,Brooklyn,40.650104,-73.949582
2,NY,Manhattan,40.789624,-73.959894
3,NY,Queens,40.749824,-73.797634
4,NY,Staten Island,40.583456,-74.149605


<h2>Gather Data Into One DataFrame (final_df)</h2>

In [81]:
frames=[paris_district, tokyo_district, london_district, berlin_district, ny_district]
final_df=pd.concat(frames)
final_df.reset_index(drop=True, inplace=True)
final_df

Unnamed: 0,City,District,Latitude,Longitude
0,Paris,Vaugirard,48.840085,2.292826
1,Paris,Opéra,48.877164,2.337458
2,Paris,Buttes-Montmartre,48.892569,2.348161
3,Paris,Luxembourg,48.849130,2.332898
4,Paris,Reuilly,48.834974,2.421325
5,Paris,Batignolles-Monceau,48.887327,2.306777
6,Paris,Ménilmontant,48.863461,2.401188
7,Paris,Louvre,48.862563,2.336443
8,Paris,Bourse,48.868279,2.342803
9,Paris,Buttes-Chaumont,48.887076,2.384821


<h2>Get Data From Foursquare to get nearby venues</h2>

In [42]:
#Foursquare API:
CLIENT_ID = '' 
CLIENT_SECRET = ''
VERSION = '20180605' 
LIMIT = 100

In [77]:
def getNearbyPlaces(district_data, radius=1000):
    places_list=[]
    for index, singleRow in district_data.iterrows():
        try:
            url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION, 
                float(singleRow['Latitude']), 
                float(singleRow['Longitude']), 
                radius, 
                LIMIT)
            results = requests.get(url).json()["response"]['groups'][0]['items']
            places_list.append([(
                singleRow['City'],
                singleRow['District'], 
                singleRow['Latitude'], 
                singleRow['Longitude'],
                p['venue']['name'], 
                p['venue']['location']['lat'], 
                p['venue']['location']['lng'],  
                p['venue']['categories'][0]['name']) for p in results])
        except:
            pass
    nearby_places = pd.DataFrame([item for place_list in places_list for item in place_list])
    nearby_places.columns = ['City', 
                            'District',
                  'District Latitude', 
                  'District Longitude', 
                  'Place', 
                  'Place Latitude', 
                  'Place Longitude', 
                  'Place Category']
    
    return(nearby_places)
    
            

In [78]:
near_places=getNearbyPlaces(final_df)

In [79]:
near_places

Unnamed: 0,City,District,District Latitude,District Longitude,Place,Place Latitude,Place Longitude,Place Category
0,Paris,Vaugirard,48.840085,2.292826,Indian Villa,48.841116,2.291621,Indian Restaurant
1,Paris,Vaugirard,48.840085,2.292826,Le Grand Venise,48.838276,2.294484,Italian Restaurant
2,Paris,Vaugirard,48.840085,2.292826,La Table Libanaise,48.841766,2.288607,Lebanese Restaurant
3,Paris,Vaugirard,48.840085,2.292826,Square Saint-Lambert,48.842343,2.297108,Park
4,Paris,Vaugirard,48.840085,2.292826,AlKaram,48.838379,2.297156,Lebanese Restaurant
5,Paris,Vaugirard,48.840085,2.292826,Amorino,48.844064,2.293377,Ice Cream Shop
6,Paris,Vaugirard,48.840085,2.292826,Afaria,48.836049,2.291783,Basque Restaurant
7,Paris,Vaugirard,48.840085,2.292826,Square Violet,48.844010,2.289383,Park
8,Paris,Vaugirard,48.840085,2.292826,CrossFit Lutèce,48.840888,2.292199,Gym
9,Paris,Vaugirard,48.840085,2.292826,The 3 Ducks Bar,48.843487,2.292473,Hotel Bar
