# Capstone Project - The Battle of the Neighborhoods (Week 1)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

The aim of the Capstone project is to compare two cities, Paris and Prague, which is my hometown. Both cities are the capital cities of countries, the city centres are very similar and attractive for tourists. This project should help people to decide which city to visit, how many tourist attractions there are and how long they should stay there. It also can be convenient and helpful for people who want to change their neighborhoods within the city. It can be also helpful for people thinking about relocating into one of these cities.The idea is to look for venues in the different neighborhoods, to cluster them and compare them. 

## Data <a name="introduction"></a>

Firstly lets import all the necessary libraries.

In [3]:
!pip install beautifulsoup4
!pip install lxml
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 


from IPython.display import display_html
import pandas as pd
import numpy as np
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

!pip install opendatasets
import opendatasets as od

print('Folium installed')
print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Folium installed
Libraries imported.


### Prague Dataset <a name="introduction"></a>

We need to have geographical coordinates for the neighborhoods of Paris and Prague. 
For the Prague neighborhood I created an CSV. dataset, which is webscraped from wikipedia. The dataset is available here: https://www.kaggle.com/konecfil/prague-neighborhoods-dataset. The first column is name of the neighborhood, the second and third are Lat and Lon, respectively. We will use the geographical coordinates as centroids of the Prague neighborhoods. 



We will use od.download("https://www.kaggle.com/konecfil/prague-neighborhoods-dataset"). It will ask us to insert username and key. It can be found on your kaggle account ( you have to create an account and then it can be found if you click "your account". It will create a new directory with the file. 
You also need to add "!pip install opendatasets" and
"import opendatasets as od" to work properly.



In [11]:
od.download("https://www.kaggle.com/konecfil/prague-neighborhoods-dataset")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: 
Your Kaggle username: 
Your Kaggle username: konecfil
Your Kaggle Key: ········


100%|██████████| 955/955 [00:00<00:00, 210kB/s]

Downloading prague-neighborhoods-dataset.zip to .\prague-neighborhoods-dataset






### Paris Dataset <a name="introduction"></a>

The Paris' dataset is available here: https://www.data.gouv.fr/fr/datasets/r/e88c6fda-1d09-42a0-a069-606d3259114e. The JSON file is for the whole France, so we have to limit it for Paris only. Columns are : postal_code: Postal codes for France, nom_comm: Name of Neighborhoods in France, nom_dept: Name of the boroughs,
geo_point_2d: Tuple containing the latitude and longitude of the Neighborhoods.

### Foursquare API <a name="introduction"></a>

For the locations of venues we will use the Foursquare API. Foursquare API provides us with information about venues in the neighborhoods within an area of interest. We will use radius of 800 metres. Foursquare API is the only data source we will be using to obtain these data. 

### Data preprocessing <a name="introduction"></a>

We download a json file. Pay attention, the name is 'france-data.json'!

In [5]:
!wget -q -O 'france-data.json' https://www.data.gouv.fr/fr/datasets/r/e88c6fda-1d09-42a0-a069-606d3259114e
print("Data Downloaded!")
paris_raw = pd.read_json("'france-data.json'")
paris_raw.head()

Data Downloaded!


Unnamed: 0,datasetid,recordid,fields,geometry,record_timestamp
0,correspondances-code-insee-code-postal,2bf36b38314b6c39dfbcd09225f97fa532b1fc45,"{'code_comm': '645', 'nom_dept': 'ESSONNE', 's...","{'type': 'Point', 'coordinates': [2.2517129721...",2016-09-21T00:29:06.175+02:00
1,correspondances-code-insee-code-postal,7ee82e74e059b443df18bb79fc5a19b1f05e5a88,"{'code_comm': '133', 'nom_dept': 'SEINE-ET-MAR...","{'type': 'Point', 'coordinates': [3.0529405055...",2016-09-21T00:29:06.175+02:00
2,correspondances-code-insee-code-postal,e2cd3186f07286705ed482a10b6aebd9de633c81,"{'code_comm': '378', 'nom_dept': 'ESSONNE', 's...","{'type': 'Point', 'coordinates': [2.1971816504...",2016-09-21T00:29:06.175+02:00
3,correspondances-code-insee-code-postal,868bf03527a1d0a9defe5cf4e6fa0a730d725699,"{'code_comm': '243', 'nom_dept': 'SEINE-ET-MAR...","{'type': 'Point', 'coordinates': [2.7097808131...",2016-09-21T00:29:06.175+02:00
4,correspondances-code-insee-code-postal,1bbcee92101fdb50f5f5fceb052681f2421ff961,"{'code_comm': '414', 'nom_dept': 'SEINE-ET-MAR...","{'type': 'Point', 'coordinates': [3.2582355268...",2016-09-21T00:29:06.175+02:00


In [6]:
paris_field_data = pd.DataFrame()
for f in paris_raw.fields:
    dict_new = f
    paris_field_data = paris_field_data.append(dict_new, ignore_index=True)
 
paris_field_data.head()

Unnamed: 0,code_arr,code_cant,code_comm,code_dept,code_reg,geo_point_2d,geo_shape,id_geofla,insee_com,nom_comm,nom_dept,nom_region,population,postal_code,statut,superficie,z_moyen
0,3,3,645,91,11,"[48.750443119964764, 2.251712972144151]","{'type': 'Polygon', 'coordinates': [[[2.238024...",16275,91645,VERRIERES-LE-BUISSON,ESSONNE,ILE-DE-FRANCE,15.5,91370,Commune simple,999.0,121.0
1,3,20,133,77,11,"[48.41256065214989, 3.052940505560729]","{'type': 'Polygon', 'coordinates': [[[3.076046...",31428,77133,COURCELLES-EN-BASSEE,SEINE-ET-MARNE,ILE-DE-FRANCE,0.2,77126,Commune simple,1082.0,88.0
2,1,9,378,91,11,"[48.52726809075556, 2.19718165044305]","{'type': 'Polygon', 'coordinates': [[[2.203466...",30975,91378,MAUCHAMPS,ESSONNE,ILE-DE-FRANCE,0.3,91730,Commune simple,313.0,150.0
3,5,14,243,77,11,"[48.87307018579678, 2.7097808131278462]","{'type': 'Polygon', 'coordinates': [[[2.727542...",17000,77243,LAGNY-SUR-MARNE,SEINE-ET-MARNE,ILE-DE-FRANCE,20.2,77400,Chef-lieu canton,579.0,71.0
4,3,25,414,77,11,"[48.62891464105825, 3.2582355268439223]","{'type': 'Polygon', 'coordinates': [[[3.294591...",34949,77414,SAINT-HILLIERS,SEINE-ET-MARNE,ILE-DE-FRANCE,0.4,77160,Commune simple,1907.0,158.0


In [7]:
df_2 = paris_field_data[['postal_code','nom_comm','nom_dept','geo_point_2d']]

Then we filter the dataset so nom_dept contains Paris only. 

In [8]:
df_paris = df_2[df_2['nom_dept'].str.contains('PARIS')].reset_index(drop=True)
df_paris.head()

Unnamed: 0,postal_code,nom_comm,nom_dept,geo_point_2d
0,75009,PARIS-9E-ARRONDISSEMENT,PARIS,"[48.87689616237872, 2.337460241388529]"
1,75002,PARIS-2E-ARRONDISSEMENT,PARIS,"[48.86790337886785, 2.344107166658533]"
2,75011,PARIS-11E-ARRONDISSEMENT,PARIS,"[48.85941549762748, 2.378741060237548]"
3,75008,PARIS-8E-ARRONDISSEMENT,PARIS,"[48.87252726662346, 2.312582560420059]"
4,75013,PARIS-13E-ARRONDISSEMENT,PARIS,"[48.82871768452136, 2.362468228516128]"


Now we divide geo_point_2d into lat and lng. 

In [9]:
paris_lat = df_paris['geo_point_2d'].apply(lambda x: x[0])

paris_lng = df_paris['geo_point_2d'].apply(lambda x: x[1])



In [10]:
paris_combined_data = pd.concat([df_paris.drop('geo_point_2d', axis=1), paris_lat, paris_lng], axis=1)
paris_combined_data

Unnamed: 0,postal_code,nom_comm,nom_dept,geo_point_2d,geo_point_2d.1
0,75009,PARIS-9E-ARRONDISSEMENT,PARIS,48.876896,2.33746
1,75002,PARIS-2E-ARRONDISSEMENT,PARIS,48.867903,2.344107
2,75011,PARIS-11E-ARRONDISSEMENT,PARIS,48.859415,2.378741
3,75008,PARIS-8E-ARRONDISSEMENT,PARIS,48.872527,2.312583
4,75013,PARIS-13E-ARRONDISSEMENT,PARIS,48.828718,2.362468
5,75012,PARIS-12E-ARRONDISSEMENT,PARIS,48.835156,2.419807
6,75003,PARIS-3E-ARRONDISSEMENT,PARIS,48.863054,2.359361
7,75006,PARIS-6E-ARRONDISSEMENT,PARIS,48.848968,2.332671
8,75004,PARIS-4E-ARRONDISSEMENT,PARIS,48.854228,2.357362
9,75010,PARIS-10E-ARRONDISSEMENT,PARIS,48.876029,2.361113


Now we have 19 neighborhoods of Paris. 


## Visualization <a name="introduction"></a>