# This notebook is the code which will cover the data extraction and manipulation for the capstone

In [2]:
print('importation: begin!')
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import re

import numpy as np # library to handle data in a vectorized manner

import json # library to handle JSON files

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!pip install folium # -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

importation: begin!
Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/a4/f0/44e69d50519880287cc41e7c8a6acc58daa9a9acf5f6afc52bcc70f69a6d/folium-0.11.0-py2.py3-none-any.whl (93kB)
[K     |████████████████████████████████| 102kB 6.3MB/s ta 0:00:011
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/13/fb/9eacc24ba3216510c6b59a4ea1cd53d87f25ba76237d7f4393abeaf4c94e/branca-0.4.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0
Libraries imported.


**The cell below extracts data about Italy's provinces, from a wikipedia page. Since some cities are displayed with the english name, we need to use the equivalent italian wikipedia page, which uses approximately the same names as the infected spreadsheet**

In [3]:
url = 'https://it.wikipedia.org/wiki/Province_d%27Italia'

df = pd.read_html(url)

**Since there are more than one data frame in that page, we will need to look into them, to find out which one we're interested in.**

It turns out we care about the fist one, so position 0 in the list of dataframes.

In [4]:
# There are 11 elements in df, which means 11 different dataframes! We care about the fisrt one, position number 0
for i in range(len(df)):
    print('new dataframe\n')
    df_new = df[i]
    print(df_new.head(2))

new dataframe

                     Provincia Sigla   Regione Popolazione(ab.)  \
0  Agrigento (lib. cons. com.)    AG   Sicilia          434 870   
1                  Alessandria    AL  Piemonte          421 284   

  Superficie(km²) Densità(ab./km²) Comuni(N°)          Presidente  \
0        3 052,59              142         43  Roberto Barberi[6]   
1        3 558,83              118        187    Gianfranco Baldi   

                        Partito  
0     Commissario straordinario  
1  Indipendente (Centro-destra)  
new dataframe

   Pos.    Provincia Sigla              Regione  Superficie (km²)
0     1      Sassari    SS             Sardegna           7692090
1     2  Bolzano[21]    BZ  Trentino-Alto Adige           7398381
new dataframe

  V · D · M Suddivisioni dell'Italia  \
0                Regioni statistiche   
1                            Regioni   

                V · D · M Suddivisioni dell'Italia.1  
0  RegioniProvince, province autonome, città metr...  
1  Province, p

In [5]:
# we extract the relevant table by accessing index two of the list
df1 = df[0]
df1.head()

Unnamed: 0,Provincia,Sigla,Regione,Popolazione(ab.),Superficie(km²),Densità(ab./km²),Comuni(N°),Presidente,Partito
0,Agrigento (lib. cons. com.),AG,Sicilia,434 870,"3 052,59",142,43,Roberto Barberi[6],Commissario straordinario
1,Alessandria,AL,Piemonte,421 284,"3 558,83",118,187,Gianfranco Baldi,Indipendente (Centro-destra)
2,Ancona,AN,Marche,471 228,"1 936,22",240,47,Luigi Cerioni,Indipendente (Centro-sinistra)
3,Aosta[7] (reg.autonoma),AO,Valle d'Aosta,125 666,"3 260,9",39,74,/,/
4,Arezzo,AR,Toscana,342 654,"3 233,08",106,36,Silvia Chiassai,Indipendente (Centro-destra)


**We can drop a few columns, since we're not really interested in many of them**

In [6]:
df2 = df1.drop(axis= 1, columns = ['Sigla', 'Regione', 'Superficie(km²)', 'Comuni(N°)', 'Partito', 'Presidente'])

**We then rename some columns to better fit our needs, using the english equivalent**

In [7]:
df2.rename(columns = {'Provincia':'Province','Popolazione(ab.)':'Population', 'Densità(ab./km²)':'Density'}, inplace = True)

In [8]:
df2.head(10)

Unnamed: 0,Province,Population,Density
0,Agrigento (lib. cons. com.),434 870,142
1,Alessandria,421 284,118
2,Ancona,471 228,240
3,Aosta[7] (reg.autonoma),125 666,39
4,Arezzo,342 654,106
5,Ascoli Piceno,207 179,169
6,Asti,214 638,142
7,Avellino,418 306,149
8,Bari,1 248 489,326
9,Barletta-Andria-Trani,390 011,253


**As we can see, there is some data cleaning to do on this table. Let's proceed with cleaning up the names. We can see there are many unnecessary brackets information that we need to take out**

In [9]:
regespr = re.compile(r' \(')
regespr1 = re.compile(r'\[')

new_list = list()
for i in df2['Province']:
    #print(i)
    try:
        start_pos = re.search(regespr, i).start()
        new_i = i[:start_pos]
    except:
        start_pos = 0
        new_i = i
    try:
        start_pos1 = re.search(regespr1, new_i).start()
        new_i1 = new_i[:start_pos1]
    except:
        start_pos1 = 0
        new_i1 = new_i
    lower_item = new_i1.lower()
    new_list.append(lower_item)

**There are still some unique records that don't match. We will replace them manually**

In [11]:
print(new_list)
for count,i in enumerate(new_list):
    if i == 'forlì-cesena':
        new_list[count] = 'forlì cesena'
    if i == 'massa-carrara':
        new_list[count] = 'massa carrara' #
    if i == 'monza e brianza':
        new_list[count] = 'monza brianza'
    if i == 'pesaro e urbino':
        new_list[count] = 'pesaro'
print(new_list)

df2['Province adjusted'] = np.array(new_list) # adds column with new list to dataframe
df2.head()

['agrigento', 'alessandria', 'ancona', 'aosta', 'arezzo', 'ascoli piceno', 'asti', 'avellino', 'bari', 'barletta-andria-trani', 'belluno', 'benevento', 'bergamo', 'biella', 'bologna', 'bolzano', 'brescia', 'brindisi', 'cagliari', 'caltanissetta', 'campobasso', 'caserta', 'catania', 'catanzaro', 'chieti', 'como', 'cosenza', 'cremona', 'crotone', 'cuneo', 'enna', 'fermo', 'ferrara', 'firenze', 'foggia', 'forlì cesena', 'frosinone', 'genova', 'gorizia', 'grosseto', 'imperia', 'isernia', "l'aquila", 'la spezia', 'latina', 'lecce', 'lecco', 'livorno', 'lodi', 'lucca', 'macerata', 'mantova', 'massa carrara', 'matera', 'messina', 'milano', 'modena', 'monza brianza', 'napoli', 'novara', 'nuoro', 'oristano', 'palermo', 'padova', 'parma', 'pavia', 'perugia', 'pesaro', 'pescara', 'piacenza', 'pisa', 'pistoia', 'pordenone', 'potenza', 'prato', 'reggio calabria', 'ragusa', 'ravenna', 'reggio emilia', 'rieti', 'rimini', 'roma', 'rovigo', 'salerno', 'sassari', 'savona', 'siena', 'siracusa', 'sondrio'

Unnamed: 0,Province,Population,Density,Province adjusted
0,Agrigento (lib. cons. com.),434 870,142,agrigento
1,Alessandria,421 284,118,alessandria
2,Ancona,471 228,240,ancona
3,Aosta[7] (reg.autonoma),125 666,39,aosta
4,Arezzo,342 654,106,arezzo


**We will then drop the first province column and then rename the new one to the old one**

In [12]:
df3 = df2.drop(axis= 1, columns = ['Province'])
df3.rename(columns = {'Province adjusted':'Province'}, inplace = True)
df3.head(150)

Unnamed: 0,Population,Density,Province
0,434 870,142,agrigento
1,421 284,118,alessandria
2,471 228,240,ancona
3,125 666,39,aosta
4,342 654,106,arezzo
5,207 179,169,ascoli piceno
6,214 638,142,asti
7,418 306,149,avellino
8,1 248 489,326,bari
9,390 011,253,barletta-andria-trani


**Now we need to delete those annoying spaces between numbers in the population column**

In [13]:
#regespr2 = re.compile(r' ')

new_list1 = list()
for i in df3['Population']:
    new_list1.append(i)
print(new_list1)
new_list2 = list()
for i in new_list1:
    #print(i)
    #print(len(i))
    if len(i) > 3:
        new_i = i[:-4]+i[-3:]
    else:
        new_i = i
    print(new_i)
    if len(new_i) > 6:
        new_i = new_i[:-7]+new_i[-6:]
    else:
        new_i = new_i
    new_i = int(new_i)
    new_list2.append(new_i)
print(new_list2)

['434\xa0870', '421\xa0284', '471\xa0228', '125\xa0666', '342\xa0654', '207\xa0179', '214\xa0638', '418\xa0306', '1\xa0248\xa0489', '390\xa0011', '202\xa0950', '277\xa0018', '1\xa0114\xa0590', '177\xa0585', '1\xa0017\xa0196', '531\xa0178', '1\xa0265\xa0954', '392\xa0975', '430\xa0372', '262\xa0458', '221\xa0238', '922\xa0965', '1\xa0103\xa0917', '358\xa0316', '385\xa0588', '599\xa0204', '705\xa0753', '358\xa0955', '174\xa0980', '587\xa0098', '164\xa0788', '173\xa0800', '345\xa0691', '1\xa0012\xa0407', '622\xa0183', '394\xa0627', '489\xa0083', '837\xa0427', '139\xa0403', '221\xa0629', '213\xa0840', '84\xa0379', '299\xa0031', '219\xa0556', '575\xa0254', '795\xa0134', '337\xa0380', '334\xa0832', '230\xa0198', '387\xa0876', '314\xa0178', '412\xa0292', '194\xa0878', '197\xa0909', '622\xa0962', '3\xa0261\xa0873', '705\xa0393', '873\xa0935', '3\xa0072\xa0996', '369\xa0018', '208\xa0550', '157\xa0707', '1\xa0245\xa0826', '937\xa0908', '451\xa0631', '545\xa0888', '656\xa0382', '358\xa0886', '31

In [14]:
df3['Population adjusted'] = np.array(new_list2) # adds column with new list to dataframe
df4 = df3.drop(axis= 1, columns = ['Population'])
df4.rename(columns = {'Population adjusted':'Population'}, inplace = True)
df4.head(150)

Unnamed: 0,Density,Province,Population
0,142,agrigento,434870
1,118,alessandria,421284
2,240,ancona,471228
3,39,aosta,125666
4,106,arezzo,342654
5,169,ascoli piceno,207179
6,142,asti,214638
7,149,avellino,418306
8,326,bari,1248489
9,253,barletta-andria-trani,390011


### Now, we proceed to wrangle the second data source: Since it is a pdf file (http://www.salute.gov.it/imgs/C_17_notizie_4702_1_file.pdf), there are a couple of python packages that allow to wrangle pdf files: tabula and camelot. Of course, none of them is working in this jupiter environment. Even though they get installed with the !pip command, they throw different traceback errors when trying to parse the file. So i had to download it, convert to csv file, and then upload it in my github page. Not ideal, but i have currently no other way to get around the problem.

In [15]:
csv_file = 'https://github.com/EmanueleLanzani/Coursera_Capstone/blob/master/infected_situation_7_may.csv'
inf_df = pd.read_html(csv_file)

In [16]:
inf_df1 = inf_df[0]

In [17]:
inf_df1.head(8)

Unnamed: 0.1,Unnamed: 0,Province,Infected
0,,agrigento,135
1,,alessandria,3654
2,,ancona,1822
3,,aosta,1150
4,,arezzo,655
5,,ascoli piceno,286
6,,asti,1655
7,,avellino,474


In [18]:
#The first column is clearly a parsing error, so we proceed to drop it:

inf_df2 = inf_df1.drop(axis= 1, columns = ['Unnamed: 0'])
inf_df2.head(150)

Unnamed: 0,Province,Infected
0,agrigento,135
1,alessandria,3654
2,ancona,1822
3,aosta,1150
4,arezzo,655
5,ascoli piceno,286
6,asti,1655
7,avellino,474
8,bari,1362
9,bat,380


In [20]:
# we need to change a specific record which appear with its acronym in the dataset
inf_df3 = inf_df2.replace('bat','barletta-andria-trani')
inf_df3.head(40)

Unnamed: 0,Province,Infected
0,agrigento,135
1,alessandria,3654
2,ancona,1822
3,aosta,1150
4,arezzo,655
5,ascoli piceno,286
6,asti,1655
7,avellino,474
8,bari,1362
9,barletta-andria-trani,380


**We can now merge together the two datasets based on the province column**

In [22]:
df5 = pd.merge(inf_df3, df4, on='Province', how='inner')

In [23]:
df5.head(15)

Unnamed: 0,Province,Infected,Density,Population
0,agrigento,135,142,434870
1,alessandria,3654,118,421284
2,ancona,1822,240,471228
3,aosta,1150,39,125666
4,arezzo,655,106,342654
5,ascoli piceno,286,169,207179
6,asti,1655,142,214638
7,avellino,474,149,418306
8,bari,1362,326,1248489
9,barletta-andria-trani,380,253,390011


### Let's install the geocoder package! Last time it worked, while during a previous exercise it didn't. Fingers crossed!

In [24]:
!pip install geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 16.6MB/s ta 0:00:01
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [25]:
from geopy.geocoders import Nominatim 
import geocoder

In [26]:
# define a function to get coordinates
def get_latlng(province):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Italy'.format(province))
        lat_lng_coords = g.latlng
    return lat_lng_coords

The cell below may take a while to run, since the geocode package is quite unreliable. We will add messages at the start and at the end of the code, to get a glimpse of the success of the process

In [27]:
print('begin geocoding!')
coords = [ get_latlng(province) for province in df5["Province"].tolist() ]
print('geocoding finished!')

begin geocoding!
geocoding finished!


In [28]:
coords

[[37.31087000000008, 13.576500000000067],
 [44.90724000000006, 8.611560000000054],
 [43.618490000000065, 13.508980000000065],
 [45.73751000000004, 7.320720000000051],
 [43.46354000000008, 11.877650000000074],
 [42.853980000000035, 13.584410000000048],
 [44.90443000000005, 8.199940000000026],
 [40.91217000000006, 14.792880000000025],
 [41.12587000000008, 16.866660000000024],
 [41.17293777700007, 16.171158924000054],
 [46.14098000000007, 12.212750000000028],
 [41.129950000000065, 14.785520000000076],
 [45.69523000000004, 9.66951000000006],
 [45.56041000000005, 8.059780000000046],
 [44.50484000000006, 11.345070000000021],
 [46.49528000000004, 11.353460000000041],
 [45.53689000000003, 10.232000000000028],
 [40.634700000000066, 17.94025000000005],
 [39.214540000000056, 9.110490000000027],
 [37.49004000000008, 14.063220000000058],
 [41.55913000000004, 14.656990000000064],
 [41.07014000000004, 14.331610000000069],
 [37.511360000000025, 15.067520000000059],
 [38.91444000000007, 16.584360000000

In [28]:
len(coords)

108

In [29]:
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])
df5['Latitude'] = df_coords['Latitude']
df5['Longitude'] = df_coords['Longitude']
df5.head()

Unnamed: 0,Province,Infected,Density,Population,Latitude,Longitude
0,agrigento,135,142,434870,37.31087,13.5765
1,alessandria,3654,118,421284,44.90724,8.61156
2,ancona,1822,240,471228,43.61849,13.50898
3,aosta,1150,39,125666,45.73751,7.32072
4,arezzo,655,106,342654,43.46354,11.87765


**We're now creating the folium map to superimpose the different provinces**

In [34]:
# get the coordinates of Rome, Italy's capital
address = 'Rome, Italy'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
rome_latitude = location.latitude
rome_longitude = location.longitude
print('The geograpical coordinate of Rome, Italy, are: {}, {}.'.format(rome_latitude, rome_longitude))

The geograpical coordinate of Rome, Italy, are: 41.8933203, 12.4829321.


In [40]:
# Now on to the map creation!

# create map of Toronto using latitude and longitude values
map_it = folium.Map(location=[rome_latitude, rome_longitude], zoom_start=6)

# add markers to map
for lat, lng, province in zip(df5['Latitude'], df5['Longitude'], df5['Province']):
    label = '{}'.format(province)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_it)  
    
map_it

## And now the dataset is almost complete! Unfortunately, since the exercise requires some foursquare data, we need to include that in our dataframe as well. A pity, since it chews up much time for little gaining, but still, let's follow our teacher's will and let's do that!

The cell below records the variables with the client id and client secret info

In [30]:
# The code was removed by Watson Studio for sharing.

In [55]:
radius = 2000
"""
    we are limiting the revenues to 5 km, since there will be too many results otherwise. Of course, this invalidates the whole exercise
    since there is no point in understanding where it is not risky to open, if the data is incomplete. Still, the process and the reasoning
    is still correct, it is just a matter of computational limit and API calls, which cannot be overridden.
"""

LIMIT = 100

venues = []

for lat, long, province in zip(df5['Latitude'], df5['Longitude'], df5['Province']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            province,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

KeyError: 'groups'

In [71]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Province', 'Latitude', 'Longitude', 'Venue_Name', 'VenLatitude', 'VenLongitude', 'Category']

venues_df.head()

Unnamed: 0,Province,Latitude,Longitude,Venue_Name,VenLatitude,VenLongitude,Category
0,agrigento,37.31087,13.5765,Osteria Expanificio,37.311008,13.576509,Italian Restaurant
1,agrigento,37.31087,13.5765,Opera,37.311663,13.579802,Pub
2,agrigento,37.31087,13.5765,Teatro Luigi Pirandello,37.311168,13.577065,Theater
3,agrigento,37.31087,13.5765,Terra E Mare,37.311732,13.578333,Food
4,agrigento,37.31087,13.5765,Il Re di Girgenti,37.30763,13.58386,Sicilian Restaurant


In [72]:
venues_df.shape

(4036, 7)

In [73]:

print('There are {} uniques categories.'.format(len(venues_df['Category'].unique())))

There are 265 uniques categories.


In [74]:
venues_df['Category'].unique()[:]

array(['Italian Restaurant', 'Pub', 'Theater', 'Food',
       'Sicilian Restaurant', 'Café', 'Ice Cream Shop', 'Pizza Place',
       'Restaurant', 'Lounge', 'History Museum', 'Cocktail Bar',
       'Trattoria/Osteria', 'Bar', 'Hotel', 'Park', 'Train Station',
       'Dessert Shop', 'Breakfast Spot', 'Cupcake Shop', 'Creperie',
       'Japanese Restaurant', 'Supermarket', 'Soccer Field', 'Plaza',
       'Chinese Restaurant', 'Burger Joint', 'Gift Shop',
       'Brazilian Restaurant', 'Pool', 'Movie Theater', 'Food Court',
       'Juice Bar', 'Electronics Store', 'Historic Site', 'Diner',
       'Gastropub', 'Multiplex', 'Bookstore', 'Indian Restaurant',
       'Fast Food Restaurant', 'Soccer Stadium', 'Tennis Court',
       'Bowling Alley', 'Shopping Mall', 'Buffet', 'Hobby Shop',
       'Seafood Restaurant', 'Fountain', 'Monument / Landmark',
       'Sushi Restaurant', 'Clothing Store', 'Bistro',
       'Miscellaneous Shop', 'Sri Lankan Restaurant', 'Irish Pub',
       'Beer Bar', 'Poo

In [70]:
venues_df['Category'].unique()[:]
# let's pick some of the most relevant categories:
"""
Park
Train Station
Plaza
PEdestrian Plaza
Historic Site
Shopping Mall
Fountain
Monument / Landmark
Garden
Market
Field
Beach
Playground
"""

'\nPark\nTrain Station\nPlaza\nPEdestrian Plaza\nHistoric Site\nShopping Mall\nFountain\nMonument / Landmark\nGarden\nMarket\nField\nBeach\nPlayground\n'

In [62]:
relevant_venues = ['Park',
'Train Station',
'Plaza',
'Pedestrian Plaza',
'Historic Site',
'Shopping Mall',
'Fountain',
'Monument / Landmark',
'Garden',
'Market',
'Field',
'Beach',
'Playground']

**So let's filter those revenues to only include the relevant categories**

In [63]:
df6 = venues_df[venues_df.Category.isin(relevant_venues)].reset_index(drop=True)
df6.head()

Unnamed: 0,Province,Latitude,Longitude,Venue_Name,VenLatitude,VenLongitude,Category
0,agrigento,37.31087,13.5765,Villa Bonfiglio,37.306483,13.591025,Park
1,agrigento,37.31087,13.5765,Agrigento Bassa,37.319285,13.588298,Train Station
2,alessandria,44.90724,8.61156,Piazzetta Della Lega Lombarda,44.913718,8.613837,Plaza
3,alessandria,44.90724,8.61156,Piazza Garibaldi,44.908943,8.612926,Plaza
4,alessandria,44.90724,8.61156,Cittadella di Alessandria,44.920859,8.606586,Historic Site


In [56]:
df6.shape

(722, 7)

**Below we analyze the different provinces**

In [81]:
# one hot encoding
it_onehot = pd.get_dummies(df6[['Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
it_onehot['Province'] = df6['Province'] 

# move neighborhood column to the first column
fixed_columns = [it_onehot.columns[-1]] + list(it_onehot.columns[:-1])
it_onehot = it_onehot[fixed_columns]

print(it_onehot.shape)
it_onehot.head()

(418, 14)


Unnamed: 0,Province,Beach,Field,Fountain,Garden,Historic Site,Market,Monument / Landmark,Park,Pedestrian Plaza,Playground,Plaza,Shopping Mall,Train Station
0,agrigento,0,0,0,0,0,0,0,1,0,0,0,0,0
1,agrigento,0,0,0,0,0,0,0,0,0,0,0,0,1
2,alessandria,0,0,0,0,0,0,0,0,0,0,1,0,0
3,alessandria,0,0,0,0,0,0,0,0,0,0,1,0,0
4,alessandria,0,0,0,0,1,0,0,0,0,0,0,0,0


**And here we're grouping the rows by province using the mean of the occurrences of each venue for that province**

In [88]:
it_grouped = it_onehot.groupby(["Province"]).sum().reset_index()

it_grouped['Total Venues'] = it_onehot.sum()
print(it_grouped.shape)
it_grouped.head(70)

(62, 15)


Unnamed: 0,Province,Beach,Field,Fountain,Garden,Historic Site,Market,Monument / Landmark,Park,Pedestrian Plaza,Playground,Plaza,Shopping Mall,Train Station,Total Venues
0,agrigento,0,0,0,0,0,0,0,1,0,0,0,0,1,
1,alessandria,0,0,0,0,1,0,0,1,0,0,2,2,0,
2,ancona,0,0,1,0,0,0,4,2,0,0,7,0,0,
3,aosta,0,0,0,0,3,0,0,1,0,0,1,0,0,
4,arezzo,0,0,0,1,1,0,0,1,0,0,4,0,0,
5,ascoli piceno,0,0,0,0,3,0,0,0,0,0,3,0,0,
6,asti,0,0,0,0,0,0,0,0,0,0,5,0,1,
7,avellino,0,0,0,0,0,1,0,1,0,0,2,1,0,
8,bari,0,0,0,0,0,0,0,0,0,0,4,0,0,
9,belluno,0,0,0,0,0,0,0,0,0,0,2,0,0,


### Unfortunately, due to API restrictions, we're unable to retrieve all the data we would want to

In [83]:
it_grouped.shape

(62, 14)

**It turns not all provinces are covered by our API call! Since the goal of the exercise is to display our learnings, and not to get the perfect project if there are technical or product limitations, we will go ahead and proceed with what we've got**

**Let's normalize the dataset!**

In [84]:
from sklearn.preprocessing import StandardScaler

X = it_grouped.values[:,1:]
X = np.nan_to_num(X)
cluster_dataset = StandardScaler().fit_transform(X)
cluster_dataset



array([[-1.28036880e-01, -1.28036880e-01, -2.88093711e-01,
        -5.00779797e-01, -7.14019435e-01, -2.42484797e-01,
        -4.25381099e-01,  1.57681568e+00, -2.02620511e-01,
        -1.82018567e-01, -1.77821670e+00, -4.31791489e-01,
         5.27192441e+00],
       [-1.28036880e-01, -1.28036880e-01, -2.88093711e-01,
        -5.00779797e-01,  3.07465744e-01, -2.42484797e-01,
        -4.25381099e-01, -2.12967498e-02, -2.02620511e-01,
        -1.82018567e-01, -5.45499734e-01,  1.12693859e+00,
        -3.43908672e-01],
       [-1.28036880e-01, -1.28036880e-01,  6.85275144e-01,
        -5.00779797e-01, -7.14019435e-01, -2.42484797e-01,
         3.84146400e+00, -1.35447638e-01, -2.02620511e-01,
        -1.82018567e-01,  7.08587474e-02, -4.31791489e-01,
        -3.43908672e-01],
       [-1.28036880e-01, -1.28036880e-01, -2.88093711e-01,
        -5.00779797e-01,  2.96332721e+00, -2.42484797e-01,
        -4.25381099e-01,  1.38514493e-01, -2.02620511e-01,
        -1.82018567e-01, -1.03858652e