# This notebook is the code which will cover the data extraction and manipulation for the capstone

In [2]:
print('importation: begin!')
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import re

import numpy as np # library to handle data in a vectorized manner

import json # library to handle JSON files

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!pip install folium # -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

importation: begin!
Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/a4/f0/44e69d50519880287cc41e7c8a6acc58daa9a9acf5f6afc52bcc70f69a6d/folium-0.11.0-py2.py3-none-any.whl (93kB)
[K     |████████████████████████████████| 102kB 6.3MB/s ta 0:00:011
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/13/fb/9eacc24ba3216510c6b59a4ea1cd53d87f25ba76237d7f4393abeaf4c94e/branca-0.4.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0
Libraries imported.


**The cell below extracts data about Italy's provinces, from a wikipedia page. Since some cities are displayed with the english name, we need to use the equivalent italian wikipedia page, which uses approximately the same names as the infected spreadsheet**

In [3]:
url = 'https://it.wikipedia.org/wiki/Province_d%27Italia'

df = pd.read_html(url)

**Since there are more than one data frame in that page, we will need to look into them, to find out which one we're interested in.**

It turns out we care about the fist one, so position 0 in the list of dataframes.

In [166]:
# There are 11 elements in df, which means 11 different dataframes! We care about the fisrt one, position number 0
"""for i in range(len(df)):
    print('new dataframe\n')
    df_new = df[i]
    print(df_new.head(2))"""

"for i in range(len(df)):\n    print('new dataframe\n')\n    df_new = df[i]\n    print(df_new.head(2))"

In [167]:
# we extract the relevant table by accessing index two of the list
df1 = df[0]
df1.head(2)

Unnamed: 0,Provincia,Sigla,Regione,Popolazione(ab.),Superficie(km²),Densità(ab./km²),Comuni(N°),Presidente,Partito
0,Agrigento (lib. cons. com.),AG,Sicilia,434 870,"3 052,59",142,43,Roberto Barberi[6],Commissario straordinario
1,Alessandria,AL,Piemonte,421 284,"3 558,83",118,187,Gianfranco Baldi,Indipendente (Centro-destra)


**We can drop a few columns, since we're not really interested in many of them**

In [6]:
df2 = df1.drop(axis= 1, columns = ['Sigla', 'Regione', 'Superficie(km²)', 'Comuni(N°)', 'Partito', 'Presidente'])

**We then rename some columns to better fit our needs, using the english equivalent**

In [7]:
df2.rename(columns = {'Provincia':'Province','Popolazione(ab.)':'Population', 'Densità(ab./km²)':'Density'}, inplace = True)

In [168]:
df2.head(2)

Unnamed: 0,Province,Population,Density,Province adjusted
0,Agrigento (lib. cons. com.),434 870,142,agrigento
1,Alessandria,421 284,118,alessandria


**As we can see, there is some data cleaning to do on this table. Let's proceed with cleaning up the names. We can see there are many unnecessary brackets information that we need to take out**

In [9]:
regespr = re.compile(r' \(')
regespr1 = re.compile(r'\[')

new_list = list()
for i in df2['Province']:
    #print(i)
    try:
        start_pos = re.search(regespr, i).start()
        new_i = i[:start_pos]
    except:
        start_pos = 0
        new_i = i
    try:
        start_pos1 = re.search(regespr1, new_i).start()
        new_i1 = new_i[:start_pos1]
    except:
        start_pos1 = 0
        new_i1 = new_i
    lower_item = new_i1.lower()
    new_list.append(lower_item)

**There are still some unique records that don't match. We will replace them manually**

In [170]:
#print(new_list)
for count,i in enumerate(new_list):
    if i == 'forlì-cesena':
        new_list[count] = 'forlì cesena'
    if i == 'massa-carrara':
        new_list[count] = 'massa carrara' #
    if i == 'monza e brianza':
        new_list[count] = 'monza brianza'
    if i == 'pesaro e urbino':
        new_list[count] = 'pesaro'
#print(new_list)

df2['Province adjusted'] = np.array(new_list) # adds column with new list to dataframe
df2.head(2)

Unnamed: 0,Province,Population,Density,Province adjusted
0,Agrigento (lib. cons. com.),434 870,142,agrigento
1,Alessandria,421 284,118,alessandria


**We will then drop the first province column and then rename the new one to the old one**

In [171]:
df3 = df2.drop(axis= 1, columns = ['Province'])
df3.rename(columns = {'Province adjusted':'Province'}, inplace = True)
df3.head(5)

Unnamed: 0,Population,Density,Province
0,434 870,142,agrigento
1,421 284,118,alessandria
2,471 228,240,ancona
3,125 666,39,aosta
4,342 654,106,arezzo


**Now we need to delete those annoying spaces between numbers in the population column**

In [172]:
#regespr2 = re.compile(r' ')

new_list1 = list()
for i in df3['Population']:
    new_list1.append(i)
#print(new_list1)
new_list2 = list()
for i in new_list1:
    #print(i)
    #print(len(i))
    if len(i) > 3:
        new_i = i[:-4]+i[-3:]
    else:
        new_i = i
    #print(new_i)
    if len(new_i) > 6:
        new_i = new_i[:-7]+new_i[-6:]
    else:
        new_i = new_i
    new_i = int(new_i)
    new_list2.append(new_i)
#print(new_list2)

In [173]:
df3['Population adjusted'] = np.array(new_list2) # adds column with new list to dataframe
df4 = df3.drop(axis= 1, columns = ['Population'])
df4.rename(columns = {'Population adjusted':'Population'}, inplace = True)
df4.head()

Unnamed: 0,Density,Province,Population
0,142,agrigento,434870
1,118,alessandria,421284
2,240,ancona,471228
3,39,aosta,125666
4,106,arezzo,342654


### Now, we proceed to wrangle the second data source: Since it is a pdf file (http://www.salute.gov.it/imgs/C_17_notizie_4702_1_file.pdf), there are a couple of python packages that allow to wrangle pdf files: tabula and camelot. Of course, none of them is working in this jupiter environment. Even though they get installed with the !pip command, they throw different traceback errors when trying to parse the file. So i had to download it, convert to csv file, and then upload it in my github page. Not ideal, but i have currently no other way to get around the problem.

In [15]:
csv_file = 'https://github.com/EmanueleLanzani/Coursera_Capstone/blob/master/infected_situation_7_may.csv'
inf_df = pd.read_html(csv_file)

In [16]:
inf_df1 = inf_df[0]

In [174]:
inf_df1.head(5)

Unnamed: 0.1,Unnamed: 0,Province,Infected
0,,agrigento,135
1,,alessandria,3654
2,,ancona,1822
3,,aosta,1150
4,,arezzo,655


In [175]:
#The first column is clearly a parsing error, so we proceed to drop it:

inf_df2 = inf_df1.drop(axis= 1, columns = ['Unnamed: 0'])
inf_df2.head()

Unnamed: 0,Province,Infected
0,agrigento,135
1,alessandria,3654
2,ancona,1822
3,aosta,1150
4,arezzo,655


In [176]:
# we need to change a specific record which appear with its acronym in the dataset
inf_df3 = inf_df2.replace('bat','barletta-andria-trani')
inf_df3.head()

Unnamed: 0,Province,Infected
0,agrigento,135
1,alessandria,3654
2,ancona,1822
3,aosta,1150
4,arezzo,655


**We can now merge together the two datasets based on the province column**

In [22]:
df5 = pd.merge(inf_df3, df4, on='Province', how='inner')

In [177]:
df5.head()

Unnamed: 0,Province,Infected,Density,Population,Latitude,Longitude
0,agrigento,135,142,434870,37.31087,13.5765
1,alessandria,3654,118,421284,44.90724,8.61156
2,ancona,1822,240,471228,43.61849,13.50898
3,aosta,1150,39,125666,45.73751,7.32072
4,arezzo,655,106,342654,43.46354,11.87765


### Let's install the geocoder package! Last time it worked, while during a previous exercise it didn't. Fingers crossed!

In [24]:
!pip install geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 16.6MB/s ta 0:00:01
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [25]:
from geopy.geocoders import Nominatim 
import geocoder

In [26]:
# define a function to get coordinates
def get_latlng(province):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Italy'.format(province))
        lat_lng_coords = g.latlng
    return lat_lng_coords

The cell below may take a while to run, since the geocode package is quite unreliable. We will add messages at the start and at the end of the code, to get a glimpse of the success of the process

In [27]:
print('begin geocoding!')
coords = [ get_latlng(province) for province in df5["Province"].tolist() ]
print('geocoding finished!')

begin geocoding!
geocoding finished!


In [178]:
coords[:5]

[[37.31087000000008, 13.576500000000067],
 [44.90724000000006, 8.611560000000054],
 [43.618490000000065, 13.508980000000065],
 [45.73751000000004, 7.320720000000051],
 [43.46354000000008, 11.877650000000074]]

In [28]:
len(coords)

108

In [29]:
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])
df5['Latitude'] = df_coords['Latitude']
df5['Longitude'] = df_coords['Longitude']
df5.head()

Unnamed: 0,Province,Infected,Density,Population,Latitude,Longitude
0,agrigento,135,142,434870,37.31087,13.5765
1,alessandria,3654,118,421284,44.90724,8.61156
2,ancona,1822,240,471228,43.61849,13.50898
3,aosta,1150,39,125666,45.73751,7.32072
4,arezzo,655,106,342654,43.46354,11.87765


**We're now creating the folium map to superimpose the different provinces**

In [34]:
# get the coordinates of Rome, Italy's capital
address = 'Rome, Italy'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
rome_latitude = location.latitude
rome_longitude = location.longitude
print('The geograpical coordinate of Rome, Italy, are: {}, {}.'.format(rome_latitude, rome_longitude))

The geograpical coordinate of Rome, Italy, are: 41.8933203, 12.4829321.


In [40]:
# Now on to the map creation!

# create map of Toronto using latitude and longitude values
map_it = folium.Map(location=[rome_latitude, rome_longitude], zoom_start=6)

# add markers to map
for lat, lng, province in zip(df5['Latitude'], df5['Longitude'], df5['Province']):
    label = '{}'.format(province)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_it)  
    
map_it

## And now the dataset is almost complete! Unfortunately, since the exercise requires some foursquare data, we need to include that in our dataframe as well. A pity, since it chews up much time for little gaining, but still, let's follow our teacher's will and let's do that!

The cell below records the variables with the client id and client secret info

In [30]:
# The code was removed by Watson Studio for sharing.

In [55]:
radius = 2000
"""
    we are limiting the revenues to 5 km, since there will be too many results otherwise. Of course, this invalidates the whole exercise
    since there is no point in understanding where it is not risky to open, if the data is incomplete. Still, the process and the reasoning
    is still correct, it is just a matter of computational limit and API calls, which cannot be overridden.
"""

LIMIT = 100

venues = []

for lat, long, province in zip(df5['Latitude'], df5['Longitude'], df5['Province']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            province,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

KeyError: 'groups'

In [71]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Province', 'Latitude', 'Longitude', 'Venue_Name', 'VenLatitude', 'VenLongitude', 'Category']

venues_df.head()

Unnamed: 0,Province,Latitude,Longitude,Venue_Name,VenLatitude,VenLongitude,Category
0,agrigento,37.31087,13.5765,Osteria Expanificio,37.311008,13.576509,Italian Restaurant
1,agrigento,37.31087,13.5765,Opera,37.311663,13.579802,Pub
2,agrigento,37.31087,13.5765,Teatro Luigi Pirandello,37.311168,13.577065,Theater
3,agrigento,37.31087,13.5765,Terra E Mare,37.311732,13.578333,Food
4,agrigento,37.31087,13.5765,Il Re di Girgenti,37.30763,13.58386,Sicilian Restaurant


In [72]:
venues_df.shape

(4036, 7)

In [73]:

print('There are {} uniques categories.'.format(len(venues_df['Category'].unique())))

There are 265 uniques categories.


In [179]:
venues_df['Category'].unique()[:5]

array(['Italian Restaurant', 'Pub', 'Theater', 'Food',
       'Sicilian Restaurant'], dtype=object)

In [70]:
venues_df['Category'].unique()[:]
# let's pick some of the most relevant categories:
"""
Park
Train Station
Plaza
PEdestrian Plaza
Historic Site
Shopping Mall
Fountain
Monument / Landmark
Garden
Market
Field
Beach
Playground
"""

'\nPark\nTrain Station\nPlaza\nPEdestrian Plaza\nHistoric Site\nShopping Mall\nFountain\nMonument / Landmark\nGarden\nMarket\nField\nBeach\nPlayground\n'

In [62]:
relevant_venues = ['Park',
'Train Station',
'Plaza',
'Pedestrian Plaza',
'Historic Site',
'Shopping Mall',
'Fountain',
'Monument / Landmark',
'Garden',
'Market',
'Field',
'Beach',
'Playground']

**So let's filter those revenues to only include the relevant categories**

In [63]:
df6 = venues_df[venues_df.Category.isin(relevant_venues)].reset_index(drop=True)
df6.head()

Unnamed: 0,Province,Latitude,Longitude,Venue_Name,VenLatitude,VenLongitude,Category
0,agrigento,37.31087,13.5765,Villa Bonfiglio,37.306483,13.591025,Park
1,agrigento,37.31087,13.5765,Agrigento Bassa,37.319285,13.588298,Train Station
2,alessandria,44.90724,8.61156,Piazzetta Della Lega Lombarda,44.913718,8.613837,Plaza
3,alessandria,44.90724,8.61156,Piazza Garibaldi,44.908943,8.612926,Plaza
4,alessandria,44.90724,8.61156,Cittadella di Alessandria,44.920859,8.606586,Historic Site


In [56]:
df6.shape

(722, 7)

**Below we analyze the different provinces**

In [81]:
# one hot encoding
it_onehot = pd.get_dummies(df6[['Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
it_onehot['Province'] = df6['Province'] 

# move neighborhood column to the first column
fixed_columns = [it_onehot.columns[-1]] + list(it_onehot.columns[:-1])
it_onehot = it_onehot[fixed_columns]

print(it_onehot.shape)
it_onehot.head()

(418, 14)


Unnamed: 0,Province,Beach,Field,Fountain,Garden,Historic Site,Market,Monument / Landmark,Park,Pedestrian Plaza,Playground,Plaza,Shopping Mall,Train Station
0,agrigento,0,0,0,0,0,0,0,1,0,0,0,0,0
1,agrigento,0,0,0,0,0,0,0,0,0,0,0,0,1
2,alessandria,0,0,0,0,0,0,0,0,0,0,1,0,0
3,alessandria,0,0,0,0,0,0,0,0,0,0,1,0,0
4,alessandria,0,0,0,0,1,0,0,0,0,0,0,0,0


**And here we're grouping the rows by province using the mean of the occurrences of each venue for that province**

In [180]:
it_grouped = it_onehot.groupby(["Province"]).sum().reset_index()

#it_grouped['Total Venues'] = it_onehot.sum(axis=1)
print(it_grouped.shape)
it_grouped.head()

(62, 14)


Unnamed: 0,Province,Beach,Field,Fountain,Garden,Historic Site,Market,Monument / Landmark,Park,Pedestrian Plaza,Playground,Plaza,Shopping Mall,Train Station
0,agrigento,0,0,0,0,0,0,0,1,0,0,0,0,1
1,alessandria,0,0,0,0,1,0,0,1,0,0,2,2,0
2,ancona,0,0,1,0,0,0,4,2,0,0,7,0,0
3,aosta,0,0,0,0,3,0,0,1,0,0,1,0,0
4,arezzo,0,0,0,1,1,0,0,1,0,0,4,0,0


In [99]:
it_grouped['Total Venues'] = it_grouped.sum(axis=1)
it_grouped.head()

Unnamed: 0,Province,Beach,Field,Fountain,Garden,Historic Site,Market,Monument / Landmark,Park,Pedestrian Plaza,Playground,Plaza,Shopping Mall,Train Station,Total Venues
0,agrigento,0,0,0,0,0,0,0,1,0,0,0,0,1,2
1,alessandria,0,0,0,0,1,0,0,1,0,0,2,2,0,6
2,ancona,0,0,1,0,0,0,4,2,0,0,7,0,0,14
3,aosta,0,0,0,0,3,0,0,1,0,0,1,0,0,5
4,arezzo,0,0,0,1,1,0,0,1,0,0,4,0,0,7


### Unfortunately, due to API restrictions, we're unable to retrieve all the data we would want to

In [104]:
it_grouped1 = it_grouped[['Province','Total Venues']]
it_grouped1.head()

Unnamed: 0,Province,Total Venues
0,agrigento,2
1,alessandria,6
2,ancona,14
3,aosta,5
4,arezzo,7


**It turns not all provinces are covered by our API call! Since the goal of the exercise is to display our learnings, and not to get the perfect project if there are technical or product limitations, we will go ahead and proceed with what we've got**

**Let's merge the two datasets together**

In [181]:
df6 = pd.merge(it_grouped1, df5, on='Province', how='inner')
df6.drop(['Province','Latitude','Longitude'], axis=1, inplace=True) # drop province to get only numbers, otherwise we cannot standardize properly
df6.head(5)

#df6.dtypes
#df6 = df6.astype({"Density": int})

Unnamed: 0,Total Venues,Infected,Density,Population
0,2,135,142,434870
1,6,3654,118,421284
2,14,1822,240,471228
3,5,1150,39,125666
4,7,655,106,342654


**Further data wrangling, since density is also messy as well: it's a string type instead of a int or float type**

In [119]:
new_list3 = list()
for i in df6['Density']:
    new_list3.append(i)
print(new_list3)
new_list4 = list()
for i in new_list3:
    #print(i)
    #print(len(i))
    if len(i) > 3:
        new_i = i[:-4]+i[-3:]
    else:
        new_i = i
    #print(new_i)
    new_i = int(new_i)
    new_list4.append(new_i)
print(new_list4)

['142', '118', '240', '39', '106', '169', '142', '149', '326', '56', '133', '405', '192', '274', '72', '265', '211', '344', '348', '310', '148', '148', '468', '105', '203', '101', '85', '64', '201', '131', '288', '89', '151', '455', '298', '49', '185', '55', '59', '249', '255', '284', '419', '276', '294', '219', '113', '176', '57', '191', '2\xa0071', '262', '2156', '2\xa0624', '275', '37', '53', '437', '249', '131', '184', '104']
[142, 118, 240, 39, 106, 169, 142, 149, 326, 56, 133, 405, 192, 274, 72, 265, 211, 344, 348, 310, 148, 148, 468, 105, 203, 101, 85, 64, 201, 131, 288, 89, 151, 455, 298, 49, 185, 55, 59, 249, 255, 284, 419, 276, 294, 219, 113, 176, 57, 191, 2071, 262, 156, 2624, 275, 37, 53, 437, 249, 131, 184, 104]


In [182]:
df6['Density adjusted'] = np.array(new_list4) # adds column with new list to dataframe
df7 = df6.drop(axis= 1, columns = ['Density'])
df7.rename(columns = {'Density adjusted':'Density'}, inplace = True)
df7.head(5)

Unnamed: 0,Total Venues,Infected,Population,Density
0,2,135,434870,142
1,6,3654,421284,118
2,14,1822,471228,240
3,5,1150,125666,39
4,7,655,342654,106


In [122]:
df7.dtypes

Total Venues    int64
Infected        int64
Population      int64
Density         int64
dtype: object

**Now we're finally finally finally good to go! Let's normalize the dataset!**

In [183]:
from sklearn.preprocessing import StandardScaler

X = df7.values[:,:]
X = np.nan_to_num(X)
cluster_dataset = StandardScaler().fit_transform(X)
cluster_dataset[:5]



array([[-1.13204266, -0.61190583, -0.25834577, -0.30882725],
       [-0.17712232,  0.4153513 , -0.28239856, -0.36900732],
       [ 1.73271835, -0.11944128, -0.19397723, -0.063092  ],
       [-0.4158524 , -0.31560973, -0.80576346, -0.56710002],
       [ 0.06160776, -0.46010882, -0.42160585, -0.39909735]])

In [147]:
num_clusters = 6

k_means = KMeans(init="k-means++", n_clusters=num_clusters, n_init=12)
k_means.fit(cluster_dataset)
labels = k_means.labels_

print(labels)

[4 0 2 4 4 4 4 4 5 4 4 5 4 5 2 5 4 2 5 5 4 4 0 4 0 4 0 4 4 2 5 4 4 2 4 4 4
 4 4 4 4 5 0 4 0 2 4 2 2 4 3 0 2 1 0 4 4 5 5 0 0 2]


In [184]:
df7["Labels"] = labels
df7.head()

Unnamed: 0,Total Venues,Infected,Population,Density,Labels
0,2,135,434870,142,4
1,6,3654,421284,118,0
2,14,1822,471228,240,2
3,5,1150,125666,39,4
4,7,655,342654,106,4


**The cell below checks if the population column returns 62 records, the same as the number of provinces. If so, we can join the dataframes based on the uniqueness of the population**

In [149]:
print('There are {} uniques categories.'.format(len(df7['Population'].unique())))

There are 62 uniques categories.


**There are indeed 62 different records, so we can merge the dataset on the Population column**

In [185]:
df8 = df5.drop(axis= 1, columns = ['Infected', 'Density'])
df9 = pd.merge(df7, df8, on='Population', how='inner')
df9.head()

Unnamed: 0,Total Venues,Infected,Population,Density,Labels,Province,Latitude,Longitude
0,2,135,434870,142,4,agrigento,37.31087,13.5765
1,6,3654,421284,118,0,alessandria,44.90724,8.61156
2,14,1822,471228,240,2,ancona,43.61849,13.50898
3,5,1150,125666,39,4,aosta,45.73751,7.32072
4,7,655,342654,106,4,arezzo,43.46354,11.87765


In [151]:
fixed_columns1 = [df9.columns[-3]] + list(df9.columns[:-3]) + list(df9.columns[-2:])
df9 = df9[fixed_columns1]
df9.head()

Unnamed: 0,Province,Total Venues,Infected,Population,Density,Labels,Latitude,Longitude
0,agrigento,2,135,434870,142,4,37.31087,13.5765
1,alessandria,6,3654,421284,118,0,44.90724,8.61156
2,ancona,14,1822,471228,240,2,43.61849,13.50898
3,aosta,5,1150,125666,39,4,45.73751,7.32072
4,arezzo,7,655,342654,106,4,43.46354,11.87765


In [152]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=6)

# set color scheme for the clusters
x = np.arange(num_clusters)
ys = [i+x+(i*x)**2 for i in range(num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, prov, cluster in zip(df9['Latitude'], df9['Longitude'], df9['Province'], df9['Labels']):
    label = folium.Popup(str(prov) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [153]:
map_clusters.save('map_clusters.html')

## Cluster investigation

In [160]:
df9.loc[df9['Labels'] == 0]
# many infected

Unnamed: 0,Province,Total Venues,Infected,Population,Density,Labels,Latitude,Longitude
1,alessandria,6,3654,421284,118,0,44.90724,8.61156
22,como,6,3440,599204,468,0,45.81076,9.0871
24,cremona,7,6178,358955,203,0,45.14241,10.01521
26,cuneo,3,2603,587098,85,0,44.38822,7.54833
42,lecco,3,2419,337380,419,0,45.85166,9.39187
44,lodi,6,3204,230198,294,0,45.312,9.49701
51,modena,6,3772,705393,262,0,44.64323,10.93381
54,novara,4,2434,369018,275,0,45.44678,8.61524
59,parma,6,3260,451631,131,0,44.79853,10.34022
60,pavia,7,4652,545888,184,0,45.19305,9.1513


In [161]:
df9.loc[df9['Labels'] == 1]
# many infected, very high density

Unnamed: 0,Province,Total Venues,Infected,Population,Density,Labels,Latitude,Longitude
53,napoli,14,2506,3072996,2624,1,40.84014,14.25226


In [162]:
df9.loc[df9['Labels'] == 2]
# many venues

Unnamed: 0,Province,Total Venues,Infected,Population,Density,Labels,Latitude,Longitude
2,ancona,14,1822,471228,240,2,43.61849,13.50898
14,bolzano,11,2552,531178,72,2,46.49528,11.35346
17,cagliari,12,243,430372,344,2,39.21454,9.11049
29,ferrara,11,960,345691,131,2,44.83839,11.61944
33,genova,15,5007,837427,455,2,44.41039,8.93898
45,lucca,14,1316,387876,219,2,43.84198,10.51531
47,mantova,10,3221,412292,176,2,45.16541,10.79242
48,matera,14,202,197909,57,2,40.66714,16.60445
52,monza brianza,11,4974,873935,156,2,45.597948,9.270323
61,perugia,12,994,656382,104,2,43.11136,12.38391


In [163]:
df9.loc[df9['Labels'] == 3]
# extremely high number of infected

Unnamed: 0,Province,Total Venues,Infected,Population,Density,Labels,Latitude,Longitude
50,milano,12,20893,3261873,2071,3,45.46796,9.18178


In [164]:
df9.loc[df9['Labels'] == 4]
# low density, low infected

Unnamed: 0,Province,Total Venues,Infected,Population,Density,Labels,Latitude,Longitude
0,agrigento,2,135,434870,142,4,37.31087,13.5765
3,aosta,5,1150,125666,39,4,45.73751,7.32072
4,arezzo,7,655,342654,106,4,43.46354,11.87765
5,ascoli piceno,6,286,207179,169,4,42.85398,13.58441
6,asti,6,1655,214638,142,4,44.90443,8.19994
7,avellino,5,474,418306,149,4,40.91217,14.79288
9,belluno,2,1142,202950,56,4,46.14098,12.21275
10,benevento,4,189,277018,133,4,41.12995,14.78552
12,biella,4,1002,177585,192,4,45.56041,8.05978
16,brindisi,4,600,392975,211,4,40.6347,17.94025


In [165]:
df9.loc[df9['Labels'] == 5]
# average density, mid-high venues, high population

Unnamed: 0,Province,Total Venues,Infected,Population,Density,Labels,Latitude,Longitude
8,bari,4,1362,1248489,326,5,41.12587,16.86666
11,bergamo,10,11622,1114590,405,5,45.69523,9.66951
13,bologna,11,4685,1017196,274,5,44.50484,11.34507
15,brescia,11,13391,1265954,265,5,45.53689,10.232
18,caserta,6,425,922965,348,5,41.07014,14.33161
19,catania,18,1027,1103917,310,5,37.51136,15.06752
30,firenze,11,3275,1012407,288,5,43.78237,11.25501
41,lecce,5,502,795134,284,5,40.35796,18.16802
57,padova,8,3885,937908,437,5,45.40944,11.87171
58,palermo,13,523,1245826,249,5,38.12207,13.36112


**The goal of the exercise is to group provinces by clutering them on certain values:**

- Infected
- Population
- Density
- Risky venues

**As we can see, if we aim to open different provinces by differentiating them based on the above criteria, we're looking for something that has low infected, low density and low risky venues. By far, the best cluster that satisfies such conditions is cluster 4.
So, if we would be in a politician shoes, and we would have reliable data (impossible to collect due to API restrictions and computing limitations), we would proceed to lift more movement restrictions in cluster 4 provinces**