# World Data League 2021
## Notebook Template

This notebook is one of the mandatory deliverables when you submit your solution (alongside the video pitch). Its structure follows the WDL evaluation criteria and it has dedicated cells where you can add descriptions. Make sure your code is readable as it will be the only technical support the jury will have to evaluate your work.

The notebook must:

*   💻 have all the code that you want the jury to evaluate
*   🧱 follow the predefined structure
*   📄 have markdown descriptions where you find necessary
*   👀 be saved with all the output that you want the jury to see
*   🏃‍♂️ be runnable


## Authors
- Nicholas Sistovaris
- Moritz Geiger
- Pravalika Myneni
- Sowmya Madela

## External links and resources

All the external data or resources that was not provided by the WDL was acquired through the following links:

1. https://noise-planet.org/noisemodelling.html 
2. https://www.torinocitylab.it/en/asset-to/open-data 
3. https://www.officeholidays.com/countries/italy/turin/2018 
4. https://www.feiertagskalender.ch/index.php?geo=3815&jahr=2018&hl=en
5. http://webgis.arpa.piemonte.it/basicviewer_arpa_webapp/index.html?webmap=89aa175451d24ae0a1911e67957d9aec
6. http://aperto.comune.torino.it/dataset/zone-statistiche
7. https://openweathermap.org/history
8. https://developers.google.com/maps/documentation/places/web-service/details 

## Introduction

**Overview:**


_from challenge description_
<blockquote>

</blockquote>



**Research:**



## Development
Start coding here! 👩‍💻

Don't hesitate to create markdown cells to include descriptions of your work where you see fit, as well as commenting your code.

We know that you know exactly where to start when it comes to crunching data and building models, but don't forget that WDL is all about social impact...so take that into consideration as well.

### Imports (libraries) 📚

In [1]:
## TABULAR
import pandas as pd 
import numpy as np
import matplotlib

## GEO
import geopandas as gpd
import fiona
import folium
from folium.plugins import MarkerCluster, HeatMap, BeautifyIcon
from folium.map import LayerControl, Layer, FeatureGroup
from folium.vector_layers import Circle, CircleMarker
from shapely.geometry import LineString, Point
from shapely import wkt


## DATA
import os
import zipfile
from collections import Counter
import re
from datetime import datetime
import requests
from dotenv import load_dotenv, find_dotenv
import ast
import datetime as dt
from io import StringIO, BytesIO


## VIS
from ipywidgets import interact, interactive, fixed, interact_manual, IntSlider
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.tsa
import branca
import plotly.express as px

## TIME SERIES
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.metrics import r2_score, median_absolute_error, mean_absolute_error
from sklearn.metrics import median_absolute_error, mean_squared_error, mean_squared_log_error
import statsmodels.tsa.api as smt
import statsmodels.api as sm
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from pmdarima.arima import auto_arima 

In [2]:
df_sensors_def = pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/noise_sensor_list.csv', sep=';')
df_sensors_def

Unnamed: 0,code,address,Lat,Long,streaming
0,s_01,"Via Saluzzo, 26 Torino",45059172,7678986,https://userportal.smartdatanet.it/userportal/...
1,s_02,"Via Principe Tommaso, 18bis Torino",45057837,7681555,https://userportal.smartdatanet.it/userportal/...
2,s_03,Largo Saluzzo Torino,45058518,7678854,https://userportal.smartdatanet.it/userportal/...
3,s_05,Via Principe Tommaso angolo via Baretti Torino,45057603,7681348,https://userportal.smartdatanet.it/userportal/...
4,s_06,"Corso Marconi, 27 Torino",45055554,768259,https://userportal.smartdatanet.it/userportal/...


The location of sensors was optimized to cover all
significant feature of “Movida” area (Figure 3):
one in a very crowded square (S_03, not active in
daytime), three in narrow streets with pubs and
bars (S_01, S_04, S_05), one in a boulevard for
traffic noise measurement (S_06) and the last one
in a quieter area with no crowd and low traffic
(S_02), for global reference. The choice of points
of installation was driven also by the power
supply, so light poles, public offices and bike
sharing station where preferred.

Source: https://wdl-data.fra1.digitaloceanspaces.com/torino/120_Euronoise2018.pdf

In [3]:
df_wifi = pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/WIFI%20Count.csv', sep=',')
df_wifi.head()

Unnamed: 0,Time,No. of Visitors
0,2018-10-24 17:00,47
1,2018-10-24 18:00,155
2,2018-10-24 19:00,181
3,2018-10-24 20:00,211
4,2018-10-24 21:00,239


In [4]:
df_businesses = pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/businesses.csv', sep=';')
df_businesses.head()

Unnamed: 0,WKT,ADDRESS,OPEN YEAR,OPEN MONTH,TYPE,Description,Merchandise Type
0,POINT (1396322.217 4990301.69),VIA CLAUDIO LUIGI BERTHOLLET 24,1977,1,EXTRALIMENTARI,PICCOLE STRUTTURE,Extralimentari
1,POINT (1396322.217 4990301.69),VIA CLAUDIO LUIGI BERTHOLLET 24,1985,6,ALIMENTARI,PICCOLE STRUTTURE,Panificio
2,POINT (1396303.762 4990325.001),VIA CLAUDIO LUIGI BERTHOLLET 25/F,2017,9,ALTRO,DIA di somministrazione,Nessuna
3,POINT (1396434.395 4990540.6),CORSO VITTORIO EMANUELE II 21/A,2013,10,ALTRO,DIA di somministrazione,Nessuna
4,POINT (1396434.395 4990540.6),CORSO VITTORIO EMANUELE II 21/A,2009,2,ALTRO,DIA di somministrazione,Nessuna


In [5]:
df_sim_june = pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/sim_count/SIM_count_04_100618.csv', sep=';', encoding='latin-1')
df_sim_june.head()

Unnamed: 0,cluster,data_da,data_a,numero_presenze,layer_id,layer_nome,dettaglio(secondi)
0,Presenze,2018-06-10T21:00:00Z,2018-06-10T22:00:00Z,3278,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
1,Presenze,2018-06-10T20:00:00Z,2018-06-10T21:00:00Z,3324,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
2,Presenze,2018-06-10T19:00:00Z,2018-06-10T20:00:00Z,3318,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
3,Presenze,2018-06-10T18:00:00Z,2018-06-10T19:00:00Z,3187,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
4,Presenze,2018-06-10T17:00:00Z,2018-06-10T18:00:00Z,2980,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600


In [6]:
df_sim_jan = pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/sim_count/SIM_count_15_210118.csv', sep=';', encoding='latin-1')
df_sim_jan.head()

Unnamed: 0,cluster,data_da,data_a,numero_presenze,layer_id,layer_nome,dettaglio(secondi)
0,Presenze,2018-01-21T22:00:00Z,2018-01-21T23:00:00Z,3026,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
1,Presenze,2018-01-21T21:00:00Z,2018-01-21T22:00:00Z,3088,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
2,Presenze,2018-01-21T20:00:00Z,2018-01-21T21:00:00Z,3119,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
3,Presenze,2018-01-21T19:00:00Z,2018-01-21T20:00:00Z,3114,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
4,Presenze,2018-01-21T18:00:00Z,2018-01-21T19:00:00Z,2991,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600


In [7]:
df_sim_march = pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/sim_count/SIM_count_19_250318.csv', sep=';', encoding='latin-1')
df_sim_march.head()

Unnamed: 0,cluster,data_da,data_a,numero_presenze,layer_id,layer_nome,dettaglio(secondi)
0,Presenze,2018-03-25T21:00:00Z,2018-03-25T22:00:00Z,3267,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
1,Presenze,2018-03-25T20:00:00Z,2018-03-25T21:00:00Z,3373,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
2,Presenze,2018-03-25T19:00:00Z,2018-03-25T20:00:00Z,3410,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
3,Presenze,2018-03-25T18:00:00Z,2018-03-25T19:00:00Z,3358,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
4,Presenze,2018-03-25T17:00:00Z,2018-03-25T18:00:00Z,3229,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600


In [8]:
df_noise_2018 = pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/noise_data/san_salvario_2018.csv', skiprows= [0,1,2,3,4,5,6,7], sep =';')
df_noise_2018.head()

Unnamed: 0,Data,Ora,C1,C2,C3,C4,"C5,,,,,"
0,01-01-2018,00:00,687,,760,,"66,6,,"
1,01-01-2018,01:00,683,,682,,"65,4,,"
2,01-01-2018,02:00,598,,644,,"64,4,,"
3,01-01-2018,03:00,674,,675,,"61,8,,"
4,01-01-2018,04:00,680,,645,,"60,5,,"


In [9]:
df_noise_2018['date_hour'] = df_noise_2018['Data'] + ' ' + df_noise_2018['Ora']
df_noise_2018.drop(columns= ['Data', 'Ora'], inplace= True)

In [10]:
df_police_1 = pd.read_excel('https://github.com/McNickSisto/world_data_league/blob/main/stage_final/data/police_complaints/OpenDataContact_Gennaio_Giugno_2018.xlsx?raw=true')
df_police_1.head()

Unnamed: 0,Categoria criminologa,Sottocategoria Criminologica,Circoscrizione,Localita,Area Verde,Data,Ora
0,Allarme Sociale,Altro,6.0,BELMONTE/(VIA) ...,,01/02/2018,
1,Allarme Sociale,Altro,6.0,DONATORE DI SANGUE/(PIAZZA DEL) ...,,12/02/2018,
2,Allarme Sociale,Altro,4.0,CIBRARIO/LUIGI (VIA) ...,,26/02/2018,
3,Allarme Sociale,Altro,1.0,ROMA/(VIA) ...,,02/03/2018,
4,Allarme Sociale,Altro,4.0,ZUMAGLIA/(VIA) ...,,05/03/2018,


In [11]:
df_police_2 = pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/police_complaints/OpenDataContact_Luglio_Dicembre_2018.csv')
df_police_2.head()

Unnamed: 0,Categoria criminologa,Sottocategoria Criminologica,Circoscrizione,Localita,Area Verde,Data,Ora
0,Allarme Sociale,Altro,8.0,D'AZEGLIO/MASSIMO (CORSO) ...,,16/07/2018,
1,Allarme Sociale,Altro,1.0,REGINA MARGHERITA/(CORSO) ...,,17/07/2018,
2,Allarme Sociale,Altro,10.0,DUINO/(VIA) ...,,14/09/2018,
3,Allarme Sociale,Altro,,,,02/10/2018,9.4
4,Allarme Sociale,Altro,9.0,CARDUCCI/GIOSUE' (PIAZZA) ...,,27/11/2018,11.53


In [46]:
df_police = pd.concat([df_police_1,df_police_2])
df_police

Unnamed: 0,Categoria criminologa,Sottocategoria Criminologica,Circoscrizione,Localita,Area Verde,Data,Ora
0,Allarme Sociale,Altro,6.0,BELMONTE/(VIA) ...,,01/02/2018,
1,Allarme Sociale,Altro,6.0,DONATORE DI SANGUE/(PIAZZA DEL) ...,,12/02/2018,
2,Allarme Sociale,Altro,4.0,CIBRARIO/LUIGI (VIA) ...,,26/02/2018,
3,Allarme Sociale,Altro,1.0,ROMA/(VIA) ...,,02/03/2018,
4,Allarme Sociale,Altro,4.0,ZUMAGLIA/(VIA) ...,,05/03/2018,
...,...,...,...,...,...,...,...
990,Qualità Urbana,Decoro e degrado urbano,6.0,VERCELLI/(CORSO) ...,,31/12/2018,11.08
991,Qualità Urbana,Veicoli abbandonati,4.0,BOSELLI/PAOLO (VIA) ...,,17/09/2018,
992,Qualità Urbana,Veicoli abbandonati,4.0,PIFFETTI/PIETRO (VIA) ...,,22/09/2018,14.01
993,Qualità Urbana,Veicoli abbandonati,6.0,FOSSATA/(VIA) ...,,22/09/2018,9.55


In [48]:
df_police['Ora'].isna().sum()/ len(df_police)

0.7809040590405905

### Merging Dataframe

In [13]:
df_noise_2018.head()

Unnamed: 0,C1,C2,C3,C4,"C5,,,,,",date_hour
0,687,,760,,"66,6,,",01-01-2018 00:00
1,683,,682,,"65,4,,",01-01-2018 01:00
2,598,,644,,"64,4,,",01-01-2018 02:00
3,674,,675,,"61,8,,",01-01-2018 03:00
4,680,,645,,"60,5,,",01-01-2018 04:00


In [14]:
df_wifi.rename(columns = {'Time': 'date_time'}, inplace=True)

In [15]:
df_sim_all = pd.concat([df_sim_jan, df_sim_march, df_sim_june], axis=0)
df_sim_all.reset_index(inplace=True)

In [16]:
for x, line in enumerate(df_sim_all['data_da']):
    df_sim_all['data_da'][x] = line[8:10] + line[4:7] + '-' + line[0:4] +' ' + line[11:16]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sim_all['data_da'][x] = line[8:10] + line[4:7] + '-' + line[0:4] +' ' + line[11:16]


In [17]:
df_sim_all.rename(columns= {'data_da' : 'date_time'}, inplace=True)

In [18]:
df_sim_all.head()

Unnamed: 0,index,cluster,date_time,data_a,numero_presenze,layer_id,layer_nome,dettaglio(secondi)
0,0,Presenze,21-01-2018 22:00,2018-01-21T23:00:00Z,3026,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
1,1,Presenze,21-01-2018 21:00,2018-01-21T22:00:00Z,3088,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
2,2,Presenze,21-01-2018 20:00,2018-01-21T21:00:00Z,3119,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
3,3,Presenze,21-01-2018 19:00,2018-01-21T20:00:00Z,3114,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
4,4,Presenze,21-01-2018 18:00,2018-01-21T19:00:00Z,2991,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600


In [19]:
df_police[df_police['Ora'].isna()] #many complaints do not have hours associated with them 

Unnamed: 0,Categoria criminologa,Sottocategoria Criminologica,Circoscrizione,Localita,Area Verde,Data,Ora
0,Allarme Sociale,Altro,6.0,BELMONTE/(VIA) ...,,01/02/2018,
1,Allarme Sociale,Altro,6.0,DONATORE DI SANGUE/(PIAZZA DEL) ...,,12/02/2018,
2,Allarme Sociale,Altro,4.0,CIBRARIO/LUIGI (VIA) ...,,26/02/2018,
3,Allarme Sociale,Altro,1.0,ROMA/(VIA) ...,,02/03/2018,
4,Allarme Sociale,Altro,4.0,ZUMAGLIA/(VIA) ...,,05/03/2018,
...,...,...,...,...,...,...,...
985,Qualità Urbana,Decoro e degrado urbano,3.0,RACCONIGI/(CORSO) ...,,23/12/2018,
986,Qualità Urbana,Decoro e degrado urbano,3.0,MONTE CUCCO/(CORSO) ...,,24/12/2018,
989,Qualità Urbana,Decoro e degrado urbano,8.0,CARDINALE MAURIZIO/(VIA) ...,,28/12/2018,
991,Qualità Urbana,Veicoli abbandonati,4.0,BOSELLI/PAOLO (VIA) ...,,17/09/2018,


In [21]:
df_weather = pd.read_csv("https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/all_weather.csv")
df_weather = df_weather.drop(columns = ['Unnamed: 0'])
df_weather.head()

Unnamed: 0,time,temp,winds,rainfall_mm,snowfall_mm
0,2018-01-01 00:00:00,1.04,0.366667,-0.01,2.6
1,2018-01-01 01:00:00,1.09,0.59,0.009,2.6
2,2018-01-01 02:00:00,1.05,0.45,0.008,2.266667
3,2018-01-01 03:00:00,0.89,0.4,0.006,2.266667
4,2018-01-01 04:00:00,0.73,0.78,-0.011,2.3


In [23]:
df_holidays = pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/holidays.csv')
df_holidays

Unnamed: 0,Date,Day,Holiday
0,01-01-2018,monday,New year's Day
1,06-01-2018,saturday,La Befana
2,19-03-2018,monday,Father's day
3,25-03-2018,sunday,Palm Sunday
4,01-04-2018,sunday,Easter
5,02-04-2018,monday,Easter Monday
6,25-04-2018,wednesday,liberation
7,01-05-2018,tuesday,Labour day
8,09-05-2018,wednesday,Europe day
9,13-05-2018,sunday,mother's day


### Merging All dataframes

Merging noise, wifi, sim,weather,... police

In [26]:
df_noise_2018['date_hour'] = pd.to_datetime(df_noise_2018['date_hour'])
df_noise_2018['date_hour'] = df_noise_2018['date_hour'].dt.strftime("%d-%m-%y %H:%M")

In [27]:
df_wifi.columns

Index(['date_time', 'No. of Visitors'], dtype='object')

In [41]:
df_wifi['date_time'] = pd.to_datetime(df_wifi['date_time'])
df_wifi['date_time'] = df_wifi['date_time'].dt.strftime("%d-%m-%y %H:%M")

In [29]:
df_final = df_noise_2018.merge(df_wifi, left_on= 'date_hour', right_on= 'date_time', how='left')
df_final

Unnamed: 0,C1,C2,C3,C4,"C5,,,,,",date_hour,date_time,No. of Visitors
0,687,,760,,"66,6,,",01-01-18 00:00,,
1,683,,682,,"65,4,,",01-01-18 01:00,,
2,598,,644,,"64,4,,",01-01-18 02:00,,
3,674,,675,,"61,8,,",01-01-18 03:00,,
4,680,,645,,"60,5,,",01-01-18 04:00,,
...,...,...,...,...,...,...,...,...
8755,619,602,603,596,616,31-12-18 19:00,31-12-18 19:00,158.0
8756,625,589,582,616,616,31-12-18 20:00,31-12-18 20:00,171.0
8757,628,567,592,582,593,31-12-18 21:00,31-12-18 21:00,151.0
8758,605,572,589,581,572,31-12-18 22:00,31-12-18 22:00,125.0


In [30]:
df_sim_all['date_time'] = pd.to_datetime(df_sim_all['date_time'])
df_sim_all['date_time'] = df_sim_all['date_time'].dt.strftime("%d-%m-%y %H:%M")

In [31]:
df_final_1 = df_final.merge(df_sim_all, left_on= 'date_hour', right_on= 'date_time', how='left')
df_final_1

Unnamed: 0,C1,C2,C3,C4,"C5,,,,,",date_hour,date_time_x,No. of Visitors,index,cluster,date_time_y,data_a,numero_presenze,layer_id,layer_nome,dettaglio(secondi)
0,687,,760,,"66,6,,",01-01-18 00:00,,,,,,,,,,
1,683,,682,,"65,4,,",01-01-18 01:00,,,,,,,,,,
2,598,,644,,"64,4,,",01-01-18 02:00,,,,,,,,,,
3,674,,675,,"61,8,,",01-01-18 03:00,,,,,,,,,,
4,680,,645,,"60,5,,",01-01-18 04:00,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17306,619,602,603,596,616,31-12-18 19:00,31-12-18 19:00,158.0,,,,,,,,
17307,625,589,582,616,616,31-12-18 20:00,31-12-18 20:00,171.0,,,,,,,,
17308,628,567,592,582,593,31-12-18 21:00,31-12-18 21:00,151.0,,,,,,,,
17309,605,572,589,581,572,31-12-18 22:00,31-12-18 22:00,125.0,,,,,,,,


In [32]:
df_weather['time'] = pd.to_datetime(df_weather['time'])
df_weather['time'] = df_weather['time'].dt.strftime("%d-%m-%y %H:%M")

In [33]:
df_final_2 = df_final_1.merge(df_weather, left_on= 'date_hour', right_on= 'time', how='left')
df_final_2

Unnamed: 0,C1,C2,C3,C4,"C5,,,,,",date_hour,date_time_x,No. of Visitors,index,cluster,...,data_a,numero_presenze,layer_id,layer_nome,dettaglio(secondi),time,temp,winds,rainfall_mm,snowfall_mm
0,687,,760,,"66,6,,",01-01-18 00:00,,,,,...,,,,,,01-01-18 00:00,1.04,0.366667,-0.010,2.600000
1,683,,682,,"65,4,,",01-01-18 01:00,,,,,...,,,,,,01-01-18 01:00,1.09,0.590000,0.009,2.600000
2,598,,644,,"64,4,,",01-01-18 02:00,,,,,...,,,,,,01-01-18 02:00,1.05,0.450000,0.008,2.266667
3,674,,675,,"61,8,,",01-01-18 03:00,,,,,...,,,,,,01-01-18 03:00,0.89,0.400000,0.006,2.266667
4,680,,645,,"60,5,,",01-01-18 04:00,,,,,...,,,,,,01-01-18 04:00,0.73,0.780000,-0.011,2.300000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17306,619,602,603,596,616,31-12-18 19:00,31-12-18 19:00,158.0,,,...,,,,,,31-12-18 19:00,5.27,,0.002,4.200000
17307,625,589,582,616,616,31-12-18 20:00,31-12-18 20:00,171.0,,,...,,,,,,31-12-18 20:00,4.99,,0.001,3.633333
17308,628,567,592,582,593,31-12-18 21:00,31-12-18 21:00,151.0,,,...,,,,,,31-12-18 21:00,4.53,,0.011,2.600000
17309,605,572,589,581,572,31-12-18 22:00,31-12-18 22:00,125.0,,,...,,,,,,31-12-18 22:00,4.06,,0.011,1.966667


In [34]:
df_final_2.columns

Index(['C1', 'C2', 'C3', 'C4', 'C5,,,,,', 'date_hour', 'date_time_x',
       'No. of Visitors', 'index', 'cluster', 'date_time_y', 'data_a',
       'numero_presenze', 'layer_id', 'layer_nome', 'dettaglio(secondi)',
       'time', 'temp', 'winds', 'rainfall_mm', 'snowfall_mm'],
      dtype='object')

In [35]:
df_final_3 = df_final_2.drop(columns = ['date_time_x','date_time_y', 'time'] )

In [36]:
df_final_3['date_hour'] = pd.to_datetime(df_final_3['date_hour'])
df_final_3['date'] = df_final_3['date_hour'].dt.strftime("%d-%m-%y")

In [37]:
df_final_3.head()

Unnamed: 0,C1,C2,C3,C4,"C5,,,,,",date_hour,No. of Visitors,index,cluster,data_a,numero_presenze,layer_id,layer_nome,dettaglio(secondi),temp,winds,rainfall_mm,snowfall_mm,date
0,687,,760,,"66,6,,",2018-01-01 00:00:00,,,,,,,,,1.04,0.366667,-0.01,2.6,01-01-18
1,683,,682,,"65,4,,",2018-01-01 01:00:00,,,,,,,,,1.09,0.59,0.009,2.6,01-01-18
2,598,,644,,"64,4,,",2018-01-01 02:00:00,,,,,,,,,1.05,0.45,0.008,2.266667,01-01-18
3,674,,675,,"61,8,,",2018-01-01 03:00:00,,,,,,,,,0.89,0.4,0.006,2.266667,01-01-18
4,680,,645,,"60,5,,",2018-01-01 04:00:00,,,,,,,,,0.73,0.78,-0.011,2.3,01-01-18


In [38]:
df_finalized = df_final_3.merge(df_holidays, left_on='date', right_on = 'Date', how ="left")
df_finalized['isHoliday'] = df_finalized['Holiday'].apply(lambda x: 0 if pd.isnull(x)==True else 1)
df_finalized.head(30)

Unnamed: 0,C1,C2,C3,C4,"C5,,,,,",date_hour,No. of Visitors,index,cluster,data_a,...,dettaglio(secondi),temp,winds,rainfall_mm,snowfall_mm,date,Date,Day,Holiday,isHoliday
0,687,,760.0,,"66,6,,",2018-01-01 00:00:00,,,,,...,,1.04,0.366667,-0.01,2.6,01-01-18,,,,0
1,683,,682.0,,"65,4,,",2018-01-01 01:00:00,,,,,...,,1.09,0.59,0.009,2.6,01-01-18,,,,0
2,598,,644.0,,"64,4,,",2018-01-01 02:00:00,,,,,...,,1.05,0.45,0.008,2.266667,01-01-18,,,,0
3,674,,675.0,,"61,8,,",2018-01-01 03:00:00,,,,,...,,0.89,0.4,0.006,2.266667,01-01-18,,,,0
4,680,,645.0,,"60,5,,",2018-01-01 04:00:00,,,,,...,,0.73,0.78,-0.011,2.3,01-01-18,,,,0
5,554,,567.0,,"59,5,,",2018-01-01 05:00:00,,,,,...,,0.78,0.55,-0.014,2.133333,01-01-18,,,,0
6,575,,532.0,,"58,2,,",2018-01-01 06:00:00,,,,,...,,0.83,0.63,-0.011,2.166667,01-01-18,,,,0
7,518,,,,"57,0,,,",2018-01-01 07:00:00,,,,,...,,1.0,1.22,-0.014,2.333333,01-01-18,,,,0
8,630,,,,"55,8,,,",2018-01-01 08:00:00,,,,,...,,1.27,1.4,-0.012,2.0,01-01-18,,,,0
9,538,,,,"56,5,,,",2018-01-01 09:00:00,,,,,...,,1.28,0.95,-0.012,2.433333,01-01-18,,,,0


In [39]:
df_finalized = df_finalized.drop(columns= ['Date'])

In [40]:
df_finalized

Unnamed: 0,C1,C2,C3,C4,"C5,,,,,",date_hour,No. of Visitors,index,cluster,data_a,...,layer_nome,dettaglio(secondi),temp,winds,rainfall_mm,snowfall_mm,date,Day,Holiday,isHoliday
0,687,,760,,"66,6,,",2018-01-01 00:00:00,,,,,...,,,1.04,0.366667,-0.010,2.600000,01-01-18,,,0
1,683,,682,,"65,4,,",2018-01-01 01:00:00,,,,,...,,,1.09,0.590000,0.009,2.600000,01-01-18,,,0
2,598,,644,,"64,4,,",2018-01-01 02:00:00,,,,,...,,,1.05,0.450000,0.008,2.266667,01-01-18,,,0
3,674,,675,,"61,8,,",2018-01-01 03:00:00,,,,,...,,,0.89,0.400000,0.006,2.266667,01-01-18,,,0
4,680,,645,,"60,5,,",2018-01-01 04:00:00,,,,,...,,,0.73,0.780000,-0.011,2.300000,01-01-18,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17306,619,602,603,596,616,2018-12-31 19:00:00,158.0,,,,...,,,5.27,,0.002,4.200000,31-12-18,,,0
17307,625,589,582,616,616,2018-12-31 20:00:00,171.0,,,,...,,,4.99,,0.001,3.633333,31-12-18,,,0
17308,628,567,592,582,593,2018-12-31 21:00:00,151.0,,,,...,,,4.53,,0.011,2.600000,31-12-18,,,0
17309,605,572,589,581,572,2018-12-31 22:00:00,125.0,,,,...,,,4.06,,0.011,1.966667,31-12-18,,,0


In [42]:
df_finalized.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17311 entries, 0 to 17310
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   C1                  16588 non-null  object        
 1   C2                  11923 non-null  object        
 2   C3                  7003 non-null   object        
 3   C4                  10086 non-null  object        
 4   C5,,,,,             17311 non-null  object        
 5   date_hour           17311 non-null  datetime64[ns]
 6   No. of Visitors     1639 non-null   float64       
 7   index               9054 non-null   float64       
 8   cluster             9054 non-null   object        
 9   data_a              9054 non-null   object        
 10  numero_presenze     9054 non-null   float64       
 11  layer_id            9054 non-null   object        
 12  layer_nome          9054 non-null   object        
 13  dettaglio(secondi)  9054 non-null   float64   

In [45]:
df_finalized.to_csv('Noise_weather_wifi_sim_holidays.csv')

## Conclusions

### Scalability and Impact
Tell us how applicable and scalable your solution is if you were to implement it in a city. Identify possible limitations and measure the potential social impact of your solution.

### Future Work
Now picture the following scenario: imagine you could have access to any type of data that could help you solve this challenge even better. What would that data be and how would it improve your solution? 🚀