# World Data League 2021
## Notebook Template

This notebook is one of the mandatory deliverables when you submit your solution (alongside the video pitch). Its structure follows the WDL evaluation criteria and it has dedicated cells where you can add descriptions. Make sure your code is readable as it will be the only technical support the jury will have to evaluate your work.

The notebook must:

*   💻 have all the code that you want the jury to evaluate
*   🧱 follow the predefined structure
*   📄 have markdown descriptions where you find necessary
*   👀 be saved with all the output that you want the jury to see
*   🏃‍♂️ be runnable


## Authors
- Nicholas Sistovaris
- Moritz Geiger
- Pravalika Myneni
- Sowmya Madela

## External links and resources

All the external data or resources that was not provided by the WDL was acquired through the following links:

1. https://noise-planet.org/noisemodelling.html 
2. https://www.torinocitylab.it/en/asset-to/open-data 
3. https://www.officeholidays.com/countries/italy/turin/2018 
4. https://www.feiertagskalender.ch/index.php?geo=3815&jahr=2018&hl=en
5. http://webgis.arpa.piemonte.it/basicviewer_arpa_webapp/index.html?webmap=89aa175451d24ae0a1911e67957d9aec
6. http://aperto.comune.torino.it/dataset/zone-statistiche
7. https://openweathermap.org/history
8. https://developers.google.com/maps/documentation/places/web-service/details 

## Introduction

**Overview:**


_from challenge description_
<blockquote>

</blockquote>



**Research:**



## Development
Start coding here! 👩‍💻

Don't hesitate to create markdown cells to include descriptions of your work where you see fit, as well as commenting your code.

We know that you know exactly where to start when it comes to crunching data and building models, but don't forget that WDL is all about social impact...so take that into consideration as well.

### Imports (libraries) 📚

In [None]:
## TABULAR
import pandas as pd 
import numpy as np
import matplotlib

## GEO
import geopandas as gpd
import fiona
import folium
from folium.plugins import MarkerCluster, HeatMap, BeautifyIcon
from folium.map import LayerControl, Layer, FeatureGroup
from folium.vector_layers import Circle, CircleMarker
from shapely.geometry import LineString, Point
from shapely import wkt


## DATA
import os
import zipfile
from collections import Counter
import re
from datetime import datetime
import requests
from dotenv import load_dotenv, find_dotenv
import ast
import datetime as dt
from io import StringIO, BytesIO


## VIS
from ipywidgets import interact, interactive, fixed, interact_manual, IntSlider
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.tsa
import branca
import plotly.express as px

## TIME SERIES
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.metrics import r2_score, median_absolute_error, mean_absolute_error
from sklearn.metrics import median_absolute_error, mean_squared_error, mean_squared_log_error
import statsmodels.tsa.api as smt
import statsmodels.api as sm
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from pmdarima.arima import auto_arima 


## MODELLING
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso 
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor as rfr

## NEURAL NETWORKS
from tensorflow import keras
from tensorflow.keras import layers
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout

### Importing Dataframes

Following a first glance at the dataframes provided by the WDL, we believed that using data from **2018** was our best bet to construct our model on. 

- First, we wanted to focus on understanding noise and complaints in the pre-covid context. The years 2020 and 2021 would have been unrepresentative of Turin's nightlife.

- Secondly, we wanted a feature that would represent the number of people outsides on an hourly basis. The data on No. of Visitors based on WiFi was most complete and representative of the population outside. However, it only had data for October, November & December 2018. This is why we picked 2018 for the rest of our data.

In [None]:
# location of the sensors
df_sensors_def = pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/noise_sensor_list.csv', sep=';')
df_sensors_def

Unnamed: 0,code,address,Lat,Long,streaming
0,s_01,"Via Saluzzo, 26 Torino",45059172,7678986,https://userportal.smartdatanet.it/userportal/...
1,s_02,"Via Principe Tommaso, 18bis Torino",45057837,7681555,https://userportal.smartdatanet.it/userportal/...
2,s_03,Largo Saluzzo Torino,45058518,7678854,https://userportal.smartdatanet.it/userportal/...
3,s_05,Via Principe Tommaso angolo via Baretti Torino,45057603,7681348,https://userportal.smartdatanet.it/userportal/...
4,s_06,"Corso Marconi, 27 Torino",45055554,768259,https://userportal.smartdatanet.it/userportal/...


**Note** The location of sensors was optimized to cover all
significant feature of “Movida” area:
one in a very crowded square (S_03, not active in
daytime), three in narrow streets with pubs and
bars (S_01, S_04, S_05), one in a boulevard for
traffic noise measurement (S_06) and the last one
in a quieter area with no crowd and low traffic
(S_02), for global reference. The choice of points
of installation was driven also by the power
supply, so light poles, public offices and bike
sharing station where preferred.

Source: https://wdl-data.fra1.digitaloceanspaces.com/torino/120_Euronoise2018.pdf

In [None]:
df_wifi = pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/WIFI%20Count.csv', sep=',')
df_wifi

Unnamed: 0,Time,No. of Visitors
0,2018-10-24 17:00,47
1,2018-10-24 18:00,155
2,2018-10-24 19:00,181
3,2018-10-24 20:00,211
4,2018-10-24 21:00,239
...,...,...
1634,2018-12-31 19:00,158
1635,2018-12-31 20:00,171
1636,2018-12-31 21:00,151
1637,2018-12-31 22:00,125


**Note** As you can see, from the data above, we can get an idea of the number of people outside at different hours.

In [None]:
df_businesses = pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/businesses.csv', sep=';')
df_businesses.head()

Unnamed: 0,WKT,ADDRESS,OPEN YEAR,OPEN MONTH,TYPE,Description,Merchandise Type
0,POINT (1396322.217 4990301.69),VIA CLAUDIO LUIGI BERTHOLLET 24,1977,1,EXTRALIMENTARI,PICCOLE STRUTTURE,Extralimentari
1,POINT (1396322.217 4990301.69),VIA CLAUDIO LUIGI BERTHOLLET 24,1985,6,ALIMENTARI,PICCOLE STRUTTURE,Panificio
2,POINT (1396303.762 4990325.001),VIA CLAUDIO LUIGI BERTHOLLET 25/F,2017,9,ALTRO,DIA di somministrazione,Nessuna
3,POINT (1396434.395 4990540.6),CORSO VITTORIO EMANUELE II 21/A,2013,10,ALTRO,DIA di somministrazione,Nessuna
4,POINT (1396434.395 4990540.6),CORSO VITTORIO EMANUELE II 21/A,2009,2,ALTRO,DIA di somministrazione,Nessuna


**Note** Location & Description of various businesses 

In [None]:
df_sim_june = pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/sim_count/SIM_count_04_100618.csv', sep=';', encoding='latin-1')
df_sim_june.head()

Unnamed: 0,cluster,data_da,data_a,numero_presenze,layer_id,layer_nome,dettaglio(secondi)
0,Presenze,2018-06-10T21:00:00Z,2018-06-10T22:00:00Z,3278,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
1,Presenze,2018-06-10T20:00:00Z,2018-06-10T21:00:00Z,3324,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
2,Presenze,2018-06-10T19:00:00Z,2018-06-10T20:00:00Z,3318,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
3,Presenze,2018-06-10T18:00:00Z,2018-06-10T19:00:00Z,3187,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
4,Presenze,2018-06-10T17:00:00Z,2018-06-10T18:00:00Z,2980,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600


In [None]:
df_sim_jan = pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/sim_count/SIM_count_15_210118.csv', sep=';', encoding='latin-1')
df_sim_jan.head()

Unnamed: 0,cluster,data_da,data_a,numero_presenze,layer_id,layer_nome,dettaglio(secondi)
0,Presenze,2018-01-21T22:00:00Z,2018-01-21T23:00:00Z,3026,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
1,Presenze,2018-01-21T21:00:00Z,2018-01-21T22:00:00Z,3088,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
2,Presenze,2018-01-21T20:00:00Z,2018-01-21T21:00:00Z,3119,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
3,Presenze,2018-01-21T19:00:00Z,2018-01-21T20:00:00Z,3114,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
4,Presenze,2018-01-21T18:00:00Z,2018-01-21T19:00:00Z,2991,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600


In [None]:
df_sim_march = pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/sim_count/SIM_count_19_250318.csv', sep=';', encoding='latin-1')
df_sim_march.head()

Unnamed: 0,cluster,data_da,data_a,numero_presenze,layer_id,layer_nome,dettaglio(secondi)
0,Presenze,2018-03-25T21:00:00Z,2018-03-25T22:00:00Z,3267,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
1,Presenze,2018-03-25T20:00:00Z,2018-03-25T21:00:00Z,3373,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
2,Presenze,2018-03-25T19:00:00Z,2018-03-25T20:00:00Z,3410,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
3,Presenze,2018-03-25T18:00:00Z,2018-03-25T19:00:00Z,3358,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600
4,Presenze,2018-03-25T17:00:00Z,2018-03-25T18:00:00Z,3229,5491d6d2-0c9e-47b7-bfde-c84c632efacc,Area 1,3600


In [None]:
df_sim_all = pd.concat([df_sim_jan, df_sim_march, df_sim_june], axis=0)
df_sim_all.reset_index(inplace=True)

**Note** Another possibility to estimate the number of people outside at certain hours is the SIM card dataframes. What it highlights is the presence of certain SIM card users at different hours of the day. We have access to SIM card data of 2018 for January, March and June.

In [None]:
df_noise_2018 = pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/noise_data/san_salvario_2018.csv', skiprows= [0,1,2,3,4,5,6,7], sep =';')
df_noise_2018.head()

Unnamed: 0,Data,Ora,C1,C2,C3,C4,"C5,,,,,"
0,01-01-2018,00:00,687,,760,,"66,6,,"
1,01-01-2018,01:00,683,,682,,"65,4,,"
2,01-01-2018,02:00,598,,644,,"64,4,,"
3,01-01-2018,03:00,674,,675,,"61,8,,"
4,01-01-2018,04:00,680,,645,,"60,5,,"


**Note** The noise data is records of noice measurements using 5 different sensors spread in the San Salvario region on an hourly basis. We will use this data as our target in our time series measurements. 

In [None]:
df_police_1 = pd.read_excel('https://github.com/McNickSisto/world_data_league/blob/main/stage_final/data/police_complaints/OpenDataContact_Gennaio_Giugno_2018.xlsx?raw=true')
df_police_1.head()

Unnamed: 0,Categoria criminologa,Sottocategoria Criminologica,Circoscrizione,Localita,Area Verde,Data,Ora
0,Allarme Sociale,Altro,6.0,BELMONTE/(VIA) ...,,01/02/2018,
1,Allarme Sociale,Altro,6.0,DONATORE DI SANGUE/(PIAZZA DEL) ...,,12/02/2018,
2,Allarme Sociale,Altro,4.0,CIBRARIO/LUIGI (VIA) ...,,26/02/2018,
3,Allarme Sociale,Altro,1.0,ROMA/(VIA) ...,,02/03/2018,
4,Allarme Sociale,Altro,4.0,ZUMAGLIA/(VIA) ...,,05/03/2018,


In [None]:
df_police_2 = pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/police_complaints/OpenDataContact_Luglio_Dicembre_2018.csv')
df_police_2.head()

Unnamed: 0,Categoria criminologa,Sottocategoria Criminologica,Circoscrizione,Localita,Area Verde,Data,Ora
0,Allarme Sociale,Altro,8.0,D'AZEGLIO/MASSIMO (CORSO) ...,,16/07/2018,
1,Allarme Sociale,Altro,1.0,REGINA MARGHERITA/(CORSO) ...,,17/07/2018,
2,Allarme Sociale,Altro,10.0,DUINO/(VIA) ...,,14/09/2018,
3,Allarme Sociale,Altro,,,,02/10/2018,9.4
4,Allarme Sociale,Altro,9.0,CARDUCCI/GIOSUE' (PIAZZA) ...,,27/11/2018,11.53


In [None]:
df_police = pd.concat([df_police_1,df_police_2])
df_police

Unnamed: 0,Categoria criminologa,Sottocategoria Criminologica,Circoscrizione,Localita,Area Verde,Data,Ora
0,Allarme Sociale,Altro,6.0,BELMONTE/(VIA) ...,,01/02/2018,
1,Allarme Sociale,Altro,6.0,DONATORE DI SANGUE/(PIAZZA DEL) ...,,12/02/2018,
2,Allarme Sociale,Altro,4.0,CIBRARIO/LUIGI (VIA) ...,,26/02/2018,
3,Allarme Sociale,Altro,1.0,ROMA/(VIA) ...,,02/03/2018,
4,Allarme Sociale,Altro,4.0,ZUMAGLIA/(VIA) ...,,05/03/2018,
...,...,...,...,...,...,...,...
990,Qualità Urbana,Decoro e degrado urbano,6.0,VERCELLI/(CORSO) ...,,31/12/2018,11.08
991,Qualità Urbana,Veicoli abbandonati,4.0,BOSELLI/PAOLO (VIA) ...,,17/09/2018,
992,Qualità Urbana,Veicoli abbandonati,4.0,PIFFETTI/PIETRO (VIA) ...,,22/09/2018,14.01
993,Qualità Urbana,Veicoli abbandonati,6.0,FOSSATA/(VIA) ...,,22/09/2018,9.55


In [None]:
df_weather = pd.read_csv("https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/all_weather.csv")
df_weather = df_weather.drop(columns = ['Unnamed: 0'])
df_weather.head()

Unnamed: 0,time,temp,winds,rainfall_mm,snowfall_mm
0,2018-01-01 00:00:00,1.04,0.366667,-0.01,2.6
1,2018-01-01 01:00:00,1.09,0.59,0.009,2.6
2,2018-01-01 02:00:00,1.05,0.45,0.008,2.266667
3,2018-01-01 03:00:00,0.89,0.4,0.006,2.266667
4,2018-01-01 04:00:00,0.73,0.78,-0.011,2.3


<br><br>
See details in [Appendix](#Weather Data)

In [None]:
df_holidays = pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/holidays.csv')
df_holidays

Unnamed: 0,Date,Day,Holiday
0,01-01-2018,monday,New year's Day
1,06-01-2018,saturday,La Befana
2,19-03-2018,monday,Father's day
3,25-03-2018,sunday,Palm Sunday
4,01-04-2018,sunday,Easter
5,02-04-2018,monday,Easter Monday
6,25-04-2018,wednesday,liberation
7,01-05-2018,tuesday,Labour day
8,09-05-2018,wednesday,Europe day
9,13-05-2018,sunday,mother's day


In [None]:
df_matches = pd.read_csv('GET MATCHES')
df_matches

FileNotFoundError: [Errno 2] No such file or directory: 'GET MATCHES'

In [None]:
df_opening_hours = pd.read_csv('GET OPENING HOURS')

### Merging Dataframes

In [None]:
noise1 = df_noise_2018.copy()

In [None]:
df_noise_2018['date_hour'] = pd.to_datetime(df_noise_2018['date_hour'])
df_noise_2018['date_hour'] = df_noise_2018['date_hour'].dt.strftime("%d-%m-%y %H:%M")

In [None]:
df_noise_2018.head()

In [None]:
df_wifi.rename(columns = {'Time': 'date_time'}, inplace=True)
df_wifi.columns

In [None]:
df_wifi['date_time'] = pd.to_datetime(df_wifi['date_time'])
df_wifi['date_time'] = df_wifi['date_time'].dt.strftime("%d-%m-%y %H:%M")

In [None]:
df_weather['time'] = pd.to_datetime(df_weather['time'])
df_weather['time'] = df_weather['time'].dt.strftime("%d-%m-%y %H:%M")

In [None]:
for x, line in enumerate(df_sim_all['data_da']):
    df_sim_all['data_da'][x] = line[8:10] + line[4:7] + '-' + line[0:4] +' ' + line[11:16]

In [None]:
df_sim_all.rename(columns= {'data_da' : 'date_time'}, inplace=True)

In [None]:
df_sim_all

In [None]:
df_sim_all['date_time'] = pd.to_datetime(df_sim_all['date_time'])
df_sim_all['date_time'] = df_sim_all['date_time'].dt.strftime("%d-%m-%y %H:%M")

Merging noise, wifi, sim,weather,... police

In [None]:
df_final = df_noise_2018.merge(df_wifi, left_on= 'date_hour', right_on= 'date_time', how='left')
df_final

In [None]:
df_final_1 = df_final.merge(df_sim_all, left_on= 'date_hour', right_on= 'date_time', how='left')
df_final_1

In [None]:
df_final_2 = df_final_1.merge(df_weather, left_on= 'date_hour', right_on= 'time', how='left')
df_final_2

In [None]:
df_final_2.columns

In [None]:
df_final_3 = df_final_2.drop(columns = ['date_time_x','date_time_y', 'time'] )

In [None]:
df_final_3['date_hour'] = pd.to_datetime(df_final_3['date_hour'])
df_final_3['date'] = df_final_3['date_hour'].dt.strftime("%d-%m-%y")

In [None]:
df_final_3.head()

In [None]:
df_finalized = df_final_3.merge(df_holidays, left_on='date', right_on = 'Date', how ="left")
df_finalized['isHoliday'] = df_finalized['Holiday'].apply(lambda x: 0 if pd.isnull(x)==True else 1)
df_finalized.head(30)

In [None]:
df_finalized = df_finalized.drop(columns= ['Date'])

In [None]:
df_finalized

In [None]:
df_finalized.info()

In [None]:
df_finalized.to_csv('Noise_weather_wifi_sim_holidays.csv')

### Preprocessing Data

In [None]:
#noise_2018=pd.read_csv('/content/drive/MyDrive/finals/noise_data/san_salvario_2018.csv',skiprows=[0,1,2,3,4,5,6,7],delimiter=';')
df=pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/Noise_weather_wifi_sim_holidays.csv')
#Converting to date time
df['date_hour']=pd.to_datetime(df['date_hour'])
df=df.drop(columns=['Unnamed: 0','C1','C2','C3','C4','C5,,,,,','date','Day'])
df.info()

In [None]:
df['date']=df['date_hour'].dt.date
df['hour']=df['date_hour'].dt.hour
df['day']=df['date_hour'].dt.dayofweek
df.head(2)

In [None]:
set(df['day'])

In [None]:
df_noise_2018.head(2)

In [None]:
noise1

In [None]:
noise1['Ora']=pd.to_datetime(noise1['Ora']).dt.hour
noise1['Data']=pd.to_datetime(noise1['Data']).dt.date

In [None]:
#Converting the noise readings into decimal format
noise1['C1']=noise1['C1'].apply(lambda x: str(x).replace(',','.'))
noise1['C2']=noise1['C2'].apply(lambda x: str(x).replace(',','.'))
noise1['C3']=noise1['C3'].apply(lambda x: str(x).replace(',','.'))
noise1['C4']=noise1['C4'].apply(lambda x: str(x).replace(',','.'))
noise1['C5']=noise1['C5'].apply(lambda x: str(x).replace(',','.'))
#Conerting the noise reading to float values
noise1['C1']=noise1['C1'].apply(lambda x: float(x))
noise1['C2']=noise1['C2'].apply(lambda x: float(x))
noise1['C3']=noise1['C3'].apply(lambda x: float(x))
noise1['C4']=noise1['C4'].apply(lambda x: float(x))
noise1['C5']=noise1['C5'].apply(lambda x: float(x))
noise1.head(2)

In [None]:
new_df = pd.merge(noise1, df,  how='inner', left_on=['Data','Ora'], right_on = ['date','hour'])
new_df.head()

In [None]:
new_df.columns

In [None]:
new_df=new_df.drop(columns=['Data','Ora'])

In [None]:
#Fillig the null values considering means on hourly basis
new_df["C1"] = new_df.groupby(["hour",'day'])['C1'].transform(lambda x: x.fillna(round(x.mean(),1)))
new_df["C2"] = new_df.groupby(["hour",'day'])['C2'].transform(lambda x: x.fillna(round(x.mean(),1)))
new_df["C3"] = new_df.groupby(["hour",'day'])['C3'].transform(lambda x: x.fillna(round(x.mean(),1)))
new_df["C4"] = new_df.groupby(["hour",'day'])['C4'].transform(lambda x: x.fillna(round(x.mean(),1)))
new_df["C5"] = new_df.groupby(["hour",'day'])['C5'].transform(lambda x: x.fillna(round(x.mean(),1)))

In [None]:
new_df.head(2)

In [None]:
new_df.isnull().sum()

In [None]:
new_df['Log_Avg']=np.log10(((10**(new_df['C1']/10))+(10**(new_df['C2']/10))+(10**(new_df['C3']/10))+(10**(new_df['C4']/10))+(10**(new_df['C5']/10)))/5)*10

In [None]:
new_df.head(2)

In [None]:
correlation_mat = new_df.corr()
sns.heatmap(correlation_mat, annot = True)
plt.show()

In [None]:
corr_pairs = correlation_mat.unstack()
print(corr_pairs)

In [None]:
sorted_pairs = corr_pairs.sort_values(kind="quicksort")
strong_pairs = sorted_pairs[abs(sorted_pairs) > 0.5]
strong_pairs

In [None]:
new_df['data_a']

### Data Exploration

In [None]:
df_police[df_police['Ora'].isna()] #many complaints do not have hours associated with them 

In [None]:
noise_2018=pd.read_csv('raw_data/noise_data/san_salvario_2018.csv',
                       skiprows=8,
                       delimiter=';',
                      decimal=',',
#                       parse_dates=[['Data', 'Ora']],
                      )

# # workaround for hour concat issue
noise_2018['Data'] = pd.to_datetime(noise_2018['Data'], format='%d-%m-%Y', errors='coerce')
noise_2018['date_hour'] = noise_2018.apply(lambda x: pd.to_datetime(str(x.Data) + ' ' + str(x.Ora), errors='coerce'), axis=1)
noise_2018 = noise_2018.drop(columns=['Data', 'Ora'])


noise_2018.info()

In [None]:
# plot matches with sensor data
match = pd.read_csv('raw_data/football/matches_2018.csv', index_col=0).set_index('Date')
match['is_match'] = match.is_match.apply(lambda x: x+80 if x == 1 else x)
fig = px.line(noise_2018.set_index('date_hour'))
fig.add_scatter(x=match.index, 
                y=match['is_match'], 
                mode='markers',
                name='football match'
               )


### Modelling

#### ARIMA

In [None]:
data=pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/Imputed_Data_Final.csv')
data=data.drop(columns='Unnamed: 0')
data.head(4)

In [None]:
data_i = data.set_index('date_hour')
data_i.head(2)

In [None]:
df=data_i['Log_Avg']
df.head(2)

In [None]:
df.plot(figsize=(20,5))

In [None]:
additive = seasonal_decompose(df,freq=52, model='additive',extrapolate_trend='freq')

In [None]:
additive_df = pd.concat([additive.seasonal, additive.trend, additive.resid, additive.observed], axis=1)
additive_df.columns = ['seasonal', 'trend', 'resid', 'actual_values']
additive_df.head()

In [None]:
plt.rcParams.update({'figure.figsize': (20,10)})
additive.plot().suptitle('Additive Decompose')
#The Trend,residuals are interesting, showing periods of high variability.

In [None]:
trend = additive.trend
from statsmodels.tsa.stattools import adfuller
result = adfuller(trend.values)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])

In [None]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Original Series
fig, axes = plt.subplots(3, 2, sharex=True)
axes[0, 0].plot(trend.values); axes[0, 0].set_title('Original Series')
plot_acf(trend.values, ax=axes[0, 1]).suptitle('Original Series', fontsize=0)
# 1st Differencing
diff1 = trend.diff().dropna()
axes[1, 0].plot(diff1.values)
axes[1, 0].set_title('1st Order Differencing')
plot_acf(diff1.values, ax=axes[1, 1]).suptitle('1st Order Differencing', fontsize=0)
# 2nd Differencing
diff2 = trend.diff().diff().dropna()
axes[2, 0].plot(diff2.values)
axes[2, 0].set_title('2nd Order Differencing')
plot_acf(diff2.values, ax=axes[2, 1]).suptitle('2nd Order Differencing', fontsize=0)

In [None]:
plt.rcParams.update({'figure.figsize':(9,3), 'figure.dpi':120})
size = 100
fig, axes = plt.subplots(1, 2, sharex=True)
axes[0].plot(diff1.values[:size])
axes[0].set_title('1st Order Differencing')
axes[1].set(ylim=(0,5))
plot_pacf(diff1.values[:size], lags=50, ax=axes[1]).suptitle('1st Order Differencing', fontsize=0)

In [None]:
from statsmodels.tsa.arima_model import ARIMA
train = trend[:3000]
test  = trend[3000:]
# order = (p=1, d=1, q=1)
model = ARIMA(train, order=(1, 1, 1))  
model = model.fit(disp=0)  
print(model.summary())

In [None]:
# Plot residual errors
residuals = pd.DataFrame(model.resid)
fig, ax = plt.subplots(1,2)
residuals.plot(title="Residuals", ax=ax[0])
residuals.plot(kind='kde', title='Density', ax=ax[1])

In [None]:
fc, se, conf = model.forecast(14311, alpha=0.05)
# Make as pandas series
fc_series = pd.Series(fc, index=test.index)
lower_series = pd.Series(conf[:, 0], index=test.index)
upper_series = pd.Series(conf[:, 1], index=test.index)
# Plot
plt.figure(figsize=(12,5), dpi=100)
plt.plot(train, label='training')
plt.plot(test, label='actual')
plt.plot(fc_series, label='forecast')
plt.fill_between(lower_series.index, lower_series, upper_series, 
                 color='k', alpha=.15)
plt.title('Forecast vs Actuals')
plt.legend(loc='upper left', fontsize=8)

#### Moving Average

In [None]:
df.head(2)

Moving Average Smoothing is a technique applied to time series to remove the fine-grained variation between time steps. The hope of smoothing is to remove noise and better expose the signal of the underlying causal processes.

In [None]:
plt.rcParams["figure.figsize"] = (20,6)
df.plot()
pyplot.show()

In [None]:
# Tail-rolling average transform
rolling = df.rolling(window=3)
rolling_mean = rolling.mean()
rolling_mean.dropna(inplace= True)
print(rolling_mean.head())
# plot original and transformed dataset
df.plot()
rolling_mean.plot(color='lightgreen')
pyplot.show()

In [None]:
from pandas import read_csv
from numpy import mean
from sklearn.metrics import mean_squared_error
from matplotlib import pyplot
# prepare situation
X = df.values
window = 3
history = [X[i] for i in range(window)]
test = [X[i] for i in range(window, len(X))]
predictions = list()
# walk forward over time steps in test
for t in range(len(test)):
	length = len(history)
	yhat = mean([history[i] for i in range(length-window,length)])
	obs = test[t]
	predictions.append(yhat)
	history.append(obs)
	#print('predicted=%f, expected=%f' % (yhat, obs))
error = mean_squared_error(test, predictions)
print('Test MSE: %.3f' % error)
# plot
pyplot.plot(test)
pyplot.plot(predictions, color='lightcoral')
pyplot.show()
# zoom plot
pyplot.plot(test[0:100])
pyplot.plot(predictions[0:100], color='lightcoral')
pyplot.show()

## Conclusions

### Scalability and Impact
Tell us how applicable and scalable your solution is if you were to implement it in a city. Identify possible limitations and measure the potential social impact of your solution.

### Future Work
Now picture the following scenario: imagine you could have access to any type of data that could help you solve this challenge even better. What would that data be and how would it improve your solution? 🚀

# Appendix 

## Weather Data

### Data

In [None]:
weather = pd.read_csv('raw_data/weather/weather_1.csv',
#                       nrows=1000, #rm later
                      sep=';',
#                       decimal=',',
                      skiprows=4,
#                       parse_dates=[[0, 1]],
#                       dayfirst=True,
                      header=0,
                      names=['date', 'hour', 'rainfall_mm', 'snowfall_mm'],
                     )

# workaround for hour concat issue
weather['date'] = pd.to_datetime(weather['date'], format='%d-%m-%Y', errors='coerce')
weather['date_hour'] = weather.apply(lambda x: pd.to_datetime(str(x.date) + ' ' + str(x.hour), errors='coerce'), axis=1)

# workaround for decimal issue
weather['rainfall_mm'] = weather.rainfall_mm.apply(lambda x: str(x).replace(',','.'))
weather['snowfall_mm'] = weather.snowfall_mm.apply(lambda x: str(x).replace(',','.'))

In [None]:
weather2 = pd.read_csv('raw_data/weather/weather_2.csv', 
                 sep=';', 
                 skiprows=4, 
                 header=0, 
#                  decimal=',',
#                 converters={2:lambda x: x.replace(',', '.')},
#                 parse_dates=[[0, 1]],
                names=['date', 'hour', 'winds'],
                na_values={2:'',
                            3:''},
                dayfirst=True,
                )
# workaround for hour concat issue
weather2['date'] = pd.to_datetime(weather2['date'], format='%d-%m-%Y', errors='coerce')
weather2['date_hour'] = weather2.apply(lambda x: pd.to_datetime(str(x.date) + ' ' + str(x.hour), errors='coerce'), axis=1)

weather2['winds'] = weather2.winds.apply(lambda x: str(x).replace(',','.'))

In [None]:
# weather['date_hour'] = pd.to_datetime(weather['date_hour'], errors='coerce')
weather_1 = weather.dropna(subset=['date_hour'])

# weather2['date_hour'] = pd.to_datetime(weather2['date_hour'], errors='coerce')
weather_2 = weather2.dropna(subset=['date_hour'])

In [None]:
merged_weather = weather_2.merge(weather_1,
                                right_on='date_hour',
                                left_on='date_hour',
                                )

In [None]:
merged_weather.sort_values(by='date_hour').tail()

In [None]:
merged_weather['hourly_date'] = merged_weather.date_hour.apply(lambda x: x.floor('h'))

In [None]:
merged_weather = merged_weather.astype({'winds': float,
                      'rainfall_mm':float,
                      'snowfall_mm':float})

In [None]:
hourly_weather = merged_weather.groupby('hourly_date').mean()

In [None]:
hourly_weather.info()

In [None]:
hourly_weather.to_csv('hourly_weather.csv')

In [None]:
hourly_weather.head()

### Open Weather Map

In [None]:
# API KEY
load_dotenv(find_dotenv())
OWM_API = os.environ.get("OWM_API")

In [None]:
# init time range
range_2019 = pd.DataFrame(pd.date_range('2016-06-01', '2021-06-12', freq='h'), columns=['hour'])
range_2019.tail().hour

In [None]:
req = 'http://history.openweathermap.org/data/2.5/history/wdl'
start = range_2019.hour.min().value
inter = range_2019.hour.max().value
end = range_2019.hour.max().value
# tail1 = tail.min().value
# tail2 = tail.max().value
params = {
    'id':'3165524', # ID of Turin
    'type':'hour',
    'start':str(start)[:10], # unix time
    'end':str(end)[:10],
    'appid': OWM_API
}

r = requests.get(req, params=params)


# with open('data/weather.txt', 'w') as outfile:
#     json.dump(r.json(), outfile)
    
weather = r.json()
lst = weather.get('list')
dct = {x.get('dt'):x.get('weather')[0].get('main') for x in lst}
weather_df = pd.DataFrame.from_dict(dct, 
                                    orient='index', 
                                    columns=['weather']).reset_index().rename(columns={'index':'time'})
weather_df['rain'] = weather_df.weather == 'Rain'

In [None]:
lst = weather.get('list')
dct = {x.get('dt'):x.get('main').get('temp') for x in lst}

In [None]:
weather_df = pd.DataFrame.from_dict(dct, 
                                    orient='index', 
                                    columns=['temp']).reset_index().rename(columns={'index':'time'})
weather_df['temp'] = weather_df.temp-273.15
weather_df['time'] = pd.to_datetime(weather_df.time, unit='s')

In [None]:
weather_df.info()

In [None]:
merge_all = weather_df.merge(hourly_weather, left_on='time', right_index=True)

In [None]:
merge_all.to_csv('all_weather.csv')

In [None]:
merge_all

## Matches Data

In [None]:
# API KEY
load_dotenv(find_dotenv())
FOOTBALL = os.environ.get("FOOTBALL")

In [None]:
# headers = {'X-Auth-Token': FOOTBALL}
# url = 'https://api.football-data.org/v2/matches'
# params = {'dateFrom': '2018-04-14',
#          'dateTo': '2018-04-16'}
# r = requests.get(url, headers=headers, params=params)
# r.json()

In [None]:
root = 'raw_data/football/'
dfs = []
for i in os.listdir(root):
    if '.csv' in i:
        df = pd.read_csv(root+i)
        dfs.append(df)

In [None]:
# filter all by juve
juve1 = dfs[0][(dfs[0]['HomeTeam'] == 'Juventus') \
              | (dfs[0]['AwayTeam'] == 'Juventus')]['Date']
juve1 = pd.to_datetime(juve1, format='%d/%m/%Y')

juve2 = dfs[1][(dfs[1]['Home Team'] == 'Juventus') \
               | (dfs[1]['Away Team'] == 'Juventus')]['Date']
juve2 = pd.to_datetime(juve2.apply(lambda x: x[:10]), format="%d/%m/%Y")

juve3 = dfs[2][(dfs[2]['HomeTeam'] == 'Juventus') \
              | (dfs[2]['AwayTeam'] == 'Juventus')]['Date']
juve3 = pd.to_datetime(juve3, format='%d/%m/%y')

juve4 = dfs[3][(dfs[3]['Home Team'] == 'Juventus') \
               | (dfs[3]['Away Team'] == 'Juventus')]['Date']
juve4 = pd.to_datetime(juve4.apply(lambda x: x[:10]), format="%d/%m/%Y")

In [None]:
# concat all dates
all_concat = pd.DataFrame(pd.concat([juve1, juve2, juve3, juve4]))
# all_concat['Date'] = pd.to_datetime(all_concat.Date)
all_concat['is_match'] = 1

In [None]:
all_concat.sort_values(by='Date')

In [None]:
# get all 2018 matches
all_concat_2018 = all_concat[(all_concat.Date > '01-01-2018') \
                            & (all_concat.Date < '2018-12-31')]

In [None]:
# put in 2018 time series
r = pd.date_range('2018-01-01', '2018-12-31', freq='h')
matches = all_concat_2018.set_index('Date').reindex(r).rename_axis('Date').reset_index()

In [None]:
matches.head()

In [None]:
matches.to_csv('raw_data/football/matches_2018.csv')

## Opening Hours Data

This part is a bit messy, so we will explain: 
We used the ```nearbysearch``` [Link](https://developers.google.com/maps/documentation/places/web-service/search#PlaceSearchRequests) to get all the ```bars``` and ```restaurants``` business hours. 

Then we fetch the unique id ```reference``` from the list of businesses and run it through the ```place_details``` API [Link](https://developers.google.com/maps/documentation/places/web-service/details)

From there we extract all the ```open``` (time-)elements and ```close``` (time-)elements and stack them in a dataframe divided by days of the week (0-6). 

In the end we merge the findings with an empty time series of 2018 with an 'hourly' sequence. 


In [None]:
# API KEY
load_dotenv(find_dotenv())
GOOGLE = os.environ.get("GOOGLE")

In [None]:
# first find all bars

url = 'https://maps.googleapis.com/maps/api/place/nearbysearch/json'
params = {            
            'location':'45.05917,7.67899', #sensor
            'radius':'200',
            'type':'restaurant',
            'key':GOOGLE,
            'next_page_token':'Aap_uED24ODLIlOhPdAHG7xFrCg_OrsQ_jAruvTm3QSG4Qbnp5Q85Aa4K7ar-QgnGI7Xnl1epc9YIEj17piMfVpFUxQysBwi8XTzdWbtl6IBGKTKQwV_kxhaAUWr8JG6XVo-BVKHd8NJUwiTP-_uQvkKxc5vLZ4-v6T8ZBuS42zw5DE1L2KgNPCbm86EsPhPYOj8L1MXTRdEm_GhmQSdOt8nDxG4gKkbxiXvmHNTmuBLavqN-VrbpkRBBoVZz_t2P53_ShPgndMEwlt55EYlZHCYK2gHymy9WJjMjKn3VzS6CfcTQJ-TjgsxsrRjSqNXV4T5i2qusSJ__gsam11RBY8XRADB31i-ec_wYCh1529gNKKy9tdQbidVaQjAI72wQ-7yzTZXGzxpz8ob_DHkdVdyJLxijWoHqsXY7oQM-W3Db0u08SHwaooMyb3Da9Ij'
         } 

r = requests.get(url, params=params)
r.json()

In [None]:
results = r.json().get('results')
results2 = r.json().get('results')
results3 = r.json().get('results')
results4 = r.json().get('results')
results5 = r.json().get('results')
results6 = r.json().get('results')

In [None]:
bars = results + results2 + results3 + results4 + results5 + results6
restaurants = results + results2 + results3 + results4 + results5 + results6
len(restaurants)

In [None]:
# get specific opening hrs from fetched bars/restaurants
url = 'https://maps.googleapis.com/maps/api/place/details/json'
params = {
    'key':GOOGLE,
    'fields':'opening_hours'
         }
opening_hrs = []
for bar in restaurants:
    reference = bar.get('reference')
    params['place_id'] = reference
    r = requests.get(url, params=params)
    opening_hrs.append(r)

In [None]:
contents_hrs = [r.json() for r in opening_hrs]
periods = []
for x in contents_hrs:
    try:
        hr = x.get('result').get('opening_hours').get('periods')
        periods.append(hr)
    except:
        pass

In [None]:
# remove 24h open bars
new = [x for x in periods if len(x) > 1]

In [None]:
closing = []
for x in new:
    for i in x:
        _close = i.get('close')
        closing.append(_close)
opening = []
for x in new:
    for i in x:
        _open = i.get('open')
        opening.append(_open)

In [None]:
opening_times_rest = pd.DataFrame(opening)
closing_times_rest = pd.DataFrame(closing)
closing_times_rest['time'] = pd.to_datetime(closing_times_rest['time'], format='%H%M')
opening_times_rest['time'] = pd.to_datetime(opening_times_rest['time'], format='%H%M')
closing_times_rest['day'] = closing_times_rest.day.apply(lambda x: x-1 if x != 0 else 6)
opening_times_rest['day'] = opening_times_rest.day.apply(lambda x: x-1 if x != 0 else 6)

In [None]:
# create unique day_hr identifier
closing_times_rest['day_time'] = closing_times_rest.apply(lambda x: str(x.day) + '_' + str(x.time.hour), axis=1)
opening_times_rest['day_time'] = opening_times_rest.apply(lambda x: str(x.day) + '_' + str(x.time.hour), axis=1)

In [None]:
# put results in dataframe
opening_times = pd.DataFrame(opening)
closing_times = pd.DataFrame(closing)
closing_times['time'] = pd.to_datetime(closing_times['time'], format='%H%M')
opening_times['time'] = pd.to_datetime(opening_times['time'], format='%H%M')
closing_times['day'] = closing_times.day.apply(lambda x: x-1 if x != 0 else 6)
opening_times['day'] = opening_times.day.apply(lambda x: x-1 if x != 0 else 6)

In [None]:
# create unique day_hr identifier
closing_times['day_time'] = closing_times.apply(lambda x: str(x.day) + '_' + str(x.time.hour), axis=1)
opening_times['day_time'] = opening_times.apply(lambda x: str(x.day) + '_' + str(x.time.hour), axis=1)

In [None]:
closing_all = pd.concat([closing_times_rest, closing_times])
opening_all = pd.concat([opening_times_rest, opening_times])

In [None]:
# count all apperances of openings and closings per weekday
agg_close = closing_all.groupby('day_time').agg({'day':'count'}).rename(columns={'day':'count_close'})
agg_open = opening_all.groupby('day_time').agg({'day':'count'}).rename(columns={'day':'count_open'})
agg_joint = agg_close.join(agg_open, how='outer')

In [None]:
# init range 2018
range_2018 = pd.DataFrame(pd.date_range('2018-01-01', '2018-12-31', freq='h'), columns=['hour'])
range_2018['day_time'] =  range_2018.apply(lambda x: str(x.hour.weekday()) + '_' + str(x.hour.hour), axis=1)

In [None]:
# join both

opening_count_2018 = range_2018.merge(agg_joint, 
                                    on='day_time',
                                    how='left').drop(columns='day_time')

In [None]:
opening_times

In [None]:
opening_count_2018.sort_values(by='count_open')

In [None]:
opening_count_2018.to_csv('raw_data/opening_count_2018.csv')

# Modelling 

##Preprocessing the final dataframe

In [None]:
df_merged = pd.read_csv('https://raw.githubusercontent.com/McNickSisto/world_data_league/main/stage_final/data/Noise_weather_wifi_sim_holidays_opencount_complaints_logvalues.csv')
df_merged = df_merged.drop(columns=['Unnamed: 0'])

For now let us consider log_avg (logarithmic average of C1, C2, C3, C4, C5), temp, rainfall, snowfall, isholiday, complaints_no, count_close, count_open

In [None]:
df_reg_1 = df_merged[['date_hour', 'temp', 'winds', 'rainfall_mm', 'snowfall_mm', 'isHoliday', 'date', 'hour', 'day', 'Log_Avg', 'Complaints_no', 'count_close']]
df_reg_1 = df_reg_1.fillna(value={'count_close' : 0, 'winds' : 0, 'snowfall_mm':0 , 'rainfall_mm' : 0})
df_reg_1['date_hour'] = pd.to_datetime(df_reg_1['date_hour'])
df_reg_1 =  df_reg_1.set_index('date_hour')

In [None]:
cor = df_reg_1.corr()
cor['Log_Avg'].sort_values()

In [None]:
df_close = df_reg_1.copy()

## Getting Log_Avg values of previous times

In [None]:
def create_regressor_attributes(df, attribute, list_of_prev_t_instants) :
    
    list_of_prev_t_instants.sort()
    start = list_of_prev_t_instants[-1] 
    end = len(df)
    df['datetime'] = df.index
    df.reset_index(drop=True)

    df_copy = df[start:end]
    df_copy.reset_index(inplace=True, drop=True)

    for attribute in attribute :
            foobar = pd.DataFrame()

            for prev_t in list_of_prev_t_instants :
                new_col = pd.DataFrame(df[attribute].iloc[(start - prev_t) : (end - prev_t)])
                new_col.reset_index(drop=True, inplace=True)
                new_col.rename(columns={attribute : '{}_(t-{})'.format(attribute, prev_t)}, inplace=True)
                foobar = pd.concat([foobar, new_col], sort=False, axis=1)

            df_copy = pd.concat([df_copy, foobar], sort=False, axis=1)
            
    df_copy.set_index(['datetime'], drop=True, inplace=True)
    return df_copy

In [None]:
list_of_attributes = ['Log_Avg']

list_of_prev_t_instants = []
for i in range(24,361,24): #we can change this list as list of times with most impact
    list_of_prev_t_instants.append(i)

list_of_prev_t_instants

In [None]:
df_new = create_regressor_attributes(df_close, list_of_attributes, list_of_prev_t_instants)
df_new.head()

In [None]:
df_new.corr()['Log_Avg'].sort_values()

Probably we should do some more preprocessing. Currently I am sticking to just the modelling. Later on we can change it accordingly

In [None]:
df_new_1 = df_new[['Log_Avg', 'Log_Avg_(t-24)',
       'Log_Avg_(t-48)', 'Log_Avg_(t-72)', 'Log_Avg_(t-96)', 'Log_Avg_(t-120)',
       'Log_Avg_(t-144)', 'Log_Avg_(t-168)', 'Log_Avg_(t-192)','Log_Avg_(t-216)', 'Log_Avg_(t-240)',
       'temp', 'winds', 'rainfall_mm', 'snowfall_mm', 'isHoliday', 
       'hour', 'day',  'Complaints_no', 'count_close']]

##Spliting the dataset into train, validation and test sets

In [None]:
test_set_size = 0.05
valid_set_size= 0.05

df_copy = df_new_1.reset_index(drop=True)

df_test = df_copy.iloc[ int(np.floor(len(df_copy)*(1-test_set_size))) : ]
df_train_plus_valid = df_copy.iloc[ : int(np.floor(len(df_copy)*(1-test_set_size))) ]

df_train = df_train_plus_valid.iloc[ : int(np.floor(len(df_train_plus_valid)*(1-valid_set_size))) ]
df_valid = df_train_plus_valid.iloc[ int(np.floor(len(df_train_plus_valid)*(1-valid_set_size))) : ]


X_train, y_train = df_train.iloc[:, 1:], df_train.iloc[:, 0]
X_valid, y_valid = df_valid.iloc[:, 1:], df_valid.iloc[:, 0]
X_test, y_test = df_test.iloc[:, 1:], df_test.iloc[:, 0]

print('Shape of training inputs, training target:', X_train.shape, y_train.shape)
print('Shape of validation inputs, validation target:', X_valid.shape, y_valid.shape)
print('Shape of test inputs, test target:', X_test.shape, y_test.shape)

## Implementing a minmaxscaler ## we can skip this

In [None]:
Target_scaler = MinMaxScaler(feature_range=(0.01, 0.99)) 
Feature_scaler = MinMaxScaler(feature_range=(0.01, 0.99))

X_train_scaled = Feature_scaler.fit_transform(np.array(X_train))
X_valid_scaled = Feature_scaler.fit_transform(np.array(X_valid))
X_test_scaled = Feature_scaler.fit_transform(np.array(X_test))

y_train_scaled = Target_scaler.fit_transform(np.array(y_train).reshape(-1,1))
y_valid_scaled = Target_scaler.fit_transform(np.array(y_valid).reshape(-1,1))
y_test_scaled = Target_scaler.fit_transform(np.array(y_test).reshape(-1,1))

## Modelling

currently adding all the model, based on the score we can use the top model and move the rest to appendix or delete them

### Linear Regression

In [None]:
Lin_reg = LinearRegression()
Lin_reg.fit(X_train_scaled, y_train_scaled)
y_pred = Lin_reg.predict(X_test_scaled)
y_pred_rescaled = Target_scaler.inverse_transform(y_pred)

y_test_rescaled =  Target_scaler.inverse_transform(y_test_scaled)
score = r2_score(y_test_rescaled, y_pred_rescaled)
print('R-squared score for the test set: ', round(score,4))


### Ridge Regression

In [None]:
ridge = Ridge(alpha=0.5)
ridge.fit(X_train_scaled, y_train_scaled)
y_pred = ridge.predict(X_test_scaled)
y_pred_rescaled = Target_scaler.inverse_transform(y_pred)

y_test_rescaled =  Target_scaler.inverse_transform(y_test_scaled)
score = r2_score(y_test_rescaled, y_pred_rescaled)
print('R-squared score for the test set: ', round(score,4))


### Lasso Regression

In [None]:

Lasso = Lasso(alpha=0.2, fit_intercept=True, normalize=False, precompute=False, max_iter=1000,
              tol=0.0001, warm_start=False, positive=False, random_state=None, selection='cyclic')
Lasso.fit(X_train_scaled, y_train_scaled)
y_pred = Lasso.predict(X_test_scaled)
y_pred_rescaled = Target_scaler.inverse_transform(y_pred.reshape(-1,1))

y_test_rescaled =  Target_scaler.inverse_transform(y_test_scaled)
score = r2_score(y_test_rescaled, y_pred_rescaled)
print('R-squared score for the test set: ', round(score,4))


### Decision Tree regression

In [None]:
tree_model = DecisionTreeRegressor()
tree_model.fit(X_train_scaled, y_train_scaled)
y_pred = tree_model.predict(X_test_scaled)
y_pred_rescaled = Target_scaler.inverse_transform(y_pred.reshape(-1,1))

y_test_rescaled =  Target_scaler.inverse_transform(y_test_scaled)
score = r2_score(y_test_rescaled, y_pred_rescaled)
print('R-squared score for the test set: ', round(score,4))

### Random Forest Regressor

In [None]:
Rfr = rfr(n_estimators = 100, criterion = 'mse',
                              random_state = 1,
                              n_jobs = -1)
Rfr.fit(X_train_scaled, y_train_scaled)
y_pred = Rfr.predict(X_test_scaled)
y_pred_rescaled = Target_scaler.inverse_transform(y_pred.reshape(-1,1))

y_test_rescaled =  Target_scaler.inverse_transform(y_test_scaled)
score = r2_score(y_test_rescaled, y_pred_rescaled)
print('R-squared score for the test set: ', round(score,4))


###Polynomial Regression

In [None]:
pol = PolynomialFeatures (degree = 2)
x_pol = pol.fit_transform(X_train)
Pol_reg = LinearRegression()
Pol_reg.fit(X_train_scaled, y_train_scaled)
y_pred = Pol_reg.predict(X_test_scaled)
y_pred_rescaled = Target_scaler.inverse_transform(y_pred.reshape(-1,1))

y_test_rescaled =  Target_scaler.inverse_transform(y_test_scaled)
score = r2_score(y_test_rescaled, y_pred_rescaled)
print('R-squared score for the test set: ', round(score,4))

### Basic Neural Network

In [None]:
model_nn = keras.Sequential([
  # the hidden layer
   layers.Dense(64, activation='sigmoid'),
    # the linear output layer 
    layers.Dense(units=1, input_shape=[X_train_scaled.shape[1]])
])
model_nn.compile(loss= 'mean_squared_error', optimizer='adam')
history_nn = model_nn.fit(X_train_scaled, y_train_scaled, epochs=50)

y_pred = model_nn.predict(X_test_scaled)
y_pred_rescaled = Target_scaler.inverse_transform(y_pred.reshape(-1,1))

y_test_rescaled =  Target_scaler.inverse_transform(y_test_scaled)
score = r2_score(y_test_rescaled, y_pred_rescaled)
print('R-squared score for the test set: ', round(score,4))

###LSTM RNN

#### Preprocessing for LSTM RNN

In [None]:
X_train_lstm = np.reshape(X_train_scaled,(X_train_scaled.shape[0], X_train_scaled.shape[1],1) )
y_train_lstm = np.reshape(y_train_scaled, (y_train_scaled.shape[0]))

X_valid_lstm = np.reshape(X_valid_scaled,(X_valid_scaled.shape[0], X_valid_scaled.shape[1],1) )
y_valid_lstm = np.reshape(y_valid_scaled, (y_valid_scaled.shape[0]))

X_test_lstm = np.reshape(X_test_scaled,(X_test_scaled.shape[0], X_test_scaled.shape[1],1) )
y_test_lstm = np.reshape(y_test_scaled, (y_test_scaled.shape[0]))


In [None]:
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train_lstm.shape[1], 1)))

model.add(LSTM(units=50, return_sequences=True))
model.add(Dropout(0.2))

model.add(LSTM(units=50, return_sequences=True))
model.add(Dropout(0.2))

model.add(LSTM(units=50))
model.add(Dropout(0.2))

model.add(Dense(units = 1))

model.compile(loss='mean_squared_error', optimizer='adam')
model.summary()

In [None]:
model.fit(x=X_train_lstm, y=y_train_lstm, batch_size=5, epochs=30, verbose=1, validation_data=(X_valid_lstm, y_valid_lstm), shuffle=True)

In [None]:
loss_per_epoch = model.history.history['loss']
val_loss_per_epoch = model.history.history['val_loss']

In [None]:
y_pred = model.predict(X_test_lstm)
y_pred_rescaled = Target_scaler.inverse_transform(y_pred)
y_test_rescaled =  Target_scaler.inverse_transform(y_test_scaled)
score = r2_score(y_test_rescaled, y_pred_rescaled)

#### Plotting loss values ##can skip this if we are not using LSTM 

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(12,5))
plt.plot(loss_per_epoch);
plt.plot(val_loss_per_epoch);
plt.title("LSTM model loss in MSE");
plt.ylabel("loss");
plt.xlabel("Epochs");
plt.legend(['train', 'val']);

In [None]:
y_axis = list(df_new_1.index)[-854:]
y_actual = pd.DataFrame(y_test_rescaled, columns=['Actual'])
y_hat = pd.DataFrame(y_pred_rescaled, columns=['Predicted'])
positions = [0,100,200,300,400,500,600,700,800]
selected_labels = []
for i in positions:
  selected_labels.append(y_axis[i])

plt.figure(figsize=(18, 10))
plt.plot(y_actual, linestyle='solid', color='r')
plt.plot(y_hat, linestyle='dashed', color='b')
plt.xticks(positions, selected_labels)
plt.legend(['Actual','Predicted'], loc='best', prop={'size': 14})
plt.title('SSTA in test', weight='bold', fontsize=16)
plt.grid(color = 'y', linewidth='0.5')
plt.show()