![](https://storage.googleapis.com/newagent-ahdsra.appspot.com/header.png)

# **Acea**
The Acea Group is one of the leading Italian multiutility operators. Listed on the Italian Stock Exchange since 1999, the company manages and develops water and electricity networks and environmental services. Acea is the foremost Italian operator in the water services sector supplying 9 million inhabitants in Lazio, Tuscany, Umbria, Molise, Campania.

# Introduction
In this competition we will focus only on the water sector to help Acea Group preserve precious waterbodies. As it is easy to imagine, a water supply company struggles with the need to forecast the water level in a waterbody (water spring, lake, river, or aquifer) to handle daily consumption. During fall and winter waterbodies are refilled, but during spring and summer they start to drain. To help preserve the health of these waterbodies it is important to predict the most efficient water availability, in terms of level and water flow for each day of the year.

## Data
The reality is that each waterbody has such unique characteristics that their attributes are not linked to each other. This analytics competition uses datasets that are completely independent from each other. However, it is critical to understand total availability in order to preserve water across the country.

Each dataset represents a different kind of waterbody. As each waterbody is different from the other, the related features are also different. So, if for instance we consider a water spring we notice that its features are different from those of a lake. These variances are expected based upon the unique behavior and characteristics of each waterbody. The Acea Group deals with four different type of waterbodies: water springs, lakes, rivers and aquifers.

## Challenge
Can you build a story to predict the amount of water in each unique waterbody? The challenge is to determine how features influence the water availability of each presented waterbody. To be more straightforward, gaining a better understanding of volumes, they will be able to ensure water availability for each time interval of the year.

The time interval is defined as day/month depending on the available measures for each waterbody. Models should capture volumes for each waterbody(for instance, for a model working on a monthly interval a forecast over the month is expected).

The desired outcome is a notebook that can generate four mathematical models, one for each category of waterbody (acquifers, water springs, river, lake) that might be applicable to each single waterbody.

![](https://storage.googleapis.com/newagent-ahdsra.appspot.com/inbox_6195295_cca952eecc1e49c54317daf97ca2cca7_Acea-Input.png)

# Evaluation
Can you build a model to predict the amount of water in each waterbody to help preserve this natural resource?
This is an Analytics competition where your task is to create a Notebook that best addresses the Evaluation criteria below. Submissions should be shared directly with host and will be judged by the Acea Group based on how well they address:

### Methodology/Completeness (min 0 points, max 5 points)
* Are the statistical models appropriate given the data?
* Did the author develop one or more machine learning models?
* Did the author provide a way of assessing the performance and accuracy of their solution?
* What is the Mean Absolute Error (MAE) of the models?
* What is the Root Mean Square Error (RMSE) of the models?

### Presentation (min 0 points, max 5 points)
* Does the notebook have a compelling and coherent narrative?
* Does the notebook contain data visualizations that help to communicate the author’s main points?
* Did the author include a thorough discussion on the intersection between features and their prediction? For example between rainfall and amount/level of water.
* Was there discussion of automated insight generation, demonstrating what factors to take into account?
* Is the code documented in a way that makes it easy to understand and reproduce?
* Were all external sources of data made public and cited appropriately?

### Application (min 0 points, max 5 points)
* Is the provided model useful/able to forecast water availability in terms of level or water flow in a time interval of the year?
* Is the provided methodology applicable also on new datasets belong to another waterbody?

# Some literature review

> ### *It's always a good practise to start from a solid base*


## [Prediction of Water Level using Monthly Lagged Data in Lake Urmia, Iran.](https://link.springer.com/article/10.1007/s11269-016-1463-y#citeas)<br><sup>*Babak Vaheddoost, Hafzullah Aksoy, Hirad Abghari*</sup>
In this interesting work the authors are using parametric and nonparametric models for predicting monthly water level fluctuations in Lake Urmia. Eleven previous water levels in the form of monthly lagged data are used as the known independent variables of the model while lake water level at the twelfth month is considered as the unknown dependent variable to be predicted. Parametric models used in the modelling are multi-linear regression (MLR), additive and multiplicative non-linear regression (ANLR and MNLR) and decision tree (DT) while feed forward back propagation neural network (FFBP-NN), generalized regression neural network (GR-NN) and radial basis function neural network (RBF-NN) are used to represent the non-parametric approach.


# **Methodology**

Every waterbody category, i.e. **aquifers, water springs, lakes, rivers**, will be considered as stand alone. In this way it'll be possible to optimize every phase for every category in order to obtain the best result. In particular the different phases that will be developed are:
1. **Data exploration and cleaning;**
2. **Feature extraction;**
3. **Feature selection;**
4. **Model development;**
5. **Results evaluation;**
6. **Deployment.**

# Modules import

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.pyplot import figure

# 0. Geographic overview of the sites.

In [None]:
"""# Installing and importing gmplot 
!pip install gmplot
    
import gmplot

## Installing and importing geopy 
!pip install geopy

import geopy
  
print('Please insert your Google Maps Api Key')
api_key = input()
gmap = gmplot.GoogleMapPlotter(41.9109, 12.4818, 6, apikey=api_key)

# Retrieving the lat and long 
from geopy.geocoders import Nominatim

locations = ['Auser, Italy', 'Doganella, Italy', 'Luco, Italy', 'Petrignano, Italy', 'Bilancino lake, Italy',
            'Amiata, Italy', 'Lupa, Italy', 'Arno River, Italy', 'Madonna di Canneto, Italy']

geolocator = Nominatim(user_agent="localizer")

latitudes = []
longitudes = []
labels = []
for loc in locations:
    latitudes.append(geolocator.geocode(loc).latitude)
    longitudes.append(geolocator.geocode(loc).longitude)
    labels.append(loc.split(',')[0])
labels
gmap.scatter(latitudes, longitudes, color='#3B0B39', size=40, marker=True, title = labels)

# Draw the map:
gmap.draw('./map.html')"""

The result can be seen [here](https://storage.googleapis.com/newagent-ahdsra.appspot.com/map.html).

![](https://storage.googleapis.com/newagent-ahdsra.appspot.com/Immagine%202020-12-16%20160936.png)

# 1. Exploratory Data Analysis

In this section we are going to explore the data, one category at the time, in order to understand the variables and their ditribution. After this we will analyze the missing data and we'll find a way to fill and clean them for every category in order to have a cleansing methodology that is replicable also with other datasets belonging with that given category.

## Aquifers

In [None]:
# Reading and basic exploration of the data
auser = pd.read_csv('../input/acea-water-prediction/Aquifer_Auser.csv')
doganella = pd.read_csv('../input/acea-water-prediction/Aquifer_Doganella.csv')
luco = pd.read_csv('../input/acea-water-prediction/Aquifer_Luco.csv')
petrignano = pd.read_csv('../input/acea-water-prediction/Aquifer_Petrignano.csv')

### Auser

Description: This waterbody consists of two subsystems, called NORTH and SOUTH, where the former partly influences the behavior of the latter. Indeed, the north subsystem is a water table (or unconfined) aquifer while the south subsystem is an artesian (or confined) groundwater.

The levels of the NORTH sector are represented by the values of the SAL, PAG, CoS and DIEC wells, while the levels of the SOUTH sector by the LT2 well.

In [None]:
df = auser

In [None]:
# Printing the dataframe without NaNs
df.dropna().head(10)

In [None]:
# Retrieving the dataframe columns
columns = df.columns.tolist()

print('Number of variables: ' + str(len(columns)))
print('Variables type:')
print(df.dtypes)

# Descriptive statistics summary
df.describe()

In [None]:
# Histogram to understand the data distribution of some relevant features
# Number of columns
ncols = 5
# Number of rows
import math
nrows = math.ceil((len(columns) -1) / 5)
fig, axs = plt.subplots(ncols = ncols, nrows = nrows, figsize= (24, 20))
grid = []
for j in range(nrows):
    for h in range(ncols):
        grid.append([j,h])
[sns.distplot(df[columns[i+1]], ax=axs[grid[i][0], grid[i][1]]) for i in range(0,len(columns)-1)]
print('### AUSER ###')

In [None]:
x_plot = pd.to_datetime(df['Date'], format = '%d/%m/%Y')

# Plotting the time series for every variable
fig, axs = plt.subplots(nrows = len(columns) - 1, figsize= (22, 8 * len(columns)))


for var in range(1,len(columns)):
    axs[var-1].plot(x_plot, df[columns[var]])
    
    axs[var-1].set_xlabel('Date')
    axs[var-1].set_ylabel(columns[var], fontsize = 25)
    
    axs[var-1].grid()

#### Missing data

The questions we want to answer by analyzing missing data are:
* How prevalent is the missing data?
* Is missing data random or does it have a pattern?

**How prevalent is the missing data?**

In [None]:
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
# We are going to consider just the feaures that have at least 1 missing value
missing_data = missing_data[missing_data['Percent'] > 0]
missing_data

In [None]:
# Plotting the percentage of the missing data.
figure(figsize= (22, 8))
sns.barplot(x = missing_data.index, y = missing_data['Percent'], palette="rocket")

ax = plt.gca()
ax.axhline(0, color="k", clip_on=False)
ax.set_ylabel("Percent", fontsize = 25)
ax.set_xlabel("Variable", fontsize = 25)
plt.xticks(fontsize=18, rotation=90)
plt.yticks(fontsize=18)

**Is missing data random or does it have a pattern?**

It is possible that the missing data are relative to the same period.

In [None]:
df['Date'] = pd.to_datetime(df['Date'], format = '%d/%m/%Y')

df['year'] = df['Date'].dt.year

nan_by_year = pd.DataFrame(index = df['year'].unique())

for col in missing_data.index.tolist():
    count = auser[col].isnull().groupby(df['year']).sum().astype(int).reset_index(name='count')
    nan_by_year[col] = count['count'].values
nan_by_year

In [None]:
figure(figsize= (22, 12))
missing_percent = nan_by_year/365
sns.heatmap(missing_percent, annot = True,  linewidths=.5, cmap = 'rocket_r')

ax = plt.gca()
ax.set_ylabel("Year", fontsize = 25)
ax.set_xlabel("Variable", fontsize = 25)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18, rotation=0)

ax.set_title("% of Missing values per year", fontsize = 35, pad = 25)

plt.savefig('Auser_missing.png', bbox_inches='tight')

### Doganella

Description: The wells field Doganella is fed by two underground aquifers not fed by rivers or lakes but fed by meteoric infiltration. The upper aquifer is a water table with a thickness of about 30m. The lower aquifer is a semi-confined artesian aquifer with a thickness of 50m and is located inside lavas and tufa products. These aquifers are accessed through wells called Well 1, ..., Well 9. Approximately 80% of the drainage volumes come from the artesian aquifer. The aquifer levels are influenced by the following parameters: rainfall, humidity, subsoil, temperatures and drainage volumes.

In [None]:
df = doganella

In [None]:
# Printing the dataframe without NaNs
df.dropna().head(10)

In [None]:
# Retrieving the dataframe columns
columns = df.columns.tolist()

print('Number of variables: ' + str(len(columns)))
print('Variables type:')
print(df.dtypes)

# Descriptive statistics summary
df.describe()

In [None]:
# Histogram to understand the data distribution of some relevant features
# Number of columns
ncols = 5
# Number of rows
import math
nrows = math.ceil((len(columns) -1) / 5)
fig, axs = plt.subplots(ncols = ncols, nrows = nrows, figsize= (24, 20))
grid = []
for j in range(nrows):
    for h in range(ncols):
        grid.append([j,h])
[sns.distplot(df[columns[i+1]], ax=axs[grid[i][0], grid[i][1]]) for i in range(0,len(columns)-1)]
print('### DOGANELLA ###')

In [None]:
x_plot = pd.to_datetime(df['Date'], format = '%d/%m/%Y')

# Plotting the time series for every variable
fig, axs = plt.subplots(nrows = len(columns) - 1, figsize= (22, 8 * len(columns)))


for var in range(1,len(columns)):
    axs[var-1].plot(x_plot, df[columns[var]])
    
    axs[var-1].set_xlabel('Date')
    axs[var-1].set_ylabel(columns[var], fontsize = 25)
    
    axs[var-1].grid()

#### Missing data

In [None]:
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
# We are going to consider just the feaures that have at least 1 missing value
missing_data = missing_data[missing_data['Percent'] > 0]
missing_data

In [None]:
# Plotting the percentage of the missing data.
figure(figsize= (22, 8))
sns.barplot(x = missing_data.index, y = missing_data['Percent'], palette="rocket")

ax = plt.gca()
ax.axhline(0, color="k", clip_on=False)
ax.set_ylabel("Percent", fontsize = 25)
ax.set_xlabel("Variable", fontsize = 25)
plt.xticks(fontsize=18, rotation=90)
plt.yticks(fontsize=18)

In [None]:
df['Date'] = pd.to_datetime(df['Date'], format = '%d/%m/%Y')

df['year'] = df['Date'].dt.year

nan_by_year = pd.DataFrame(index = df['year'].unique())

for col in missing_data.index.tolist():
    count = df[col].isnull().groupby(df['year']).sum().astype(int).reset_index(name='count')
    nan_by_year[col] = count['count'].values
nan_by_year

In [None]:
figure(figsize= (22, 12))
missing_percent = nan_by_year/365
sns.heatmap(missing_percent, annot = True,  linewidths=.5, cmap = 'rocket_r')

ax = plt.gca()
ax.set_ylabel("Year", fontsize = 25)
ax.set_xlabel("Variable", fontsize = 25)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18, rotation=0)

ax.set_title("% of Missing values per year", fontsize = 35, pad = 25)
plt.savefig('Doganella_missing.png', bbox_inches='tight')

### Luco

Description: The Luco wells field is fed by an underground aquifer. This aquifer not fed by rivers or lakes but by meteoric infiltration at the extremes of the impermeable sedimentary layers. Such aquifer is accessed through wells called Well 1, Well 3 and Well 4 and is influenced by the following parameters: rainfall, depth to groundwater, temperature and drainage volumes.

In [None]:
df = luco

In [None]:
# Printing the dataframe without NaNs
df.dropna().head(10)

In [None]:
# Retrieving the dataframe columns
columns = df.columns.tolist()

print('Number of variables: ' + str(len(columns)))
print('Variables type:')
print(df.dtypes)

# Descriptive statistics summary
df.describe()

In [None]:
# Histogram to understand the data distribution of some relevant features
# Number of columns
ncols = 5
# Number of rows
import math
nrows = math.ceil((len(columns) -1) / 5)
fig, axs = plt.subplots(ncols = ncols, nrows = nrows, figsize= (24, 20))
grid = []
for j in range(nrows):
    for h in range(ncols):
        grid.append([j,h])
[sns.distplot(df[columns[i+1]], ax=axs[grid[i][0], grid[i][1]], kde_kws = {'bw' : 1}) for i in range(0,len(columns)-1)]
print('### LUCO ###')

In [None]:
x_plot = pd.to_datetime(df['Date'], format = '%d/%m/%Y')

# Plotting the time series for every variable
fig, axs = plt.subplots(nrows = len(columns) - 1, figsize= (22, 8 * len(columns)))


for var in range(1,len(columns)):
    axs[var-1].plot(x_plot, df[columns[var]])
    
    axs[var-1].set_xlabel('Date')
    axs[var-1].set_ylabel(columns[var], fontsize = 25)
    
    axs[var-1].grid()

#### Missing data

In [None]:
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
# We are going to consider just the feaures that have at least 1 missing value
missing_data = missing_data[missing_data['Percent'] > 0]
missing_data

In [None]:
# Plotting the percentage of the missing data.
figure(figsize= (22, 8))
sns.barplot(x = missing_data.index, y = missing_data['Percent'], palette="rocket")

ax = plt.gca()
ax.axhline(0, color="k", clip_on=False)
ax.set_ylabel("Percent", fontsize = 25)
ax.set_xlabel("Variable", fontsize = 25)
plt.xticks(fontsize=18, rotation=90)
plt.yticks(fontsize=18)

In [None]:
df['Date'] = pd.to_datetime(df['Date'], format = '%d/%m/%Y')

df['year'] = df['Date'].dt.year

nan_by_year = pd.DataFrame(index = df['year'].unique())

for col in missing_data.index.tolist():
    count = df[col].isnull().groupby(df['year']).sum().astype(int).reset_index(name='count')
    nan_by_year[col] = count['count'].values
nan_by_year

In [None]:
figure(figsize= (22, 12))
missing_percent = nan_by_year/365
sns.heatmap(missing_percent, annot = True,  linewidths=.5, cmap = 'rocket_r')

ax = plt.gca()
ax.set_ylabel("Year", fontsize = 25)
ax.set_xlabel("Variable", fontsize = 25)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18, rotation=0)

ax.set_title("% of Missing values per year", fontsize = 35, pad = 25)

plt.savefig('Luco_missing.png', bbox_inches='tight')

### Petrignano

Description: The wells field of the alluvial plain between Ospedalicchio di Bastia Umbra and Petrignano is fed by three underground aquifers separated by low permeability septa. The aquifer can be considered a water table groundwater and is also fed by the Chiascio river. The groundwater levels are influenced by the following parameters: rainfall, depth to groundwater, temperatures and drainage volumes, level of the Chiascio river.

In [None]:
df = petrignano

In [None]:
# Printing the dataframe without NaNs
df.dropna().head(10)

In [None]:
# Retrieving the dataframe columns
columns = df.columns.tolist()

print('Number of variables: ' + str(len(columns)))
print('Variables type:')
print(df.dtypes)

# Descriptive statistics summary
df.describe()

In [None]:
# Histogram to understand the data distribution of some relevant features
# Number of columns
ncols = 5
# Number of rows
import math
nrows = math.ceil((len(columns) -1) / 5)
fig, axs = plt.subplots(ncols = ncols, nrows = nrows, figsize= (24, 20))
grid = []
for j in range(nrows):
    for h in range(ncols):
        grid.append([j,h])
[sns.distplot(df[columns[i+1]], ax=axs[grid[i][0], grid[i][1]]) for i in range(0,len(columns)-1)]
print('### PETRIGNANO ###')

In [None]:
x_plot = pd.to_datetime(df['Date'], format = '%d/%m/%Y')

# Plotting the time series for every variable
fig, axs = plt.subplots(nrows = len(columns) - 1, figsize= (22, 8 * len(columns)))


for var in range(1,len(columns)):
    axs[var-1].plot(x_plot, df[columns[var]])
    
    axs[var-1].set_xlabel('Date')
    axs[var-1].set_ylabel(columns[var], fontsize = 25)
    
    axs[var-1].grid()

#### Missing data

In [None]:
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
# We are going to consider just the feaures that have at least 1 missing value
missing_data = missing_data[missing_data['Percent'] > 0]
missing_data

In [None]:
# Plotting the percentage of the missing data.
figure(figsize= (22, 8))
sns.barplot(x = missing_data.index, y = missing_data['Percent'], palette="rocket")

ax = plt.gca()
ax.axhline(0, color="k", clip_on=False)
ax.set_ylabel("Percent", fontsize = 25)
ax.set_xlabel("Variable", fontsize = 25)
plt.xticks(fontsize=18, rotation=90)
plt.yticks(fontsize=18)

In [None]:
df['Date'] = pd.to_datetime(df['Date'], format = '%d/%m/%Y')

df['year'] = df['Date'].dt.year

nan_by_year = pd.DataFrame(index = df['year'].unique())

for col in missing_data.index.tolist():
    count = df[col].isnull().groupby(df['year']).sum().astype(int).reset_index(name='count')
    nan_by_year[col] = count['count'].values
nan_by_year

figure(figsize= (22, 12))
missing_percent = nan_by_year/365
sns.heatmap(missing_percent, annot = True,  linewidths=.5, cmap = 'rocket_r')

ax = plt.gca()
ax.set_ylabel("Year", fontsize = 25)
ax.set_xlabel("Variable", fontsize = 25)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18, rotation=0)

ax.set_title("% of Missing values per year", fontsize = 35, pad = 25)

plt.savefig('Petrignano_missing.png', bbox_inches='tight')

## Checkpoint
Let's try to wrap up what we've seen so far in order to understand how to go ahead.

The most important point to analyze seems to be the missing data. Indeed we have seen as for almost all the variables missing data is a serious problem, considering that when more than 15% of the data is missing we should delete the corresponding variable and pretend it never existed. The situation is even worse when the variable affected is the target variable; for this reason we must be careful on choosing the best approach to clean the data.

For every aquifier we're going to list here the target variables and how's the relative missing data situation (Considering that unitl 2005/06 all variables are unusable):
### Auser
* **Depth_to_Groundwater_SAL:** From 2010 til 2020 it seems we have an acceptable percentage of missing data except for year 2016;
* **Depth_to_Groundwater_COS:** Data is not looking good like the previous one, years 2011/13/14 are rich of missing data;
* **Depth_to_Groundwater_LT2:** It seems very siilar to "Depth_to_Groundwater_SAL" but in years 2011/12 we have high rates of missing data.
![Auser](./Auser_missing.png)

### Doganella
In general we are going to consider years from 2013 on and moreover 2017 seems to be critical for most of the target variables except for "Pozzo" 2 and 3.

* **Depth_to_Groundwater_Pozzo_1/2:** Don't look very bad;
* **Depth_to_Groundwater_Pozzo_3:** 2019/20 missing data rates a bit too high;
* **Depth_to_Groundwater_Pozzo_4:** Same as "Pozzo 3" but with 2016 very high;
* **Depth_to_Groundwater_Pozzo_5:** 2015/16 missing data rates too high;
* **Depth_to_Groundwater_Pozzo_6:** Looks good apart from 2016;
* **Depth_to_Groundwater_Pozzo_7:** 2014/16 missing data rates too high;
* **Depth_to_Groundwater_Pozzo_8:** 2019/20 missing data rates a bit too high;
* **Depth_to_Groundwater_Pozzo_9:** 2014/16 missing data rates too high;
![Doganella](./Doganella_missing.png)

### Luco

* **Depth_to_Groundwater_Podere_Casetta:** Very complicated situation I'd say! Target data are present mainly from 2008 to 2015 but for this period we also have a lot of missing data in the other variables. Let's see that we'll be able to do here!

![Luco](./Luco_missing.png)

### Petrignano

**I Like it!!**
![Petrignano](./Petrignano_missing.png)


# 2. Data Cleaning

## Outliers

Before diving into missing data we are going to deal with outliers and we'll set them as NaN in such a way that during the data cleaning process they will be considered as well.

For simplicity we'll remove the outliers in batch, considering all the quifiers together.

Let's visualize the data with a plot box.

In [None]:
data_frames = { 'auser' : auser, 'doganella' :doganella,
               'luco' :luco, 'petrignano' :petrignano }

for key in data_frames.keys():
    
    df = data_frames[key].dropna()

    columns = df.columns.tolist()

    ncols = 5
    nrows = math.ceil((len(columns) -1) / 5)
    fig, axs = plt.subplots(ncols = ncols, nrows = nrows, figsize= (24, 30))
    
    fig.suptitle(key.capitalize(), fontsize = 35, va = 'center')

    grid = []

    for j in range(nrows):
        for h in range(ncols):
            grid.append([j,h])
    [df.boxplot(column = columns[i+1], ax=axs[grid[i][0], grid[i][1]]) for i in range(0,len(columns)-1)]
    

With some variables it seems to be easy to detect which are the outliers. Take for example **Depth_to_Groundwater_CoS** in the Auser aquifier, it's clear how there're some 0 values that must be removed.
![Depth_to_Groundwater_CoS - Auser](https://storage.googleapis.com/newagent-ahdsra.appspot.com/Auser_outlier.png)
The situation looks very different for example for the rainfall variables. As stated in the dataset description **Rainfall_X** indicates the quantity of rain falling, expressed in millimeters (mm), in the area X. It could happen that in some isolated cases the quantity of rain fallen is above tha usual one. Therefore we should be careful to consider the best way to treat these data.
![Depth_to_Groundwater_CoS - Auser](https://storage.googleapis.com/newagent-ahdsra.appspot.com/auser_rainfall.png)


### Z-Score
We're going to use the z-score to detect the outliers but we'll analyze the results before doing anything.

Z score is an important concept in statistics. Z score is also called standard score. This score helps to understand if a data value is greater or smaller than mean and how far away it is from the mean. More specifically, Z score tells how many standard deviations away a data point is from the mean.

**Z score = (x -mean) / std. deviation**

Let's create a table where we'll list the highest absolute value of the z-score for each variable in order to understand better a way to remove just the outliers.

In [None]:
init = True

for key in data_frames.keys():
    
    df = data_frames[key]
    
    for column in df.columns:
        
        if df[column].name != 'Date':
            mean = np.mean(df[column]) 
            std = np.std(df[column])
            
            z = abs((df[column]-mean)/std)
            
            if init == True:
                z_scores = pd.DataFrame(z.sort_values(ascending = False)[0:25].values ,columns = [[key], [z.name]], index = ['Top ' + str(i) for i in range(1,26)])
                init = False
            else:
                z_scores[key,z.name] = z.sort_values(ascending = False)[0:25].values
                
        
z_scores

Let's analyze them one by one

### Auser

In [None]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 150)

In [None]:
z_scores['auser']

### Doganella

In [None]:
z_scores['doganella']

### Luco

In [None]:
z_scores['luco']

### Petrignano

In [None]:
z_scores['petrignano']

## Missing data

We'll use different approaches for each variable.

First of all we are going to consider just the periods where the target variables have an accettable amount of missing data.
The options we'll consider are:

**Interpolation:** 
- Linear;
- Polynomial;
- Akima;

**Stochastic Methods:**
- Regression Methods;
- AutoRegressive Methods;

### Auser