# Introduction

As it is easy to imagine, a water supply company struggles with the need to forecast the water level in a waterbody (water spring, lake, river, or aquifer) to handle daily consumption. During fall and winter waterbodies are refilled, but during spring and summer they start to drain. To help preserve the health of these waterbodies it is important to predict the most efficient water availability, in terms of level and water flow for each day of the year.
## Data
The reality is that each waterbody has such unique characteristics that their attributes are not linked to each other. This analytics competition uses datasets that are completely independent from each other. However, it is critical to understand total availability in order to preserve water across the country.
Each dataset represents a different kind of waterbody. As each waterbody is different from the other, the related features are also different. So, if for instance we consider a water spring we notice that its features are different from those of a lake. These variances are expected based upon the unique behavior and characteristics of each waterbody. The Acea Group deals with four different type of waterbodies: water springs, lakes, rivers and aquifers.
## Challenge
Can you build a story to predict the amount of water in each unique waterbody? The challenge is to determine how features influence the water availability of each presented waterbody. To be more straightforward, gaining a better understanding of volumes, they will be able to ensure water availability for each time interval of the year.
The time interval is defined as day/month depending on the available measures for each waterbody. Models should capture volumes for each waterbody(for instance, for a model working on a monthly interval a forecast over the month is expected).
The desired outcome is a notebook that can generate four mathematical models, one for each category of waterbody (acquifers, water springs, river, lake) that might be applicable to each single waterbody.

![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F6195295%2Fcca952eecc1e49c54317daf97ca2cca7%2FAcea-Input.png?generation=1606932492951317&alt=media)




# Exploratory Data Analysis

This competition uses nine different datasets, completely independent and not linked to each other. Each dataset can represent a different kind of waterbody. As each waterbody is different from the other, the related features as well are different from each other.

Let’s see how these nine datasets differ from each other.

In [None]:
import pandas as pd
from pandas.io.formats.style import Styler
from IPython.display import HTML
import numpy as np
import seaborn as sns
from datetime import datetime, date
import matplotlib.pyplot as plt
%matplotlib inline 
sns.set(color_codes=True)
import os
import warnings
warnings.filterwarnings('ignore')

aq_auser = pd.read_csv("../input/acea-water-prediction/Aquifer_Auser.csv")
aq_doganella = pd.read_csv("../input/acea-water-prediction/Aquifer_Doganella.csv")
aq_luco = pd.read_csv("../input/acea-water-prediction/Aquifer_Luco.csv")
aq_petrignano = pd.read_csv("../input/acea-water-prediction/Aquifer_Petrignano.csv")
lk_bilancino = pd.read_csv("../input/acea-water-prediction/Lake_Bilancino.csv")
rv_arno = pd.read_csv("../input/acea-water-prediction/River_Arno.csv")
ws_amiata = pd.read_csv("../input/acea-water-prediction/Water_Spring_Amiata.csv")
ws_lupa = pd.read_csv("../input/acea-water-prediction/Water_Spring_Lupa.csv")
ws_madonna = pd.read_csv("../input/acea-water-prediction/Water_Spring_Madonna_di_Canneto.csv")

files = []
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        if '.csv' in filename:
            files += list([filename])

datasets_df = pd.DataFrame(columns=['File_Name'], data=files)
datasets_df['Waterbody_type'] = datasets_df.File_Name.apply(lambda x: x.split('_')[0])
datasets_df['Rows'] = datasets_df.File_Name.apply(lambda x: pd.read_csv(f"../input/acea-water-prediction/{x}").shape[0])
datasets_df['Cols'] = datasets_df.File_Name.apply(lambda x: pd.read_csv(f"../input/acea-water-prediction/{x}").shape[1])
datasets_df = datasets_df.replace('Water', 'Water_Spring')
datasets_df = datasets_df.sort_values( by = ['Waterbody_type','Rows'],ascending = [True ,False]).reset_index(drop = True)
datasets_df.style.bar(subset=['Rows', 'Cols'], color='#118DFF')

## Relevant Variables info

In [None]:
def df_relinfo(df, target_var = []):
    x = pd.DataFrame(df.isna().sum().apply(lambda x: x/df.shape[0])).\
            reset_index().rename(columns={"index": "Feature", 0: "%Na"})
    x['Na_qnt'] = df.isna().sum().tolist()
    x['dType'] = df.dtypes.tolist()
    x['Variable'] =  x.Feature.apply(lambda x: 'Target' if x in target_var else 'Predictor')
    return x.sort_values(by = '%Na',ascending = False).\
            reset_index(drop = True).style.bar(subset = ['%Na'], color='#118DFF')#.\
#            style.applymap('font-weight: bold', subset=x.['Variable'])# if 'Target' == x['Variable'] else '') subset=pd.

### Aquifer Auser features

In [None]:
auser_targets = ['Depth_to_Groundwater_LT2', 'Depth_to_Groundwater_SAL', 'Depth_to_Groundwater_CoS']
    
df_relinfo(aq_auser,auser_targets)

### Aquifer Doganella features

In [None]:
doganella_targets = ['Depth_to_Groundwater_Pozzo_1',
                     'Depth_to_Groundwater_Pozzo_2',
                     'Depth_to_Groundwater_Pozzo_3',
                     'Depth_to_Groundwater_Pozzo_4',
                     'Depth_to_Groundwater_Pozzo_5',
                     'Depth_to_Groundwater_Pozzo_6',
                     'Depth_to_Groundwater_Pozzo_7',
                     'Depth_to_Groundwater_Pozzo_8',
                     'Depth_to_Groundwater_Pozzo_9']
    
df_relinfo(aq_doganella,doganella_targets)

### Aquifer Luco features

In [None]:
luco_targets = ['Depth_to_Groundwater_Podere_Casetta']
    
df_relinfo(aq_luco,luco_targets)

### Aquifer Petrignano features

In [None]:
petrignano_targets = ['Depth_to_Groundwater_P24', 'Depth_to_Groundwater_P25']
    
df_relinfo(aq_petrignano,petrignano_targets)

### Lake Bilancino features 

In [None]:
bilancino_targets = ['Lake_Level', 'Flow_Rate']

df_relinfo(lk_bilancino, bilancino_targets)

### River Arno features

In [None]:
arno_targets = ['Hydrometry_Nave_di_Rosano']
df_relinfo(rv_arno, arno_targets)

### Water Spring Amiata features

In [None]:
amiata_targets = ['Flow_Rate_Bugnano', 
                  'Flow_Rate_Arbure', 
                  'Flow_Rate_Ermicciolo', 
                  'Flow_Rate_Galleria_Alta']
df_relinfo(ws_amiata, amiata_targets)

### Water Spring Lupa

In [None]:
lupa_targets = ['Flow_Rate_Lupa']
df_relinfo(ws_lupa, lupa_targets)

### Water Spring Madonna di Caneto

In [None]:
madonna_targets = ['Flow_Rate_Madonna_di_Canneto']
df_relinfo(ws_madonna, madonna_targets)

# Work on progress...