The Acea Group is one of the leading Italian multiutility operators. Listed on the Italian Stock Exchange since 1999, the company manages and develops water and electricity networks and environmental services. Acea is the foremost Italian operator in the water services sector supplying 9 million inhabitants in Lazio, Tuscany, Umbria, Molise, Campania.

In this competition Kagglers were asked to focus on the water sector to help Acea Group preserve precious waterbodies such as water springs, lakes, rivers, and aquifers. To help preserve the health of these waterbodies it is important to predict the water availability in terms of level and water flow for each day/month of the year.

In the following sections, I would like to show my work on the prediction of water availabilities and its accuracies in terms of mean absolute error, mean squared error, and R^2 beween obervation and prediction values.

# Content:

1. Loading Data
2. Visulization of Data
3. Data imputation
4. Exploratory Analysis and Feature Engineering
5. Prediction

In [None]:
# import libraries that will be used in the analysis
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Loading Data

In [None]:
# loading all data into a list structure
files=os.listdir('../input/acea-water-prediction')
print(files)
# read cvs data files
Aquifer_Doganella=pd.read_csv('../input/acea-water-prediction/'+'Aquifer_Doganella.csv')
Aquifer_Auser=pd.read_csv('../input/acea-water-prediction/'+'Aquifer_Auser.csv')
Water_Spring_Amiata=pd.read_csv('../input/acea-water-prediction/'+'Water_Spring_Amiata.csv')
Lake_Bilancino=pd.read_csv('../input/acea-water-prediction/'+'Lake_Bilancino.csv')
Water_Spring_Madonna_di_Canneto=pd.read_csv('../input/acea-water-prediction/'+'Water_Spring_Madonna_di_Canneto.csv')
Aquifer_Luco=pd.read_csv('../input/acea-water-prediction/'+'Aquifer_Luco.csv')
Aquifer_Petrignano=pd.read_csv('../input/acea-water-prediction/'+'Aquifer_Petrignano.csv')
Water_Spring_Lupa=pd.read_csv('../input/acea-water-prediction/'+'Water_Spring_Lupa.csv')
River_Arno=pd.read_csv('../input/acea-water-prediction/'+'River_Arno.csv')
# combine data into datasets
datasets = [Aquifer_Doganella,Aquifer_Auser,Water_Spring_Amiata,Lake_Bilancino,Water_Spring_Madonna_di_Canneto,
           Aquifer_Luco,Aquifer_Petrignano,Water_Spring_Lupa,River_Arno]
datasets_names=['Aquifer_Doganella.csv', 'Aquifer_Auser.csv', 'Water_Spring_Amiata.csv', 'Lake_Bilancino.csv', 'Water_Spring_Madonna_di_Canneto.csv', 'Aquifer_Luco.csv', 'Aquifer_Petrignano.csv', 'Water_Spring_Lupa.csv', 'River_Arno.csv']

# 2. Visulization of Data 


In [None]:
#boxplot of all variables in each file to examine data range and distribution
fig,ax1 = plt.subplots(3,3,figsize=(15,15))
for i in range(len(datasets)):
    ax1.flatten()[i].set_title(datasets_names[i][:-4])
    datasets[i][datasets[i].columns[1:]].boxplot(ax=ax1.flatten()[i],rot=90)
plt.tight_layout()
    

**Observation 1：**Data visulization using boxplot tells what variables are included in each data file and how responsive varibles and predictive variables are distributed. However, variable values are not at same scale (e.g. rainfall in mm, flow rate in cubic meter per sencond, and volume in cubic meter) which make it hard to fully view the distributions. Further data analysis including correlation, exploratory factor, and feature engineering are necessary. 

# 3.Data imputation and missing value treatment 
There are a lot of missing values in the daily time series. for example the water use vollumes are mostly missing and they are hard to be replaced with meaningful values. In this analysis, monthly data were firstly grouped by from daily values and then monthly means were used to replace mising values

In [None]:
# define a function to plot features vs. target variables
def plot1(inputdata,features,target_var,ylabel1,ylabel2):
    fig, ax1= plt.subplots(figsize=(15,5))
    #inputdata['Year-mon']=pd.to_datetime(inputdata['Year-mon'])
    ax1.bar(inputdata['Year-mon'],inputdata[features])
    ax1.spines['left'].set_color('blue')
    ax1.spines['left'].set_linewidth(3)
    ax1.legend([features],loc=2)
    ax2 =ax1.twinx()
    ax2.plot(inputdata['Year-mon'],inputdata[target_var],color='red')
    ax2.spines['right'].set_color('red')
    ax2.spines['right'].set_linewidth(3)
    ax1.set_ylabel(ylabel1)
    ax1.set_xticks(range(0,len(inputdata),10))
    ax1.set_xticklabels(inputdata['Year-mon'][range(0,len(inputdata),10)],rotation=90)
    ax2.set_ylabel(ylabel2)
    ax2.set_xticks(range(0,len(inputdata),10))
    ax2.set_xticklabels(inputdata['Year-mon'][range(0,len(inputdata),10)],rotation=90)
    ax2.legend([target_var[0]],loc=1)


In [None]:
# transform daily data into montly data using groupby
for i in range(len(datasets)): 
  datasets[i].drop(datasets[i][datasets[i].Date.isnull()].index,inplace = True,axis=0)
  datasets[i]['Year-mon']=pd.to_datetime(datasets[i].Date).apply(lambda x: x.strftime('%Y-%m'))
datasets_monthly = [datasets[i].groupby('Year-mon').mean().reset_index() for i in range(len(datasets))]


In [None]:
target_var=['Depth_to_Groundwater_Pozzo_1', 'Depth_to_Groundwater_Pozzo_2',
       'Depth_to_Groundwater_Pozzo_3', 'Depth_to_Groundwater_Pozzo_4',
       'Depth_to_Groundwater_Pozzo_5', 'Depth_to_Groundwater_Pozzo_6',
       'Depth_to_Groundwater_Pozzo_7', 'Depth_to_Groundwater_Pozzo_8',
       'Depth_to_Groundwater_Pozzo_9', ]
plot1(datasets_monthly[0],'Rainfall_Monteporzio',target_var,'Rainfall, mm','Groundwater Level, m')

In [None]:
plot1(datasets_monthly[0],'Volume_Pozzo_9',['Depth_to_Groundwater_Pozzo_9'],'Volume cm','Groundwater Level, m')

**Observation 2:** Relative dry years (less rainfall) appeared in 2015-2017, which seems to cause well water depth droped, especially at wells Pozzo_1 and Pozzo_9. While water usage seems not to be very important in water depth drops as there is some cooccurence between less water usage and greater water depth drop at the well Pozzo_9. Of course, these are just observation for occasion wells at their occasion times. More important info about important features can be seen later in prediction and feature importance analysis.

# 4. Exploratory Analysis and Feature Engineering

In [None]:
# create a season variable to see if it can help improve prediction
for i in range(len(datasets_monthly)):
    datasets_monthly[i].loc[datasets_monthly[i]['Year-mon'].str.split('-').str[1].str.contains('12|01|02'),'season']=1
    datasets_monthly[i].loc[datasets_monthly[i]['Year-mon'].str.split('-').str[1].str.contains('03|04|05'),'season']=2
    datasets_monthly[i].loc[datasets_monthly[i]['Year-mon'].str.split('-').str[1].str.contains('06|07|08'),'season']=3
    datasets_monthly[i].loc[datasets_monthly[i]['Year-mon'].str.split('-').str[1].str.contains('09|10|11'),'season']=4
# create a dryness variable based on rainfall amount
for i in range(len(datasets_monthly)):
    datasets_monthly[i].loc[datasets_monthly[i].iloc[:,1]>(datasets_monthly[i].iloc[:,1].mean()+datasets_monthly[i].iloc[:,1].std()),'dryness']=4
    datasets_monthly[i].loc[(datasets_monthly[i].iloc[:,1]<=(datasets_monthly[i].iloc[:,1].mean()+datasets_monthly[i].iloc[:,1].std())) & (datasets_monthly[i].iloc[:,1]>datasets_monthly[i].iloc[:,1].mean()),'dryness']=3
    datasets_monthly[i].loc[(datasets_monthly[i].iloc[:,1]>=(datasets_monthly[i].iloc[:,1].mean()-datasets_monthly[i].iloc[:,1].std())) & (datasets_monthly[i].iloc[:,1]<=datasets_monthly[i].iloc[:,1].mean()),'dryness']=2
    datasets_monthly[i].loc[datasets_monthly[i].iloc[:,1]<(datasets_monthly[i].iloc[:,1].mean()-datasets_monthly[i].iloc[:,1].std()),'dryness']=1

# 5. Predictive Accuracy and Feature Importance**

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
f,ax3 = plt.subplots(9,2,figsize=(10,50))
# setup target variabe list
target_var = [[col for col in datasets_monthly[0].columns if 'Depth' in col]]
target_var.append([col for col in datasets_monthly[1].columns if 'Depth' in col])
target_var.append([col for col in datasets_monthly[2].columns if 'Flow_Rate' in col])
target_var.append([col for col in datasets_monthly[3].columns if 'Lake' in col])
target_var.append([col for col in datasets_monthly[4].columns if 'Flow_Rate' in col])
target_var.append([col for col in datasets_monthly[5].columns if 'Depth' in col])
target_var.append([col for col in datasets_monthly[6].columns if 'Depth' in col])
target_var.append([col for col in datasets_monthly[7].columns if 'Flow_Rate' in col])
target_var.append([col for col in datasets_monthly[8].columns if 'Hydrometry' in col])
# create a dataframe to store prediction results
stats_index=[datasets_names[j][:-4]+': '+item for j, sublist in enumerate(target_var) for item in sublist]
stats = pd.DataFrame(columns=['MSE','MAE','R-Squared','Top 3 important features'],index=stats_index)
for j in range(len(target_var)):
    for i,target in enumerate(target_var[j]):
        #select non-null values of responsive variable and imputing nan values of predictive variables by mean
        idx1 = datasets_monthly[j][target].notnull()
        data = datasets_monthly[j][idx1].fillna(datasets_monthly[j][idx1].mean())
        x = data.drop(target_var[j],axis=1)
        x = x.drop('Year-mon',axis=1)
        y = data[target]
        x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)
        np.random.seed(1234)
        RFG = RandomForestRegressor(n_estimators=20,random_state=0)
        RFG.fit(x_train,y_train)
        importance = RFG.feature_importances_
        # output statistics of prediction
        stats.loc[f'{datasets_names[j][:-4]}: {target}','MSE']=f'{mean_squared_error(y_test,RFG.predict(x_test)):.3f}'
        stats.loc[f'{datasets_names[j][:-4]}: {target}','MAE']=f'{mean_absolute_error(y_test,RFG.predict(x_test)):.3f}' 
        stats.loc[f'{datasets_names[j][:-4]}: {target}','R-Squared']=f'{r2_score(y_test,RFG.predict(x_test)):.3f}'
        stats.loc[f'{datasets_names[j][:-4]}: {target}','Top 3 important features']= [list(x.columns[np.argsort(importance)[::-1][:3]])]
        # give an example of prediction in visulization
        if j == 0:
            ax3[i,0].plot(y_test,RFG.predict(x_test),'.')
            ax3[i,0].plot(np.linspace(np.amin(y_test),np.amax(y_test),100),np.linspace(np.amin(y_test),np.amax(y_test),100),'k')
            ax3[i,0].set_ylabel('Predicted')
            ax3[i,0].set_xlabel('Observed')
            ax3[i,0].text(np.min(y_test),np.max(y_test)-1,f'MSE: {mean_squared_error(y_test,RFG.predict(x_test)):.3f}')
            ax3[i,0].text(np.min(y_test),np.max(y_test)-2,f'MAE: {mean_absolute_error(y_test,RFG.predict(x_test)):.3f}')
            ax3[i,0].text(np.min(y_test),np.max(y_test)-3,f'R-Squared: {r2_score(y_test,RFG.predict(x_test)):.3f}' )
            ax3[i,0].set_title(f'{target}, meter')   
            ax3[i,1].bar(range(len(importance)),importance)
            ax3[i,1].set_xticks(range(len(importance)))
            ax3[i,1].set_xticklabels(x.columns,rotation=90)
            ax3[i,1].set_ylabel('Importance')
    plt.tight_layout()

# 6. Statistics of prediction

In [None]:
stats


In [None]:
stats.to_csv('final_stats.csv')