# **Hi!ckathon #5: AI & Sustainability**

## **Group 21: Aqua Smart Solution**

The goal is to build an AI model that can predict the watertable/ground water levels of french piezometric stations, with a focus on the summer months. To build this model, you were given piezometric/watertable, weather, hydrology, water withdrawal and economic data. But beyond producing an AI model, the competition will ask you to realistically project your solution in a market / real-world context.

The full dataset contains over 3 million rows with 136 columns. It was split into a train/test set.
- Train set (`X_train_Hi5.csv`): The dataset has around 2 800 000 rows. It contains data between 2020 and 2023, excluding the summer months (june, july, august, september) of 2022 and 2023.
- Test set (`X_test_Hi5.csv`): The contains has around 600 000 rows. It contains data for the 2022 and 2023 summer months (june, july, august, september).
- Test submission example (`y_test_submission_example_Hi5.csv`): Please follow this example to submit results to the leaderboard. The "row_index" variable is a unique identifier of each row, to match the values

The target variable to predict is `piezo_groundwater_level_category`.

----

## **Import libraries**

In [1]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler, RobustScaler, LabelEncoder, FunctionTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 280)

----

## **Load the datasets**

In [2]:
train = pd.read_csv('data_processed_clipped_asymetry_symetry_upload2.csv')
test = pd.read_csv('test_preprocessed_clipped_asymetry_symetry.csv')

In [3]:
train['piezo_measurement_date'] = pd.to_datetime(train['piezo_measurement_date'])
test['piezo_measurement_date'] = pd.to_datetime(test['piezo_measurement_date'])

----

## **Feature Engineering**

In [4]:
to_passthrough = [] 

In [5]:
def sin_transformer(period):
    return FunctionTransformer(lambda x: np.sin(x / period * 2 * np.pi))

def cos_transformer(period):
    return FunctionTransformer(lambda x: np.cos(x / period * 2 * np.pi))


In [None]:
train['day_cos'] = train['piezo_measurement_date'].dt.day.astype(float)
train['day_sin'] = train['piezo_measurement_date'].dt.day.astype(float)
train['month_cos'] = train['piezo_measurement_date'].dt.month.astype(float)
train['month_sin'] = train['piezo_measurement_date'].dt.month.astype(float)
train['month'] = train['piezo_measurement_date'].dt.month.astype(int)
train['year'] = train['piezo_measurement_date'].dt.year.astype(int)

train["day_sin"] = sin_transformer(365).fit_transform(train["day_sin"])
train["day_cos"] = cos_transformer(365).fit_transform(train["day_cos"])
train["month_sin"] = sin_transformer(12).fit_transform(train["month_sin"])
train["month_cos"] = cos_transformer(12).fit_transform(train["month_cos"])
train['quarter_sin'] = sin_transformer(4).fit_transform(train["quarter"])
train['quarter_cos'] = sin_transformer(4).fit_transform(train['quarter'])

test['day_cos'] = test['piezo_measurement_date'].dt.day.astype(float)
test['day_sin'] = test['piezo_measurement_date'].dt.day.astype(float)
test['month_cos'] = test['piezo_measurement_date'].dt.month.astype(float)
test['month_sin'] = test['piezo_measurement_date'].dt.month.astype(float)
test['month'] = test['piezo_measurement_date'].dt.month.astype(int)
test['year'] = test['piezo_measurement_date'].dt.year.astype(int)

test["day_sin"] = sin_transformer(365).fit_transform(test["day_sin"])
test["day_cos"] = cos_transformer(365).fit_transform(test["day_cos"])
test["month_sin"] = sin_transformer(12).fit_transform(test["month_sin"])
test["month_cos"] = cos_transformer(12).fit_transform(test["month_cos"])
test['quarter_sin'] = sin_transformer(4).fit_transform(test["quarter"])
test['quarter_cos'] = sin_transformer(4).fit_transform(test['quarter'])

to_passthrough.extend(['day_cos', 'day_sin', 'month_cos', 'month_sin', 'quarter_sin', 'quarter_cos', 'quarter', 'month', "year"])

In [7]:
train = train.drop(columns=['piezo_measurement_date'])

In [None]:
train_categorical = train.select_dtypes(include=['object', 'category'])
test_categorical = test.select_dtypes(include=['object', 'category'])


train_numerical = train.select_dtypes(include=['int', 'float', 'number'])
test_numerical = test.select_dtypes(include=['int', 'float', 'number'])

----

## **Pipeline of the feature engineering process**

In [None]:
ordinal_order = {
    "hydro_qualification_label": ['Douteuse', 'Non qualifiée', 'Bonne'],
    "hydro_status_label": ["Donnée brute", 'Donnée corrigée', 'Donnée pré-validée', 'Donnée validée'],
    "piezo_measure_nature_code": ['N', '0', 'I', 'D', 'S'],
    "piezo_qualification": ['Incorrecte', 'Non qualifié', 'Incertaine', 'Correcte'],
    "piezo_status": ['Donnée brute', 'Donnée contrôlée niveau 1', 'Donnée contrôlée niveau 2', 'Donnée interprétée'],
    "piezo_obtention_mode": ["Mode d'obtention inconnu", 'Valeur mesurée', 'Valeur reconstituée'],
}

to_ordinal = list(ordinal_order.keys())
# to_one_hot = ['hydro_hydro_quantity_elab', 'piezo_station_commune_code_insee']
to_one_hot = ['hydro_hydro_quantity_elab']
# to_label_encode = ['piezo_station_commune_code_insee']

numerical_to_drop = [
    'hydro_method_label', 'hydro_observation_date_elab', 'piezo_station_department_code',
    'piezo_station_bss_code', 'piezo_station_commune_name', 'piezo_station_bss_id',
    'hydro_longitude', 'hydro_latitude', 'distance_piezo_meteo', 'distance_piezo_hydro',
    'piezo_bss_code', 'piezo_obtention_mode', 'piezo_qualification', 'piezo_continuity_name', 
    'piezo_producer_code', 'piezo_measure_nature_name', 'meteo_wind_direction_max_inst_2m', 
    'meteo_wind_speed_avg_2m', 'meteo_time_wind_max_2m', 'meteo_id', 'meteo_name', 
    'meteo_temperature_avg_threshold', 'meteo_temperature_min_50cm', 
    'prelev_structure_code_2', 'prelev_structure_code_1', 'prelev_structure_code_0', 
    'prelev_volume_obtention_mode_label_2', 'prelev_volume_obtention_mode_label_1', 
    'prelev_volume_obtention_mode_label_0', 'prelev_usage_label_2', 
    'prelev_usage_label_1', 'prelev_usage_label_0'
]

numerical_to_drop.extend(to_passthrough)

to_scale = [col for col in train_numerical.columns if col not in numerical_to_drop]

ordinal_pipeline = Pipeline(steps=[
    ('ordinal', OrdinalEncoder(categories=[ordinal_order[col] for col in to_ordinal]))
])

one_hot_pipeline = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

label_encode_pipeline = Pipeline(steps=[
    ('label_encode', LabelEncoder())
])

robust_scaling_pipeline = Pipeline(steps=[
    ('scaler', RobustScaler())
])

preprocessor = ColumnTransformer(transformers=[
    ('ordinal', ordinal_pipeline, to_ordinal),
    ('onehot', one_hot_pipeline, to_one_hot),
    ('robust_scale', robust_scaling_pipeline, to_scale),
    ('passthrough', 'passthrough', to_passthrough) 
], remainder='drop')


In [None]:
summer_months = [6, 7, 8, 9]

summer_2021 = train[(train['year'] == 2021) & (train['month'].isin(summer_months))]
summer_2020 = train[(train['year'] == 2020) & (train['month'].isin(summer_months))]
summers = train[((train['year'] == 2020) | (train['year'] == 2021)) & (train['month'].isin(summer_months))]

train_data = train[~((train['year'] == 2021) & (train['month'].isin(summer_months)))]

train_final = summers.reset_index(drop=True)
validation_final = summer_2021.reset_index(drop=True)

X_train = train_final.drop('piezo_groundwater_level_category', axis=1)
y_train = train_final['piezo_groundwater_level_category']
X_val = validation_final.drop('piezo_groundwater_level_category', axis=1)
y_val = validation_final['piezo_groundwater_level_category']

In [None]:
X_train = X_train.drop(columns=['piezo_station_commune_code_insee'])
X_val = X_val.drop(columns=['piezo_station_commune_code_insee'])
test = test.drop(columns=['piezo_station_commune_code_insee'])

In [12]:
preprocessor.fit(X_train)

In [None]:
X_train_processed = preprocessor.transform(X_train)
X_val_processed = preprocessor.transform(X_val)

In [14]:
print(X_train_processed.shape)
print(X_val_processed.shape)

(614663, 99)
(309088, 99)


In [None]:
categories = ['Very Low', 'Low', 'Average', 'High', 'Very High']

encoder = OrdinalEncoder(categories=[categories]) 

y_train_encoded = encoder.fit_transform(y_train.values.reshape(-1, 1))
y_val_encoded = encoder.transform(y_val.values.reshape(-1, 1))

---

## **Model Training & Validation**

In [None]:
clf = RandomForestClassifier(
    n_estimators=1,  
    max_depth=None,    
    random_state=42,   
    n_jobs=-1          
)

In [18]:
clf.fit(X_train_processed, y_train_encoded.ravel())

In [None]:
y_val_pred = clf.predict(X_val_processed)

In [None]:
f1 = f1_score(y_val_encoded, y_val_pred, average='weighted')
print(f"Weighted F1 Score sur l'ensemble de validation : {f1:.4f}")

Weighted F1 Score sur l'ensemble de validation : 0.9360


----

## **Model Training on the entire train dataset**

In [None]:
clf = RandomForestClassifier(
    n_estimators=200,  
    max_depth=None,    
    random_state=42,   
    n_jobs=-1          
)

In [None]:
X_train_total = train.drop('piezo_groundwater_level_category', axis=1)
y_train_total = train['piezo_groundwater_level_category']

preprocessor.fit(X_train_total)

X_train_total_processed = preprocessor.transform(X_train_total)
X_test_total_processed = preprocessor.transform(test)

In [None]:
categories = ['Very Low', 'Low', 'Average', 'High', 'Very High']

encoder = OrdinalEncoder(categories=[categories]) 

y_train_total_encoded = encoder.fit_transform(y_train_total.values.reshape(-1, 1))

In [None]:
clf.fit(X_train_total_processed, y_train_total_encoded.ravel())

In [None]:
y_test_pred = clf.predict(X_test_total_processed)

In [None]:
test = pd.read_csv('./annex/row_index_of_test.csv')

In [None]:
y_test_pred_labels = encoder.inverse_transform(y_test_pred.reshape(-1, 1))

In [None]:
test_predictions = pd.DataFrame({
    'row_index': test["row_index"],  
    'piezo_groundwater_level_category': y_test_pred_labels.ravel()
})

## **Submission to csv**

In [None]:
test_predictions.to_csv('y_test_Hi5_RF_summeronly_v_very_final_yes.csv', index=False)