# <div style="color:#fff;display:fill;border-radius:10px;background-color:#137aa4;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:40px;color:white;overflow:hidden;margin:0;font-size:100%"> Hackathon DataMining 2023 - Groupe 7 - 🤖🐦</div>

#### This notebook documents the process and results of a hackathon on the subject of Data Mining. It includes the steps taken by the group, including brainstorming, research, planning, execution, and presentation, as well as the results and conclusion of their project. The notebook serves as a record of the group's work and can be used for future reference and analysis.

### We decided to centralize our code on Kaggle notebooks to allow for greater adaptability and collaboration, as well as easy accessibility for corrections. 


# <div style="color:#fff;display:fill;border-radius:10px;background-color:#3c968b;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%"> 📋 Table of Contents</div>

* [Libraries](#section-1)
* [Data Loading](#section-2)
* [EDA](#section-3)
* [Data Cleaning](#section-4) 
* [AutoML](#section-5)
* [Predictions](#section-6) 
* [Save](#section-7) 

<a id="section-1"></a>
# <div style="color:#fff;display:fill;border-radius:10px;background-color:#3c968b;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%"> 📖 Libraries and color palette</div> 

In [None]:
!pip install xlrd
!pip install pip install autoviz --upgrade
!pip install dataprep
!pip install datacleaner

In [None]:
# import the necessary libraries
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

import matplotlib.pyplot as plt
import seaborn as sns

from pandas_profiling import ProfileReport
from autoviz.AutoViz_Class import AutoViz_Class
from autoviz import data_cleaning_suggestions
from dataprep.clean import clean_df
from datacleaner import autoclean, autoclean_cv

from sklearn.metrics import confusion_matrix
from h2o import h2o
from h2o.automl import H2OAutoML
from sklearn.metrics import classification_report

In [None]:
# Defining all our palette colours.
primary_blue = "#496595"
primary_blue2 = "#85a1c1"
primary_blue3 = "#3f4d63"
primary_grey = "#c6ccd8"
primary_black = "#202022"
primary_bgcolor = "#f4f0ea"

primary_green = px.colors.qualitative.Plotly[2]

plt.rcParams['axes.facecolor'] = primary_bgcolor

colors = [primary_blue, primary_blue2, primary_blue3, primary_grey, primary_black, primary_bgcolor, primary_green]
sns.palplot(sns.color_palette(colors))

<a id="section-3"></a>
# <div style="color:#fff;display:fill;border-radius:10px;background-color:#3c968b;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%">⬇️ Data Loading</div> 

As part of our research for the hackathon, we downloaded relevant data from the CELENE server and loaded it into our notebook. This allowed us to analyze the data and gain insights that informed our project plan and execution. 

In [None]:
# Paths
train_path = '/kaggle/input/2022-dataset-hackathon/train_set_subgroup.csv'
test_path = '/kaggle/input/2022-dataset-hackathon/train_set_subgroup.csv'
service_path = '/kaggle/input/2022-dataset-hackathon/train_set_subgroup.csv'

# Read the data into a dataframe from the csv files [delimited by ; = sep=';']
train = pd.read_csv(train_path, sep=';')
test = pd.read_csv(test_path, sep=';')
service = pd.read_csv(service_path, sep=';')

In [None]:
# Variable declaration
TARGET = 'IS_CO_REF'
ID = 'id_client'

In [None]:
# Show the shape of the train and test dataframes
print("Train shape : "+train.shape.__str__())
print("Test shape : "+test.shape.__str__())

In [None]:
# Show columns of the train and test
print("Train columns : "+train.columns.__str__())
print("Test columns : "+test.columns.__str__())

<a id="section-2"></a>
# <div style="color:#fff;display:fill;border-radius:10px;background-color:#3c968b;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%">💹 Exploratory data analysis</div> 


In [None]:
%matplotlib inline
# AV = AutoViz_Class()
# # Use the AutoViz class to generate a report on the dataframe
# filename = AV.AutoViz(filename='', depVar=TARGET, dfte=train, chart_format='bokeh')
data_cleaning_suggestions(train)

In [None]:
data_cleaning_suggestions(test)

In [None]:
data_profile_train = ProfileReport(train, title='Hackathon Train data Profile Report',dark_mode=True, explorative=True)
data_profile_test = ProfileReport(test, title='Hackathon Test data Profile Report',dark_mode=True, explorative=True)
comparison_report = data_profile_train.compare(data_profile_test)

In [None]:
comparison_report

<a id="section-4"></a>
# <div style="color:#fff;display:fill;border-radius:10px;background-color:#3c968b;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%">🧹 Data Cleaning</div> 

In [None]:
# Extract the target from the train dataframe and drop the column
targets = train[TARGET]
train.drop(TARGET, axis=1, inplace=True)

In [None]:
targets_train = test[TARGET]
test.drop(TARGET, axis=1, inplace=True)

In [None]:
ID = 'FILE_TEXT'
ids_train = train[ID]
train.drop(ID, axis=1, inplace=True)
ids_test = test[ID]
test.drop(ID, axis=1, inplace=True)


In [None]:
# Show the shape of the train and test dataframes
print("Train shape : "+train.shape.__str__())
print("Test shape : "+test.shape.__str__())
# Show columns of the train and test
print("Train columns : "+train.columns.__str__())
print("Test columns : "+test.columns.__str__())

In [None]:
train_clean, test_clean = autoclean_cv(train, test)

In [None]:
# Show the shape of the train and test dataframes
print("Train shape : "+train_clean.shape.__str__())
print("Test shape : "+test_clean.shape.__str__())
# Show columns of the train and test
print("Train columns : "+train_clean.columns.__str__())
print("Test columns : "+test_clean.columns.__str__())

In [None]:
data_profile_train_clean = ProfileReport(train, title='Hackathon Train cleaned data Profile Report',dark_mode=True, explorative=True)
data_profile_test_clean = ProfileReport(test, title='Hackathon Test cleaned data Profile Report',dark_mode=True, explorative=True)
comparison_report_clean = data_profile_train_clean.compare(data_profile_test_clean)

In [None]:
data_cleaning_suggestions(train_clean)

In [None]:
data_cleaning_suggestions(test_clean)

In [None]:
comparison_report_clean

In [None]:
# Rejoin targets and train
train_clean = pd.concat([train_clean, targets], axis=1)

In [None]:
# Rejoin targets and test
test_clean = pd.concat([test_clean, targets_train], axis=1)

<a id="section-5"></a>
# <div style="color:#fff;display:fill;border-radius:10px;background-color:#3c968b;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%">🤖 AutoML - H2O </div> 

In [None]:
# Start h2o
h2o.init()

In [None]:
# Convert the dataframe to h2o dataframe
df_h2o = h2o.H2OFrame(train_clean)
df_h2o_test = h2o.H2OFrame(test_clean)

In [None]:
x = df_h2o.columns # Put all columns all dataset train in an array named x
y = TARGET # Call the name of the target
x.remove(y) # Remove target from x

# Drop collums we don't care -> ids ...
drop_columns = ['RELATION_TYPE'] 
x = [i for i in x if i not in drop_columns]

In [None]:
# Select the y column as the target
df_h2o[y] = df_h2o[y].asfactor()
df_h2o_test[y] = df_h2o_test[y].asfactor()

In [None]:
# Set the parameters
# Documentation is here : https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
aml = H2OAutoML(seed=2023,# To reproduce the same schema
                nfolds = 0 , # Cross Validation
                max_models = 30, # Number of models will be tested
                exclude_algos = ["DeepLearning"],# Model we don't want to try a fit - possible : StackedEnsemble, GBM, XGBoost, GLM, DRF 
                # include_algos = ["XGBoost"], # The same but for target algos 
                max_runtime_secs = 300, #  maximum time that the AutoML process will run for
                # balance_classes = True, # minority classes to balance
                max_runtime_secs_per_model = 80, # it's litterally what the name say
                )

# Train the model
aml.train(x=x, y=y, training_frame=df_h2o)

In [None]:
# Get the leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)

<a id="section-6"></a>
# <div style="color:#fff;display:fill;border-radius:10px;background-color:#3c968b;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%">🎲 Predictions</div> 


In [None]:
# Take the best model
model = aml.leader

# Predict the test data
pred = model.predict(df_h2o_test)

<a id="section-7"></a>
# <div style="color:#fff;display:fill;border-radius:10px;background-color:#3c968b;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%">💾 Save the work</div> 


#### Save the predictions

In [None]:
# Convert the h2o dataframe to pandas dataframe
pred_df = pred.as_data_frame()

# Get the predictions
y_pred_df = pred_df['predict'].values
y_test_df = df_h2o_test[TARGET].as_data_frame().values

In [None]:
# Get the classification report in a dataframe
report = classification_report(y_test_df, y_pred_df, output_dict=True)
report = pd.DataFrame(report).transpose()
report

In [None]:
# Get the confusion matrix plot
cm = confusion_matrix(y_test_df, y_pred_df)
sns.heatmap(cm, annot=True, fmt='d')
plt.show()

In [None]:
# Export the predictions in a txt file with in the first line the ID = "700641" and the second line are the positive and negative class predictions represented by 0 and 1 in the sameorder as the pairs points from the csv
# print the first 10 predictions
print(y_pred_df[:10])
# convert y_pred NO to 0 and 1 to YES
y_pred_2 = np.where(y_pred_df == 'NO', 0, 1)
# print the first 10 predictions
print(y_pred_2[:10])
# to string with a space between each element
y_pred_2 = ' '.join(y_pred_2.astype(str))
# print the first 10 predictions
print(y_pred_2[:10])
# ID before the predictions and a new line
y_pred_2 = '700641' + '\n' + y_pred_2
# save the predictions in a txt file
with open('/kaggle/working/prediction.txt', 'w') as f:
    f.write(y_pred_2)

#### Save the model

In [None]:
# h2o.save_model(aml.leader, path = "./product_backorders_model_bin")

Explain the best model :

In [None]:
# get the best xgboost model
xgb = aml.get_best_model(algorithm="xgboost")

In [None]:
xgb.explain(df_h2o)