# Predicting failing cable connections with a classification model

# Introduction
Our powergrid is an intricate network of cables/lines and stations that transport electricity from powerplants, solar parcs, wind parc, etc. to your e.g. your home. Within this network the cables are connected using cable connections (or 'Moffen' in Dutch). A range of connection types are used in our grid as these have improved over time. Older connections might experience failure due to a range of conditions. One of Alliander's main objectives is to have a reliable grid. Therefore, it is important to know which connection types are prone to failire in order to prevent power failures. 

Today, Alliander is going to ask you to come up with a way to predict connections failures using classification models. We know that connections fail due to due large temperature variations and cause short circuits. The failure of cable connections is difficult to determine. However, we know that there is a relationship between a connection failure and the depth of a connection, the connection type, soil type and groundwater levels. We also know that connection type is the most dominant factor to connection failures.

Using the information supplied above and the supplied dataset containing information on the cable connections present in our grid, try to come up with a classification models that predicts the failure of cable connections.

To Do: hypothesen toevoegen

**Contents:**
1. [Install packages and load the data](#1)
1. [Data Exploration](#2)
1. [Preparation of the data](#3)
1. [Analysis ](#4)
1. [Split train- and testset](#5)
1. [Train en validate the models](#6)


<a id="1"></a> 

## 1. Import packages and load the data

In [None]:
import math
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import sklearn
import sklearn.ensemble
import sklearn.model_selection
import sklearn.tree
import xgboost as xgb

In [None]:
# Set plotly as panda's default way of plotting
pd.options.plotting.backend = "plotly"

In [None]:
# Load data as a DataFrame
df = pd.read_csv("data/dataset-2.csv")

<a id="2"></a> 


## 2. Data Exploration

#### 2.1 First insights in the dataset!

In [None]:
# First view of the data
df.head(10)

In [None]:
# Print the number of rows and columns
print('nrow:', df.shape[0])
print('ncols:', df.shape[1])

In [None]:
# Show a short statistical summary for the numeric values in the dataframe
df.describe()

In [None]:
# Show data types for each column
df.dtypes

#### 2.2 Analyse the variable we try to predict

In [None]:
# Analyse the variable we try to predict
df["FAILURE"].value_counts()

#### 2.3 Analyse categorical variables

In [None]:
categorical_columns = [
    "CONSTRUCTION_ORIG",
    "CABLE_COX2"
]

for col in categorical_columns:
    df.plot(
        kind='hist',
        x=col,
        title=col
    ).show()

#### 2.4 Analyse the distribution of the numerical variables

In [None]:
df.plot(
    kind="hist",
    x="AGE"
)

In [None]:
df.plot(
    kind="box",
    x="AGE"
)

#### 2.5 Check for missing data

When training a model the quality of your data can be a limiting factor.
Therefore, it is wise to check your data early on for completeness.

In [None]:
# Get insights in how many missing values (NULL or NANs) are in the dataset
df_missing = pd.DataFrame(
    data={
        'NUM_MISSING': df.isna().sum(axis='rows'),
        'NUM_TOTAL': len(df)
    }
)

# Compute percentage missing
df_missing['PCT_MISSING'] = df_missing['NUM_MISSING'] / df_missing['NUM_TOTAL']

# Sort dataframe based on
df_missing = df_missing.sort_values(by='PCT_MISSING', ascending=False)

# Show the 'missing' dataframe
df_missing

In [None]:
# Visualize the results
df_missing.plot(
    kind='bar',
    y='PCT_MISSING',
    title="Missing Values"
)

<a id="3"></a> 


## 3. Data Preparation

#### 3.1 Remove outliers

In [None]:
# EXERCISE:
# Based on your results in the exploration step, try to remove some absurd values (outliers) that might negatively impact a machine learning model.

df = df[~(df["DEWATERING_DEPTH_CM"] <= -30)]
df = df[df["YEAR_CONSTRUCTION"] > 1900]

#### 3.2 Fill missing values

Not all models can handle handle missing values in input variables: NaN/None/etc.
In that case you have to come up with a strategy on how replace if you want to be able to use all variables.

In [None]:
# Fill numeric missing values with average
df = df.fillna(df.mean(numeric_only=True))

#### 3.3 Apply one-hot-encoding on caterogorical variables

In [None]:
categorical_columns = [
    "CABLE_COX1",
    "CABLE_COX2",
    "CONSTRUCTION_ORIG",
    "CONSTRUCTION_EXP",
    "CONSTRUCTION_COX",
    "GROUND_TYPE"
]

df_prepped = pd.get_dummies(df, columns=categorical_columns)

In [None]:
# Check the prepped dataframe
df_prepped.head(10)

<a id="4"></a> 


## 4. Analyse

So far we have prepped the dataset and analysed the distributions of the variables.
Before we start to train some models, it is a good idea to analyse the different relationships between the variables. 
Most importantly how they relate to the failure in cable-joints.

In [None]:
# EXERCISE: 
# Try make some useful figures to investigate the relationships between age-related variables and the amount of failures.
df_prepped.plot(
    kind="hist",
    x="AGE",
    color="FAILURE"
)

In [None]:
# EXERCISE:
# Now try to find a way to visualise the relations between the categorical variables and the failures of cable-joints.

columns = [
    "COX1==COX2",
    "SUBSIDENCE"
]

plt.figure(figsize=(20, 140))
for col in columns:
    plt.figure()
    sns.barplot(
        x=df_prepped[col], 
        y=df_prepped["FAILURE"], 
        palette='Blues'
    )
    plt.show()

In [None]:
# EXERCISE: 
# Compute a correlation matrix for all the variables in the dataset. Select subsets to have clear overview, everything in one plot becomes quite unreadable.
# What variables seem to be (strongly) correlated with "FAILURE" and with each other? Can you explain the relationships that you have found? And do they make sense?

# Select the columns here
columns = ['FAILURE'] + df_prepped.columns[2:20].to_list()

# Compute correlation matrix
corrmat = df_prepped[columns].corr().round(2)

In [None]:
# Correlation matrix figure (with seaborn)
sns.set(rc={'figure.figsize':(16, 16)})
sns.heatmap(corrmat, vmax=.8, square=True, annot=True, cmap='RdBu_r')

<a id="5"></a> 


## 5. Split train- and testset
In order to validate how good you machine learning model is able 

In [None]:
# EXERCISE: split the dataset into two dataframes one for training and the other for testing.
df_train, df_test = sklearn.model_selection.train_test_split(df_prepped, train_size=0.8)

<a id="6"></a> 
## 6. Train models

In [None]:
# In this dictionary we will save the trained models
models = {}

In [None]:
# Here we define our predictive variable
y_var = "FAILURE"
y_train = df_train[y_var]
y_test = df_test[y_var]

<a id="6a"></a> 
## 6.1 Decision Tree

For more information see: https://en.wikipedia.org/wiki/Decision_tree_learning

For the documentation of the sklearn implementation see: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

In [None]:
# Select the input variables for the model
x_vars = [
    "CONSTRUCTION_EXP_Nekaldietmof"
]
x_vars = [col for col in df_train.columns if col != y_var]

x_train = df_train.loc[:, x_vars]
x_test = df_test.loc[:, x_vars]

In [None]:
# Fit model
models["decision_tree"] = sklearn.tree.DecisionTreeClassifier(
    # Maximum depth of final tree (nr of levels)
    max_depth=16,
    # ...
    min_samples_split=8,
    # ...
    min_samples_leaf=2
).fit(x_train, y_train)

<a id="6b"></a> 
## 6.2 Random Forest 

In [None]:
# Select the input variables for the model
x_vars = [
    "CONSTRUCTION_EXP_Nekaldietmof"
]
x_vars = [col for col in df_train.columns if col != y_var]

x_train = df_train.loc[:, x_vars]
x_test = df_test.loc[:, x_vars]

In [None]:
models["random_forest"] = sklearn.ensemble.RandomForestClassifier(
    # Number of trees
    n_estimators=100,
    # Max depth of a tree
    max_depth=32
).fit(x_train, y_train)

<a id="6v"></a> 
## 6.3 XGBoost

In [None]:
# Select the input variables for the model
x_vars = [
    "CONSTRUCTION_EXP_Nekaldietmof"
]
x_vars = [col for col in df_train.columns if col != y_var]

x_train = df_train.loc[:, x_vars]
x_test = df_test.loc[:, x_vars]

In [None]:
models["xgboost"] = xgb.XGBClassifier(
    random_state=42
).fit(x_train, y_train)

<a id="7"></a> 
## 7. Model Validation
Now that we have trained a single or multiple models, we want to know how good it can predict the failure of cable-joints.

In [None]:
# List fitted models
for model in models:
    print(model)

In [None]:
# We select the model we want to evaluate here
clf = models["decision_tree"]

#### 7.1 Prediction on test-set

In [None]:
# Predict on the test set (True/False for Failure/Non-failure)
y_test_pred = clf.predict(x_test)

In [None]:
# Predict probabilities as well
y_test_pred_proba = clf.predict_proba(x_test)

#### 7.2 Confusion Matrix

In [None]:
# Confusion matrix
cm = sklearn.metrics.confusion_matrix(
    y_true=y_test,
    y_pred=y_test_pred
).T

In [None]:
pd.DataFrame(cm).plot(
    title="Confusion Matrix",
    kind="imshow",
    x=["Negative", "Positive"],
    y=["Negative", "Positive"],
    labels={
        "x": "Actual Values",
        "y": "Predicted Values"
    },
    text_auto=True
)

#### 7.3 Accuracy

In [None]:
accuracy_score = sklearn.metrics.accuracy_score(y_test, y_test_pred)
print(accuracy_score)

#### 7.4 Precision / Recall curve and F1 score

In [None]:
precision, recall, _ = sklearn.metrics.precision_recall_curve(y_test,  y_test_pred_proba[:, 1])
pr_display = sklearn.metrics.PrecisionRecallDisplay(precision=precision, recall=recall).plot()

In [None]:
# F1 score
f1_score = sklearn.metrics.f1_score(y_test, y_test_pred)

print("f1-score:", f1_score)

#### 7.5 ROC Curve and AUC score

In [None]:
# Compute ROC curve
fpr, tpr, t = sklearn.metrics.roc_curve(y_test, y_test_pred_proba[:, 1])

# Visualize ROC curve
sklearn.metrics.RocCurveDisplay(fpr=fpr, tpr=tpr).plot()

In [None]:
auc_score = sklearn.metrics.auc(fpr, tpr)
print("AUC score:", auc_score)

#### 7.x Optimal Cut-off point

In [None]:
# Cut off point optimalisation?

#### 7.x Feature Importance

In [None]:
# Feature importance
df_feature_importances = (
    pd.DataFrame(
        data={
            "FEATURE": clf.feature_names_in_,
            "IMPORTANCE": clf.feature_importances_
        }
    )
    .sort_values(
        by="IMPORTANCE", 
        ignore_index=True,
        ascending=False
    )
    .head(10)
)

df_feature_importances.plot(
    title="Feature Importances (Top 10)",
    kind="bar", 
    x="IMPORTANCE", 
    y="FEATURE",
)

#### 7.x Model Visualisation

<a id="8"></a> 
## 8. Conclusion

In [None]:
# What are the top 200 cable-joints with the worst condition according to your model? How many of them could did actually fail?

In [None]:
# Can you explain why some models seem to be able to predict better on this dataset than others?