<a href="https://colab.research.google.com/github/BhekiMabheka/Data_Driven_Competions/blob/master/zindi_sasol_competetion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Description
The objective of this challenge is to create a machine-learning model that can forecast the probability of each customer becoming inactive and refraining from making any transactions for a period of 90 days.

An effective solution will enable a business to identify customers who may be on the verge of becoming inactive, allowing them to implement strategies in advance to retain these customers.

Sasol is looking for (2 senior, 1 principal) data scientists with experience in communicating their discoveries and methodology to the business. Solutions will be requested from the top 15 users in this challenge, and 10 users residing in South Africa will be invited for a job interview at Sasol. When submitting your solution, please include your up-to-date CV.

## Evaluation
The prize-winning submission is based on the following weightings: 60% F1 score, 20% approach & methodology, 20% verbal presentation.

The error metric for this competition is the F1 score, which ranges from 0 (total failure) to 1 (perfect score). Hence, the closer your score is to 1, the better your model.

F1 Score: A performance score that combines both precision and recall. It is a harmonic mean of these two variables. Formula is given as: 2*Precision*Recall/(Precision + Recall)

Precision: This is an indicator of the number of items correctly identified as positive out of total items identified as positive. Formula is given as: TP/(TP+FP)

Recall / Sensitivity / True Positive Rate (TPR): This is an indicator of the number of items correctly identified as positive out of total actual positives. Formula is given as: TP/(TP+FN)

Where:

`TP=True Positive`
`FP=False Positive`
`TN=True Negative`
`FN=False Negative`

In [None]:
import pandas as pd
from google.colab import drive
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report,confusion_matrix,ConfusionMatrixDisplay
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import plotly.express as px
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import xgboost as xgb
import warnings
from sklearn.metrics import accuracy_score

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
warnings.filterwarnings("ignore")
drive.mount('/content/drive')

In [None]:
train_data   = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/zindi_competetions/sasol_competetion/data/Train.csv")
test_data    = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/zindi_competetions/sasol_competetion/data/Test.csv")
variables_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/zindi_competetions/sasol_competetion/data/VariableDescription.csv")

In [None]:
test_data.sample(3)

In [None]:
train_data.sample(3)

#### Examine classes and class imbalance

Class imbalance means that there are unequal numbers of cases for the categories of the label. Class imbalance can seriously bias the training of classifier algorithms. It many cases, the imbalance leads to a higher error rate for the minority class. Most real-world classification problems have class imbalance, sometimes severe class imbalance, so it is important to test for this before training any model.

Class imbalance can be sovled by various sampling techniques such as [stratified sampling](https://https://en.wikipedia.org/wiki/Stratified_sampling), [Synthetic Minority Over-sampling](https://https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html#r001eabbe5dd7-1), [SVMSMOTE](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SVMSMOTE.html#imblearn.over_sampling.SVMSMOTE) etc. Imbalance have a big impact on the model metric evaluation; more especially if you're trying to optimize accuracy!

In [None]:
normalized_counts = train_data['Target'].value_counts(normalize=True).round(2)

# Create a bar plot using Plotly Express
fig = px.bar(x=normalized_counts.index, y=normalized_counts.values,
             labels={'x': 'Target', 'y': 'Normalized Count'},
             title='Distribution of the Target Label',
             text=normalized_counts.values,
             color=normalized_counts.index,
             color_discrete_sequence=px.colors.qualitative.Set1)
fig.show()

## Visualize class separation by numeric features

The primary goal of visualization for classification problems is to understand which features are useful for class separation.

In [None]:
def countplot(df, target_label):
    """
    df: Pandas DataFrame
    target_label: str, the column name for the target variable
    """
    cols = df.columns.tolist()
    for col in cols:
        if df[col].dtypes == 'object':
            fig = px.histogram(df, x=col, color=target_label, barmode='group',
                               category_orders={target_label: sorted(df[target_label].unique())},
                               labels={target_label: target_label}, width=800, height=500)
            fig.update_layout(title_text=f'Count Plot of {col} grouped by {target_label}')
            fig.show()

countplot(df=train_data, target_label="Target")

In [None]:
def countplot(df, target_label):
    """
    df: Pandas DataFrame
    target_label: str, the column name for the target variable
    """
    cols = df.columns.tolist()
    for col in cols:
        if df[col].dtypes == 'object':
            fig = px.histogram(df, x=col, color=target_label, barmode='group',
                               category_orders={target_label: sorted(df[target_label].unique())},
                               labels={target_label: target_label}, width=800, height=500)
            fig.update_layout(title_text=f'Count Plot of {col} grouped by {target_label}')
            fig.show()

countplot(df=train_data, target_label="damage_grade")

### Data Preprocessing

In [None]:
def preprocess_data(data):

    # Impute missing values for numeric features with the median
    numeric_features = data.select_dtypes(include=['int64', 'float64']).columns
    data[numeric_features] = data[numeric_features].fillna(data[numeric_features].median())

    # Impute missing values for categorical features with the mode
    categorical_features = data.select_dtypes(include=['object']).columns
    data[categorical_features] = data[categorical_features].fillna(data[categorical_features].mode().iloc[0])

    # Convert categorical features to numerical using one-hot encoding
    data = pd.get_dummies(data, columns=categorical_features, drop_first=True)

    return data

# Invoke the function
train_data = preprocess_data(train_data)
test_data  = preprocess_data(test_data)