[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](http://colab.research.google.com/github/Jonas-Metz-verovis/verovis_Coding_Challenge/blob/main/03_Live_Production_Model.ipynb)

# Introduction - Coding Challenge #3 - Live Production Model

**Today's Coding Challenge focuses on a Live Production Model. The underlying case is based on data from a Portuguese bank. The data contains information from a telephonic marketing campaign, i.e. information about potential customers who have (not) subscribed to a term deposit after the phone call. Telephonic marketing campaigns still remain one of the most effective ways to reach out to people. However, they require massive investment as large call centers usually must be hired to execute these campaigns. Hence, it is crucial to identify the customers, which will be most likely to convert, before starting the campaign, so that they can be targeted explicitly.**


**This challenge's main task is to deliver a smooth-running model ready to be used in a live production environment. Therefore, you'll have access to two different datasets. The first dataset (Marketing.csv) contains the so-called raw data. For the second dataset (Marketing_Live.csv), we'll assume that it contains the unknown data on which the model will be used in the live production environment. The big challenge is to deliver a full pipeline, capable of performing excellent on your training data and also usable in the live environment.**

**The Challenge will be scored based on:**

1.  The Prediction Model's Test Accuracy Score
1.  The verbal Explanations for specific Processing/Modeling Choices
1.  The Readability and Transferability of the submitted Code
1.  The Documentation of the submitted Code
1.  Optional (not scored): Explanation of the Model's learned Relationships (e.g. through the Feature Importances)

General Machine Learning Project Checklist (**Focus of this Challenge**) by [Aurélien Géron](https://github.com/ageron/handson-ml)

1. Frame the Problem and look at the Big Picture
1. Get the Data
1. Explore the Data to gain Insights
1. Prepare the Data to better expose the underlying Data Patterns to the used Machine Learning Algorithms
1. **Explore many different Models and short-list the best ones**
1. Fine-tune your Models and combine them into a great Solution
1. Present your Solution
1. **Launch, monitor, and maintain your Model/Service**

**INFO:** Instead of working with [Google Colab](https://colab.research.google.com/), which is recommended because you can get started right away, or [Databricks](https://adb-7072220306909809.9.azuredatabricks.net/?o=7072220306909809), which is recommended if you want to collaborate in real-time, you can also work with your own Development Environment (e.g. [Visual Studio Code](https://code.visualstudio.com/)), by using [Git](https://git-scm.com/) to clone the [verovis Coding Challenge GitHub Repository](https://github.com/Jonas-Metz-verovis/verovis_Coding_Challenge) and collaborate e.g. by using [Microsoft Visual Studio Live Share](https://marketplace.visualstudio.com/items?itemName=MS-vsliveshare.vsliveshare-pack)



# Documentation and Support

#### The following Resources might be useful to complete this Challenge:

1.  [Scikit-Learn (Chi-square)](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html)
1.  [Scikit-Learn (ColumnTransformer)](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)
1.  [Scikit-Learn (Pipelines)](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
1.  [Medium: Scikit-Learn Pipelines](https://medium.com/vickdata/a-simple-guide-to-scikit-learn-pipelines-4ac0d974bdcf)
1.  Joblib [dump](https://joblib.readthedocs.io/en/latest/generated/joblib.dump.html) and [load](https://joblib.readthedocs.io/en/latest/generated/joblib.load.html) Documentation

<hr>

1.  [Pandas Documentation](https://pandas.pydata.org/docs/reference/index.html#api)
1.  [Numpy Documentation](https://numpy.org/doc/stable/)
1.  [Scikit-Learn Documentation](https://scikit-learn.org/stable/modules/classes.html)
1.  [Category Encoders Documentation](https://contrib.scikit-learn.org/category_encoders/)
1.  [Imbalanced-Learn Documentation](https://imbalanced-learn.readthedocs.io/en/stable/api.html)
1.  [Seaborn Documentation](https://seaborn.pydata.org/api.html)
1.  [SHAP Documentation](https://shap.readthedocs.io/en/latest/api.html)
1.  [Pandas Data Wrangling Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
1.  [TowardsDataScience: Data Cleansing](https://towardsdatascience.com/data-cleaning-in-python-the-ultimate-guide-2020-c63b88bf0a0d)
1.  [TowardsDataScience: Data Preprocessing](https://towardsdatascience.com/data-preprocessing-concepts-fa946d11c825)
1.  [TowardsDataScience: Feature Engineering](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114)
1.  [Machine Learning Mastery: Feature Engineering](https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/)
1.  [TowardsDataScience: Working with Numerical Variables](https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b)
1.  [TowardsDataScience: Working with Categorical Variables](https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63)
1.  [TowardsDataScience: Categorical Variable Encoding](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02)
1.  [TowardsDataScience: One-Hot-Encoding for tree-based Models](https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769)
1.  [Stat Trek: One-Hot-Encoding (Dummy Variables)](https://stattrek.com/multiple-regression/dummy-variables.aspx)

#### If you don't know how to find a Solution to a given Problem, it often works well if one just "googles the problem". Great Sources are:

1.  [TowardsDataScience](https://towardsdatascience.com/)
1.  [StackOverflow](https://stackoverflow.com/)
1.  [Machine Learning Mastery](https://machinelearningmastery.com/start-here/)
1.  [Python-Kurs.eu](https://www.python-kurs.eu/python3_kurs.php)
1.  [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook)
1.  [The Hitchhiker's Guide to Python](https://docs.python-guide.org/)
1.  [Overview of Data Science YouTube Channels](https://towardsdatascience.com/top-20-youtube-channels-for-data-science-in-2020-2ef4fb0d3d5)
1.  [Introduction to Machine Learning with Python](https://github.com/amueller/introduction_to_ml_with_python) / [Buy the Book](https://www.amazon.de/Introduction-Machine-Learning-Python-Scientists/dp/1449369413)
1.  [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12_toc.pdf)
1.  [Bayesian Reasoning and Machine Learning](http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/200620.pdf)
1.  [Deep Learning](https://www.deeplearningbook.org/)
1.  [Hyndman/Athanasopoulos, Forecasting: Principles and Practice](https://otexts.com/fpp2/)

#### This Challenge was created by [Tim Fritzsche](tfritzsche@verovis.com), [Jonas Metz](jmetz@verovis.com) and [Marcel Fynn Froboese](mfroboese@verovis.com), please contact us anytime, if you have any Questions! :-)


# Global Flags and Variables
Please use the given RANDOM_STATE for all your Models etc.

In [None]:
import os

RANDOM_STATE = 42

# TODO: Please choose a Team Name!
TEAM_NAME = 'AdminTeam'

DATABRICKS_INSTANCE = "https://adb-7072220306909809.9.azuredatabricks.net"
DATABRICKS_ORGANISATION = "7072220306909809"
DATABRICKS_BASE_DIRECTORY = os.path.join ("/dbfs/FileStore", TEAM_NAME)

MODELS = os.path.join (DATABRICKS_BASE_DIRECTORY, "Models")

SAVE_MODEL = True
SAVE_PIPELINE = True

# Databricks Specifics

[Databricks Filestore Documentation](https://docs.databricks.com/data/filestore.html)

In [None]:
dbutils.fs.rm ("/FileStore/" + TEAM_NAME, recurse=True)
dbutils.fs.mkdirs("/FileStore/" + TEAM_NAME + "/Models")
dbutils.fs.ls("/FileStore/" + TEAM_NAME)

# Imports

### Info (Google Colab)

If you are working in Google Colab, you can install necessary (and not already installed) Packages by running e.g.

```
!pip install shap
```

### Info (Databricks)

If you are working in [Databricks](https://docs.databricks.com/libraries/notebooks-python-libraries.html), you can install necessary (and not already installed) Packages by running e.g. this Command in the first Cell of your Notebook (the Kernel will reset after the Package has been installed):

```
%pip install shap
```

In [None]:
%matplotlib inline

import os
import calendar
import itertools

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from joblib import dump, load
from pandas.api.types import CategoricalDtype
from sklearn_pandas import DataFrameMapper
from collections import Counter
from datetime import date, datetime
from numpy.lib.npyio import zipfile_factory
from numpy.ma.core import make_mask
from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder, MinMaxScaler
from sklearn.base import BaseEstimator, TransformerMixin
from tqdm.notebook import tqdm
from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix, plot_roc_curve, accuracy_score, roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

# Helper Functions

In [None]:
def change_dtypes(numeric_feat=None, ordinal_feat=None, nominal_feat=None):
    """
    Changes the type properly and directly
    
    """
    for num in numeric_feat:
        Marketing[num] = Marketing[num].astype('int32')

    for ord in ordinal_feat:
        Marketing[ord] = Marketing[ord].astype('category')

    for nom in nominal_feat:
        Marketing[nom] = Marketing[nom].astype('object')


def get_features_type(df):
    """
    Sort the Features according to its Level and returns the names of the columns
    
    Parameter:
    ----------
    df: pandas DataFrame
    
    Returns:
    ---------
    nominal_feat: list

    ordinal_feat: list

    numeric_feat: list
    
    """
    nominal_feat = [feat for feat in df.columns if df[feat].dtypes == 'object']
    ordinal_feat = df.select_dtypes(include=['category']).columns.tolist()
    numeric_feat = [feat for feat in df.columns if df[feat].dtypes == 'int32']
    passed_feat = [feat for feat in df.columns if df[feat].dtypes == 'float64']
    return nominal_feat, ordinal_feat, numeric_feat, passed_feat


def get_feat_imp(model_metrics:dict):
    """
    This function returns the feature importances with respect to the used transformation and the used estimator
    
    """

    transformation_names = {}
    for key in model_metrics['DecisionTreeClassifier']['Best_Model (GridSearch)']['columntransformer'].transformers_:
        if str(key[1]) == 'OneHotEncoder()':
            transformation_names[key[0]] = key[1].get_feature_names(input_features=key[-1]).tolist()
        # print(key[0], key[-1])
        else:
            transformation_names[key[0]] = key[-1] 

    complete_feature_names = []
    for value in transformation_names.values():
        complete_feature_names.append(value)

    complete_feature_names = list(itertools.chain(*complete_feature_names))

    feature_importances_dict = {'Feature Names': complete_feature_names}
    for key in model_metrics.keys():
        feature_importances_dict[key + '_Import.'] = model_metrics[str(key)]['Best_Model (GridSearch)'][str(key).lower()].feature_importances_.tolist()

    result = pd.DataFrame(feature_importances_dict)

    return result.sort_values(by='DecisionTreeClassifier_Import.', ascending=False).reset_index(drop=True)

# Data Loading

In [None]:
Marketing_File_Link = 'https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Marketing/Marketing.csv'
Marketing_Live_File_Link = 'https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Marketing/Marketing_Live.csv'

Marketing = pd.read_csv(Marketing_File_Link, sep=';')

In [None]:
Marketing.head()

In [None]:
nominal_feat = ['marital', 'default','housing','loan','contact','month', 'job','education']
ordinal_feat = ['poutcome'] 
numeric_feat = ['age', 'balance', 'day', 'duration', 'campaign','pdays','previous']

In [None]:
# Change Types of Features
change_dtypes(numeric_feat, ordinal_feat, nominal_feat)

In [None]:
# Prevent Target-Leakage and choose column "y" as target
target = Marketing['y']
Marketing = Marketing.drop(['y'], axis=1)

In [None]:
# Train-Test-Split
X_train, X_test, y_train, y_test = train_test_split(Marketing, target, test_size=0.33, random_state=RANDOM_STATE, shuffle=False)

## Show Feature Distribution

In [None]:
# for col in X_train.columns:
#     sns.displot(X_train[col])
#     plt.xticks(rotation=45)

fig, axes = plt.subplots(ncols=4, nrows=2, figsize=(20, 8))

for feat, ax in zip(nominal_feat, axes.flatten()):
    sns.countplot(X_train[feat], ax=ax)
    plt.suptitle('Nominal - Features')

plt.tight_layout()

#display (fig)

In [None]:
fig, axes = plt.subplots(ncols=4, nrows=2, figsize=(20, 8))
for feat, ax in zip(numeric_feat, axes.flatten()):
    sns.distplot(X_train[feat], ax=ax)
plt.tight_layout()

#display (fig)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.countplot(X_train['poutcome'], ax=axes[0])
sns.countplot(y_train, ax=axes[1])
plt.tight_layout()

#display (fig)

# Feature Engineering

**Note:**
You can use the additional feature engineering capabilities of the custom "*CombineFeatureTransformer"-Class*. The Class can be added to a *Scikit-Learn Pipeline* and can be stored as part of the pipeline.

## Short Summary
### Features (numerical):
- **age:**
The variable *age* represents the age of the person.

 - **balance:**
The variable *balance* represents the average yearly balance, in euros.

- **day:**
The variable *day* represents the last contact day of the month.

- **duration:**
The variable *duration* represents the last contact duration.


- **pdays:**
The variable *pdays* represents the number of days that passed by after the client was last contacted from a previous campaign (-1 means client was not previously contacted)

- **previous:** 
The variable *previous* represents the number of contacts performed before this campaign and for this client.

- **campaign:**
The variable *campaign* represents the number of contacts performed during this campaign and for this client.

### Features (categorical)
- **y:**
The variable *y* represents the outcome of the campaign and therefore, if the client has subscribed a term deposit or not.

- **marital:**
The variable *marital* represents the marital-status of each person.

- **default:**
The variable *default* represents the presents of a credit default.

- **housing:**
The variable *housing* represents the presents of a housing loan.

- **loan:**
The variable *loan* represents the presents of a personal loan.

- **contact**
The variable *contact* represents the type of communication

- **month:**
The variable *month* represents the last contact month of the year.

- **job:**
The variable *job* represents the type of job.

- **education:**
The variable *education* represents the status of the education.

- **poutcome:**
The variable represents the outcome of the previous marketing campaign.

### Task (optional): Feature Engineering is optional for this challenge. Of course, you can always improve the performance of your model if you wish.

INFO: The CombineFeatureTransformer gives you the possibility to make additional transformations for specified features. The advantage of the CombineFeatureTransformer is the simple implementation into the pipeline itself. The pipeline contains all the necessary steps to prepare new unknown data for the corresponding model.

In [None]:
class CombineFeatureTransformer(BaseEstimator, TransformerMixin):

    def __init__(self,
                 add_Benefit_Deposit=False,
                 add_Affordable_Invest=False,
                 add_New_Income=False,
                 add_Contact_Before=False,
                 add_Passed_Time=False,
                 add_Risk_Protection=False,
                 dropping_non_valuable_feat=False):
        self.add_Benefit_Deposit = add_Benefit_Deposit
        self.add_Affordable_Invest = add_Affordable_Invest
        self.add_New_Income = add_New_Income
        self.add_Contact_Before = add_Contact_Before
        self.add_Passed_Time = add_Passed_Time
        self.add_Risk_Protection = add_Risk_Protection

        #Dropping non valuable features:
        self.dropping_non_valuable_feat = dropping_non_valuable_feat

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        # Protect the original DataFrame
        X_copy = X.copy()
        
        ##################################################################################################################################################
        # Feel free to manipulate or add new features as you like
        ##################################################################################################################################################
        ##################################################################################################################################################
        # Add your new Features here
        ##################################################################################################################################################
        
        # if self.<YourFeatureName>:
        #     <Your computation>



        # Info: If your new feature is ready for the estimator (model), please typewrite your feature as 'float' since you want to use additional column transformation later.

        ##################################################################################################################################################
        # End of your Features
        ##################################################################################################################################################

        # To ensure the ordinal scale
        ord_poutcome = pd.Categorical(X_copy['poutcome'], categories=['unknown','failure','other','success'], ordered=True)
        labels_poutcome, unique_poutcome = pd.factorize(ord_poutcome, sort=True)
        X_copy['poutcome'] = labels_poutcome
        X_copy['poutcome'] = X_copy['poutcome'].astype('float')

        ord_education = pd.Categorical(X_copy['education'], categories=['secondary','primary','unknown','tertiary'], ordered=True)
        labels_education, unique_education = pd.factorize(ord_education, sort=True)
        X_copy['education'] = labels_education
        X_copy['education'] = X_copy['education'].astype('float')

        ord_job = pd.Categorical(X_copy['job'], categories=['services','blue-collar','admin.','technician','student',
                                                            'housemaid','entrepreneur','unemployed','self-employed','management','unknown','retired'], ordered=True)
        labels_job, unique_job = pd.factorize(ord_job, sort=True)
        X_copy['job'] = labels_job
        X_copy['job'] = X_copy['job'].astype('float')

        if self.add_Benefit_Deposit:        
            # Benefits the most from term deposit
            X_copy['Benefit_Deposit'] = pd.cut(X_copy['age'], 5, labels=[-2, -1 ,0, 1, 2])
            X_copy['Benefit_Deposit'] = X_copy['Benefit_Deposit'].astype('float')

        if self.add_Affordable_Invest:
            # Term deposit possible if balance > 2000€
            X_copy['Affordable_Invest'] = np.where(X_copy['balance'] <=2000, 0, 1)
            X_copy['Affordable_Invest'] = X_copy['Affordable_Invest'].astype('float')

        if self.add_New_Income:
            # New Income is available for customer
            conditions_new_income = [((X['day'] >= 28) & (X['day'] <= 31)) | ((X['day'] >= 12) & (X['day'] <= 16)) | ((X['day'] >= 1) & (X['day'] <= 3))]
            X_copy['New Income'] = np.select(conditions_new_income, [1], default=0)
            X_copy['New Income'] = X_copy['New Income'].astype('float')

        if self.add_Contact_Before:
            # Customer was contacted before
            X_copy['Contact Before'] = np.where(X['pdays'] < 0, 0, 1)
            X_copy['Contact Before'] = X_copy['Contact Before'].astype('float')

        if self.add_Passed_Time:
            # Time passed after the last campaign
            X_copy['Passed Time'] = pd.cut(X['pdays'], bins=5, labels=[-2, -1, 0, 1, 2])
            X_copy['Passed Time'] = X_copy['Passed Time'].astype('float')

        if self.add_Risk_Protection:
            # Customers have engough Risk protection based on marital and balance to invest in term deposit
            conditions = [(X['marital'] == 'married') & (X['balance'] > 2000),\
                 (X['marital'] == 'single') & (X['balance'] > 2500),\
                      (X['marital'] == 'divorced') & (X['balance'] > 3000)]
            choices = [1 ,1, 1]
            X_copy['Risk Protection'] = np.select(conditions, choices, default=0)
            X_copy['Risk Protection'] = X_copy['Risk Protection'].astype('float')

        ##################################################################################################################################################
        
        if self.dropping_non_valuable_feat:
            # Droping Fetures after Feature-Importances
            droppings = ['month', 'poutcome', 'marital' ,'previous', 'default', 'loan', 'housing', 'pdays', 'day', 'job']
            X_copy.drop(droppings, axis=1, inplace=True)
        
        return X_copy

## Initialize the CombineFeatureTransformer
### Task (optional): If you perform your feature engineering via the "CombineFeatureTransformer-Approach", you can easily handle feature selection here.

**IMPORTANT: If you don't want to use the CombineFeatureTransformer, set all attributes to False. But make sure that the following code is still based on it.**

In [None]:
# Create Custom Feature Transformer
add_custom_features = CombineFeatureTransformer(add_Benefit_Deposit=False, add_Affordable_Invest=False, add_New_Income=False,
                                                add_Contact_Before=False, add_Passed_Time=True, add_Risk_Protection=False, dropping_non_valuable_feat=True)
X_train_temp = add_custom_features.transform(X_train)

# INFO: The Pipeline's OrdinalEncoder or OneHotEncoder work only with a Pandas DataFrame if the feature names are properly stored. Thus, for additional adding or deleting of features it is neccessary to update the feature names. If the type of the features is correct, you can use the function "get_features_type".
nominal_feat, ordinal_feat, numeric_feat, passed_feat = get_features_type(X_train_temp)

# There feature 'duration' and 'campaign' need to be Min-Max-Scaled
minmax_feat = [feat for feat in numeric_feat if feat in ['duration', 'campaign']]
numeric_feat.remove(minmax_feat[0])
numeric_feat.remove(minmax_feat[1])

## Create ColumnTransformer
### Task (optional): Depending on your exploratory data analysis, some features need to be transformed. Alternative transformation steps can be implemented or omitted here. The task is optional.

In [None]:
# Create ColumTransformer
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(), nominal_feat),
        ('minmax', MinMaxScaler(), minmax_feat),
        ('stdscaler', StandardScaler(), numeric_feat),
        ('passed', 'passthrough', passed_feat)
    ],
    remainder='drop' 
)

## Counter imbalanced Data with Over- and Under-Sampling
### Task (optional): Within the exploratory data analysis, you might have noticed that the data is sometimes very unevenly distributed. One way to counteract such a distribution is to generate the data synthetically within the cross-validation. Since this task is optional, additional information concerning over- and under-sampling can be found in the following links:

1.  [SMOTE for imbalanced Classification with Python](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/)
1.  [Imbalanced-Learn Documentation](https://imbalanced-learn.org/stable/)

In [None]:
# Counter imbalanced Data
sampling_over = None
sampling_under = None

## Models
### Task (optional): Since the main focus of this challenge lies on the live production pipeline, two classifiers are already given here. In addition, however, the list can be supplemented by other estimation methods.

In [None]:
# Create a list of classifiers
classifiers = [
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    '...',
    '<Your Additonal Models>',
    '...'
]

## GridSearch Parameter
### Task (optional): In order to find the best parameter configuration of the respective estimation method, the GridSearch approach is followed here. Additional parameters of the individual methods can supplement the following dictionary.

In [None]:
# Create Parameter for GridSearch

# For additional Hyperparamter-Tuning alter the dictionary accordingly
# INFO: To tune the hyperparameters dynamically you need to specify the attribute according to its estimator, e.g. max_depth would be decisiontreeclassifier__max_depth

parameter = {
    'DecisionTreeClassifier':{
        'decisiontreeclassifier__max_depth': [2, 6, 8, 10],
        # 'decisontreeclassifier__..........'
    },

    'RandomForestClassifier':{
        'randomforestclassifier__n_estimators': [2, 4, 8, 10],
        # 'randomforestclassifier__............'

    },
    # '<Your Additional Models>':{

    # }
 
}

## Training and Scoring of the Models
### Task: In the following code, you have to store both, the individual models and their respective pipelines as well as the best (final) pipeline. Please insert your code at the specified positions.

In [None]:
"""
INFO: You can use this Code as Template for "how to save my Model in Databricks (and download it afterwards"

model_name = TEAM_NAME + "_" + dt.datetime.now().strftime("%Y%m%d_%H%M%S") + "_CC_03.joblib"
dump(model, os.path.join(MODELS, model_name))
print ("The fitted Model has been successfully saved, you can download it from:")
print (DATABRICKS_INSTANCE + "/files/" + TEAM_NAME + "/Models/" + model_name + "?o=" + DATABRICKS_ORGANISATION)

"""

%%time
# Create a Dictionary to Store metrics
model_metrics = {}

# NOTE: For some reason the SMOTE-Method only works with "make_pipeline".
for classifier in classifiers:
    pipe_imb = make_pipeline(
        add_custom_features,
        preprocessor,
        sampling_over,
        sampling_under,
        classifier
    )

    # Store String of Classifier
    clf_str = str(classifier).replace('()','')

    # Evaluate Model with ROC-AUC and ReapetedStartifiedKFold and using GridSearch for best Parameter
    cv = RepeatedStratifiedKFold(n_splits=2, n_repeats=2, random_state=42)

    # Create List for scorer
    scorer = ['accuracy', 'balanced_accuracy', 'roc_auc']

    # GridSearch
    grid_search = GridSearchCV(pipe_imb, param_grid=parameter.get(clf_str), scoring=scorer, n_jobs=-1, cv=cv, refit='roc_auc', verbose=1, return_train_score=True)

    # Fit grid Search
    grid_search.fit(X_train, y_train)

    #######################################################################################################################################################
    # Task: Save each estimator with best performance according to its hyperparameters
    #######################################################################################################################################################
    if SAVE_MODEL:
        # TODO: Add your code here ...


    #######################################################################################################################################################
    # Task: Save each pipeline
    #######################################################################################################################################################
    if SAVE_PIPELINE:
        # TODO: Add your code here ...
        

    # Save to metrics
    model_metrics[clf_str] = {'Accuracy': grid_search.cv_results_['mean_test_accuracy'].mean(),
                              'Acc_Std': grid_search.cv_results_['std_test_accuracy'].mean(),
                              'Balanced_Accuracy': grid_search.cv_results_['mean_test_balanced_accuracy'].mean(),
                              'Balanced_Accuracy_Std': grid_search.cv_results_['std_test_balanced_accuracy'].mean(),
                              'ROC_AUC': grid_search.cv_results_['mean_test_roc_auc'].mean(),
                              'ROC_AUC_Std': grid_search.cv_results_['std_test_roc_auc'].mean(),
                              'Best_Score (Grid Search)': grid_search.best_score_,
                              'Best_Params (GridSearch)': grid_search.best_params_,
                              'Best_Model (GridSearch)': grid_search.best_estimator_,
    }
    
# Save metrics to dataframe
model_metrics_df = pd.DataFrame(model_metrics).transpose()

# Choose the best model(pipeline) 
best_pipeline = model_metrics_df[model_metrics_df['Best_Score (Grid Search)'] == model_metrics_df['Best_Score (Grid Search)'].max()]['Best_Model (GridSearch)'][0]
best_pipeline_name = model_metrics_df[model_metrics_df['Best_Score (Grid Search)'] == model_metrics_df['Best_Score (Grid Search)'].max()].index[0]


#######################################################################################################################################################
# Task: Save the best (final) pipeline
#######################################################################################################################################################
if SAVE_PIPELINE:
    # TODO: Add your code here ...
        


## Evaluate the Models
### Task: Look at the results. What do you notice?

INFO:If you want to get more detailed information use 'detailed_Overview'.

In [None]:
# Metrics from used Models
model_metrics_df.transpose()

### Please insert your answer here:

In [None]:
# detailed_Overview = pd.DataFrame(grid_search.cv_results_)
# detailed_Overview

### Task: Next, let's visualize and analyze the performance. What do you notice in general? What was the actual goal of this prediction model? Does the model work as necessary? Briefly write down what your observations:

In [None]:
# Show Feature Importance
# get_feat_imp(model_metrics)

In [None]:
# Show confusion-matrix
fig, axes = plt.subplots(ncols=len(classifiers), figsize=(20, 8))
for title, pipe, ax in zip(classifiers, model_metrics_df['Best_Model (GridSearch)'], axes.flatten()):
    conf_mat_disp = plot_confusion_matrix(pipe, X_test, y_test, cmap=plt.cm.Blues, normalize='true', ax=ax)
    conf_mat_disp.ax_.set_title(title)
plt.tight_layout()

In [None]:
fig, axes = plt.subplots(ncols=len(classifiers), figsize=(20, 8))
for title, pipe, ax in zip(classifiers, model_metrics_df['Best_Model (GridSearch)'], axes.flatten()):
    roc_auc_disp = plot_roc_curve(pipe, X_test, y_test, ax=ax)
    roc_auc_disp.ax_.set_title(title)
plt.tight_layout()

### Please insert your answer here:

# Load Pipeline and predict on live Marketing Data
### Task: Now imagine you've transferred your pipeline to the production environment. The next task is to reload the pipeline which you've just saved. Next, please load the unknown data (Marketing_Live.csv). To determine your prediction quality on the live data set, you can additionally load the True Values (Marketing_Live_Actuals.csv).

In [None]:
# Loading the best Pipeline
# Best_Pipeline = load('...')

# # Load unknow data (Marketing_Live) and Actuals
Marketing_Live = pd.read_csv(Marketing_Live_File_Link, sep=';')
#Marketing_Live_Actuals_File_Link = 'https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Marketing/Marketing_Live_Actuals.csv'
#y_actuals = pd.read_csv(Marketing_Live_Actuals_File_Link, sep=';')

### Task: Next, use your pipeline to create your predictions.

In [None]:
# Create Predictions
# y_pred = '...'

In [None]:
# Marketing_Live['Prediction'] = '...'
# Marketing_Live.head()

# Data Saving

### Info (Google Colab)

If you are working in Google Colab, you can save the Results to your Google Drive by running

```
from google.colab import drive
drive.mount("/content/drive")
```

You will be requested to authenticate with your Google Account.

The Path to your Google Colab Notebooks Folder will be "/content/drive/My Drive/Colab Notebooks".

The Commands can then use this Path:

```
os.makedirs ("/content/drive/My Drive/Colab Notebooks/Results", exist_ok=True)
df_predictions.to_csv ("/content/drive/My Drive/Colab Notebooks/Results/Marketing_Live_Predictions.csv", index=False)
```

### Task: Save a DataFrame which contains the actual Live Targets as well as the corresponding Live Predictions to a CSV-File.
Please write the CSV-File in a way which can be read by a German Microsoft Excel without any necessary Modifications and submit the CSV-File together with your Solution Notebook.

In [None]:
# Marketing_Live.to_csv (os.path.join (DATABRICKS_BASE_DIRECTORY, TEAM_NAME + "_Marketing_Live_Predictions.csv"), sep=";", decimal=",", header=True, index=False, encoding="utf-8", float_format="%.4f")
# print ("The Predictions have been successfully saved to a CSV-File, you can download them from:")
# print (DATABRICKS_INSTANCE + "/files/" + TEAM_NAME + "/" + TEAM_NAME + "_Marketing_Live_Predictions.csv" + "?o=" + DATABRICKS_ORGANISATION)