[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/saschaschworm/big-data-and-data-science/blob/master/notebooks/prediction-challenge/minimum-working-example.ipynb)

# Prediction Challange by Sebastian and Tim

Your job is to predict whether or not a person will become a customer of a bank. The data itself contains basic demographic information about numerous  customers as well as data related to phone-based marketing calls during specific campaigns.

# Header

## Version 0.1


---


**2020-01-24** Start mit dem Minimum Working Example

## Data Dictionary

<table style="width: 100%;">
    <thead>
        <tr>
            <th style="width: 30%; text-align: left;">Feature</th>
            <th style="width: 70%; text-align: left;">Description</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>date</td>
            <td>The last contact date</td>
        </tr>
        <tr>
            <td>age</td>
            <td>The age of the customer</td>
        </tr>
        <tr>
            <td>marital_status</td>
            <td>The marital status of the customer</td>
        </tr>
        <tr>
            <td>education</td>
            <td>The educationan of the customer</td>
        </tr>
        <tr>
            <td>job</td>
            <td>The type of job of the customer</td>
        </tr>
        <tr>
            <td>credit_default</td>
            <td>Whether or not the customer has a credit in default</td>
        </tr>
        <tr>
            <td>housing_loan</td>
            <td>Whether or not the customer has a housing loan</td>
        </tr>
        <tr>
            <td>personal_loan</td>
            <td>Whether or not the customer has a personal loan</td>
        </tr>
        <tr>
            <td>communication_type</td>
            <td>The type of contact communication</td>
        </tr>
        <tr>
            <td>n_contacts_campaign</td>
            <td>The number of contacts performed during this marketing campaign and for this customer</td>
        </tr>
        <tr>
            <td>days_since_last_contact</td>
            <td>The number of days passed by after the customer was last contacted from a previous domain</td>
        </tr>
        <tr>
            <td>n_contacts_before</td>
            <td>The number of contacts performed before this marketing campaign and for this customer</td>
        </tr>
        <tr>
            <td>previous_conversion</td>
            <td>Whether or not the customer has been a customer before</td>
        </tr>
        <tr>
            <td>success</td>
            <td>Whether or not the customer became an actual customer (target variable)</td>
        </tr>
    </tbody>   
</table>   

# Programming

## Package Import

In [0]:
import numpy as np
import pandas as pd
import xgboost as xgb
from scipy.stats import uniform, randint
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score, make_scorer
from sklearn.model_selection import RandomizedSearchCV, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

## Data Import

In [0]:
dataset = pd.read_csv(
    'https://raw.githubusercontent.com/saschaschworm/big-data-and-data-science/' +
    'master/datasets/prediction-challenge/dataset.csv', 
    index_col='identifier', parse_dates=['date'])

prediction_dataset = pd.read_csv(
    'https://raw.githubusercontent.com/saschaschworm/big-data-and-data-science/' +
    'master/datasets/prediction-challenge/prediction-dataset.csv', 
    index_col='identifier', parse_dates=['date'])

## Feature Engineering

In [3]:
# Create some new features based on the given features
# or enrich the dataset with features from datasets.
dataset.head(3)

Unnamed: 0_level_0,date,age,marital_status,education,job,credit_default,housing_loan,personal_loan,communication_type,n_contacts_campaign,days_since_last_contact,n_contacts_before,previous_conversion,duration,success
identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
34203,2009-05-06,36,divorced,High School,Service provider,No,No,Yes,Landline network,5,-1,0,Inexistent,992,No
1250,2008-05-08,32,single,University,Student,No,Yes,No,Landline network,1,-1,0,Inexistent,147,No
38130,2009-09-04,68,single,University,Pensioner,No,Yes,No,Celluar phone network,1,-1,0,Inexistent,139,No


## Model, Pipeline and Scoring Initialization

In [4]:
X, y = dataset[['age', 'n_contacts_campaign']], dataset['success']
# data_dmatrix = xgb.DMatrix(data=X,label=y)
y

identifier
34203    No
1250     No
38130    No
19300    No
34497    No
         ..
33642    No
2079     No
36964    No
25883    No
7256     No
Name: success, Length: 37069, dtype: object

In [0]:
hyperparams = {
    'loss': 'log', 'penalty': 'l2', 'alpha': 0.0001, 'max_iter': 1000, 'tol': 1e-3, 
    'n_jobs': -1, 'random_state': 1909, 'learning_rate': 'invscaling', 'eta0': 0.01}
# classifier = SGDClassifier(**hyperparams)
classifier = xgb.XGBClassifier(random_state=1909)

In [0]:
scorer = make_scorer(f1_score, pos_label='Yes')

In [0]:
numeric_features = ['age']
numeric_transformer = Pipeline([
    ('scaler', MinMaxScaler()),
])

preprocessor = ColumnTransformer([
    ('numeric_transformer', numeric_transformer, numeric_features),
])

pipeline = Pipeline([
    ('preprocessor', preprocessor), 
    ('classifier', classifier)
])

In [0]:
n_estimators = randint(100, 500)
max_depth = randint(1, 10)
param_distributions = {'classifier__n_estimators': n_estimators, 'classifier__max_depth': max_depth, 'classifier__learning_rate': uniform(0, 0.0001)}
# , 'classifier__eta0': uniform(0, 0.0001)

In [0]:
xgb.XGBClassifier?

In [0]:
search = RandomizedSearchCV(
    pipeline, param_distributions=param_distributions, n_iter=3, scoring=scorer, 
    n_jobs=-1, cv=10, random_state=1909, return_train_score=True)

In [0]:
search = search.fit(X, y)
# classifier.fit(X, y)

In [12]:
f'Optimal parameters: {search.best_params_}'

"Optimal parameters: {'classifier__learning_rate': 6.699737886943773e-05, 'classifier__max_depth': 6, 'classifier__n_estimators': 266}"

## Evaluation

In [0]:
training_score = search.cv_results_['mean_train_score'][search.best_index_] * 100
test_score = search.cv_results_['mean_test_score'][search.best_index_] * 100

In [28]:
f'Mean F1 Score (Training/Test): {training_score:.2f}%/{test_score:.2f}%'

'Mean F1 Score (Training/Test): 5.30%/4.28%'

## Prediction

In [0]:
predictions = search.best_estimator_.predict(prediction_dataset)

## Submission Dataset Preparation

Your upload to the Online-Campus should contain your written report (the actual seminar paper), this notebook as file as well as the generated submission dataset with your predictions.

In [0]:
end

In [0]:
submission = pd.DataFrame(
    predictions, index=prediction_dataset.index, columns=['prediction'])

In [0]:
matriculation_number = '12345678'

In [0]:
submission.to_csv(
    f'./submission-{matriculation_number}.csv', index_label='identifier')

# Nützliche Links
* XGboost Implementation: [https://www.datacamp.com/community/tutorials/xgboost-in-python](https://www.datacamp.com/community/tutorials/xgboost-in-python)