## Task: Predict whether or not someone will default on their loan.

In [14]:
import pandas as pd
import numpy as np

from pandas_profiling import ProfileReport
from sklearn.metrics import accuracy_score
from supervised import AutoML

# 1 - Pandas Profiling

In [5]:
df = pd.read_csv('train.csv')

profile = ProfileReport(df, title='Loan')
profile.to_file('eda_default.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

- no missing values
- sus columns: Payment Plan, Accounts Delinquent, Loan Title, ID, Delinquency - two years, Inquires - six months

# 2 - AutoML

https://supervised.mljar.com/

In [16]:
X = df.iloc[:, :-1]
cols_to_drop = ['Payment Plan', 'Accounts Delinquent', 'Loan Title', 'ID', 'Delinquency - two years', 'Inquires - six months']
X = X.drop(columns=cols_to_drop)
y = df['Loan Status']

In [23]:
automl = AutoML(results_path="AutoML_results_2", start_random_models=3, golden_features=True, features_selection=True,
                hill_climbing_steps=2)
automl.fit(X, y)

Linear algorithm was disabled.
AutoML directory: AutoML_results_2
The task is binary_classification with evaluation metric logloss
AutoML will use algorithms: ['Baseline', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'not_so_random', 'golden_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'ensemble']
* Step simple_algorithms will try to check up to 4 models
1_Baseline logloss 0.30827 trained in 1.52 seconds
2_DecisionTree logloss 0.30853 trained in 15.07 seconds
3_DecisionTree logloss 0.314254 trained in 17.48 seconds
4_DecisionTree logloss 0.314254 trained in 16.37 seconds
* Step default_algorithms will try to check up to 3 models
5_Default_Xgboost logloss 0.309061 trained in 12.66 seconds
6_Default_NeuralNetwork logloss 0.308961 trained in 19.84 seconds
7_Default_RandomForest logloss 0.307872 trained in 52.92 seconds
* Step 

# AutoML Leaderboard

| Best model   | name                                                                                           | model_type     | metric_type   |   metric_value |   train_time |
|:-------------|:-----------------------------------------------------------------------------------------------|:---------------|:--------------|---------------:|-------------:|
|              | [1_Baseline](1_Baseline/README.md)                                                             | Baseline       | logloss       |       0.30827  |         2.26 |
|              | [2_DecisionTree](2_DecisionTree/README.md)                                                     | Decision Tree  | logloss       |       0.30853  |        17.11 |
|              | [3_DecisionTree](3_DecisionTree/README.md)                                                     | Decision Tree  | logloss       |       0.314254 |        18.63 |
|              | [4_DecisionTree](4_DecisionTree/README.md)                                                     | Decision Tree  | logloss       |       0.314254 |        17.55 |
|              | [5_Default_Xgboost](5_Default_Xgboost/README.md)                                               | Xgboost        | logloss       |       0.309061 |        14.8  |
|              | [6_Default_NeuralNetwork](6_Default_NeuralNetwork/README.md)                                   | Neural Network | logloss       |       0.308961 |        20.8  |
|              | [7_Default_RandomForest](7_Default_RandomForest/README.md)                                     | Random Forest  | logloss       |       0.307872 |        54.52 |
|              | [8_Xgboost](8_Xgboost/README.md)                                                               | Xgboost        | logloss       |       0.309361 |        14.93 |
|              | [10_RandomForest](10_RandomForest/README.md)                                                   | Random Forest  | logloss       |       0.307866 |        32.53 |
|              | [12_NeuralNetwork](12_NeuralNetwork/README.md)                                                 | Neural Network | logloss       |       0.318136 |        35.4  |
|              | [9_Xgboost](9_Xgboost/README.md)                                                               | Xgboost        | logloss       |       0.309572 |        28.87 |
|              | [11_RandomForest](11_RandomForest/README.md)                                                   | Random Forest  | logloss       |       0.307865 |        63.35 |
|              | [13_NeuralNetwork](13_NeuralNetwork/README.md)                                                 | Neural Network | logloss       |       0.314488 |        42.35 |
|              | [11_RandomForest_GoldenFeatures](11_RandomForest_GoldenFeatures/README.md)                     | Random Forest  | logloss       |       0.30808  |        50.12 |
|              | [10_RandomForest_GoldenFeatures](10_RandomForest_GoldenFeatures/README.md)                     | Random Forest  | logloss       |       0.307916 |        38.65 |
|              | [7_Default_RandomForest_GoldenFeatures](7_Default_RandomForest_GoldenFeatures/README.md)       | Random Forest  | logloss       |       0.30791  |        43.9  |
|              | [11_RandomForest_RandomFeature](11_RandomForest_RandomFeature/README.md)                       | Random Forest  | logloss       |       0.307976 |        56.52 |
|              | [11_RandomForest_SelectedFeatures](11_RandomForest_SelectedFeatures/README.md)                 | Random Forest  | logloss       |       0.307963 |        42.96 |
|              | [6_Default_NeuralNetwork_SelectedFeatures](6_Default_NeuralNetwork_SelectedFeatures/README.md) | Neural Network | logloss       |       0.312654 |        33.35 |
|              | [5_Default_Xgboost_SelectedFeatures](5_Default_Xgboost_SelectedFeatures/README.md)             | Xgboost        | logloss       |       0.309074 |        20.23 |
| **the best** | [Ensemble](Ensemble/README.md)                                                                 | Ensemble       | logloss       |       0.307693 |         9.16 |


# 3 - Prediction

In [27]:
df_predict = pd.read_csv('test.csv')
X_predict = df_predict.iloc[:, :-1]
X_predict = X_predict.drop(columns=cols_to_drop)

In [30]:
y_predict = automl.predict(X_predict)

In [32]:
df_submit = pd.DataFrame({'Loan Status': y_predict})
df_submit.to_csv('submission.csv', index=False)