In the context of unbalanced dataset, we should choose a metric that takes this issue into consideration. The accuracy value is not informative in this case. 

As we want to have high values of recall and precision. We chose simply the roc_auc_score for our problem that calculates the area under the curve of the points (1-Precision, Recall) for different threshholds. 

Other metrics can also be adapted to our case as the F-score.

## MLP Classifier

In [None]:
from sklearn.neural_network import MLPClassifier

model = MLPClassifier(
    solver="adam", hidden_layer_sizes=(30,15), max_iter=42,
    batch_size=100, random_state=57)

In [None]:
model.fit(X_train, Y_train)

In [None]:
y_pred = model.predict(X_test)
from sklearn.metrics import roc_auc_score

print(f"roc_auc_score: {roc_auc_score(Y_test.values[:,0],y_pred):.4f}")

We can see that the roc_auc_score is not good. It's close to a value of $0.5$ of a naive classifier. 

## Random forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

In [None]:
rf_clf.fit(X_train, Y_train)

In [None]:
y_pred = model.predict(X_test)
print(f"roc_auc_score: {roc_auc_score(Y_test.values[:,0],y_pred):.4f}")

We have the same roc_auc_value with a random forest model. We should think of taking into account the imbalance in the training dataset.

We will keep the random forest model and set the "class_weight" parameter to "balanced" to account for the imbalance in the training dataset and see the results.

In [9]:
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42,class_weight='balanced')

In [10]:
rf_clf.fit(X_train, Y_train)

RandomForestClassifier(class_weight='balanced', random_state=42)

In [None]:
y_pred = rf_clf.predict(X_test)

In [None]:
from sklearn.metrics import roc_auc_score

print(f"roc_auc_score: {roc_auc_score(Y_test.values[:,0],y_pred):.4f}")

We can see that by taking into consideration the data imbalance we could obtain better roc_auc_score on test dataset. A good machine learning algorithm should account for this imbalance and shoulf at least beat this score.