# Improvement of previous donor model

In [9]:
import pandas as pd
import numpy as np

In [5]:
data = pd.read_pickle('df.pkl')

We have choosen 3 models on the data that we have gathered.

- Logistic Regression
- Decision Tree
- Gradient Boosting

We have found this with all the respective models

|     Metric     | Benchmark Model | Unoptimized Model | Optimized Model |
| :------------: | :---------------: | :---------------: | :-------------: | 
| Accuracy Score | 0.2478           |       0.8701            |      0.8704           |
| F-score        | 0.2917          |        0.7497           |   0.7510       |

Our next objective is to improve these scores. We are adoting the following two methods

- We are doing ```GridSearchCV(...)```with LogisticRegression as the reviewer suggeted to see if we can do some thing better

- We shall be using ```XGBoost()``` to see if it still can give the best result as it happens usually

We may also perform some **Feature Engineering** with above best model to see if we can boost up the score even further.


# 1. GridSearchCV on Logistic Regression

In [25]:
label = data.income
features = data.drop('income', axis=1)

In [26]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, 
                                                    label, 
                                                    test_size = 0.2, 
                                                    random_state = 0)

In [45]:
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import fbeta_score, accuracy_score

scorer = make_scorer(fbeta_score, beta=0.5)

clf = LogisticRegression(class_weight={1: 0.45, 0: 0.56}, 
                         max_iter=5000,
                         solver='newton-cg')

params = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}
grid_obj = GridSearchCV(clf, params, scoring=scorer)
grid_fit = grid_obj.fit(X_train, y_train.ravel())
predictions = (clf.fit(X_train, y_train.ravel())).predict(X_test)
best_clf = grid_fit.best_estimator_
best_predictions = best_clf.predict(X_test)

# Report the before-and-afterscores
print "Unoptimized model\n------"
print "Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions))
print "F-score on testling data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 0.5))
print "\nOptimized Model\n------"
print "Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions))
print "Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5))

Unoptimized model
------
Accuracy score on testing data: 0.8411
F-score on testling data: 0.6877

Optimized Model
------
Final accuracy score on the testing data: 0.8432
Final F-score on the testing data: 0.6950


Clearly both **accuracy** and **f-beta score** is not improved than the Gradient Boosting but it actually imroved than the previous Logistic Regression setup. Lets check for **XGBoost**.