# Project 1: Bank Credit
### adrianty & sondrewo

### The implementation of the functions for `name_banker.py` can be found in the source code folder. We used the TestLending.py code as a baseline for the model development.

We started our model development by inspecting the different columns in the data set, identifying both numerical and discrete  features. As is well known, the Naive Bayes classifier supports catergorical features natively and can be adjusted to use numerical ones as well. Thus, we formulated the following hypothesis:

H<sub>0</sub> : The Multinomial Naive Bayes classifier will provide a high accuracy

We then attempted to falsify this hypothesis (Exp 1) by testing out different models: Logistic regression, KNN, BernoulliNB

In [134]:
from sklearn.naive_bayes import MultinomialNB 
from sklearn.naive_bayes import BernoulliNB 
from sklearn.naive_bayes import GaussianNB 
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
import pandas
import math
import matplotlib.pyplot as plt

In [110]:
PATH = "./data/credit/D_valid.csv"
features = ['checking account balance', 'duration', 'credit history',
            'purpose', 'amount', 'savings', 'employment', 'installment',
            'marital status', 'other debtors', 'residence time',
            'property', 'age', 'other installments', 'housing', 'credits',
            'job', 'persons', 'phone', 'foreign']
target = 'repaid'

df = pandas.read_csv(PATH, sep=' ',
                     names=features+[target])

In [111]:
numerical_features = ['duration', 'age', 'residence time', 'installment', 'amount', 'persons', 'credits']
quantitative_features = list(filter(lambda x: x not in numerical_features, features))
X = pandas.get_dummies(df, columns=quantitative_features, drop_first=True)
encoded_features = list(filter(lambda x: x != target, X.columns))

In [117]:
def test_decision_maker(X_test, y_test, interest_rate, decision_maker):
    n_test_examples = len(X_test)
    utility = 0

    ## Example test function - this is only an unbiased test if the data has not been seen in training
    total_amount = 0
    total_utility = 0
    decision_maker.set_interest_rate(interest_rate)
    for t in range(n_test_examples):
        action = decision_maker.get_best_action(X_test.iloc[t])
        good_loan = y_test.iloc[t] # assume the labels are correct
        duration = X_test['duration'].iloc[t]
        amount = X_test['amount'].iloc[t]
        # If we don't grant the loan then nothing happens
        if (action==1):
            if (good_loan != 1):
                utility -= amount
            else:
                utility += amount*(pow(1 + interest_rate, duration) - 1)
        total_utility += utility
        total_amount += amount
    return utility, total_utility/total_amount

In [119]:
import name_banker
from sklearn.model_selection import train_test_split

interest_rate = 0.017


n_tests = 100

### Do a number of preliminary tests by splitting the data in parts
def run_test(models):
    '''
    args:
        models (dict): dictionary of models to test on. key=str (name of model), value=model
    returns:
        results: dictionary of the total utility and avg investment return per n, per model
    '''
    results = {}
    for name, model in models.items():
        print(name)
        decision_maker = name_banker.NameBanker(model)
        utility = 0
        investment_return = 0
        for iter in range(n_tests):
            X_train, X_test, y_train, y_test = train_test_split(X[encoded_features], X[target], test_size=0.2)
            decision_maker.set_interest_rate(interest_rate)
            decision_maker.fit(X_train, y_train)
            Ui, Ri = test_decision_maker(X_test, y_test, interest_rate, decision_maker)
            utility += Ui
            investment_return += Ri
        results[name] = [math.floor((utility / n_tests) * 100)/100.0, math.floor((investment_return / n_tests) * 100)/100.0]
    return results

## Exp 1: Comparing different classification models:

In [135]:
results = run_test({"KNN": KNeighborsClassifier(n_neighbors=31),
                    "BernoulliNB": BernoulliNB(),
                    "MultinomialNB": MultinomialNB(),
                    "Log.regression": LogisticRegression(max_iter=1500),
                    "Neural Net": MLPClassifier(alpha=1, max_iter=1000),
                    "AdaBoost": AdaBoostClassifier()})

KNN
BernoulliNB
MultinomialNB
Log.regression
Neural Net
AdaBoost


In [136]:
pandas.DataFrame(results.items(), columns=["Model", "Total Utility, Avg Investment Return"])

Unnamed: 0,Model,"Total Utility, Avg Investment Return"
0,KNN,"[375992.19, 9.17]"
1,BernoulliNB,"[2194591.89, 55.39]"
2,MultinomialNB,"[4410305.44, 107.24]"
3,Log.regression,"[1388105.99, 37.12]"
4,Neural Net,"[690783.72, 18.38]"
5,AdaBoost,"[2005530.97, 55.42]"


### Results of Exp 1:

Based on these results, we chose to keep our hypothesis H<sub>0</sub> and continue the development using the Multinomial NB model. 

<b> Assumption 1 </b>: 
    Since the results for Multinomial NB was that much better than for KNN, we assumed that changing the amount of neighbours would not out-perform NB and decided to only test for `k=floor(sqrt(n))=31` (a common approach for K selection for KNN)

## Exp 2: Comparison with RandomBanker.py

In [123]:
def run_test_single_model(maker):
    res = []
    decision_maker = maker
    utility = 0
    investment_return = 0
    for iter in range(n_tests):
        X_train, X_test, y_train, y_test = train_test_split(X[encoded_features], X[target], test_size=0.2)
        decision_maker.set_interest_rate(interest_rate)
        decision_maker.fit(X_train, y_train)
        Ui, Ri = test_decision_maker(X_test, y_test, interest_rate, decision_maker)
        utility += Ui
        investment_return += Ri
    
    res.append(math.floor((utility / n_tests) * 100)/100.0)
    res.append(math.floor((investment_return / n_tests) * 100)/100.0)
    return res

In [124]:
comp_test = {}
comp_test["Random banker"] = run_test_single_model(random_banker.RandomBanker())
comp_test["Name banker (our model)"] = run_test_single_model(name_banker.NameBanker(MultinomialNB()))

In [128]:
pandas.DataFrame(comp_test.items(), columns=["Model", "Total Utility, Avg Investment Return"])

Unnamed: 0,Model,"Total Utility, Avg Investment Return"
0,Random banker,"[1830630.53, 51.92]"
1,Name banker (our model),"[4851005.15, 123.77]"


### Results of Exp 2

The table above shows that our model performed better than the random banker module