In [1]:
NAME = "Nathan Schaefer"
COLLABORATORS = "Nick Hageman"

In [3]:
# importing all the stuff needed

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd
import seaborn as sns
from sklearn import tree
from random import randint
from sklearn import (datasets, neighbors,
                     naive_bayes,
                     model_selection as skms,
                     linear_model, dummy,
                     metrics,
                     pipeline,
                     preprocessing as skpre)
from sklearn.preprocessing import OneHotEncoder
from mlwpy import *
from sklearn.ensemble import RandomForestClassifier

Pre-Analysis Data considerations:
1. There were no missing values in the Bank Churners Dataset, but there were columns that included strings that couldn't be processed by normal classification.
2. I addressed this by using a one-hot method to assign values to new columns that represented each unique value in the original columns.
3. I believe there was a non-linear relationship between features and targets because higher complexity classifiers seemed to perform better on the dataframes.
4. Yes, I converted them to one-hot values.
5. I didn't consider the number of rows, I can't think of a way that it would affect the classification
6. Click on the edit option to see the features formatted:


id: 1-6750                                                                                                      Not useful
Customer Age: seems to be around 30-50                                                                          Useful
Gender: M or F                                                                                                  Useful
Dependents: Around 0-5                                                                                          Useful
Education: Uneducated, high school, college, post-graduate, graduate, doctorate, unknown                        Useful
Marital Status: Single, Married, Divorced, Unknown                                                              Useful
Income category: Around 40000 - 1200000                                                                         Useful
Card Category: Blue, Gold, Silver, Platinum                                                                     Might not be useful
Months on book: Around 1-40                                                                                     Useful
Total relationship count: values around 1-7                                                                     Might not be useful
Months inactive: Around 1-5                                                                                     Useful
Contacts count 12 months: Around 0-5                                                                            Useful
Credit Limit: around 1000-30000                                                                                 Useful
Total revolving balance: 0-2500                                                                                 Useful
Average open to buy: 200-20000                                                                                  Useful
Total amount change Q4 Q1: 0-1.5                                                                                Useful
Total transaction amount: 1000-16000                                                                            Useful
Total transaction CT: 15-150                                                                                    Might not be useful
Total CT Chng Q4 Q1: 0-1.1                                                                                      Might not be useful
Avg Utilization Ratio: 0-1                                                                                      Useful
Target: 0 or 1                                                                                                  Target


For these features, it seems like most of them will be usefull for the outcome, it takes a little tinkering to find out which help and which don't. This dataset seems to fit complex classifiers better than simplistic ones. I'm not sure if the total relationship count might be helpful, or if the card category is helpful if it is related to the income category.


Feature Engineering
1. I chose features that would seem to affect if a person would want to stay with the bank. Things like their id would obviously not help. Other things like stuff that deals with the CT or their relationship count seem to be a toss-up with whether or not they affect the person's stay.
2. I believe that the evaluation of features is definitely helpful. Through train, test splits, I found which kNN n value was the best, but it may be a different value depending on how I evaluate it. For this particular project, it fluctuated between 3, 5, and 7.

In [9]:
#reading the data and forming feature and target dataframes, and dropping the columns that contain strings

data_train_df = pd.read_csv("BankChurners.train.csv") 

data_train_ft = data_train_df.drop(['id', 'Total_Ct_Chng_Q4_Q1', 'Card_Category','Marital_Status','Gender','Education_Level','Target'], axis=1)

data_train_tgt = data_train_df["Target"]
display(data_train_ft.head())

Unnamed: 0,Customer_Age,Dependent_count,Income_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Avg_Utilization_Ratio
0,51,2,40000,39,3,4,2,2581.0,1722,859.0,0.765,4431,79,0.667
1,50,2,120000,38,4,4,2,2123.0,995,1128.0,0.626,4516,78,0.469
2,44,5,120000,31,5,1,2,7567.0,2496,5071.0,0.709,4076,60,0.33
3,38,2,120000,29,4,1,2,2818.0,1656,1162.0,1.404,2916,45,0.588
4,32,1,50000,24,1,1,2,9711.0,972,8739.0,0.647,14926,115,0.1


In [4]:
#onehot function to create one-hot dataframes
def onehot(x,str):
    onehot = x[['id',str]]
    encoder = OneHotEncoder(handle_unknown='ignore')
    encoder_df = pd.DataFrame(encoder.fit_transform(onehot[[str]]).toarray())
    final_df = onehot.join(encoder_df)
    final_df.drop('id', axis=1, inplace=True)
    final_df.drop(str, axis=1, inplace=True)
    return(final_df)


# working the onehot function for each column
marriage = onehot(data_train_df, 'Marital_Status')
card = onehot(data_train_df, 'Card_Category')
gender = onehot(data_train_df, 'Gender')
education = onehot(data_train_df, 'Education_Level')

#Assigning values so I don't have to write it all out again
mar = ['Divorced', 'Married', 'Single', 'UnknownMAR']
car = ['Blue','Gold','Platinum','Silver']
gen = ['Female', 'Male']
edu = ['College', 'Doctorate', 'Graduate', 'High School', 'Post-Graduate', 'Uneducated', 'UnknownEDU']

marriage.columns = mar
card.columns = car
gender.columns =  gen
education.columns = edu


#Adding each column back to the original df
for col in mar:
    data_train_ft[col] = marriage[col]

for col in car:
    data_train_ft[col] = card[col]

for col in gen:
    data_train_ft[col] = gender[col]

for col in edu:
    data_train_ft[col] = education[col]

display(education.head())
display(data_train_ft.head())

Unnamed: 0,College,Doctorate,Graduate,High School,Post-Graduate,Uneducated,UnknownEDU
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0


Unnamed: 0,Customer_Age,Dependent_count,Income_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,...,Silver,Female,Male,College,Doctorate,Graduate,High School,Post-Graduate,Uneducated,UnknownEDU
0,51,2,40000,39,3,4,2,2581.0,1722,859.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,50,2,120000,38,4,4,2,2123.0,995,1128.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,44,5,120000,31,5,1,2,7567.0,2496,5071.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,38,2,120000,29,4,1,2,2818.0,1656,1162.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,32,1,50000,24,1,1,2,9711.0,972,8739.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [5]:
#Doing the train, test, split
(data_train_train_ftrs,
 data_train_test_ftrs,
 data_train_train_tgt,
 data_train_test_tgt) = skms.train_test_split(data_train_ft,
                                                      data_train_tgt,
                                                      test_size=.25,)


In [6]:
#Getting kNN and NB classifiers to test as pipelines
models_to_try = {'nb': naive_bayes.GaussianNB()}

for k in range(1,20,2):
    models_to_try[f'{k}-NN'] = neighbors.KNeighborsClassifier(n_neighbors=k)

# scaling each pipeline
scaler = skpre.StandardScaler()
pipelines_to_try = {}
for model_name in models_to_try:
    pipelines_to_try[f'std_{model_name}_pipe'] = pipeline.make_pipeline(scaler, 
                                                      models_to_try[model_name])

In [7]:
#Testing the accuract of each pipeline with scaled data
accuracy_scores = {}
for pipeline_name in pipelines_to_try:
    loo = skms.LeaveOneOut()
    scores = skms.cross_val_score(pipelines_to_try[pipeline_name],
                                  data_train_train_ftrs,
                                  data_train_train_tgt,
                                  cv=10,
                                  scoring='accuracy')
    mean_accuracy = scores.mean()
    accuracy_scores[pipeline_name] = mean_accuracy
    print(f'{pipeline_name}: {mean_accuracy:.3f}')

#Displaying and assigning the best pipeline
best_pipeline_name = max(accuracy_scores,key=accuracy_scores.get)
print(f'\nBest pipeline: {best_pipeline_name} (accuracy = {accuracy_scores[best_pipeline_name]:.3f})')

final_pipeline = pipelines_to_try[best_pipeline_name]

std_nb_pipe: 0.877
std_1-NN_pipe: 0.842
std_3-NN_pipe: 0.859
std_5-NN_pipe: 0.865
std_7-NN_pipe: 0.867
std_9-NN_pipe: 0.864
std_11-NN_pipe: 0.862
std_13-NN_pipe: 0.862
std_15-NN_pipe: 0.860
std_17-NN_pipe: 0.856
std_19-NN_pipe: 0.855

Best pipeline: std_nb_pipe (accuracy = 0.877)


In [8]:
#fitting the best pipeline again and checking the accuracy again(pretty unnecessary)
data_test_predictions = (final_pipeline.fit(data_train_ft, 
                                           data_train_tgt)
                                      .predict(data_train_ft))
test_accuracy = metrics.accuracy_score(data_train_tgt,
                                       data_test_predictions)
print(f'Test set accuracy: {test_accuracy:.2f}')


Test set accuracy: 0.88


In [12]:
#Using the classifier pyramid to check accuracy of various non-kNN classifiers

classifier_parade = \
    {'GNB' : naive_bayes.GaussianNB(),
     'SVC(2)' : svm.LinearSVC(),
     'DTC' : tree.DecisionTreeClassifier(),
     'DTC-10' : tree.DecisionTreeClassifier(max_depth=10),
     'RF': RandomForestClassifier()}


#Testing each classifier on the overall data
for name, model in classifier_parade.items():    
    cv_scores = skms.cross_val_score(model, 
                                     data_train_ft, data_train_tgt, 
                                     cv=10, 
                                     scoring='accuracy', 
                                     n_jobs=-1) # all CPUs
    print(f'model: {name} \tscores:{cv_scores} avg_scores:{round(sum(cv_scores) / len(cv_scores), 3)} ')

model: GNB 	scores:[0.8993 0.8933 0.8948 0.8785 0.8637 0.8919 0.8889 0.8696 0.8815 0.8889] avg_scores:0.885 
model: SVC(2) 	scores:[0.8415 0.6741 0.8593 0.8089 0.8356 0.4474 0.7704 0.8044 0.8385 0.8504] avg_scores:0.773 
model: DTC 	scores:[0.9259 0.9541 0.9407 0.9378 0.9437 0.9348 0.9407 0.9407 0.9452 0.923 ] avg_scores:0.939 
model: DTC-10 	scores:[0.9319 0.9541 0.9511 0.9348 0.9437 0.9393 0.9467 0.9541 0.9496 0.9333] avg_scores:0.944 
model: RF 	scores:[0.9541 0.957  0.9615 0.9674 0.9481 0.9452 0.9585 0.9556 0.9481 0.9437] avg_scores:0.954 


In [141]:
#Individually trying the decision tree
tree_classifiers = {'DTC' : tree.DecisionTreeClassifier(max_depth=10)}
dtc = tree.DecisionTreeClassifier()
skms.cross_val_score(dtc, 
                     data_train_ft, data_train_tgt, 
                     cv=10, scoring='accuracy') 

data_train_tree_predictions = (dtc.fit(data_train_ft, 
                                           data_train_tgt)
                                      .predict(data_train_ft))

tree_test_accuracy = metrics.accuracy_score(data_train_tgt,
                                       data_test_predictions)
print(f'Test set accuracy: {tree_test_accuracy:.4f}')

Test set accuracy: 0.8822


In [142]:
#reading the test data and making the features df
data_test_df = pd.read_csv("BankChurners.test.csv") 
data_test_ft = data_test_df.drop(['id','Total_Ct_Chng_Q4_Q1','Card_Category','Marital_Status','Education_Level','Gender'], axis=1)


In [143]:
#Going through the same onehot process with the test data

def onehot(x,str):
    onehot = x[['id',str]]
    encoder = OneHotEncoder(handle_unknown='ignore')
    encoder_df = pd.DataFrame(encoder.fit_transform(onehot[[str]]).toarray())
    final_df = onehot.join(encoder_df)
    final_df.drop('id', axis=1, inplace=True)
    final_df.drop(str, axis=1, inplace=True)
    return(final_df)



marriage = onehot(data_test_df, 'Marital_Status')
card = onehot(data_test_df, 'Card_Category')
gender = onehot(data_test_df, 'Gender')
education = onehot(data_test_df, 'Education_Level')

mar = ['Divorced', 'Married', 'Single', 'UnknownMAR']
car = ['Blue','Gold','Platinum','Silver']
gen = ['Female', 'Male']
edu = ['College', 'Doctorate', 'Graduate', 'High School', 'Post-Graduate', 'Uneducated', 'UnknownEDU']

marriage.columns = mar
card.columns = car
gender.columns =  gen
education.columns = edu

for col in mar:
    data_test_ft[col] = marriage[col]

for col in car:
    data_test_ft[col] = card[col]

for col in gen:
    data_test_ft[col] = gender[col]

for col in edu:
    data_test_ft[col] = education[col]

In [144]:

#I used these when applying the kNN classifiers
#fit = final_pipeline.fit(data_train_ft, data_train_tgt)
#predictions = fit.predict(data_test_ft)

#fitting and making predictions on the test data
fit = dtc.fit(data_train_ft, data_train_tgt)
predictions = fit.predict(data_test_ft)

# create submission dataframe
submission_df = data_test_df[['id']].copy()
submission_df['Target'] = predictions
display(submission_df.head())

# write to csv file
import csv
def writeSubmission(predictions):
   i=6751
   submissionList = []
   for prediction in predictions:
       submissionList.append([str(i), str(prediction)])
       i+=1
   with open('Bank_Churners_submission9.csv', 'w', newline='') as submission:
       writer = csv.writer(submission)
       writer.writerow(['id', 'Target'])
       for row in submissionList:
           writer.writerow(row)

writeSubmission(predictions)
# display message
print("Saved predictions to csv file.")

Unnamed: 0,id,Target
0,6751,1
1,6752,0
2,6753,1
3,6754,1
4,6755,1


Saved predictions to csv file.


Report:

I used kNN, NB, SV, and DT forms of classification. At the end of the day, it seemed that more complex classifiers would perform best. Because of this, the Decision Tree with a large depth would format the data best, since it has a high variance. It may not have ran the best when testing it, but it worked especially well for the test split that was hidden. 

I used one-hot encoding to use the conditional columns to my advantage. To do this, I took the unique values in the conditional columns, and turned them into their own columns. I then assigned a value of 1 to the corresponding feature to the id of the person. I then added these columns back to the original dataframe.

For my assumptions. I was right about the id number affecting it, but messing with the other features brought different results. There were three instances where I took a feature out(card category or Total CT Change Q4 Q1) and ended up with the same exact score(which was my highest score). When I removed both of them, it performed much worse. This was one of the factors that led me to trust overfitting more than underfitting with this data set. It seemed like the more factors I used, the better results I got. This makes sense, as most of the data seems useful to whether a person would stay with a bank.