### Create models to predict whether a client will default on their next payment. Your group should work together and produce an ensemble model and then analyze the effectiveness of your model.

### Write a short report explaining your best ensemble model and the evaluation of that model. Consider what metrics you should use to evaluate your model.  At a minimum you should report F1 scores, ROC curve, and AUC, but you might also want to consider using  accuracy scores, confusion matrices, recall, precision, sensitivity, specificity, false positives, false negatives, Matthews correlation coefficient etc.   Please justify the evaluation metric(s) that you use. 

### Submit your report as a pdf.

In [8]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')

In [2]:
clients = pd.read_csv('credit_clients.csv')

In [4]:
clients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   ID                          30000 non-null  int64
 1   LIMIT_BAL                   30000 non-null  int64
 2   SEX                         30000 non-null  int64
 3   EDUCATION                   30000 non-null  int64
 4   MARRIAGE                    30000 non-null  int64
 5   AGE                         30000 non-null  int64
 6   PAY_0                       30000 non-null  int64
 7   PAY_2                       30000 non-null  int64
 8   PAY_3                       30000 non-null  int64
 9   PAY_4                       30000 non-null  int64
 10  PAY_5                       30000 non-null  int64
 11  PAY_6                       30000 non-null  int64
 12  BILL_AMT1                   30000 non-null  int64
 13  BILL_AMT2                   30000 non-null  int64
 14  BILL_A

ID: ID of each client 
LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit 
SEX: Gender (1=male, 2=female) 
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown) 
MARRIAGE: Marital status (1=married, 2=single, 3=others) 
AGE: Age in years 
PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above) 
PAY_2: Repayment status in August, 2005 (scale same as above) 
PAY_3: Repayment status in July, 2005 (scale same as above) 
PAY_4: Repayment status in June, 2005 (scale same as above) 
PAY_5: Repayment status in May, 2005 (scale same as above) 
PAY_6: Repayment status in April, 2005 (scale same as above) 
BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar) 
BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar) 
BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar) 
BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar) 
BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar) 
BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar) 
PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar) 
PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar) 
PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar) 
PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar) 
PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar) 
PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar) 
default.payment.next.month: Default payment (1=yes, 0=no)

In [5]:
clients.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [10]:
#df = df.fillna(df.mean())
x = clients.iloc[:,1:23]
y = clients.iloc[:,24]
poly = PolynomialFeatures(interaction_only=True)
x = poly.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)

In [11]:
lnr_model = LogisticRegression(solver='lbfgs')

In [13]:
lnr_model.fit(x_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [20]:
lnr_model.predict(x_test)
print(lnr_model.score(x_test,y_test))
print(lnr_model.score(x_train, y_train))

0.7708333333333334
0.7749166666666667
