# default of credit card clients Data Set 

In the workshop for this week, you are to select a data set from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) and based on the recommended analysis type, wrangle the data into a fitted model, showing some model evaluation. In particular:

- Layout the data into a dataset X and targets y.
- Choose regression, classification, or clustering and build the best model you can from it. 
- Report an evaluation of the model built
- Visualize aspects of your model (optional)
- Compare and contrast different model families

When complete, I will review your code, so please submit your code via pull-request to the [Introduction to Machine Learning with Scikit-Learn](https://github.com/georgetown-analytics/machine-learning) repository!

## Wheat Kernel Example

Downloaded from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/seeds) on February 26, 2015. The first thing is to fully describe your data in a README file. The dataset description is as follows:

- Data Set: Multivariate
- Attribute: Real
- Tasks: Classification, Clustering
- Instances: 210
- Attributes: 7

### Data Set Information:

The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin.

The data set can be used for the tasks of classification and cluster analysis.

### Attribute Information:

Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables: 

X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. 

X2: Gender (1 = male; 2 = female). 

X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). 

X4: Marital status (1 = married; 2 = single; 3 = others). 

X5: Age (year). 

X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. 

X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005. 
X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005. 


### Relevant Papers:


## Data Exploration 

In this section we will begin to explore the dataset to determine relevant information.

In [1]:
%matplotlib inline

import os
import json
import time
import pickle
import requests


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
#data = pd.read_csv('~/machine-learning/data/default/dataset.csv')

data = pd.read_csv('~/Documents/machine-learning/data/default/dataset.csv')

In [3]:
data.head()

Unnamed: 0.1,Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
1,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
2,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0


In [4]:
data.iloc[0]
df = data

new_header = df.iloc[0] #grab the first row for the header
df = df[1:] #take the data less the header row
df =df.rename(columns = new_header) #set the header row as the df header



In [5]:
#this forces all of the columns to become numeric

for c in df.columns:
        df[c] = pd.to_numeric(df[c], errors='coerce')


In [6]:

# Describe the dataset
print(df.describe())
df.info()

                 ID       LIMIT_BAL           SEX     EDUCATION      MARRIAGE  \
count  30000.000000    30000.000000  30000.000000  30000.000000  30000.000000   
mean   15000.500000   167484.322667      1.603733      1.853133      1.551867   
std     8660.398374   129747.661567      0.489129      0.790349      0.521970   
min        1.000000    10000.000000      1.000000      0.000000      0.000000   
25%     7500.750000    50000.000000      1.000000      1.000000      1.000000   
50%    15000.500000   140000.000000      2.000000      2.000000      2.000000   
75%    22500.250000   240000.000000      2.000000      2.000000      2.000000   
max    30000.000000  1000000.000000      2.000000      6.000000      3.000000   

                AGE         PAY_0         PAY_2         PAY_3         PAY_4  \
count  30000.000000  30000.000000  30000.000000  30000.000000  30000.000000   
mean      35.485500     -0.016700     -0.133767     -0.166200     -0.220667   
std        9.217904      1.123802

In [7]:
#changing this so it is easier to work with

df.columns =['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'Default']

In [8]:
from sklearn import metrics
from sklearn import cross_validation
from sklearn.cross_validation import KFold
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split




In [9]:
feature_cols = ['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']


In [10]:
X = df[feature_cols]


In [11]:
y = df.Default

In [12]:
import matplotlib.pyplot as plt

In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split as tts
from sklearn import metrics
from sklearn import cross_validation
from sklearn.cross_validation import KFold

from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier


In [14]:
estimator = RandomForestClassifier()

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


In [16]:
for train, test in KFold(len(X), n_folds=12, shuffle=True):
        X_train, X_test = X.iloc[train], X.iloc[test]
        y_train, y_test = y.iloc[train], y.iloc[test]   
        estimator

In [17]:
df.iloc[test]

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,Default
2,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
7,7,500000,1,1,2,29,0,0,0,0,...,542653,483003,473944,55000,40000,38000,20239,13750,13770,0
31,31,230000,2,1,2,27,-1,-1,-1,-1,...,15339,14307,36923,17270,13281,15339,14307,37292,0,0
35,35,500000,1,1,1,58,-2,-2,-2,-2,...,3180,0,5293,5006,31178,3180,0,5293,768,0
40,40,280000,1,1,2,31,-1,-1,2,-1,...,9976,17976,9477,9075,0,9976,8000,9525,781,0
54,54,180000,2,1,2,25,1,2,0,0,...,43510,44420,45319,1300,2010,1762,1762,1790,1622,0
55,55,150000,2,1,2,29,2,0,0,0,...,26518,21042,16540,1600,1718,1049,1500,2000,5000,0
66,66,200000,1,1,1,57,-2,-2,-2,-1,...,8174,8198,7918,0,0,8222,300,0,1000,1
71,71,80000,1,1,2,31,-1,-1,-1,-1,...,390,390,390,0,390,390,390,390,390,0
105,105,60000,2,2,2,26,2,2,2,2,...,60218,55447,55305,0,5000,2511,6,3000,3000,0


In [19]:
label = "default data credit card random forest"
start  = time.time() # Start the clock! 
scores = {'precision':[], 'recall':[], 'accuracy':[], 'f1':[]}
    
for train, test in KFold(len(X), n_folds=12, shuffle=True):
    X_train, X_test = X.iloc[train], X.iloc[test]
    y_train, y_test = y.iloc[train], y.iloc[test] 
        
estimator = RandomForestClassifier()
estimator.fit(X_train, y_train)
        
expected  = y_test
predicted = estimator.predict(X_test)
        
# Append our scores to the tracker
scores['precision'].append(metrics.precision_score(expected, predicted, average="weighted"))
scores['recall'].append(metrics.recall_score(expected, predicted, average="weighted"))
scores['accuracy'].append(metrics.accuracy_score(expected, predicted))
scores['f1'].append(metrics.f1_score(expected, predicted, average="weighted"))

# Report
print("Build and Validation of {} took {:0.3f} seconds".format(label, time.time()-start))
print("Validation scores are as follows:\n")
print(pd.DataFrame(scores).mean())
    
# Write official estimator to disk
estimator = RandomForestClassifier()
estimator.fit(X, y)
    
outpath = label.lower().replace(" ", "-") + ".pickle"
with open(outpath, 'wb') as f:
    pickle.dump(estimator, f)

#print("\nFitted model written to:\n{}".format(os.path.abspath(outpath)))

Build and Validation of default data credit card random forest took 1.222 seconds
Validation scores are as follows:

accuracy     0.812800
f1           0.788336
precision    0.792421
recall       0.812800
dtype: float64


In [19]:
label = "default data credit card SVC"
start  = time.time() # Start the clock! 
scores = {'precision':[], 'recall':[], 'accuracy':[], 'f1':[]}
    
for train, test in KFold(len(X), n_folds=12, shuffle=True):
    X_train, X_test = X.iloc[train], X.iloc[test]
    y_train, y_test = y.iloc[train], y.iloc[test] 
        
estimator = SVC()
estimator.fit(X_train, y_train)
        
expected  = y_test
predicted = estimator.predict(X_test)
        
# Append our scores to the tracker
scores['precision'].append(metrics.precision_score(expected, predicted, average="weighted"))
scores['recall'].append(metrics.recall_score(expected, predicted, average="weighted"))
scores['accuracy'].append(metrics.accuracy_score(expected, predicted))
scores['f1'].append(metrics.f1_score(expected, predicted, average="weighted"))

# Report
print("Build and Validation of {} took {:0.3f} seconds".format(label, time.time()-start))
print("Validation scores are as follows:\n")
print(pd.DataFrame(scores).mean())
    
    
    
#outpath = label.lower().replace(" ", "-") + ".pickle"
#with open(outpath, 'wb') as f:
#    pickle.dump(estimator, f)

#print("\nFitted model written to:\n{}".format(os.path.abspath(outpath)))

Build and Validation of default data credit card SVC took 243.370 seconds
Validation scores are as follows:

accuracy     0.774400
f1           0.675942
precision    0.599695
recall       0.774400
dtype: float64


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [None]:
label = "default data credit card KNNclassifer "
start  = time.time() # Start the clock! 
scores = {'precision':[], 'recall':[], 'accuracy':[], 'f1':[]}
    
for train, test in KFold(len(X), n_folds=12, shuffle=True):
    X_train, X_test = X.iloc[train], X.iloc[test]
    y_train, y_test = y.iloc[train], y.iloc[test] 
        
estimator = KNeighborsClassifier(n_neighbors=12)
estimator.fit(X_train, y_train)
        
expected  = y_test
predicted = estimator.predict(X_test)
        
# Append our scores to the tracker
scores['precision'].append(metrics.precision_score(expected, predicted, average="weighted"))
scores['recall'].append(metrics.recall_score(expected, predicted, average="weighted"))
scores['accuracy'].append(metrics.accuracy_score(expected, predicted))
scores['f1'].append(metrics.f1_score(expected, predicted, average="weighted"))

# Report
print("Build and Validation of {} took {:0.3f} seconds".format(label, time.time()-start))
print("Validation scores are as follows:\n")
print(pd.DataFrame(scores).mean())
    
    
    
#outpath = label.lower().replace(" ", "-") + ".pickle"
#with open(outpath, 'wb') as f:
#    pickle.dump(estimator, f)

#print("\nFitted model written to:\n{}".format(os.path.abspath(outpath)))


Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
1,1,20000,2,2,1,24,2,2,-1,-1,...,689,0,0,0,0,689,0,0,0,0
2,2,120000,2,2,2,26,-1,2,0,0,...,2682,3272,3455,3261,0,1000,1000,1000,0,2000
3,3,90000,2,2,2,34,0,0,0,0,...,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
4,4,50000,2,2,1,37,0,0,0,0,...,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
5,5,50000,1,2,1,57,-1,0,-1,0,...,35835,20940,19146,19131,2000,36681,10000,9000,689,679


In [None]:
# Write official estimator to disk
    estimator = model(**kwargs)
    estimator.fit(X, y)

In [None]:
estimator.fit(X_train, y_train)

In [None]:
estimator.score(X_train,y_train)

In [None]:
estimator.fit(X_train,y_train)

In [None]:
y_preds = estimator.predict(X_test)