# Credit One Regression


Updated: 2020.08.15


### Define a Data Science Process
Now that you have had a chance to understand the problem, you’ll need to define a Data Science process that outlines exactly how you’ll be using the data.

Given a new project, your first step towards a successful analysis should be to select, tailor, and instantiate a process framework appropriate to your project.          

Two of the most important factors in determining the success of an analysis are likely to be a clear definition of the goals of the analysis and exercising the discipline to follow the principled approach you have defined.

Considering the steps outlined below in either process (there are two alternatives so read both first) and in your readings

### 1. Define the process that you will follow to thoroughly analyze the data found in the Credit One dataset. You may choose either framework, but you'll need to review all of the questions in each based on your quick examination of the Credit One data

Note that both of these process frameworks are iterative. A poor or unexpected outcome at any step might necessitate returning to previous steps. And if the problem is business critical, the process might be re-executed regularly.

### Framework One - Zumel and Mount, Practical Data Science with R, chapter 1:
Define the goal The first step in a data science process is to define a measurable and quantifiable goal.

- Why do the stakeholders want to do the project?
* **They've seen an increase in loan defaults & they risk loosing business**
- What do they need from it?
* **bottom line they need a better way to understand how much credit to allow someone & someone should be approved or not**
- Why is their current solution inadequate?
- What resources do you need?
- How will the result of your project be deployed?
Collect and manage data This step includes identifying the data you need, then exploring and conditioning it. This is often the most time consuming step.
Collect and manage data This step includes identifying the data you need, then exploring and conditioning it. This is often the most time consuming step.

- What data is available?
- Will it help to solve the problem? Is it enough?
- Is the data quality good enough?
- Build the model Here is where you try to extract useful insights from the data in order to achieve your goals.

- Which techniques might I apply to build the model?
- How many techniques should I apply?
Evaluate and critique the model Once you have derived a model, you need to determine whether it meets your goals. If not, it’s time to loop back to the modeling step.

Is the model accurate enough to meet the stakeholders’ needs?
Does it perform better than "the obvious guess" and any techniques being used currently?
Do the results of the model make sense in the context of the real-world problem domain?
Present results and document Once you have a model that meets your criteria, you will present your results to your project sponsor and   other stakeholders.

- How should stakeholders interpret the model?
- How confident should they be in its predictions?
- When should they potentially overrule the model’s predictions?
- Deploy and maintain the model Finally the model is put into But you still need to ensure that the model will run smoothly. In many cases this requires enhancement of the requirements based on customer feedback or in some cases fixing bugs.

- How is the model to be handed off to "production"?
- How often, and under which circumstances, should the model be revised?

### Framework Two - BADIR (Jain and Sharma, Behind Every Good Decision, chapter 4):

Business question

- What is the stated business question?
- What is the intent underlying the question (e.g., what is the context, what is the impacted segment, and what are stakeholders’ current thoughts about the underlying reasons?
- What business considerations (e.g., stakeholders, timeline, and cost) are likely to impact the analysis?
Analysis plan

- What is the analysis goal?
- What hypotheses are to be tested?
- What data is required/available to test the hypotheses?
- What methodology(-ies) will you employ?
- What is the project plan (timeline and milestones, risks, phasing, prioritization, …)?
Data collection

- From where can the data be obtained?
- How must the data be cleansed and validated?
Insights

- What patterns do you see in the data?
- Are each of the hypotheses proven or disproven?
- How much confidence should stakeholders place in the results?
- How do you rank your findings in terms of quantified impact on the business?
Recommendation

- How can you most effectively present the results of your analysis to your stakeholders (in terms they can understand and in alignment with information they’ll value)?
Note: A generic template for a recommendation presentation or report might include:
Objective
Background (optional)
Scope (optional)
Approach (optional)
Recommendations
Key insights with impact
Next steps

# Import packages

In [93]:
# DS Basics
import numpy as np
import pandas as pd
import scipy
from math import sqrt
import matplotlib.pyplot as plt

from sqlalchemy import create_engine
import pymysql
import pandas as pd

#EXAMPLE
import scipy.stats as stats
import numpy as np
import seaborn as sns
plt.style.use('ggplot')

#estimators
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn import linear_model

#model metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import KFold


#cross validation
#"from sklearn.cross_validation import train_test_split" has been depreiciated to  sklearn.model_selection import train_test_split
# https://stackoverflow.com/questions/54726125/no-module-named-sklearn-cross-validation
from sklearn.model_selection import train_test_split

# Import data

In [288]:
# Connect to data source 
db_connection_str = 'mysql+pymysql://deepanalytics:Sqltask1234!@34.73.222.197/deepanalytics'

# Perform select statement
db_connection = create_engine(db_connection_str)
df = pd.read_sql('SELECT * FROM credit', con=db_connection)

# Create new header
new_header = df.iloc[0] 
df = df[1:] 
df.columns = new_header

#Remove top Header
df.columns = df.columns.str.replace(' ', '_')
df = df.rename({'default_payment_next_month': 'default'}, axis=1)

#Reference SEX 0=female, 1=male
#Convert SEX from Nominal to Numerical
def SEX_to_numeric(x):
    if x=='female':
        return 0
    if x=='male':
        return 1
    
#Reference (X3) Education  (1 = graduate school; 2 = university; 3 = high school; 0, 4, 5, 6 = others)
#Convert X3 from Nominal to Numerical
def EDUCATION_to_numeric(x):
    if x=='high school':
        return 3
    if x=='university':
        return 2
    if x=='graduate school':
        return 1
    if x==('nan', '4', '5', '6'):
        return 0
    
        
#Reference 0=no default, 1=default
#Convert default from Nominal to Numerical
def default_to_numeric(x):
    if x=='default':
        return 1
    if x=='not default':
        return 0

transformedDf = df
transformedDf['SEX'] = transformedDf['SEX'].apply(SEX_to_numeric)
transformedDf['EDUCATION'] = transformedDf['EDUCATION'].apply(EDUCATION_to_numeric)
transformedDf['default'] = transformedDf['default'].apply(default_to_numeric)
df = transformedDf
transformedDf.head(3)

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
1,1,20000,0.0,2.0,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1.0
2,2,120000,0.0,2.0,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1.0
3,3,90000,0.0,2.0,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0.0


In [287]:
!pwd
!ls

/Users/mikebauler/Desktop/Data.Analytics/C2
[34mC2T1[m[m
C2T1 DataSourceUpdated5.18.pdf
C2T1 The-Five-Myths-of-Predictive-Analytics.pdf
[31mC2T1.key[m[m
C2T1.zip
Credit One Regression WORKING.ipynb
Credit One Regression.ipynb
DataSourceUpdated5.18.pdf


## Evaluate data

In [316]:
#Types
df.dtypes

0
ID            object
LIMIT_BAL     object
SEX          float64
EDUCATION    float64
MARRIAGE      object
AGE           object
PAY_0         object
PAY_2         object
PAY_3         object
PAY_4         object
PAY_5         object
PAY_6         object
BILL_AMT1     object
BILL_AMT2     object
BILL_AMT3     object
BILL_AMT4     object
BILL_AMT5     object
BILL_AMT6     object
PAY_AMT1      object
PAY_AMT2      object
PAY_AMT3      object
PAY_AMT4      object
PAY_AMT5      object
PAY_AMT6      object
default      float64
dtype: object

In [315]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30002 entries, 1 to 30203
Data columns (total 25 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         30002 non-null  object 
 1   LIMIT_BAL  30002 non-null  object 
 2   SEX        30000 non-null  float64
 3   EDUCATION  29532 non-null  float64
 4   MARRIAGE   30002 non-null  object 
 5   AGE        30002 non-null  object 
 6   PAY_0      30002 non-null  object 
 7   PAY_2      30002 non-null  object 
 8   PAY_3      30002 non-null  object 
 9   PAY_4      30002 non-null  object 
 10  PAY_5      30002 non-null  object 
 11  PAY_6      30002 non-null  object 
 12  BILL_AMT1  30002 non-null  object 
 13  BILL_AMT2  30002 non-null  object 
 14  BILL_AMT3  30002 non-null  object 
 15  BILL_AMT4  30002 non-null  object 
 16  BILL_AMT5  30002 non-null  object 
 17  BILL_AMT6  30002 non-null  object 
 18  PAY_AMT1   30002 non-null  object 
 19  PAY_AMT2   30002 non-null  object 
 20  PAY_AM

In [326]:
df.describe()

Unnamed: 0,SEX,EDUCATION,default
count,30000.0,29532.0,30000.0
mean,0.396267,1.808073,0.2212
std,0.489129,0.698643,0.415062
min,0.0,1.0,0.0
25%,0.0,1.0,0.0
50%,0.0,2.0,0.0
75%,1.0,2.0,0.0
max,1.0,3.0,1.0


In [272]:
df.shape

(30203, 25)

#### Reference

{NOTE: The following is updated information from the source’s author}
This research employed a binary variable, default payment (Yes = 1, No = 0), as the
response variable. This study reviewed the literature and used the following 23 variables
as explanatory variables:
- X1: Amount of the given credit (NT dollar): it includes both the individual consumer
credit and his/her family (supplementary) credit.
- X2: Gender (1 = male; 2 = female).
- X3: Education (1 = graduate school; 2 = university; 3 = high school; 0, 4, 5, 6 = others).
- X4: Marital status (1 = married; 2 = single; 3 = divorce; 0=others).
- X5: Age (year).
- X6 - X11: History of past payment. We tracked the past monthly payment records (from
April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7
= the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005.
The measurement scale for the repayment status is:
-2: No consumption; -1: Paid in full; 0: The use of revolving credit; 1 = payment delay
for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight
months; 9 = payment delay for nine months and above.
- X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in
September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of
bill statement in April, 2005.
- X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September,
2005; - X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.
Y: client's behavior; 

* **Y=0 no default, Y=1 default"**

## Preprocess

In [289]:
#DataCamp
# Replace the '?'s with NaN
df = df.replace('?', np.nan)

### Duplicates

In [290]:
df.duplicated().any()

True

In [317]:
#Delete / Drop Duplicates 
df = df.drop_duplicates()
print(df.duplicated().any())
df.shape

False


(30002, 25)

In [318]:
print(df[df.duplicated()].shape)
df[df.duplicated()]

(0, 25)


Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default


In [319]:
#Unique values in Column
for val in df['EDUCATION'].unique():
    print(val)

2.0
1.0
3.0
nan


### Null values

In [285]:
#How many null values in a column
count = df["EDUCATION"].isna().sum()
print(count)

472


In [260]:
df.isnull().any()
df.isnull().sum()

0
ID             0
LIMIT_BAL      0
SEX            2
EDUCATION    470
MARRIAGE       0
AGE            0
PAY_0          0
PAY_2          0
PAY_3          0
PAY_4          0
PAY_5          0
PAY_6          0
BILL_AMT1      0
BILL_AMT2      0
BILL_AMT3      0
BILL_AMT4      0
BILL_AMT5      0
BILL_AMT6      0
PAY_AMT1       0
PAY_AMT2       0
PAY_AMT3       0
PAY_AMT4       0
PAY_AMT5       0
PAY_AMT6       0
default        2
dtype: int64

In [299]:
#Remove nulls 
df.dropna(how='any',axis=0) 
df

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
1,1,20000,0.0,2.0,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1.0
2,2,120000,0.0,2.0,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1.0
3,3,90000,0.0,2.0,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0.0
4,4,50000,0.0,2.0,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0.0
5,5,50000,1.0,2.0,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30199,29996,220000,1.0,3.0,1,39,0,0,0,0,...,88004,31237,15980,8500,20000,5003,3047,5000,1000,0.0
30200,29997,150000,1.0,3.0,2,43,-1,-1,-1,-1,...,8979,5190,0,1837,3526,8998,129,0,0,0.0
30201,29998,30000,1.0,2.0,2,37,4,3,2,-1,...,20878,20582,19357,0,0,22000,4200,2000,3100,1.0
30202,29999,80000,1.0,3.0,1,41,1,-1,0,0,...,52774,11855,48944,85900,3409,1178,1926,52964,1804,1.0


In [300]:
df.isnull().any()

0
ID           False
LIMIT_BAL    False
SEX           True
EDUCATION     True
MARRIAGE     False
AGE          False
PAY_0        False
PAY_2        False
PAY_3        False
PAY_4        False
PAY_5        False
PAY_6        False
BILL_AMT1    False
BILL_AMT2    False
BILL_AMT3    False
BILL_AMT4    False
BILL_AMT5    False
BILL_AMT6    False
PAY_AMT1     False
PAY_AMT2     False
PAY_AMT3     False
PAY_AMT4     False
PAY_AMT5     False
PAY_AMT6     False
default       True
dtype: bool

In [301]:
df.isnull().sum()

0
ID             0
LIMIT_BAL      0
SEX            2
EDUCATION    470
MARRIAGE       0
AGE            0
PAY_0          0
PAY_2          0
PAY_3          0
PAY_4          0
PAY_5          0
PAY_6          0
BILL_AMT1      0
BILL_AMT2      0
BILL_AMT3      0
BILL_AMT4      0
BILL_AMT5      0
BILL_AMT6      0
PAY_AMT1       0
PAY_AMT2       0
PAY_AMT3       0
PAY_AMT4       0
PAY_AMT5       0
PAY_AMT6       0
default        2
dtype: int64

#### Correlation References

In [66]:
corr_mat = df.corr()
print(corr_mat)


Empty DataFrame
Columns: []
Index: []


#### Correlation

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
sns.heatmap(corr_mat, vmax=1.0, center=0, fmt='.2f',
square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .70})
plt.show();

In [None]:
#Top Correlations
print("Correlation Matrix")
print(df2.corr())
print()

def get_redundant_pairs(df2):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df2.columns
    for i in range(0, df2.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df2, n=5):
    au_corr = df2.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df2)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

print("Top Absolute Correlations")
print(get_top_abs_correlations(df2, 9))

##### Reference
- X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in
September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of
bill statement in April, 2005.
- X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September,
2005; - X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.
Y: client's behavior; 

In [None]:
df.shape

In [None]:
for val in df['X3'].unique():
    print(val)

In [None]:
n = 20
df['X5'].value_counts()[:n].index.tolist()

In [None]:
n = 10
df['X3'].value_counts()[:n].index.tolist()

### Discretize

#### Packages (from article in resources)
* from sklearn.preprocessing import KBinsDiscretizer
* from feature_engine.discretisers import EqualWidthDiscretiser
* discretizer = EqualWidthDiscretiser(bins=3, variables = [data['amount']])



In [None]:
# Discretize amount - eg., 0-1000, 1001-2000, 2001+
amtBin = [your code goes here]


In [None]:
# Discretize age - eg., 18-33, 34-49, 50-64, 65+
ageBin = [your code goes here]

In [None]:
# add amtBin and ageBin to the dataset

# Analyze Data
### Statistical Analysis
* All statistical analyses in this section

### Visualizations
* All visualizations in this section

In [48]:
header = df.dtypes.index
print(header)

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'default'],
      dtype='object', name=0)


# Feature Selection
For this task, you will not be selecting features. Instead, focus on answering the following questions:
* 2) Is there a relationship between the number of items purchased and amount spent?
* 4a) Is there any correlation between age of a customer and if the transaction was made online or in the store?

### Correlation

In [None]:
corr_mat = data.corr()
print(corr_mat)

# Train/Test Sets
* The modeling (predicitive analytics) process begins with splitting data in to train and test sets. 
* Focus on buiding models to answer the following questions:
* 3b) Can we predict the age of a customer in a region based on other demographic data? (Decision tree.)
* 4a) Is there any correlation between age of a customer and if the transaction was made online or in the store? (In addition to correlation analysis, a decision tree can also provide insight.)
* 4b) Do any other factors predict if a customer will buy online or in our stores? (Decison tree.)


### Set random seed

In [None]:
seed = 123

 ### Split datasets into X (IVs) and y (DV)

In [None]:
data.columns

In [None]:
## region as dv
Y_oobReg = data['region']
X_oobReg = data[['in-store','age','items','amount']]

In [None]:
## in-store as dv 
Y_oobIns = 
X_oobIns = 

In [None]:
## ageBin as dv
Y_oobAgeB = 
X_oobAgeB = 

In [None]:
## amtBin as dv
Y_oobAmtB = 
X_oobAmtB = 

### Create train and test sets

In [None]:
## region as dv

X_trainReg, X_testReg, Y_trainReg, Y_testReg = train_test_split(X_oobReg, 
                                            Y_oobReg, 
                                            test_size = .30, 
                                            random_state = seed)

print(X_trainReg.shape, X_testReg.shape)
print(Y_trainReg.shape, Y_testReg.shape)

In [None]:
## in-store as dv

X_trainIns, X_testIns, Y_trainIns, Y_testIns = 


In [None]:
## ageBin as dv

X_trainAgeB, X_testAgeB, Y_trainAgeB, Y_testAgeB = 


In [None]:
## amtBin as dv

X_trainAmtB, X_testAmtB, Y_trainAmtB, Y_testAmtB = 


# Modeling
#### Two purposes of modeling:
* 1) Evaluate patterns in data
* 2) Make predictions
  

## Evaluate patterns in data using a Decision Tree (DT)

### dv = region
* 3b) Can we predict the age of a customer in a region based on other demographic data? (Evaluate DT output.)

In [None]:
#--- DT ---#

# dv = region

# select DT model for classification
dt = DecisionTreeClassifier(max_depth=3)

# train/fit 
dtModelReg = dt.fit(X_trainReg, Y_trainReg)

# make predicitons 
dtPredReg = dtModelReg.predict(X_testReg)

# performance metrics
print(accuracy_score(Y_testReg, dtPredReg))
print(classification_report(Y_testReg, dtPredReg))

In [None]:
X_trainReg.columns

In [None]:
#--- Visualize DT ---#

# this is just a list specifying the region classes
# region_values = ['0','1','2','3'] 
region_values = ['North','South','East','West'] 

dot_data = StringIO()

export_graphviz(dtModelReg,
                out_file=dot_data, 
                filled=True, 
                rounded=True,
                feature_names=X_trainReg.columns, 
                class_names=region_values,
                label='all',
                precision=1,
                special_characters=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 
Image(graph.create_png())

 * Evaluation question: From the above DT, is 'items' in the tree? What does it mean if it is, or is not, in the tree?

## Make Predictions
### dv = region

In [None]:
#--- Cross validation; identify top model ---#

# create empty list and then populate it with models to run
models = []
models.append(('DT', DecisionTreeClassifier(max_depth=3)))
models.append(('RF', RandomForestClassifier()))
models.append(('GB', GradientBoostingClassifier()))


# create empty lists to hold results and model names
results = []
names = []

for name, model in models:
    kfold = KFold(n_splits=3, random_state=seed, shuffle=True)
    result = cross_val_score(model,
                             X_trainReg,
                             Y_trainReg,
                             cv=kfold,
                             scoring='accuracy')
    names.append(name)
    results.append(result)
    #msg = '%s: %.4f (%.4f)' % (name, result.mean(), result.std())
    #print(msg)

# print results
for i in range(len(names)):
    print(names[i],results[i].mean())


In [None]:
#--- Fit top model from CV ---#

# select top model for classification
gb = GradientBoostingClassifier()

# train/fit 
gbModelReg = gb.fit(X_trainReg, Y_trainReg)

# make predicitons 
gbPredReg = gbModelReg.predict(X_testReg)

# performance metrics
print(accuracy_score(Y_testReg, gbPredReg))
print(classification_report(Y_testReg, gbPredReg))

## DT
### dv = in-store
* 4a) Is there any correlation between age of a customer and if the transaction was made online or in the store? (Evaluate correlation matrix and DT output.)
* 4b) Do any other factors predict if a customer will buy online or in our stores? 

In [None]:
#--- DT ---#

# dv = in-store

# select DT model for classification
dt = 

# train/fit 
dtModelIns = 

# make predicitons 
dtPredIns = 

# performance metrics



In [None]:
X_trainIns.columns

In [None]:
#--- Visualize DT ---#

 

## Predictions
### dv = In-store

In [None]:
#--- Cross validation ---#

# create empty list and then populate it with models to run
 


# create empty lists to hold results and model names
 


# print results



In [None]:
#--- Fit model ---#

# select top model for classification
 

# train/fit 
gbModelIns =  

# make predicitons 
gbPredIns =  

# performance metrics
 
    

## DT
### dv = ageBin
* Analysis question: Discretize Age and use it as the dependent variable.
* Optional: Experiment with a different numbers of bins.


In [None]:
#--- DT ---#

# dv = ageBin

# select top model for classification
 

# train/fit 
dtModelAgeB =  

# make predicitons 
dtPredAgeB =  

# performance metrics
 
    

In [None]:
Y_trainAgeB.value_counts()

In [None]:
#--- Visualize DT ---#





## Predictions
### dv = ageBin

In [None]:
#--- Cross validation ---#

# create empty list and then populate it with models to run
 


# create empty lists to hold results and model names
 


# print results
 


In [None]:
#--- Fit model ---#

# select top model for classification 


# train/fit 
gbModelAgeB =  

# make predicitons 
gbPredAgeB =  

# performance metrics
 


## DT
### dv = amtBin
* Analysis question: Discretize Amount and use it as the dependent variable. Can a useful model be constructed?


In [None]:
#--- DT ---#

# dv = amtBin

# select DT model for classification
 

# train/fit 
dtModelAmtB =  

# make predicitons 
dtPredAmtB =  

# performance metrics



In [None]:
Y_trainAmtB.value_counts()

In [None]:
#--- Visualize DT ---#

 

## Predictions
### dv = amtBin

In [None]:
#--- Cross validation ---# 

# create empty list and then populate it with models to run
 


# create empty lists to hold results and model names
 


# print results
 

In [None]:
#--- Fit model ---#

# select DT model for classification
 

# train/fit 
gbModelAmtB =  

# make predicitons 
gbPredAmtB =  

# performance metrics
 
   
