# Lending Tree

![](banner_lending_tree.jpg)

Exploratory Data Analysis (EDA), Kernel Density Estimation (KDE), Principal Component Analysis (PCA), Classification

In [1]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)                                                     
update_geom_defaults("point", list(size=0.01, colour=PALETTE[1]))
update_geom_defaults("vline", list(color="black", size=0.15))

## Situation

LendingTree is an online lending exchange that connects consumers with multiple lenders, banks, and credit partners who compete for business.  Since being founded in 1998 LendingTree has facilitated more than 32 million loan requests.

* **Role:** Banker. 
* **Business Decision:** Take on a new portfolio of loans?
* **Approach:** Use kernel density estimation and principal component analysis to look for features that distinguish known good vs. bad loans, and use that insight to inform decisions about taking on on new portfolio of loans.  
* **Dataset:** Lending Tree Loans 2007-2010

## Decision Model

### Influence Diagram

<img src="business-model_lending_tree.jpg" align=left width=600 />

## Data 

In [None]:
datax = read.csv("Lending Tree Loans.csv") # may take about 2 minutes
size(datax)
fmt(datax[1:3,], "First few observations ...", position="left")

## Prepare Data

### Focus Analysis on Data Subset

Note which loans are inactive.  Note which inactive loans were not paid back.

In [None]:
fmt(unique(datax$loan_status), "all labels")

In [None]:
good_labels = c("Fully Paid", "Does not meet the credit policy. Status:Fully Paid")
bad_labels  = c("Default", "Charged Off", "Does not meet the credit policy. Status:Charged Off")

layout(fmt(good_labels), fmt(bad_labels))

In [None]:
inactive.good = which(datax$loan_status %in% good_labels)
inactive.bad = which(datax$loan_status %in% bad_labels)
inactive = c(inactive.good, inactive.bad)
fmt(data.frame(inactive.good=length(inactive.good), inactive.bad=length(inactive.bad), inactive=length(inactive)), "observation count")

### Focus Analysis on Convenient Variables

Which variables are not IDs?

In [None]:
m = which(!(colnames(datax) %in% c("id","member_id")))
m

Which variables are numeric?

In [None]:
n = which(var_info.type(datax[inactive,]) %in% c("integer","numeric"))                  
n

Which variables are complete (no missing data)?

In [None]:
f = which(var_info.na_count(datax[inactive,], labels=FALSE) == 0)
f

Which variables have at least some variation (distribution of more than one value)?

In [None]:
v = which(var_info.unique(datax[inactive,], labels=FALSE) > 1)
v

These are the convenient variables:

In [None]:
convenient_variables = intersect(intersect(m, n), intersect(f, v))
convenient_variables

### Prepared Data

In [None]:
data  = datax[inactive, convenient_variables]
class = c(rep("good", length(inactive.good)), rep("bad", length(inactive.bad)))

size(data)
fmt(head(data[class=="good",]), "First few good loan observations ...", position="left")
fmt(head(data[class=="bad",]),  "First few bad loan observations ...",  position="left")

## Explore Data

### Scree Plot of Variables

In [None]:
variable = names(data)
sdev = var_info.sd(data, labels=FALSE)
variance = var_info.var(data, labels=FALSE)
cum_variance = cumsum(variance)
relative_variance = variance / sum(variance)
cum_relative_variance = cumsum(relative_variance)

scree = data.frame(variable, sdev, variance, cum_variance, relative_variance, cum_relative_variance)
scree

In [None]:
ggplot(scree) + ylim(0,1) +
geom_col(aes(x=factor(variable, levels=variable), y=relative_variance, fill=variable)) +
theme.no_x_axis_title + theme.x_axis_45 + theme.no_legend

### 1D Scatter Plots of 2 Variables

In [None]:
output_size(8,1)
p1 = ggplot(data) + geom_jitter(aes(x=loan_amnt, y=0)) + theme.x_axis_only
p2 = ggplot(data) + geom_jitter(aes(x=int_rate, y=0))  + theme.x_axis_only
grid.arrange(p1, p2, nrow=1)
output_size(restore)

### Probability Density Functions of 2 Variables

Use kernel density estimation to estimate probability densities.

In [None]:
p1 = ggplot(data) + ggtitle("PDF for loan_amnt") +
     geom_density(aes(x=loan_amnt), kernel="gaussian", bw=1000, fill=PALETTE[1]) +
     theme.no_axis_titles

p2 = ggplot(data) + ggtitle("PDF for int_rate") +
     geom_density(aes(x=int_rate), kernel="gaussian", bw=0.5, fill=PALETTE[1]) +
     theme.no_axis_titles

grid.arrange(p1, p2, nrow=1)

### 2D Scatter Plot of 2 Variables

In [None]:
ggplot(data) + geom_point(aes(x=loan_amnt, y=int_rate))

## Try to Distiguish Observations by Loan Class

### Probabilities of Variable Values in Certain Ranges

In [None]:
d.loan_amnt = density(data$loan_amnt, kernel="gaussian", bw=1000, from=0, to=40000)
d.int_rate  = density(data$int_rate,  kernel="gaussian", bw=0.5,  from=0, to=30)

pdf.loan_amnt = approxfun(d.loan_amnt)
pdf.int_rate = approxfun(d.int_rate)

data.frame(variable=c("loan_amnt","int_rate"),
           probability=c(integrate(pdf.loan_amnt, 0, 18000)$value, integrate(pdf.int_rate, 10, 20)$value),
           range_low=c(0, 10),
           range_high=c(18000, 20))

p1 = ggplot(data) + xlim(0, 40000) + ggtitle("PDF for loan_amnt") +
     stat_function(fun=pdf.loan_amnt, geom="area", n=2000, fill=PALETTE[1]) +
     stat_function(fun=pdf.loan_amnt, geom="area", n=2000, xlim=c(0,18000), fill=PALETTE[5]) +
     geom_vline(xintercept=0) + geom_vline(xintercept=18000) +
     theme.no_axis_titles

p2 = ggplot(data) + xlim(0, 30) + ggtitle("PDF for int_rate") +
     stat_function(fun=pdf.int_rate, geom="area", n=2000, fill=PALETTE[1]) +
     stat_function(fun=pdf.int_rate, geom="area", n=2000, xlim=c(10,20), fill=PALETTE[5]) +
     geom_vline(xintercept=10) + geom_vline(xintercept=20) +
     theme.no_axis_titles

grid.arrange(p1, p2, nrow=1)

### Probabilities of Variable Values in Certain Ranges: Good Loans

In [None]:
d.loan_amnt = density(data$loan_amnt[class=="good"], kernel="gaussian", bw=1000, from=0, to=40000)
d.int_rate  = density(data$int_rate[class=="good"],  kernel="gaussian", bw=0.5,  from=0, to=30)

pdf.loan_amnt = approxfun(d.loan_amnt)
pdf.int_rate = approxfun(d.int_rate)

data.frame(variable=c("loan_amnt","int_rate"),
           probability=c(integrate(pdf.loan_amnt, 0, 18000)$value, integrate(pdf.int_rate, 10, 20)$value),
           range_low=c(0, 10),
           range_high=c(18000, 20))

p1 = ggplot(data[class=="good",]) + xlim(0, 40000) + ggtitle("PDF for loan_amnt") +
     stat_function(fun=pdf.loan_amnt, geom="area", n=2000, fill=PALETTE[3]) +
     stat_function(fun=pdf.loan_amnt, geom="area", n=2000, xlim=c(0,18000), fill=PALETTE[5]) +
     geom_vline(xintercept=0) + geom_vline(xintercept=18000) +
     theme.no_axis_titles

p2 = ggplot(data[class=="good",]) + xlim(0, 30) + ggtitle("PDF for int_rate") +
     stat_function(fun=pdf.int_rate, geom="area", n=2000, fill=PALETTE[3]) +
     stat_function(fun=pdf.int_rate, geom="area", n=2000, xlim=c(10,20), fill=PALETTE[5]) +
     geom_vline(xintercept=10) + geom_vline(xintercept=20) +
     theme.no_axis_titles

grid.arrange(p1, p2, nrow=1)

### Probabilities of Variable Values in Certain Ranges: Bad Loans

In [None]:
d.loan_amnt = density(data$loan_amnt[class=="bad"], kernel="gaussian", bw=1000, from=0, to=40000)
d.int_rate  = density(data$int_rate[class=="bad"],  kernel="gaussian", bw=0.5,  from=0, to=30)

pdf.loan_amnt = approxfun(d.loan_amnt)
pdf.int_rate = approxfun(d.int_rate)

data.frame(variable=c("loan_amnt","int_rate"),
           probability=c(integrate(pdf.loan_amnt, 0, 18000)$value, integrate(pdf.int_rate, 10, 20)$value),
           range_low=c(0, 10),
           range_high=c(18000, 20))

p1 = ggplot(data[class=="bad",]) + xlim(0, 40000) + ggtitle("PDF for loan_amnt") +
     stat_function(fun=pdf.loan_amnt, geom="area", n=2000, fill=PALETTE[2]) +
     stat_function(fun=pdf.loan_amnt, geom="area", n=2000, xlim=c(0,18000), fill=PALETTE[5]) +
     geom_vline(xintercept=0) + geom_vline(xintercept=18000) +
     theme.no_axis_titles

p2 = ggplot(data[class=="bad",]) + xlim(0, 30) + ggtitle("PDF for int_rate") +
     stat_function(fun=pdf.int_rate, geom="area", n=2000, fill=PALETTE[2]) +
     stat_function(fun=pdf.int_rate, geom="area", n=2000, xlim=c(10,20), fill=PALETTE[5]) +
     geom_vline(xintercept=10) + geom_vline(xintercept=20) +
     theme.no_axis_titles

grid.arrange(p1, p2, nrow=1)

### 2D Scatter Plot of 2 Variables + Loan Class

Note that distinguishing characteristics are obscured.

In [None]:
ggplot(data) +
geom_point(aes(x=loan_amnt, y=int_rate, color=class), alpha=0.2) +
scale_color_manual(values=PALETTE[2:3]) + guides.standard + theme.legend_title

## Further Prepare Data: Principal Component Analysis

### Data Represented as Principal Components

In [None]:
pc = prcomp(data, scale=TRUE, retx=TRUE)
data.pc = as.data.frame(pc$x)

size(data.pc)
fmt(data.pc[1:6,], "First few observations ...", position="left")

### Qualitative Interpretation of Principal Components.

Here, each column lists variable names sorted by weight applied to the principal component.

In [None]:
pc_constituents(pc)

### Scree Plot of Principal Components

In [None]:
variable = names(data.pc)
sdev = var_info.sd(data.pc, labels=FALSE)
variance = var_info.var(data.pc, labels=FALSE)
cum_variance = cumsum(variance)
relative_variance = variance / sum(variance)
cum_relative_variance = cumsum(relative_variance)

scree.pc = data.frame(variable, sdev, variance, cum_variance, relative_variance, cum_relative_variance)
scree.pc

In [None]:
ggplot(scree.pc) + ylim(0,1) + xlab("variable") +
geom_col(aes(x=factor(variable, levels=variable), y=relative_variance, fill=variable)) +
theme.no_legend

## Try to Distiguish Transformed Observations by Loan Class

### Probabilities of Principal Component Values in Certain Ranges: Good Loans

In [None]:
d.PC1 = density(data.pc$PC1[class=="good"], kernel="gaussian", bw=0.2, from=-5, to=5)
d.PC2 = density(data.pc$PC2[class=="good"], kernel="gaussian", bw=0.2, from=-5, to=5)

pdf.PC1 = approxfun(d.PC1)
pdf.PC2 = approxfun(d.PC2)

data.frame(variable=c("PC1","PC2"),
           probability=c(integrate(pdf.PC1, -5, 1)$value, integrate(pdf.PC2, -2, 0)$value),
           range_low=c(-5, -2),
           range_high=c(1, 0))


p1 = ggplot(data.pc[class=="good",]) + xlim(-5, 5) + ggtitle("PDF for PC1") +
     stat_function(fun=pdf.PC1, geom="area", n=2000, fill=PALETTE[3]) +
     stat_function(fun=pdf.PC1, geom="area", n=2000, xlim=c(-5,1), fill=PALETTE[5]) +
     geom_vline(xintercept=-5) + geom_vline(xintercept=1) +
     theme.no_axis_titles

p2 = ggplot(data.pc[class=="good",]) + xlim(-5, 5) + ggtitle("PDF for PC2") +
     stat_function(fun=pdf.PC2, geom="area", n=2000, fill=PALETTE[3]) +
     stat_function(fun=pdf.PC2, geom="area", n=2000, xlim=c(-2,0), fill=PALETTE[5]) +
     geom_vline(xintercept=-2) + geom_vline(xintercept=0) +
     theme.no_axis_titles

grid.arrange(p1, p2, nrow=1)

In [None]:
d.PC1 = density(data.pc$PC1[class=="good"], kernel="gaussian", bw=0.2, from=-5, to=5)
d.PC2 = density(data.pc$PC2[class=="good"], kernel="gaussian", bw=0.2, from=-5, to=5)

pdf.PC1 = approxfun(d.PC1)
pdf.PC2 = approxfun(d.PC2)

data.frame(variable=c("PC1","PC2"),
           probability=c(integrate(pdf.PC1, -5, 0)$value, integrate(pdf.PC2, 0, 2)$value),
           range_low=c(-5, 0),
           range_high=c(0, 2))

p1 = ggplot(data.pc[class=="good",]) + xlim(-5, 5) + ggtitle("PDF for PC1") +
     stat_function(fun=pdf.PC1, geom="area", n=2000, fill=PALETTE[3]) +
     stat_function(fun=pdf.PC1, geom="area", n=2000, xlim=c(-5,0), fill=PALETTE[5]) +
     geom_vline(xintercept=-5) + geom_vline(xintercept=0) +
     theme.no_axis_titles

p2 = ggplot(data.pc[class=="good",]) + xlim(-5, 5) + ggtitle("PDF for PC2") +
     stat_function(fun=pdf.PC2, geom="area", n=2000, fill=PALETTE[3]) +
     stat_function(fun=pdf.PC2, geom="area", n=2000, xlim=c(0,2), fill=PALETTE[5]) +
     geom_vline(xintercept=0) + geom_vline(xintercept=2) +
     theme.no_axis_titles

grid.arrange(p1, p2, nrow=1)

### Probabilities of Principal Component Values in Certain Ranges: Bad Loans

In [None]:
d.PC1 = density(data.pc$PC1[class=="bad"], kernel="gaussian", bw=0.2, from=-5, to=5)
d.PC2 = density(data.pc$PC2[class=="bad"], kernel="gaussian", bw=0.2, from=-5, to=5)

pdf.PC1 = approxfun(d.PC1)
pdf.PC2 = approxfun(d.PC2)

data.frame(variable=c("PC1","PC2"),
           probability=c(integrate(pdf.PC1, -5, 1)$value, integrate(pdf.PC2, -2, 0)$value),
           range_low=c(-5, -2),
           range_high=c(1, 0))

p1 = ggplot(data.pc[class=="bad",]) + xlim(-5, 5) + ggtitle("PDF for PC1") +
     stat_function(fun=pdf.PC1, geom="area", n=2000, fill=PALETTE[2]) +
     stat_function(fun=pdf.PC1, geom="area", n=2000, xlim=c(-5,1), fill=PALETTE[5]) +
     geom_vline(xintercept=-5) + geom_vline(xintercept=1) +
     theme.no_axis_titles

p2 = ggplot(data.pc[class=="bad",]) + xlim(-5, 5) + ggtitle("PDF for PC2") +
     stat_function(fun=pdf.PC2, geom="area", n=2000, fill=PALETTE[2]) +
     stat_function(fun=pdf.PC2, geom="area", n=2000, xlim=c(-2,0), fill=PALETTE[5]) +
     geom_vline(xintercept=-2) + geom_vline(xintercept=0) +
     theme.no_axis_titles

grid.arrange(p1, p2, nrow=1)

In [None]:
d.PC1 = density(data.pc$PC1[class=="bad"], kernel="gaussian", bw=0.2, from=-5, to=5)
d.PC2 = density(data.pc$PC2[class=="bad"], kernel="gaussian", bw=0.2, from=-5, to=5)

pdf.PC1 = approxfun(d.PC1)
pdf.PC2 = approxfun(d.PC2)

data.frame(variable=c("PC1","PC2"),
           probability=c(integrate(pdf.PC1, -5, 0)$value, integrate(pdf.PC2, 0, 2)$value),
           range_low=c(-5, 0),
           range_high=c(0, 2))

p1 = ggplot(data.pc[class=="bad",]) + xlim(-5, 5) + ggtitle("PDF for PC1") +
     stat_function(fun=pdf.PC1, geom="area", n=2000, fill=PALETTE[2]) +
     stat_function(fun=pdf.PC1, geom="area", n=2000, xlim=c(-5,0), fill=PALETTE[5]) +
     geom_vline(xintercept=-5) + geom_vline(xintercept=0) +
     theme.no_axis_titles

p2 = ggplot(data.pc[class=="bad",]) + xlim(-5, 5) + ggtitle("PDF for PC2") +
     stat_function(fun=pdf.PC2, geom="area", n=2000, fill=PALETTE[2]) +
     stat_function(fun=pdf.PC2, geom="area", n=2000, xlim=c(0,2), fill=PALETTE[5]) +
     geom_vline(xintercept=0) + geom_vline(xintercept=2) +
     theme.no_axis_titles

grid.arrange(p1, p2, nrow=1)

### 2D Scatter Plot of 2 Principal Components + Loan Class

Note that distinguishing characteristics are revealed.

In [None]:
ggplot(data.pc) + xlim(-5,20) + ylim(-5,50) +
geom_point(aes(x=PC1, y=PC2, color=class), alpha=0.2, na.rm=TRUE) +
scale_color_manual(values=PALETTE[2:3]) + guides.standard + theme.legend_title

## Predictive Model

Here is one of infinitely many predictive models.

### Homogeneous Spaces

#### Convex Hulls Around Observations by Loan Class

In [None]:
convex_hull.good = data.pc[class=="good", c("PC1","PC2")][chull(data.pc$PC1[class=="good"], data.pc$PC2[class=="good"]),]
convex_hull.bad  = data.pc[class=="bad",  c("PC1","PC2")][chull(data.pc$PC1[class=="bad"],  data.pc$PC2[class=="bad"]),]

ggplot(data.pc) + xlim(-5,20) + ylim(-5,50) +
geom_point(aes(x=PC1, y=PC2, color=class), alpha=0.2) +
geom_polygon(aes(x=PC1, y=PC2), data=convex_hull.good, fill=PALETTE[3], color="black", alpha=0.2) +
geom_polygon(aes(x=PC1, y=PC2), data=convex_hull.bad,  fill=PALETTE[2], color="black", alpha=0.2) +
scale_color_manual(values=PALETTE[2:3]) + guides.standard + theme.legend_title

#### Homogeneous Spaces by Loan Class

A predictive model with high certainty ...

In [None]:
A = list(x=convex_hull.good$PC1, y=convex_hull.good$PC2)
B = list(x=convex_hull.bad$PC1,  y=convex_hull.bad$PC2)
X = polyclip(A, B, op="minus")
space.high.good = as.data.frame(X[[1]]); names(space.high.good) = c("PC1","PC2")
X = polyclip(B, A, op="minus")
space.high.bad = as.data.frame(X[[1]]); names(space.high.bad) = c("PC1","PC2")

ggplot() + ggtitle("High Certainty Spaces") + xlim(-5,20) + ylim(-5,50) +
geom_polygon(aes(x=PC1, y=PC2, fill="good"), data=space.high.good) +
geom_polygon(aes(x=PC1, y=PC2, fill="bad"),  data=space.high.bad) +
scale_fill_manual("class", values=c("good"=PALETTE[3], "bad"=PALETTE[2])) + theme.legend_title

### Non-Homogeneous Space

A predictive model with medium certainty ...

In [None]:
A = list(x=convex_hull.good$PC1, y=convex_hull.good$PC2)
B = list(x=convex_hull.bad$PC1,  y=convex_hull.bad$PC2)
X = polyclip(A, B, op="intersection")
space.medium = as.data.frame(X[[1]]); names(space.medium) = c("PC1","PC2")

A = list(x=space.medium$PC1, y=space.medium$PC2)
B = list(x=c(-5,-5,20,20), y=c(0,-5,-5,0))
X = polyclip(A, B, op="intersection")
space.medium.good = as.data.frame(X[[1]]); names(space.medium.good) = c("PC1","PC2")

A = list(x=space.medium$PC1, y=space.medium$PC2)
B = list(x=c(-5,-5,20,20), y=c(0,50,50,0))
X = polyclip(A, B, op="intersection")
space.medium.bad = as.data.frame(X[[1]]); names(space.medium.bad) = c("PC1","PC2")

ggplot() + ggtitle("Medium Certainty Spaces", "PC2 threshold = 0.00") + xlim(-5,20) + ylim(-5,50) +
geom_polygon(aes(x=PC1, y=PC2, fill="good"), data=space.medium.good, alpha=0.5) +
geom_polygon(aes(x=PC1, y=PC2, fill="bad"),  data=space.medium.bad,  alpha=0.5) +
geom_hline(yintercept=0, linetype="dashed") +
scale_fill_manual("class", values=PALETTE[2:3]) + guides.standard + theme.legend_title

### Model Form & Parameterization

A predictive model with high and medium certainties ...

In [None]:
ggplot() + ggtitle("Predictive Model") + xlim(-5,20) + ylim(-5,50) +
geom_polygon(aes(x=PC1, y=PC2, fill="good", alpha="medium"), data=space.medium.good) +
geom_polygon(aes(x=PC1, y=PC2, fill="bad", alpha="medium"), data=space.medium.bad) +
geom_polygon(aes(x=PC1, y=PC2, fill="good", alpha="high"), data=space.high.good) +
geom_polygon(aes(x=PC1, y=PC2, fill="bad", alpha="high"), data=space.high.bad) +
scale_fill_manual("class", values=c("good"=PALETTE[3], "bad"=PALETTE[2])) +
scale_alpha_manual("certainty", values=c("medium"=0.5, "high"=1)) + 
theme.legend_title

#### Hyper-Parameters

You pick the hyper-parameter value(s).  This model's parameters were determined by form and 1 hyper-parameter:
1. PC2 threshold for medium non-homogeneous space

In [None]:
fmt(0.00, "PC2 threshold")

#### Parameters

This model is defined by these parameters:
1. Polygon vertices for high certainty good loans
1. Polygon vertices for high certainty bad loans

Note that polygon vertices for medium certainty good loans are implied by hyper-parameter value and parameter values.<br/>
Note that polygon vertices for medium certainty bad loans are implied by hyper-parameter value and parameter values.

In [None]:
fmt(data.frame(t(space.high.good)), "If new observation is within this space, then predict it's a good loan with high certainty:", row.names=TRUE, position="left")

In [None]:
fmt(data.frame(t(space.high.bad)), "If new observation is within this space, then predict it's a bad loan with high certainty:", position="left", row.names=TRUE)

In [None]:
fmt(data.frame(t(space.medium.good)), "If new observation is within this space, then predict it's a good loan with moderate certainty:", position="left", row.names=TRUE)

In [None]:
fmt(data.frame(t(space.medium.bad)), "If new observation is within this space, then predict it's a bad loan with high certainty:", position="left", row.names=TRUE)

## Prediction

Some new observations ...

In [None]:
new = datax[c(3000, 6000, 5014, 5017),]
fmt(new, title=NA)

Represent the new observations as principal components ...

In [None]:
new.pc = as.data.frame(predict(pc, new))
fmt(new.pc, title=NA)

Visualize observations and model ...

In [None]:
ggplot() + ggtitle("New Observations & Model")+ xlim(-5,20) + ylim(-5,50) +
geom_polygon(aes(x=PC1, y=PC2, fill="good", alpha="medium"), data=space.medium.good) +
geom_polygon(aes(x=PC1, y=PC2, fill="bad", alpha="medium"), data=space.medium.bad) +
geom_polygon(aes(x=PC1, y=PC2, fill="good", alpha="high"), data=space.high.good) +
geom_polygon(aes(x=PC1, y=PC2, fill="bad", alpha="high"), data=space.high.bad) +
geom_point(aes(x=PC1, y=PC2), data=new.pc, size=3, color="black") +
scale_fill_manual("class", values=c("good"=PALETTE[3], "bad"=PALETTE[2])) +
scale_alpha_manual("certainty", values=c("medium"=0.5, "high"=1)) + 
theme.legend_title

Make predictions ...

In [None]:
prediction = data.frame(PC1=new.pc$PC1, PC2=new.pc$PC2,
                        high_good=inside(new.pc[,c("PC1","PC2")],   space.high.good),
                        high_bad=inside(new.pc[,c("PC1","PC2")],    space.high.bad),
                        medium_good=inside(new.pc[,c("PC1","PC2")], space.medium.good),
                        medium_bad=inside(new.pc[,c("PC1","PC2")],  space.medium.bad))

prediction$commit = factor(aaply(1:nrow(prediction), 1, function(i) which(as.logical(prediction[i,3:6]))),
                           levels=1:4,
                           labels=names(prediction)[3:6])

fmt(prediction)

Visualize observations, model, & predictions ...

In [None]:
ggplot() + ggtitle("New Observations, Model, & Predictions") + xlim(-5,20) + ylim(-5,50) +
geom_polygon(aes(x=PC1, y=PC2, fill="good", alpha="medium"), data=space.medium.good) +
geom_polygon(aes(x=PC1, y=PC2, fill="bad", alpha="medium"), data=space.medium.bad) +
geom_polygon(aes(x=PC1, y=PC2, fill="good", alpha="high"), data=space.high.good) +
geom_polygon(aes(x=PC1, y=PC2, fill="bad", alpha="high"), data=space.high.bad) +
geom_point(aes(x=PC1, y=PC2), data=new.pc[1,], size=3, fill=PALETTE[3], color="black", alpha=1.0, shape=21) +
geom_point(aes(x=PC1, y=PC2), data=new.pc[2,], size=3, fill=PALETTE[3], color="black", alpha=0.5, shape=21) +
geom_point(aes(x=PC1, y=PC2), data=new.pc[3,], size=3, fill=PALETTE[3], color="black", alpha=0.5, shape=21) +
geom_point(aes(x=PC1, y=PC2), data=new.pc[4,], size=3, fill=PALETTE[2], color="black", alpha=0.5, shape=21) +
scale_fill_manual("class", values=c("good"=PALETTE[3], "bad"=PALETTE[2])) +
scale_alpha_manual("certainty", values=c("medium"=0.5, "high"=1)) + 
theme.legend_title

## Evaluation

### Evaluate Model's Performance on New Observations

In [None]:
evaluation = prediction
evaluation$actual = class[c(3000, 6000, 5014, 5017)]
evaluation

### Evaluate Model's Perfromance on All Data

In [None]:
data.pc.high.good = data.pc[inside(data.pc[,c("PC1","PC2")], space.high.good),]
class.high.good   = class[inside(data.pc[,c("PC1","PC2")], space.high.good)]

data.pc.high.bad = data.pc[inside(data.pc[,c("PC1","PC2")], space.high.bad), ]
class.high.bad   = class[inside(data.pc[,c("PC1","PC2")], space.high.bad)]

data.pc.medium.good = data.pc[inside(data.pc[,c("PC1","PC2")], space.medium.good),]
class.medium.good   = class[inside(data.pc[,c("PC1","PC2")], space.medium.good)]

data.pc.medium.bad = data.pc[inside(data.pc[,c("PC1","PC2")], space.medium.bad),]
class.medium.bad   = class[inside(data.pc[,c("PC1","PC2")], space.medium.bad)]

d = data.frame(high.good=100,
               medium.good=100*table(class.medium.good)["good"] / length(class.medium.good),
               high.bad=100,
               medium.bad=100*table(class.medium.bad)["bad"] / length(class.medium.bad))

fmt(d, "Correct predictions (%)")

In [None]:
output_size(8,5)

p1 = ggplot(data.pc.high.good) + xlim(-5,20) + ylim(-5,50) +
     geom_point(aes(x=PC1, y=PC2, color=class.high.good), alpha=0.2) +
     scale_color_manual(values=PALETTE[2:3]) + theme.no_legend +
     ggtitle("High certainty that it's a good loan")

p2 = ggplot(data.pc.medium.good) + xlim(-5,20) + ylim(-5,50) +
     geom_point(aes(x=PC1, y=PC2, color=class.medium.good), alpha=0.2) +
     geom_point(aes(x=PC1, y=PC2), data=data.pc.medium.good[class.medium.good=="bad",], color="gray30", size=2, shape=4) +
     scale_color_manual(values=PALETTE[2:3]) + theme.no_legend +
     ggtitle("Medium certainty that it's a good loan")

p3 = ggplot(data.pc.high.bad) + xlim(-5,20) + ylim(-5,50) +
     geom_point(aes(x=PC1, y=PC2, color=class.high.bad), alpha=0.2) +
     scale_color_manual(values=PALETTE[2:3]) + theme.no_legend +
     ggtitle("High certainty that it's a bad loan")

p4 = ggplot(data.pc.medium.bad) + xlim(-5,20) + ylim(-5,50) +
     geom_point(aes(x=PC1, y=PC2, color=class.medium.bad), alpha=0.2) +
     geom_point(aes(x=PC1, y=PC2), data=data.pc.medium.bad[class.medium.bad=="good",], color="gray30", size=2, shape=4) +
     scale_color_manual(values=PALETTE[2:3]) + theme.no_legend +
     ggtitle("Medium certainty that it's a bad loan")

grid.arrange(p1, p2, p3, p4, nrow=2, ncol=2)

output_size(restore)

<p style="text-align:left; font-size:10px;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float:right;">
Document revised March 1, 2020
</span>
</p>