# Financial Risk Analysis and Visualization in Large Data

## Matt Triano

## Abstract

This research analyzed credit card debt defaults and explored five classifiers (AdaBoost, Logistic Regression, KNN, Random Forests, and Bagged Decision Trees (essentially random forest again)), and multiple data mining techniques in pursuit of a reliable method for discerning between high and low risk credit card users in a large, imbalanced data set. The target feature of the data set, default_payment $(1=\text{yes}, \; 0=\text{no})$, is imbalanced with about $22\%$ of the observed credit users defaulting in at least 1 of the 6 included months, so balancing techniques like SMOTE and SMOTETokem were explored with the 5 examined classifiers. For this business application, underestimating a cardholder's default risk can result in loss of liquidity or possible permanent loss of money, while overestimating a cardholder's credit risk can result in missed profit. As loss is much more costly than missed profit, I used recall and precision as the primary metrics for evaluating classifier performance, and I explored techniques for visualizing classifier performance, and I improved on sklearn's ROC curve visualization to present threshold information that enables an analyst to quickly evaluate the threshold to choose to identify the credit risks. 

Additionally, A classifier's performance metrics can vary greatly depending on the data preprocessing steps, so as a secondary goal, I sought to develop a more general framework for streamlining the combination of preprocessing steps, with the intention of adding that framework to my toolbox for use with future problems. 

### Objectives

The accuracy and performance of a machine learning classifier, or in my case, an ensemble of classifiers, is highly dependent on the preprocessing steps applied to the data. With big data sets, inadequate data preprocessing or poor algorithm parameter selection can cause prohibitively long run times. My initial project focus was in exploring the performance of multiple base classifiers across multiple large ($n_{obs}>10,000$) with faint signals concerning financial risk. The time constraints of the project and the time costs of tuning classifiers revealed a more interesting project, namely to build a framework to streamline the process of testing different preprocessing configurations over a number of different base classifiers, and I used one large data set to evaluate this framework.  This involved experimenting with many different tools and attempting to adapt them to my use-case.

Additionally, I am very interested in visualizing data (I took Dr. Eli Brown's CSC595 on Interactive Data Visualization last quarter and I loved it. I'm still proud of [my final project](http://bl.ocks.org/MattTriano/raw/154e9c142504d61985e2b65287e389be/)), so I spent a lot of time improving the visualizations for precision-recall relationships. This was another significant time consumer, but I'm pleased with the results produced. 

### Risk Analysis

For a financial entity to maintain adequate financial reserves, that entity must be able to both accurately assess their exposure to risk. For a financial entity to make profitable loans, that entity must keep the proportion of bad loans (ie loans where the lendee fails to make payments on time) below some threshold proportion. Both of these goals require the financial entity to have a reasonably accurate method for discriminating between people who will default on that loan and people who will regularly make their payments on time.

### Data Set: Taiwanese Credit Card Data

This data set was collected by a Taiwanese credit issuing company and was initially published by I-Cheng Yeh and Che-hui Lien with a paper on their **Sorting Smoothing Method** for estimating the real probability of default$^{2}$. The data set includes 30000 total observations where each observation represents a 6 month stretch of payment history for a single cardholder and consists of 23 explanatory features and a binary label (defaulted or did-not-default). From the distributions below, we see that cardholders who default are much more likely to be on the younger end of the distribution, but as you move to the older end of the spectrum, the proportion of defaulters to non-defaulters increases significantly. We can also see that the 'Marriage' and 'Education' features include undefined data ($0.18\%$ and $1.15\%$ of observations are missing defined values for 'Marriage' and 'Education' respectively, with no overlap), with although it is not indicated as missing in the data and the prior researchers make no mention of handling it. 

<img src="final_proj/data_set.png" alt="Drawing" style="width: 900px;"/>

      Fig. 1.  Data Set Distributions

18 of the 23 explanatory features described six consecutive months of payment history. Six features described the number of months delinquent (see Fig. 2 below), six features described the bill amounts for those six consecutive months, and the other six described the amounts repaid those six months. The number in the names correspond to the number of months prior to September 2005 (the last of the six months).

<img src="final_proj/pay_dists.png" alt="Drawing" style="width: 900px;"/>

      Fig. 2.  Payment Delinquency Distributions by Month

### Data Set Feature Descriptions

|--- Feature Name ---|-------- Feature Description --------|------------------  Defined Values  ------------------|
| :--------------: | --------------------: | :---------------- |
| $\text{default_payment_next_month}$ | Binary Target Feature indicating a Default on Payment | $0=\text{No Default}$ |
|  |  | $1=\text{Default}$ |
| $LIMIT\_BAL$ | Ammount of available consumer and family credit [NT \$] | $LIMIT\_BAL \geq 0$ |
| $SEX$ | Gender  | $1=\text{Male}$ |
|  |  | $2=\text{Female}$ |
| $EDUCATION$ | Highest completed education level | $1=\text{Grad School}$ |
|  |  | $2=\text{University}$ |
|  |  | $3=\text{High School}$ |
|  |  | $4=\text{Other}$ |
| $MARRIAGE$ | Marital Status | $1=\text{Married}$ |
|  |  | $2=\text{Single}$ |
|  |  | $3=\text{Other}$ |
| $AGE$ | Age | $AGE \geq 0$ |
| $PAY\_x$ | Number of months delinquent 'x' months ago | $-1=\text{no delinquency}$ |
|  |  | $ 1\text{ up to }N=\text{Delinquent 1 to N months}$ |
| $BILL\_AMT\_x$ | Ammount of bill statement 'x' months ago [NT \$] | Any real number (negative values are credits) |
| $PAY\_AMT\_x$ | Amount paid 'x' months ago [NT \$] | $PAY\_AMT\_x \geq 0$ |

Yeh and Lien explored using single K-nearest neighbor classifiers, logistic regression classifiers, discriminant analysis, naive Bayes classifiers, artificial neural networks, and classification trees to identify credit risks. They provided training and validation error rates in total accuracy and in the area ratio from the lift chart (the ratio of area between the baseline and classifier's curve over the area between the baseline and the Wizard's lift curve). Their KNN classifier had the lowest validation error at $16\%$
<img src="final_proj/Yet_results.PNG" alt="Drawing" style="width: 400px;"/>

                                            Table 1. Yeh and Lien Classifier Accuracy

## My Methedology

### Preprocessing Dimensions

The data description provided by UCI indicates that defined values for $\text{EDUCATION}$ and $\text{MARRIAGE}$ are $[1: Grad School,2: University, 3: High School, 4: Other]$ and $[1: Married, 2: Single, 3: Other]$ respectively. However, $\text{EDUCATION}$ includes values $[0,4,5]$ and $\text{MARRIAGE}$ includes values of $[0]$. 

I explored simply dropping these undefined values to perform complete case analysis and I explored simply leaving in the values and assuming they are Missing Not At Random (ie the fact that an observation has a missing value expresses useful information). 

I also explored imputing those values through application of multivariate imputation by chained equations$^{5}$ (or **MICE**), which assumes that missing values are Missing At Random and constructs arrays of equations by replacing missing values with feature means as placeholders. Single placeholder means are removed and a regression is performed using the data in the imputation set to predict the value. This process cycles through all missing values a number of times until the change is below a threshold and it stops running. As this method is using information from the entire data set to calculate the values to impute, this step must be performed after splitting the data into a testing and training set to avoid information leakage. I thought I found a way to correctly achieve this using sklearns Pipeline tool, but I was unable to replicate the performance seen in documentation and unfortunately, the Pipeline tool fails silently. I started implementing similar functionality into the manual cross-validators I built, but this was error-prone and slow.  

As a general convention in my code, the bold strings itemized below are included in variable names (separated by underscores) of data_variables received the corresponding preprocessing steps. Given more time, I would expand the set of preprocessing configurations to explore. The graphs included in my presentation were built using a dataset that converted the **d0** set of features into categorical features (dummy variables), and the graphs included in the **results** section were generated using **raw** features as a baseline. 

The **d0** options were chosen as the simplest way to encode the divergent categorical difference between 'PAY_x' values $\leq 0$ (ie pays on-time or has surplus account credit) and values $\geq 1$ (ie at least 1 month delinquent on payment), although inspection of the correlation matrix (see Fig. A1 in Appendix A) indicates there is multicollinearity between these features that may be problematic.

I was able to achieve surprisingly good performance (with consistently lower classification error rates than Yeh and Lien in their published results) treating these categorical features as continuous. I suspect that they would not perform as well if education had a different ordering that didn't map to reliable estimators of credit-worthiness (e.g. income), but that hypothesis would require explicit testing.  

### Explored Preprocessing Steps

**Feature Engineering**
* **raw**: leave the data in it's original $[30000,23]$ form.
* **d0**:  dummy variables for columns ['SEX', 'EDUCATION','MARRIAGE','PAY_0', 'PAY_2','PAY_3','PAY_4','PAY_5', 'PAY_6']

**Class Balancers**
* none: No balancing applied
* **sm** (**SMOTE)**: Apply **Synthetic Minority Over-Sampling Technique**. I continually ran into problems in my implementation of a manual cross-validator, which limited the number of oversampling and undersampling strategies I was able to integrate. 
* <s>**as (ADASYN)**</s>: I was able to implement a pipeline replacement to handle ADASYN, but it was very inefficient. My best manual cross-validator used numpy arrays wherever possible as numpy breaks out of python and uses C to do calculations, as C has direct access to memory without having to spend extra operations negotiating with the operating system. This is both a blessing (speed) and a curse (less flexibility), as it requires array sizes to be known in advance so that memory can be allocated and then updated. Without knowing that size, or without having a fixed size, the system will have to repeatedly allocate a new memory space any time the limit of the prior data structure was reached and then copy the data into that new data structure. I was excited about using the **Adaptive Synthetic Sampling Approact** (**ADASYN**), but per He, Bai, Garcia, and Li$^{6}$, the number of synthetic samples needed for any data set (or by extension, any training set) depends on the density distribution of the minority class to determine the number of synthetic samples needed, in contrast to SMOTE, which produces an equal number of synthetic samples for the minority class in each data slice.

**Data Scaler**
* none: No scaling applied
* **ss**: Standard Scaler()

### Future Preprocessing Exploration Options

* **d1**:  dummy variables for columns ['SEX', 'EDUCATION','MARRIAGE'] 
* **d2**:  dummy variables for columns ['SEX', 'EDUCATION','MARRIAGE'], and re-bin values from 'PAY_0' through 'PAY_6' so values below 0 (ie: pays on time or has overpaid on an account), values from 1 to 3 months delinquent, and values greater than 3 months. 

### Evaluation Metrics and Visualization Development

For this problem, I reason that it's far more important to identify False-Negatives and True-Positives than it is to identify False-Positives and True-Negatives, as False-Negatives involve underestimating a potential lendee's creditworthiness (thereby putting money in the hands of people less likely to pay it back) and True-Positives involve correctly identifying and avoiding making bad loans.  False-Positives involve overestimating the risk of a loan, and thereby forgoing the potential profit. This is also undesirable, but it is more of a business concern to lose money than to fail to gain profit. These qualities are excellently expressed in the **ROC curve**$^{3}$, which essentially shows how accurate a classifier is in discriminating between and ranking true positives vs true negatives. ROC curves plot the True-Positive Rate (**TPR**) and False-Positive Rate (**FPR**) for a classifier. For a binary classifier, the TPR and FPR can be calculated using values from that binary classifier's confusion matrix, which consists of True-Positives (**TP**, actually positive and predicted positive), False-Positives (**FP**, actually negative but predicted positive), True-Negatives (**TN**, actually negative and predicted negative), and False-Negatives (**FN**, actually positive but predicted negative).

$$\text{False Positive Rate} = \frac{FP}{FP+TN} \qquad \qquad \text{precision}=\frac{TP}{TP+FP}$$

$$ \text{True Positive Rate} = \text{recall} = \frac{TP}{TP+FN} \;$$

To produce a curve, many points are connected together. The ROC curve is produces by sorting the list of (FPR,TPR) pairs and then plotting them from the lower left corner of of the figure as a series of connecting line segments. Technically, there's no information between those points, so the optimistic curve will go straight up from each point and then straight right to the next point, the pessimistic curve will go right from each point then up, or the average curve, which interpolates. Single discrete classifiers like decision trees produce a single confusion matrix, which corresponds to a single (fpr,tpr) pair, and a curve is built from an ensemble of this discrete classifier. 

Probabilistic classifiers, however, do not make the binary predictions needed for a confusion matrix. Rather, they predict the probability that an instance will fall into a specific class. However, these probabilities can be converted to discrete values by applying a decision threshold, where probabilities below the threshold are treated as predicted negative and probabilities above the threshold are predicted positive. The curve is generated by choosing a range of thresholds and calculating the new confusion matrix values for each threshold value.  

Fawcett discusses several other methods for producing curves from single discrete classifiers. With decision trees, one can look into the model and use the class proportions calculated at each node and aggregated. Alternatively,  an ensemble of classifiers can be bootstrapped together. 

The native sklearn implementation for generating ROC curves takes an array of true values and an array of probability estimates or decision_function scores and returns an array containing FPR points, TPR points corresponding to the FPR points at the same index, and the decision function thresholds that produced that (FPR, TPR) pair$^{4}$. The sklearn library didn't implement any functionality for easily including threshold information in the ROC curve, but this threshold information is simply a gradient of ordered values, which can easily be represented by a color gradient. In the March 12th presentations, one group included ROC curves (generated in R) that encoded threshold information as line color, but it used a rainbow color map which is a generally poor choice for data visualization as it doesn't have a logical mapping to human expectation (ie given a gradient of 2 colors, it's easy to interpret the value across the spectrum from one extreme to the other, but across a spectrum of many colors, our visual intuition breaks down and requires thought). I implemented a much better visualization that I'm very pleased with (see Fig. 3). By presenting threshold information in this way, an analyst can use the ROC curve to determine the actual threshold value to use to achieve a certain True Positive rate. 

### Curve Interpretation

The ROC curve below describes the performance of a Random Forest classifier I trained. From the curve, we see that a threshold value of $0.9$ corresponds to a TPR of about $0.6$ and an FPR of about $0$. This classifier can return a decision score for a single case, and that an analyst can compare that score to a threshold to determine a proper business reaction. The addition of color makes it much easier for someone using the classifier to understand the cost of different choices for threshold.

<img src="final_proj/example_ROC_curve.PNG" alt="Drawing" style="width: 500px;"/>
    
                    Fig. 3.  ROC Curve with color-encoded threshold

To produce these visualizations and to capture the performance of all of the classifiers, I needed more control over the cross validation process than I could get from off-the-shelf sklearn cross validators. For this business problem, the ROC curve, Precision-Recall curve, and Precision and Recall vs threshold are useful visualizations, and confusion matrices are useful for understanding the performance of any classifier or ensemble of classifiers. To produce these visualizations across all of the folds of the cross validation, I had to collect and store data from each fold, and then pass appropriate data off to other functions to generate the visualizations. Implementing this in a performant way that avoided data leakage required better understanding of how sklearn works under the hood.

### Scikit-Learn Exploration:

The number of possible preprocessing dimensions quickly creates a large number of combinations to try. It quickly became apparent that a coherent framework would be necessary to both accelerate development of new models, as well as reduce the space for possible variations between models. I have been using Python's sklearn module (short for Scikit-Learn, part of the SciPy scientific computing stack) in school for a couple years now, but I haven't taken the time to really dig in and study the ecosystem. The developers that designed sklearn saw that this would be a massive project and put a lot of thought into designing the API so that the project would both be usable and maintainable.  They set the design principles that all objects should share a consistent interface, only learning algorithms will be represented by classes, it should be simple to inspect an object's parameters, default values should be reasonable (i.e. the value that generally performs best given other parameters), and wherever possible, objects should be modular so that they can be assembled to perform sequential or parallel tasks.

Sklearn's consistency is largely a result of its interfaces. 
* The **estimator interface** requires that all sklearn objects must implement a **fit()** method for training a learning algorithm and a **constructor** to initialize an instance of an estimator with hyperparameters.  
* The **predictor interface** builds on estimators by requiring that estimators implement a **predict()** method that allows a fitted estimator instance

A major hurdle was figuring out how to include preprocessing steps without introducing data leakage that would contaminate the testing folds with information from the training folds, tainting the cross validation results. I found a native solution to this problem in sklearn's Pipeline interface, although, ultimately I wasn't able to get the (allegedly) fully sklearn-compliant library Imbalanced-learn$^{[1]}$ to properly apply class balancing in a pipeline. This caused a number of headaches and was a very difficult bug to diagnose, but by the 6th implementation of a manual cross-validator, I was able to integrate both class balancing and simulate a pipeline and apply a normalizing step (although I just discovered that the Imbalanced-Learn library polluted the namespace and caused the incorrect pipeline implementation to be called, resulting in an aggregate of 8+ hours of high-load computing that produced results that weren't significantly different from other runs). Despite setbacks, bugs, redesigns, and workarounds, I was pleased with the visualizations produced by my framework, as well as the performance of the AdaBoost and RandomForest classifiers it produced. 

### My Framework:

1. Determine best model parameters and preprocessing steps for use with a version of a data set (semi-automated):
    1. Make a parameter grid containing a dictionary of hyperparameters and hyperparameter values to try.
    2. Select a combination of preprocessing steps and put them in a pipeline.
    3. Perform a GridSearch over the paramater grid using the preprocessing pipeline. Because I was unable to use the pipeline for class balancing, I made sample testing and training sets that are balanced by the methods explored in this paper. Fit the GridSearch using the training data and then check for overfitting by comparing the recall_scores and precision_scores produced by predictions from the training set and corresponding testing set. (Tip: save the estimator returned by GridSearchCV, it will save you several minutes if you want to use it again.)
    4. Visualize the performance over the hyperparameter space. If it there are promising trends that extend out of the selected hyperparameter window, adjust the the parameter grid and run another GridSearch. When the scores stop varying significantly with hyperparameter values, select the least complex, quickest model that meets the chosen score threshold.
2. Instantiate a new classifier with the selected hyperparameters.
3. Pass that new classifier to my evaluator function with a corresponding data set and corresponding label set, enter the desired class balancer algorithm, the scaler function, and descriptors to use in graphs. 

While I still want to develop more functionality into my framework, encapsulating so many steps into the evaluator made it much easier to batch out tests.

## Results

I've included the best classifiers developed for each of the listed combinations of Classifier, feature engineering, scaling, and balancing. The best classifiers were determined by a GridSearch over at least 5 folds. Severe overfitting disqualified classifiers with the best testing accuracies unless overfitting was common to the most of the top classifiers.  The highest validation accuracy that Yeh and Lien achieved was $84\%$ with their KNearest Neighbors classifier. While my classifiers didn't achieve recall scores much higher than $80\%$, my AdaBoost, RandomForest, and BAgged Decision Trees achieved a higher accuracy than their best models. 

<img src="final_proj/Yet_results.PNG" alt="Drawing" style="width: 400px;"/>

                                            Table 1, again. Yeh and Lien Classifier Accuracy

In [1]:
import pandas as pd

In [2]:
params = pd.read_csv('results_params.csv')
params

Unnamed: 0,Base Classifier,Feature Eng.,Scaling,Balancing,Best Params
0,Logistic,Raw,,,"Solver: liblinear, C: 0.01, Penalty: L1"
1,Logistic,Raw,SMOTE,,"Solver: Newton-CG, C: 1, Penalty: L2"
2,Logistic,Raw,SMOTE,StdScaler,"Solver: Saga, C: 0.001, Penalty: L1"
3,KNN,Raw,,,"Weight: Uniform, NNeighbors: 5, Dim: 1"
4,KNN,Raw,SMOTE,,"Weight: Distance, NNeighbors: 3, Dim: 2"
5,KNN,Raw,SMOTE,StdScaler,"Weight: Uniform, NNeighbors: 3, Dim: 1"
6,Random Forest,Raw,,,"Entropy, MaxDepth: 10, N_estimators: 50, Max_F..."
7,Random Forest,Raw,SMOTE,,"Gini, MaxDepth: 30, N_estimators: 10, Max_Feat..."
8,Random Forest,Raw,SMOTE,StdScaler,"Gini, MaxDepth: 30, N_estimators: 50, Max_Fea..."
9,Bagged Dtrees,Raw,,,"Entropy, MaxDepth: 45, Min Samples to Split: 6..."


In [5]:
values = pd.read_csv('results_values.csv')
values.sort_values('Test Recall', ascending=False)

Unnamed: 0,Base Classifier,Feature Eng.,Scaling,Balancing,Train Recall,Train Precision,Train Acc,Test Recall,Test Precision,Test Acc
13,AdaBoost,Raw,SMOTE,,0.8005,0.9298,0.87,0.8006,0.9274,0.8689
14,AdaBoost,Raw,SMOTE,StdScaler,0.8005,0.9298,0.87,0.8006,0.9274,0.8689
11,Bagged Dtrees,Raw,SMOTE,StdScaler,0.8835,0.9626,0.9246,0.7952,0.916,0.8611
7,Random Forest,Raw,SMOTE,,0.9992,0.9997,0.9994,0.7949,0.9188,0.8623
8,Random Forest,Raw,SMOTE,StdScaler,0.9992,0.9997,0.9995,0.7947,0.919,0.8623
10,Bagged Dtrees,Raw,SMOTE,,0.8829,0.9628,0.9244,0.7937,0.9155,0.8602
1,Logistic,Raw,SMOTE,,0.6612,0.7108,0.6961,0.6577,0.7085,0.6936
2,Logistic,Raw,SMOTE,StdScaler,0.6005,0.8024,0.7263,0.5979,0.8025,0.7254
5,KNN,Raw,SMOTE,StdScaler,0.9582,0.9141,0.9341,0.595,0.7705,0.7089
4,KNN,Raw,SMOTE,,0.9994,1.0,0.9997,0.4953,0.594,0.5784


## Visualizations

I was very startled by how clearly the Recall-Precision curves (on the right) display the difference is between models using completely untransformed data and the balanced data. Additionally, as one line was plotted for each cross validation fold, you get a good idea of the variance and noise in the predictions, and classifier performance really stabilizes when the data is balanced.

<img src="roc_rp_ADA_raw.PNG" alt="Drawing" style="width: 800px;"/> 
<img src="roc_rp_ADA_raw_sm.PNG" alt="Drawing" style="width: 800px;"/> 
<img src="roc_rp_ADA_raw_sm_ss.PNG" alt="Drawing" style="width: 800px;"/>
    
                    Fig. 4.  AdaBoost: ROC Curves and Recall-Precision Curves
                    top: [Raw Data], mid: [Raw Data, SMOTE], bot: [Raw Data, SMOTE, StdScaler]

<img src="roc_RF_raw.PNG" alt="Drawing" style="width: 800px;"/> 
<img src="roc_RF_raw_sm.PNG" alt="Drawing" style="width: 800px;"/> 
<img src="roc_RF_raw_sm_ss.PNG" alt="Drawing" style="width: 800px;"/>
    
                    Fig. 5.  Random Forest: ROC Curves and Recall-Precision Curves
                    top: [Raw Data], mid: [Raw Data, SMOTE], bot: [Raw Data, SMOTE, StdScaler]

<img src="roc_rp_LOG_raw.PNG" alt="Drawing" style="width: 800px;"/> 
<img src="roc_rp_LOG_raw_sm.PNG" alt="Drawing" style="width: 800px;"/> 
<img src="roc_rp_LOG_raw_sm_ss.PNG" alt="Drawing" style="width: 800px;"/>
    
                    Fig. 6.  Logistic Classifier: ROC Curves and Recall-Precision Curves
                    top: [Raw Data], mid: [Raw Data, SMOTE], bot: [Raw Data, SMOTE, StdScaler]

<img src="roc_rp_BAggDT_raw.PNG" alt="Drawing" style="width: 800px;"/> 
<img src="roc_rp_BAggDT_raw_sm.PNG" alt="Drawing" style="width: 800px;"/> 
<img src="roc_rp_BAggDT_raw_sm_ss.PNG" alt="Drawing" style="width: 800px;"/>
    
                    Fig. 7.  BAgged Decision Trees: ROC Curves and Recall-Precision Curves
                    top: [Raw Data], mid: [Raw Data, SMOTE], bot: [Raw Data, SMOTE, StdScaler]

<img src="roc_rp_KNN_raw.PNG" alt="Drawing" style="width: 800px;"/> 
<img src="roc_rp_KNN_raw_sm.PNG" alt="Drawing" style="width: 800px;"/> 
<img src="roc_rp_KNN_raw_sm_ss.PNG" alt="Drawing" style="width: 800px;"/>
    
                    Fig. 8.  KNearest Neighbors: ROC Curves and Recall-Precision Curves
                    top: [Raw Data], mid: [Raw Data, SMOTE], bot: [Raw Data, SMOTE, StdScaler]

### Recall and Precision vs Threshold Curves

These curves show how recall and precision vary with threshold. They provide a finer ability to resolve specific threshold values. Note that the X-axes have different scales. These images can be viewed in higher resolution in the code file.

It's interesting to see how noisy precision gets when the threshold reaches its upper limit, but this makes sense as a high threshold means the classifier predicts most things are negatives, so the number of datapoints drops and behavior becomes undefined. 

<img src="rpt_ADA_raw.PNG" alt="Drawing" style="width: 700px;"/> 
<img src="rpt_ADA_raw_sm.PNG" alt="Drawing" style="width: 700px;"/> 
<img src="rpt_ADA_raw_sm_ss.PNG" alt="Drawing" style="width: 700px;"/>
    
                    Fig. 9.  AdaBoost: Recall and Precision vs Threshold Curves
                    top: [Raw Data], mid: [Raw Data, SMOTE], bot: [Raw Data, SMOTE, StdScaler]

<tr>
    <td><img src="rpt_RF_raw.PNG" alt="Drawing" style="width: 450px;"/> </td>
    <td><img src="rpt_BAggDT_raw.PNG" alt="Drawing" style="width: 450px;"/> </td>
</tr>
<tr>
    <td><img src="rpt_RF_raw_sm.PNG" alt="Drawing" style="width: 450px;"/> </td>
    <td><img src="rpt_BAggDT_raw_sm.PNG" alt="Drawing" style="width: 450px;"/> </td>
</tr>
<tr>
    <td><img src="rpt_RF_raw_sm_ss.PNG" alt="Drawing" style="width: 450px;"/> </td>
    <td><img src="rpt_BAggDT_raw_sm_ss.PNG" alt="Drawing" style="width: 450px;"/> </td>
</tr>      

         Fig. 10.  KNearest Neighbors and Logistic Classifier: Recall and Precision vs Threshold Curves
                    top: [Raw Data], mid: [Raw Data, SMOTE], bot: [Raw Data, SMOTE, StdScaler]

<tr>
    <td><img src="rpt_KNN_raw.PNG" alt="Drawing" style="width: 450px;"/> </td>
    <td><img src="rpt_LOG_raw.PNG" alt="Drawing" style="width: 450px;"/> </td>
</tr>
<tr>
    <td><img src="rpt_KNN_raw_sm.PNG" alt="Drawing" style="width: 450px;"/> </td>
    <td><img src="rpt_LOG_raw_sm.PNG" alt="Drawing" style="width: 450px;"/> </td>
</tr>
<tr>
    <td><img src="rpt_KNN_raw_sm_ss.PNG" alt="Drawing" style="width: 450px;"/> </td>
    <td><img src="rpt_LOG_raw_sm_ss.PNG" alt="Drawing" style="width: 450px;"/> </td>
</tr>      

         Fig. 11.  KNearest Neighbors and Logistic Classifier: Recall and Precision vs Threshold Curves
                    top: [Raw Data], mid: [Raw Data, SMOTE], bot: [Raw Data, SMOTE, StdScaler]

# Conclusion

In this project, I sought out to identify faint signals in large financial data sets, and I've produced classifiers the perform moderately well at that task. While working on this assignment, I discovered a much more ambitious goal. I attempted to build a framework that could be used to survey and express the effects of preprocessing on classifier performance. I'm very pleased with the progress I've made and I plan on continuing to work on this framework, and I will continue exploring how visualization can be used to forcefully convey the operating characteristics of data mining techniques. 

# Bibiliography

* [1] Lemaitre, Guillaume, Fernando Nogueira, and Christos K. Aridas. "Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning." Journal of Machine Learning Research, 17th ser., 18, no. 18 (January 2017): 1-5.
* [2] Yeh, I-Cheng, and Che-Hui Lien. "The Comparisons of Data Mining Techniques for the Predictive Accuracy of Probability of Default of Credit Card Clients." Expert Systems with Applications 36, no. 2 (2009): 2473-480. doi:10.1016/j.eswa.2007.12.020.
* [3] Fawcett, Tom. "An introduction to ROC analysis." Pattern Recognition Letters 27, no. 8 (December 19, 2005): 861-74. doi:10.1016/j.patrec.2005.10.010.
* [4] Metz, Charles E. "Basic Principles of ROC Analysis." Seminars in Nuclear Medicine 8, no. 4 (1978): 283-98. doi:10.1016/s0001-2998(78)80014-2.
* [5] Azur, Melissa J., Elizabeth A. Stuart, Constantine Frangakis, and Philip J. Leaf. "Multiple imputation by chained equations: what is it and how does it work?" International Journal of Methods in Psychiatric Research 20, no. 1 (2011): 40-49. doi:10.1002/mpr.329.
* [6] He, Haibo, Yang Bai, Edwardo A. Garcia, and Shutao Li. "ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning." 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 2008. doi:10.1109/ijcnn.2008.4633969.
* [7] Gustavo E. A. P. A. Batista, Ronaldo C. Prati, and Maria Carolina Monard. "A study of the behavior of several methods for balancing machine learning training data." ACM SIGKDD Explorations Newsletter 6, no. 1 (2004): 20-29. doi:10.1145/1007730.1007735.

## Appendix A

<img src="final_proj/variable_heatmap.png" alt="Drawing" style="width: 900px;"/>
         Fig. A1.  Correlation Matrix for Data Set Features

## Appendix B

See the other document included with my submission to view the generating code and other visualizations.

## Appendix C

I feel foolish for not checking my entire assignment #4 when I submitted it. I foolishly collected the graph images I had produced into a folder without updating the file locations in the HTML img tags, as they still rendered in Jupyter Notebook as they were in my browser's cache. I don't expect any points, but I want you to know I was still paying attention during the Bayesian Network lectures. 

### 3a:
$$P(X)*P(Y|X)*P(Z|X,Y)*P(W|X,Y,Z)$$
<img src="final_proj/3a.PNG" alt="Drawing" style="width: 400px;"/>

### 3b:
$$P(W)*P(X)*P(Y)*P(Z)$$
<img src="final_proj/3b.PNG" alt="Drawing" style="width: 400px;"/>

### 3c:
$$P(Y)*P(Z|Y)*P(X|Y)*P(W|Y)$$
<img src="final_proj/3c.PNG" alt="Drawing" style="width: 400px;"/>

### 3d:
$$P(Z|X,Y)*P(X)*P(Y)*P(W|X)$$
<img src="final_proj/3d.PNG" alt="Drawing" style="width: 400px;"/>

### 3e:
$$P(W|X)*P(X|Y)*P(Y|Z)*P(Z)$$
<img src="final_proj/3e.PNG" alt="Drawing" style="width: 400px;"/>

### 3f:
$$P(W|X)*P(Y|X)*P(Z|Y)*P(X)$$
<img src="final_proj/3f.PNG" alt="Drawing" style="width: 400px;"/>