cc_cats_d0

### Objectives

The accuracy and performance of a machine learning classifier is highly dependent on the preprocessing steps applied to the data. With big data sets, inadequate data preprocessing or poor algorithm parameter selection can cause prohibitively long run times. My initial project focus was in exploring the performance of multiple base classifiers across multiple large ($n_{obs}>10,000$) with faint signals concerning financial risk. The time constraints of the project and the time costs of tuning classifiers revealed a more interesting project, namely to build a framework to streamline the process of testing different preprocessing configurations over a number of different base classifiers, and I used one large data set to evaluate this framework.  This involved experimenting with many different tools and I encountered a number of bugs in the current implementations of modules in the sklearn environment that I will try to patch at a later date.

Additionally, I am very interested in visualizing data (I took Dr. Eli Brown's CSC595 on Interactive Data Visualization last quarter and I loved it. I'm still proud of [my final project](http://bl.ocks.org/MattTriano/raw/154e9c142504d61985e2b65287e389be/)), so I spent a lot of time improving the visualizations for precision-recall relationships. This was another significant time consumer, and I suspect there may bugs in my implementations, as I was observing different results between different iterations. 

### Risk Analysis

For a financial entity to maintain adequate financial reserves, that entity must be able to both accurately assess their exposure to risk. For a financial entity to make profitable loans, that entity must keep the proportion of bad loans (ie loans where the lendee fails to make payments on time) below some threshold proportion. Both of these goals require the financial entity to have a reasonably accurate method for discriminating between people who will default on that loan and people who will regularly make their payments on time.

### Data Set: Taiwanese Credit Card Data

 The data set used for this assignment 

<img src="final_proj/data_set.png" alt="Drawing" style="width: 900px;"/>

      Fig. 1.  Data Set Distributions

<img src="final_proj/pay_dists.png" alt="Drawing" style="width: 900px;"/>

      Fig. 2.  Payment Delinquency Distributions by Month

### Data Set Feature Descriptions

|--- Feature Name ---|-------- Feature Description --------|------------------  Defined Values  ------------------|
| :--------------: | --------------------: | :---------------- |
| $\text{default_payment_next_month}$ | Binary Target Feature indicating a Default on Payment | $0=\text{No Default}$ |
|  |  | $1=\text{Default}$ |
| $LIMIT\_BAL$ | Ammount of available consumer and family credit [NT \$] | $LIMIT\_BAL \geq 0$ |
| $SEX$ | Gender  | $1=\text{Male}$ |
|  |  | $2=\text{Female}$ |
| $EDUCATION$ | Highest completed education level | $1=\text{Grad School}$ |
|  |  | $2=\text{University}$ |
|  |  | $3=\text{High School}$ |
|  |  | $4=\text{Other}$ |
| $MARRIAGE$ | Marital Status | $1=\text{Married}$ |
|  |  | $2=\text{Single}$ |
|  |  | $3=\text{Other}$ |
| $AGE$ | Age | $AGE \geq 0$ |
| $PAY\_x$ | Number of months delinquent 'x' months ago | $-1=\text{no delinquency}$ |
|  |  | $ 1\text{ up to }N=\text{Delinquent 1 to N months}$ |
| $BILL\_AMT\_x$ | Ammount of bill statement 'x' months ago [NT \$] | Any real number (negative values are credits) |
| $PAY\_AMT\_x$ | Amount paid 'x' months ago [NT \$] | $PAY\_AMT\_x \geq 0$ |

### Preprocessing Dimensions

The data description provided by UCI indicates that defined values for $\text{EDUCATION}$ and $\text{MARRIAGE}$ are $[1,2,3,4]$ and $[1,2,3]$ respectively. However, $\text{EDUCATION}$ includes values $[0,4,5]$ and $\text{MARRIAGE}$ includes values of $[0]$. I explored simply dropping these values, 

As a general convention in my code, the bold strings itemized below are included in variable names (separated by underscores) of data_variables received the corresponding preprocessing steps.  

**Undefined Value Processing**
* **NA**: The undefined values are replaced with NA values and will be imputed at model building time.
* * **no_NA**: Rows containing undefined values are simply dropped.

**Feature Engineering**
* **d0**: dummy variables for columns ['SEX', 'EDUCATION','MARRIAGE','PAY_0', 'PAY_2','PAY_3','PAY_4','PAY_5', 'PAY_6']
* **d1**: dummy variables for columns ['SEX', 'EDUCATION','MARRIAGE'] 

**Class Balancers**
* **sm**: SMOTE
* **as**: AdaSyn


cc_df_d0
cc_df_d1
cc_df_d1_NA
cc_df_d1_no_NA :: classes_d1_no_NA
cc_df_no_NA :: classes_no_NA

fancyimpute.MICE().complete(data matrix)

### Framework:

1.  

### Evaluation Metrics and Visualization Development

For this problem, it's far more important to identify False-Negatives and True-Positives than it is to identify False-Positives and True-Negatives, as False-Negatives involve underestimating a potential lendee's creditworthiness (thereby putting money in the hands of people less likely to pay it back) and True-Positives involve correctly identifying and avoiding making bad loans.  False-Positives involve overestimating the risk of a loan, and thereby forgoing the potential profit. This is also undesirable, but it is more of a business concern to lose money than to fail to gain profit. These qualities are decently expressed in the **ROC curve**$^{[ROCbib]}$, which plots the True-Positive Rate (**TPR**) and False-Positive Rate (**FPR**) for a classifier. For a binary classifier, the TPR and FPR can be calculated using values from that binary classifier's confusion matrix, which consists of True-Positives (**TP**, actually positive and predicted positive), False-Positives (**FP**, actually negative but predicted positive), True-Negatives (**TN**, actually negative and predicted negative), and False-Negatives (**FN**, actually positive but predicted negative).

$$\text{recall} = \frac{\text{TP}}{\text{TP}+\text{FN}} \;\; :: \; \; \text{False Positive Rate}$$

To produce a curve, many points are connected together. The ROC curve is produces by sorting the list of (FPR,TPR) pairs and then plotting them from the lower left corner of of the figure as a series of connecting line segments. Technically, there's no information between those points, so the optimistic curve will go straight up from each point and then straight right to the next point, the pessimistic curve will go right from each point then up, or the average curve, which interpolates. Single discrete classifiers like decision trees produce a single confusion matrix, which corresponds to a single (fpr,tpr) pair, and a curve is built from an ensemble of this discrete classifier. 

Probabilistic classifiers, however, do not make the binary predictions needed for a confusion matrix. Rather, they predict the probability that an instance will fall into a specific class. This can be converted to a 

Many of the classifiers I used, like Random Forests or the BAgging classifier built from Decision Tree stumps, only produce discrete classifications, but probabilities are needed to generate 

After seeing the ROC curves that encoded threshold information into 

### Results



|---  Base Estimator --- | --- Data Format ---|-------- Class Balancing Strategy --------|---- Best Parameters -----|
| :--------------: | --------------------: | :---------------- |
| $\text{default_payment_next_month}$ | Binary Target Feature indicating a Default on Payment | $0=\text{No Default}$ |
|  |  | $1=\text{Default}$ |
| $LIMIT\_BAL$ | Ammount of available consumer and family credit [NT \$] | $LIMIT\_BAL \geq 0$ |
| $SEX$ | Gender  | $1=\text{Male}$ |
|  |  | $2=\text{Female}$ |
| $EDUCATION$ | Highest completed education level | $1=\text{Grad School}$ |
|  |  | $2=\text{University}$ |
