# Data to Decision

## Nathan Scheperle
## Monday, February 11

Note: students may work in groups on the problem but are responsible for submitting their own answers. Type answers directly in this word document, rename the file with your name, and upload to Sakai Dropbox before noon on Monday. 

## Applying the Threshold Model and Forecasting Future Error Rates

Computer chips that exceed certain size tolerances on the assembly line tend to be defective. You are given size measures for each chip, wirth binary outcomes (defective/not defective). You are given 400 typical cases (drawn at random from a stable production process). 

The cost per FP classification is estimated at $\$50-100$ and the cost per FN classification is $\$150-300$.  

In [1]:
class Threshold():
    """ Implementation of the Threshold Model in Python"""
    
    def __init__(self, fn_cost, fp_cost):
        self.fn_cost = fn_cost
        self.fp_cost = fp_cost
        pass
    
    def fit(self, x, y):
        df = pd.concat([x, y], axis=1)
        df.columns = ['x', 'y']
        pos = len(df[df['y']])
        neg = len(df) - pos
        df = df.sort_values(by='x', ascending=False)
        if df.iloc[0, df.columns.get_loc('y')]:
            df['FP'] = 0
            df['FN'] = pos - 1
            df['TP'] = 1
            df['TN'] = neg
        else:
            df['FP'] = 1
            df['FN'] = pos
            df['TP'] = 0
            df['TN'] = neg  - 1
        for i in range(1,len(df)):
            if df.iloc[i,df.columns.get_loc('y')]:
                df.iloc[i, df.columns.get_loc('FN')] = df.iloc[i-1, df.columns.get_loc('FN')] - 1
                df.iloc[i, df.columns.get_loc('TP')] = df.iloc[i-1, df.columns.get_loc('TP')] + 1

                df.iloc[i, df.columns.get_loc('TN')] = df.iloc[i-1, df.columns.get_loc('TN')]
                df.iloc[i, df.columns.get_loc('FP')] = df.iloc[i-1, df.columns.get_loc('FP')]
            else:
                df.iloc[i, df.columns.get_loc('TN')] = df.iloc[i-1, df.columns.get_loc('TN')] - 1
                df.iloc[i, df.columns.get_loc('FP')] = df.iloc[i-1, df.columns.get_loc('FP')] + 1

                df.iloc[i, df.columns.get_loc('FN')] = df.iloc[i-1, df.columns.get_loc('FN')]
                df.iloc[i, df.columns.get_loc('TP')] = df.iloc[i-1, df.columns.get_loc('TP')]

        df['Cost'] = self.fn_cost*df['FN'] + self.fp_cost*df['FP']
        df['Error'] = (df['FN'] + df['FP'])/len(df)
        
        threshold_row = df.loc[df['Cost'].idxmin()]
        self.threshold = threshold_row['x']
        self.err_rate = threshold_row['Error']
        self.cost = threshold_row['Cost']
        self.n = len(df)
        pass
    
    def avg_cost(self):
        return self.cost/self.n
    
    def predict(self, x):
        x_sort = x.sort_values(ascending=False)
        y_hat = x >= self.threshold
        return y_hat    
    

Partition your $400$ cases into a *training set* and *test set*, based on ensuring that with $99\%$ confidence the error rate on future data will not exceed the error rate on test data by $.10$.  Partition the data by assigning individual cases at random to the two sets in your desired proportions.

1. How large is your test set? Why? 

Using Hoeffding's Inequality, the required number of observations in the test dataset to ensure an error rate below $.10$ at the $99\%$ confidence level is:
$$
n \geq \frac{log(2/\alpha)}{2t^2} = \frac{log(2/.01)}{2 \cdot .1^2} = \frac{log(200)}{.02} \approx 265
$$

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# read in dataset
conv = lambda input: input=='POS'
df = pd.read_excel("./data/Data for Feb 11 Problem.xlsx", header=None, 
                   names=['x','y'])
df['y'] = conv(df['y'])

# n derived above
n = 265

train_idx = np.random.choice(range(len(df)), size=n, replace=False)

train_df = df.iloc[train_idx]
test_df = df.drop(train_idx)

2. Evaluate the threshold model on your training data only, and identify six parameters (thresholds) that minimize the average per-chip cost of misclassification at each of six combinations of costs. 

(A) Record these parameters. Also give (B) the average cost per event and (C) the error rate at each threshold on training set.

In [14]:
def do_threshold(dataset, fn_cost, fp_cost):
    thresh_model = Threshold(fn_cost, fp_cost)
    thresh_model.fit(dataset['x'], dataset['y'])
    print("At ${} per FN and ${} per FP, ".format(fn_cost, fp_cost))
    print("Threshold = {}".format(thresh_model.threshold))
    print("Error rate = {}".format(thresh_model.err_rate))
    print("Average cost per event = ${0:.4}".format(thresh_model.avg_cost()))
    print()
    return fn_cost, fp_cost, thresh_model.threshold, thresh_model.avg_cost(), thresh_model.err_rate 

m_150_50 = do_threshold(train_df, 150, 50)
m_225_50 = do_threshold(train_df, 225, 50)
m_350_50 = do_threshold(train_df, 350, 50)

m_150_100 = do_threshold(train_df, 150, 100)
m_225_100 = do_threshold(train_df, 225, 100)
m_350_100 = do_threshold(train_df, 350, 100)
"""
m11_t, m11_err, m11_cost = do_threshold(train_df, 150, 50)
m12_t, m12_err, m12_cost = do_threshold(train_df, 225, 50)
m13_t, m13_err, m13_cost = do_threshold(train_df, 350, 50)

m21_t, m21_err, m21_cost = do_threshold(train_df, 150, 100)
m22_t, m22_err, m22_cost = do_threshold(train_df, 225, 100)
m23_t, m23_err, m23_cost = do_threshold(train_df, 350, 100)
"""

At $150 per FN and $50 per FP, 
Threshold = 10210.77657077832
Error rate = 0.04150943396226415
Average cost per event = $5.094

At $225 per FN and $50 per FP, 
Threshold = 10183.438877965953
Error rate = 0.052830188679245285
Average cost per event = $7.264

At $350 per FN and $50 per FP, 
Threshold = 10183.438877965953
Error rate = 0.052830188679245285
Average cost per event = $10.57

At $150 per FN and $100 per FP, 
Threshold = 10241.889257347055
Error rate = 0.03773584905660377
Average cost per event = $5.472

At $225 per FN and $100 per FP, 
Threshold = 10210.77657077832
Error rate = 0.04150943396226415
Average cost per event = $7.925

At $350 per FN and $100 per FP, 
Threshold = 10210.77657077832
Error rate = 0.04150943396226415
Average cost per event = $11.7



'\nm11_t, m11_err, m11_cost = do_threshold(train_df, 150, 50)\nm12_t, m12_err, m12_cost = do_threshold(train_df, 225, 50)\nm13_t, m13_err, m13_cost = do_threshold(train_df, 350, 50)\n\nm21_t, m21_err, m21_cost = do_threshold(train_df, 150, 100)\nm22_t, m22_err, m22_cost = do_threshold(train_df, 225, 100)\nm23_t, m23_err, m23_cost = do_threshold(train_df, 350, 100)\n'

In [15]:
empty = " "*10

arraystr = "\begin{array}{c|ccc} & & & Cost\\ Per\\ FN & \\\\ \\hline Cost\\ Per\\ FP & & \\$ 150 & \\$ 225 & \\$ 350 \\\\ \\hline \\$ 50 & & & \\\\ \\$ 100 & & & \\\\ \\hline \\end{array}"

from IPython.display import HTML, display
import tabulate

headers = ["Cost Per FN", "Cost Per FP", "Threshold", "Average Cost", "Error Rate"]
table = [m_150_50, m_225_50, m_350_50, 
        m_150_100, m_225_100, m_350_100]

display(HTML(tabulate.tabulate(table, headers=headers, tablefmt='html')))


ltx = tabulate.tabulate(table, headers, tablefmt="latex_raw")

Cost Per FN,Cost Per FP,Threshold,Average Cost,Error Rate
150,50,10210.8,5.09434,0.0415094
225,50,10183.4,7.26415,0.0528302
350,50,10183.4,10.566,0.0528302
150,100,10241.9,5.4717,0.0377358
225,100,10210.8,7.92453,0.0415094
350,100,10210.8,11.6981,0.0415094



{{ltx}}


|Cost Per FP |{{empty}}    | {{empty}}  | Cost Per FN | {{empty}} |
|------------|-------------|------------|-------------|-----------|
| {{empty}}  | {{empty}}   | $ \$150$   | $ \$225$   |$ \$350$  |




\begin{array}{c|ccc}
& & & Cost\ Per\ FN & \\
 \hline
Cost\ Per\ FP & & \$ 150 & \$ 225 & \$ 350 \\
 \hline
\$ 50 & {{m11_t}} & {{m12_t}} & {{m13_t}} \\
\$ 100 & {{m21_t}} & {{m22_t}} & {{m23_t}} \\
\hline
\end{array}

3. Evaluate the six thresholds derived from training data on the test data and record (B) the average cost per event and (C) the error rate at each threshold. 

\begin{array}{c|ccc}
& & & Cost\ Per\ FN & \\
 \hline
Cost\ Per\ FP & & \$ 150 & \$ 225 & \$ 350 \\
 \hline
\$ 50 & & & \\
\$ 100 & & & \\
\hline
\end{array}


4. With 99% reliability, what is the expected maximum error rate on new data at each of the six thresholds? 

\begin{array}{c|ccc}
& & & Cost\ Per\ FN & \\
 \hline
Cost\ Per\ FP & & \$ 150 & \$ 225 & \$ 350 \\
 \hline
\$ 50 & & & \\
\$ 100 & & & \\
\hline
\end{array}

Note that the Hoeffding inequality can be used not only to generalize the overall error rate from test data results, but also the separate overall portion of False Positives (not the FP rate, g/b, but the portion g/(a+b)) and the overall portion of False Negatives (not the FN rate, f/a, but the portion of False Negatives f/(a+b)).  

5. Calculate these portions for each of the six parameters on the test data. 

\begin{array}{c|ccc}
& & & Cost\ Per\ FN & \\
 \hline
Cost\ Per\ FP & & \$ 150 & \$ 225 & \$ 350 \\
 \hline
\$ 50 & & & \\
\$ 100 & & & \\
\hline
\end{array}

6. With 99% confidence, what is the maximum portion of False Positives, and the maximum portion of False Negatives, that will be observed at each of the six thresholds? 

\begin{array}{c|ccc}
& & & Cost\ Per\ FN & \\
 \hline
Cost\ Per\ FP & & \$ 150 & \$ 225 & \$ 350 \\
 \hline
\$ 50 & & & \\
\$ 100 & & & \\
\hline
\end{array}

7. With 99% confidence, what is the maximum cost per event that will be observed at each of the six thresholds?  

\begin{array}{c|ccc}
& & & Cost\ Per\ FN & \\
 \hline
Cost\ Per\ FP & & \$ 150 & \$ 225 & \$ 350 \\
 \hline
\$ 50 & & & \\
\$ 100 & & & \\
\hline
\end{array}




END