### Calculating Errors

Here are two datasets that represent two of the examples you have seen in this lesson.  

One dataset is based on the parachute example, and the second is based on the judicial example.  Neither of these datasets is based on real people.

Use the exercises below to assist in answering the quiz questions at the bottom of this page.

In [21]:
import numpy as np
import pandas as pd

jud_data = pd.read_csv('../data/judicial_dataset_predictions.csv')
par_data = pd.read_csv('../data/parachute_dataset.csv')

In [22]:
jud_data.head()

Unnamed: 0,defendant_id,actual,predicted
0,22574,innocent,innocent
1,35637,innocent,innocent
2,39919,innocent,innocent
3,29610,guilty,guilty
4,38273,innocent,innocent


In [23]:
par_data.head()

Unnamed: 0,parachute_id,actual,predicted
0,3956,opens,opens
1,2147,opens,opens
2,2024,opens,opens
3,8325,opens,opens
4,6598,opens,opens


## Question 1
`1.` Above, you can see the actual and predicted columns for each of the datasets.  Using the **jud_data**, find the proportion of errors for the dataset, and furthermore, the percentage of errors of each type.  Use the results to answer the questions in quiz 1 below.  

**Hint for quiz:** an error is any time the prediction doesn't match an actual value.  Additionally, there are Type I and Type II errors to think about.  We also know we can minimize one type of error by maximizing the other type of error.  If we predict all individuals as innocent, how many of the guilty are incorrectly labeled?  Similarly, if we predict all individuals as guilty, how many of the innocent are incorrectly labeled?

<strong>H0: </strong> innocent <br>
<strong>H1: </strong>guilty

<table>
    <tr>
        <th></th>
        <th></th>        
        <th halign="center" colspan=2>Actual</th>
    </tr>       
    <tr>
        <th></th>           
        <th></th>        
        <th>True</th>
        <th>False</th>        
    </tr>   
    <tr>
        <th rowspan=2>Descision</th>        
        <th>True</th>        
        <td>True Positive</td>
        <td>False Positive</td>        
    </tr>       
    <tr>
        <th>False</th>        
        <td>False Negative</td>
        <td>True Negative</td>        
    </tr>    
</table>    
<br>
False Positive is Error Type I
False Negative is Error Type II

In [24]:
nr_total = len(jud_data)
nr_total

7283

In [25]:
# True positives
true_positives = jud_data[(jud_data['actual'] == 'guilty')&(jud_data['predicted'] == 'guilty')]
nr_true_positives = len(true_positives)

nr_portion = nr_true_positives
print(nr_portion, '/', nr_total, '=', nr_portion/nr_total)

3698 / 7283 = 0.5077577921186325


In [26]:
# False positives (Type I errors)
false_positives = jud_data[(jud_data['actual'] == 'innocent')&(jud_data['predicted'] == 'guilty')]
nr_false_positives = len(false_positives)

nr_portion = nr_false_positives 
print(nr_portion, '/', nr_total, '=', nr_portion/nr_total)

11 / 7283 = 0.001510366607167376


In [27]:
# False negatives
false_negatives = jud_data[(jud_data['actual'] == 'guilty')&(jud_data['predicted'] == 'innocent')]
nr_false_negatives = len(false_negatives)

nr_portion = nr_false_negatives
print(nr_portion, '/', nr_total, '=', nr_portion/nr_total)

296 / 7283 = 0.04064259233832212


In [28]:
# true negatives
true_negatives = jud_data[(jud_data['actual'] == 'innocent')&(jud_data['predicted'] == 'innocent')]
nr_true_negatives = len(true_negatives)

nr_portion = nr_true_negatives
print(nr_portion, '/', nr_total, '=', nr_portion/nr_total)

3278 / 7283 = 0.4500892489358781


In [29]:
# percentage of errors
(nr_false_positives / nr_total) + (nr_false_negatives / nr_total)

0.042152958945489497

In [30]:
# if everyone was to be predicted guilty, what is the type I errors
len(jud_data[(jud_data['actual'] == 'innocent')]) / len(jud_data)

0.45159961554304545

In [31]:
# if everyone was to be predicted guilty, what is the type II errors


## Question 2
`2.` Using the **par_data**, find the proportion of errors for the dataset, and furthermore, the percentage of errors of each type.  Use the results to answer the questions in quiz 2 below.

These should be very similar operations to those you performed in the previous question.

In [32]:
par_data.head()

Unnamed: 0,parachute_id,actual,predicted
0,3956,opens,opens
1,2147,opens,opens
2,2024,opens,opens
3,8325,opens,opens
4,6598,opens,opens


In [33]:
nr_total = len(par_data)
nr_total

5829

<strong>H0: </strong>Fails <br>
<strong>H1: </strong>Open

<br>

<table>
    <tr>
        <th></th>
        <th></th>        
        <th halign="center" colspan=2>Actual</th>
    </tr>       
    <tr>
        <th></th>           
        <th></th>        
        <th>True</th>
        <th>False</th>        
    </tr>   
    <tr>
        <th rowspan=2>Descision</th>        
        <th>True</th>        
        <td>True Positive</td>
        <td>False Positive</td>        
    </tr>       
    <tr>
        <th>False</th>        
        <td>False Negative</td>
        <td>True Negative</td>        
    </tr>    
</table>    
<br>
False Positive is Error Type I<br>
False Negative is Error Type II

In [34]:
# True positives
true_positives = par_data[(par_data['actual'] == 'opens')&(par_data['predicted'] == 'opens')]
nr_true_positives = len(true_positives)

nr_portion = nr_true_positives
print(nr_portion, '/', nr_total, '=', nr_portion/nr_total)

5549 / 5829 = 0.951964316349288


In [35]:
# false positives (type I)
false_positives = par_data[(par_data['actual'] == 'fails')&(par_data['predicted'] == 'opens')]
nr_false_positives = len(false_positives)

nr_portion = nr_false_positives
print(nr_portion, '/', nr_total, '=', nr_portion/nr_total)

1 / 5829 = 0.00017155601303825698


In [36]:
# false negatives (type II)
false_negatives = par_data[(par_data['actual'] == 'opens')&(par_data['predicted'] == 'fails')]
nr_false_negatives = len(false_negatives)

nr_portion = nr_false_negatives
print(nr_portion, '/', nr_total, '=', nr_portion/nr_total)

232 / 5829 = 0.03980099502487562


In [37]:
# true negatives
true_negatives = par_data[(par_data['actual'] == 'fails')&(par_data['predicted']== 'fails')]
nr_true_negatives = len(true_negatives)

nr_portion = nr_true_negatives
print(nr_portion, '/', nr_total, '=', nr_portion/nr_total)

47 / 5829 = 0.008063132612798079


In [38]:
# total percentage of errors (type I plus type 2)
len(par_data[(par_data['actual'] == 'fails')&(par_data['predicted'] == 'opens')]) / len(par_data) + len(par_data[(par_data['actual'] == 'opens')&(par_data['predicted'] == 'fails')]) / len(par_data) 

0.039972551037913875

In [39]:
par_data[(par_data['actual'] != par_data['predicted'])] 

Unnamed: 0,parachute_id,actual,predicted
77,3716,opens,fails
112,7116,opens,fails
215,3520,opens,fails
244,3300,opens,fails
288,8146,opens,fails
...,...,...,...
5767,7145,opens,fails
5774,4008,opens,fails
5798,5965,opens,fails
5802,7802,opens,fails


In [40]:
# total percentage of errors (type I plus type 2)
len(par_data[(par_data['actual'] != par_data['predicted'])]) / len(par_data)

0.039972551037913875