# Exercise 2

In [1]:
import numpy as np
import matplotlib.pyplot as plt

## PART A

**(30 points)** Write a function named `data_split` that splits a given dataset into a specified number of subsets, as determined by the `portions` parameter. The function should look like:

```python
def data_split(X, y, portions, shuffle=True):
    # TODO
```

where `X` and `y` are both The portions should be outlined in a dictionary format. For instance, invoking:

```python
data_split(X, y, portions={"training": .75, 'validation': .15, 'test': .10 })
```

will return a dictionary containing the divided sets. The returned dictionary's keys correspond to the indicated portions, while the values represent the input and output data of the splits. By default, the function should randomly shuffle the data before splitting.

**IMPORTANT:** If the sum of the requested portions exceeds 1, normalize them, ensuring their total equals 1. If the sum falls below 1, an additional entry should be added into the portions dictionary. This entry's key should be named 'remaining', and its value should be one minus the sum.

In [2]:
def data_split(X, y, portions, shuffle=True):
    #ensure X and y are same lengths
    if len(X) != len(y):
        print('X and y are not same length')
        return
    
    #ensure sums of portions == 1
    portions_sum = sum([portions[i] for i in portions])
    if portions_sum > 1:
        print('Portions sum > 1...normalizing')
        for dataset in portions:
            portions[dataset] /= portions_sum
            
    portions_sum = sum([portions[i] for i in portions])
    if portions_sum < 1:
        print('Portions sum < 1...adding "remaining" portion...')
        portions['remaining'] = 1 - portions_sum
    
    #shuffle if shuffle
    if shuffle:
        permutation = np.random.permutation(len(X))
        X, y =  X[permutation], y[permutation]
      
    #get split indices
    tmp = [int(portion * len(X)) for portion in portions.values()]
    splits = []
    curr = 0
    for item in range(len(tmp) - 1):
        curr += tmp[item]
        splits.append(curr)
    splits.append(len(X))
    splits = [0] + splits  
    
    #split datasets
    divided = {}
    for i in range(len(splits)-1):
        currX =  X[splits[i]:splits[i+1],:]
        curry = y[splits[i]:splits[i+1]]
        divided[list(portions.keys())[i]] = [currX, curry]
    
    return divided

**(20 points)** Test your function on the IRIS dataset using the following four different split scenarios:

In [3]:
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target

In [4]:
# Test scenario 1: A 70/30 training/testing split.
test1 = data_split(X, y, {'training': .7, 'testing': .3})
print(test1.keys())
print('trainingX shape:', test1['training'][0].shape)
print('trainingy shape:', test1['training'][1].shape)

print('testingX shape:', test1['testing'][0].shape)
print('testingy shape:', test1['testing'][1].shape)

dict_keys(['training', 'testing'])
trainingX shape: (105, 4)
trainingy shape: (105,)
testingX shape: (45, 4)
testingy shape: (45,)


In [5]:
# Test scenario 2: A 60/25/15 training/validation/testing split.
test2 = data_split(X, y, {'training': .60, 'validation': .25, 'testing': .15})
print(test2.keys())
print('trainingX shape:', test2['training'][0].shape)
print('trainingy shape:', test2['training'][1].shape)

print('validationX shape:', test2['validation'][0].shape)
print('validationy shape:', test2['validation'][1].shape)

print('testingX shape:', test2['testing'][0].shape)
print('testingy shape:', test2['testing'][1].shape)

dict_keys(['training', 'validation', 'testing'])
trainingX shape: (90, 4)
trainingy shape: (90,)
validationX shape: (37, 4)
validationy shape: (37,)
testingX shape: (23, 4)
testingy shape: (23,)


In [6]:
# Test scenario 3: A 50/30/40 training/validation/testing split.
test3 = data_split(X, y, {'training': .50, 'validation': .30, 'testing': .40})
print(test3.keys())
print('trainingX shape:', test3['training'][0].shape)
print('trainingy shape:', test3['training'][1].shape)

print('validationX shape:', test3['validation'][0].shape)
print('validationy shape:', test3['validation'][1].shape)

print('testingX shape:', test3['testing'][0].shape)
print('testingy shape:', test3['testing'][1].shape)

print('remainingX shape:', test3['remaining'][0].shape)
print('remainingy shape:', test3['remaining'][1].shape)

Portions sum > 1...normalizing
Portions sum < 1...adding "remaining" portion...
dict_keys(['training', 'validation', 'testing', 'remaining'])
trainingX shape: (62, 4)
trainingy shape: (62,)
validationX shape: (37, 4)
validationy shape: (37,)
testingX shape: (50, 4)
testingy shape: (50,)
remainingX shape: (1, 4)
remainingy shape: (1,)


In [7]:
# Test scenario 4: A 30 testing split.
test4 = data_split(X, y, {'testing': .30})
print(test4.keys())
print('testingX shape:', test4['testing'][0].shape)
print('testingy shape:', test4['testing'][1].shape)

print('remainingX shape:', test4['remaining'][0].shape)
print('remainingy shape:', test4['remaining'][1].shape)

Portions sum < 1...adding "remaining" portion...
dict_keys(['testing', 'remaining'])
testingX shape: (45, 4)
testingy shape: (45,)
remainingX shape: (105, 4)
remainingy shape: (105,)


## PART B 

**(30 points)** Write a function that accepts a confusion matrix with more than two classes (for example, a confusion matrix with 3 classes: A, B, and C), alongside a class index (like 0, 1, or 2). The function should return the sensitivity (recall), specificity, precision, and F1 metrics for the specified class index.

**HINT**: Convert the confusion matrix into a  $2 \times 2$  matrix that represents the designated class versus all other classes.

In [8]:
#assuming top axis is actual, left is predicted
def cm_to_2x2(cm, i):
    new = np.zeros(shape=(2,2))
    
    #true positives
    tp = cm[i,i]
    #false negatives
    fn = sum(cm[:,i]) - tp
    #false positives
    fp = sum(cm[i,:]) - tp
    #true negatives
    tn = cm.sum() - tp - fn - fp
    
    #set true positives for indicated index into 0,0
    new[0,0] = tp
    #set true negatives for all other classes into 1,1
    new[1,1] = tn
    #set false negatives for indicated index into 1,0
    new[1,0] = fn
    #set negatives for other
    new[0,1] = fp
    
    return new

def details_cm(cm, i):
    #convert to 2x2
    cm = cm_to_2x2(cm, i)
    
    #calc
    details = {
        'recall': cm[0,0] / (cm[0,0] + cm[1,0]),
        'specificity': cm[1,1] / (cm[1,1] + cm[0,1]),
        'precision': cm[0,0] / (cm[0,0] + cm[0, 1]),
        'cm_2x2': cm
    }
    details['f1'] = 2 * details['precision'] * details['recall'] / (details['precision'] + details['recall'])
    return details

**(5 points)** Test your implementation using the following confusion matrix:

In [9]:
cm = np.array([29, 1, 0, 2, 23, 1, 1, 3, 30]).reshape(3,3)
cm

array([[29,  1,  0],
       [ 2, 23,  1],
       [ 1,  3, 30]])

In [10]:
for i in range(len(cm)):
    print(f'----class {i}----')
    print(details_cm(cm, i))

----class 0----
{'recall': 0.90625, 'specificity': 0.9827586206896551, 'precision': 0.9666666666666667, 'cm_2x2': array([[29.,  1.],
       [ 3., 57.]]), 'f1': 0.9354838709677419}
----class 1----
{'recall': 0.8518518518518519, 'specificity': 0.9523809523809523, 'precision': 0.8846153846153846, 'cm_2x2': array([[23.,  3.],
       [ 4., 60.]]), 'f1': 0.8679245283018868}
----class 2----
{'recall': 0.967741935483871, 'specificity': 0.9322033898305084, 'precision': 0.8823529411764706, 'cm_2x2': array([[30.,  4.],
       [ 1., 55.]]), 'f1': 0.923076923076923}


**(15 points)** Using the function and confusion matrix above, calculate the micro and macro $F_1$ scores.

In [11]:
classes = len(cm)
classes

3

In [12]:
#micro F1
mf1_TP = sum([details_cm(cm,i)['cm_2x2'][0,0] for i in range(classes)])
mf1_FP = sum([details_cm(cm,i)['cm_2x2'][0,1] for i in range(classes)])
mf1_FN = sum([details_cm(cm,i)['cm_2x2'][1,0] for i in range(classes)]) / classes

mf1 = mf1_TP / (mf1_TP + 0.5 * (mf1_FP + mf1_FN))
mf1

0.9389312977099237

In [13]:
#macro F1
Mf1 = sum([details_cm(cm,i)['f1'] for i in range(classes)]) / classes
Mf1

0.9088284407821838