# Casting Classification as Regression, Regressing to Probabilities
1. We can turn classification labels into a one-hot vector.
2. We can regress to the vector.
3. To produce output classes, we can take the element with highest weight.
4. The regressed value can be interpreted as an (approximate) probability.

Regressing to probabilities is a useful trick, especially when we start thinking about confidences and unsupervised data analysis.

[Link to Fish Dataset Details](https://www.kaggle.com/aungpyaeap/fish-market)

In [11]:
import numpy as np
import csv

rows = []

with open('Fish.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    for row in csv_reader:
        rows.append(row)

print(len(rows))
print(rows[0]) # first row is a header
print(rows[1])

rows = rows[1:]

labels = {} # Create a dictionary of label strings to numeric values
for row in rows:
    if row[0] not in labels:
        labels[row[0]]=len(labels)

print(labels)
        
inputs = np.array([row[1:] for row in rows])
outputs = np.array([labels[row[0]] for row in rows])
print(outputs)

160
['ï»¿Species', 'Weight', 'Length1', 'Length2', 'Length3', 'Height', 'Width']
['Bream', '242', '23.2', '25.4', '30', '11.52', '4.02']
{'Bream': 0, 'Roach': 1, 'Whitefish': 2, 'Parkki': 3, 'Perch': 4, 'Pike': 5, 'Smelt': 6}
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6
 6 6 6 6 6 6 6 6 6 6 6]


In [15]:
def output_to_one_hot(categories, max_val):
    data = np.zeros((len(categories), max_val))
    data[np.arange(len(categories)), categories] = 1
    return data

encodings = output_to_one_hot(outputs, len(labels))
print(encodings[:10])
print(encodings[-10:])

[[1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]]
[[0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1.]]


In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inputs, encodings)

# Assignment:
1. Define a network class that regresses to the 7 outputs.
2. Train a sufficiently large network to perform the categorization.
3. Measure the test accuracy of the model by counting the number of accurate labels

# Stretch Goals:
- Test out different network architectures (depth, breadth) and examine training performance.