# Task:

- Implement batch gradient descent with early stopping for softmax regression without using Scikit-Learn, only NumPy. Use it on a classification task such as the iris dataset.

## Softmax Regression

Generalization of *Logistic Regression* to support multiple classes. 

Given an instance `x`, the model computes a score $s_k(x)$ for each class `k`, then estimates the probability of each class applying the *softmax function* (normalized exponential) to the scores.

The equation to compute $s_k(x)$ is:

$$
s_k(x) = \left( \theta^{(k)} \right)^T  x
$$

Where:
- $s_k(x)$ is the score of class `k` for instance `x`
- $\Theta$ is the parameter matrix where each row represents the parameter vector of class `k`

Once we have every class score (also called *logits* or *log-odds*) for the instance `x`, we estimate the porbability $\hat{p}_k$ of the instance belonging to class `k` using the softmax function:

$$
\hat{p}_k = \sigma(s(x))_k = \frac{ \exp(s_k(x)) }{ \sum_{j=1}^{K} \exp(s_j(x)) }
$$

Where:
- $K$ is the number of classes
- $s(x)$ is a vector containing the scores of each class for the instance `x`
- $\sigma(s(x))_k$ is the estimated probability that the instance `x` belongs to class `k`

Now, the prediction $\hat{y}$ is the class with the highest estimated probability:

$$
\hat{y} = \underset{k}{\text{argmax}} \, \sigma(s(x))_k = \underset{k}{\text{argmax}} \, s_k(x) = \underset{k}{\text{argmax}} \, \left( \theta^{(k)} \right)^T  x
$$

## Analysis of the problem

**Using only NumPy**

1. Load the iris dataset
1. Divide into train, test and validation sets
1. Scale the data
1. Implement the softmax regression model
1. Implement the batch gradient descent algorithm
1. Implement early stopping
1. Train the model
1. Evaluate the model

In [2]:
import numpy as np

# Scikit-learn just to load the data
from sklearn.datasets import load_iris

In [98]:
import sys
import os
# Add the parent directory to the Python path
sys.path.append(os.path.abspath(os.path.join('..', 'MyClasses')))

# Import the class from the other directory
import ProcessData

In [67]:
iris= load_iris()

In [68]:
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

In [72]:
# Split the data into features and labels
X = iris.data
y = iris.target

In [126]:
# Process using my class
def preprocess(X, y):
	pd = ProcessData.ProcessData()

	# Scale the features (normalize)
	X_p = pd.normalize(X)

	# The dummy feature is a column of ones to account for the bias term
	# Done *after* scaling
	X_p = pd.add_dummy(X_p)

	# One-hot encode the labels
	# Labels are a probaility 1 for their class, 0 for the others 
	y_p = pd.one_hot_encoder(y)

	return X_p, y_p

X_p, y_p = preprocess(X, y)

In [127]:
X_p.shape, y_p.shape

((150, 5), (150, 3))

In [129]:
processor = ProcessData.ProcessData()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = processor.split_data(X_p, y_p)

IndexError: index 125 is out of bounds for axis 0 with size 120

In [99]:
def process_split(X, y, test_ratio=0.2, val_ratio=0.2, random_state=42):
	X, y = preprocess(X, y)
	pd = ProcessData.ProcessData()

	X_train, y_train, X_val, y_val, X_test, y_test = pd.split_data(X, y, test_ratio=test_ratio,
																val_ratio=val_ratio, random_state=random_state)
	
	return X_train, y_train, X_val, y_val, X_test, y_test

In [None]:
def softmax(logits):
	exp = np.exp(logits)
	exp_sum = np.sum(exp, axis=1, keepdims=True)
	return exp / exp_sum

def cross_entropy_loss(y_k, p_k):
	m = len(y_k)
	J = -1/m * np.sum(y_k * np.log(p_k))
	