In [5]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

## Dataset
> The dataset is from: https://www.kaggle.com/datasets/uciml/mushroom-classification

In [20]:
df = pd.read_csv('mushrooms.csv')
df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [124]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

#### One hot encoding to ensure equal treatement for each category in categorical features

In [21]:
df_ohe = pd.get_dummies(df)
df_ohe.head()

Unnamed: 0,class_e,class_p,cap-shape_b,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_f,cap-surface_g,...,population_s,population_v,population_y,habitat_d,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,0,1,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0
1,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,0,1,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0
4,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0


#### Neural Network
1. Choose Number of layers  and neurons and create corrosponding weights and bias
2. Choose Activation Function 
3. Choose Cost Function
4. Implement Forward Propgation
5. Implement Back Propagation

#### 1) Creating neural network 

In [99]:
input_layer_size  = 118  # 118 category features
hidden_layer_size = 20   # 20 neurons
output_layer_size = 1          # 2 labels, (0 = p, e = 1), which coorspond to 1 output neurons
np.random.seed(1)
w1 = np.random.randn(input_layer_size, hidden_layer_size)
b1 = np.random.randn(hidden_layer_size)
w2 = np.random.randn(hidden_layer_size, output_layer_size)
b2 = np.random.randn(output_layer_size)
print(w1.shape,b1.shape,w2.shape,b2.shape)

(118, 20) (20,) (20, 1) (1,)


![image-2.png](attachment:image-2.png)

#### 2) Choose activation function
> Sigmoid Activation Function is choosen
$$\sigma = \frac1{1 + e ^{-x}}$$

In [122]:
def sig(x):
    return 1/(1 + np.exp(-x))

#### 3) Choose cost function


>Various cost functions and Notation: 
https://stats.stackexchange.com/questions/154879/a-list-of-cost-functions-used-in-neural-networks-alongside-applications#:~:text=A%20cost%20function%20is%20a,network%20did%20as%20a%20whole.

>Mean Square error function is choosen
$$ mse =  \frac 1 {2n} \sum_{i=0}^n (yi_{act} - yi_{pred})^2$$

In [123]:
def mse(y_pred,y_act):
    return 0.5 * np.mean((y_act - y_pred)**2)

#### 4) Batch Forward Propgation

> we can relate the next layer's input to it's previous via the following relation:
$$a^i_j=σ(\sum_k(w^i_{jk}⋅a^{i−1}_k)+b^i_j)$$

where

σ is the activation function,

$w^i_{jk}$ is the weight from the kth neuron in the (i−1)th layer to the jth neuron in the ith layer,

$b^i_j$ is the bias of the jth neuron in the ith layer, and

$a^i_j$ represents the activation value of the jth neuron in the ith layer.


In [125]:
# forward propgation for a 2 layer NN
def forward_prob(df,w1,b1,w2,b2):
    # converting entire dataframe into a matrix with shape (rows,columns-1)
    # so that when it is multiplied by the weight matrix (columns-1,layer_2_size)
    # it yields layer 2 input (rows,layer_2_size)
    # then bias value is added before activation function
    Z1 = df.iloc[:,1:].dot(w1) + b1
    A1 = sig(Z1)
    # doing the same with first layer output
    Z2 = Z1.dot(w2) + b2
    y_pred = sig(Z2)
    y_act = df.iloc[:,0]
    loss = mse(y_pred,y_act)
    return Z1,A1,Z2,y_pred, loss

#### 5) Back Propagation

> Before implementing back propagation these derivatives must be calculated first
1. Derivative of loss function wrt to activation function $$\frac{dC}{dA} = \frac 1n\sum_{i=0}^n (yi_{act} - yi_{pred})$$
    
2. Derivative of activation function wrt to linear function $$\frac{dA}{dZ} = \frac{d\sigma}{dz}= \sigma(z) (1 - \sigma(z))=
A (1-A)$$
3. Derivative of linear function wrt to Weights $$ \frac{dZ}{dW} = X\space ,since \space   Z = X.W + b$$
4. Derivative of loss function wrt to linear function (delta) $$\delta = \frac{\partial C}{\partial Z} = \frac{dL}{dA} . \frac{dA}{dZ} = 
\frac 1n\sum_{i=0}^n (yi_{act} - yi_{pred}) * {\sigma(z_i)(1-\sigma(z_i)}$$ 
*Note that this is only valid for last layer as other layers don't have $y_{pred}$ as output*

5. Derivative of loss function wrt to weights $$\frac{\partial C}{\partial W} = \delta . \frac {dZ}{dW} =
\frac 1n\sum_{i=0}^n (yi_{act} - \sigma(a^{l-1}_{i}w^l)) * \sigma(a^{l-1}_{i}w^l)(1-\sigma(a^{l-1}_{i}w^l)) * a^{l-1}_i$$ 

Then using taylor expansion therom to approximate $\Delta C$ using only first order derivatives
$$\Delta C \approx \sum_{i=0}^{W_j} \frac{\partial C}{\partial W_j} \Delta w_j = \nabla C . \Delta W$$

Then choosing $\Delta W= - \nabla C$ ensures that $\Delta C$ always negative since $\nabla C . \nabla C$ is the magnitude squared which is positive. However, since $\nabla C$ can be very big, a learning rate hyperparameter is required to make sure that the value is very small to satisfy taylor's theorm, thus the final equation becomes:
$$\Delta C \approx  -\eta  |\nabla C |^2 $$

Then $$\Delta W = - \eta \nabla C = - \eta \frac{\partial C}{\partial W}$$
$$W_{k+1} = W_k - \Delta W_k =  W_k - \eta \frac{\partial C}{\partial W_k}$$