# Perceptron - MLP

<img src="images/Perceptron architecture.png" style="width:500px; height:200px;">

* **Input Data** : Input data consists of the information or features you want the perceptron to make decisions about. The different features of the input are represented as x1, x2, x3,.. xm. Each piece of input data is represented as a numerical value. 
* **Weights** : Weights are coefficients assigned to each input. They represent the importance or impact of each input on the perceptron's decision. Weights can be positive or negative and are adjusted during the learning process. 
* **Weighted Summation/ Net input function** : The perceptron calculates the weighted sum of all input data. It multiplies each input by its corresponding weight and adds these products together. This step is also known as the weighted sum. 
    weighted_sum = (x₁ * w₁) + (x₂ * w₂) + (x₃ * w₃) + ... + (xᵢ * wᵢ)
* **Activation Function** : The weighted sum is passed through an activation function, which determines the perceptron’s output. Common activation functions include:

    * Step Function: Outputs 1 if the weighted sum is above a certain threshold, and 0 otherwise.
    * Sigmoid Function: Outputs a value between 0 and 1, representing the probability of activation.
    * Rectified Linear Unit (ReLU): Outputs the weighted sum if it’s positive, and 0 otherwise.
* **Decision** : Based on the result of the activation function, the perceptron makes a decision or prediction. For binary classification tasks, this decision is often binary, such as 0 or 1.
    * if weighted_sum ≥ θ, then y = 1 or else if weighted_sum < θ, then y = 0
* **Learning** : The perceptron can learn from its mistakes through a learning algorithm using an optimizer. If the decision is incorrect, the weights are adjusted to improve future decisions. Common learning algorithms include the Perceptron Learning Rule and the Delta Rule (Widrow-Hoff learning).


In [None]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense

from keras.utils import to_categorical

# Step 1: Create a Sequential model
model = Sequential()

# Step 2: Add input layer (with 4 input features) and first hidden layer
model.add(Dense(units=8, activation='relu', input_dim=4))

# Step 3: Add a second hidden layer
model.add(Dense(units=6, activation='relu'))

# Step 4: Add the output layer (with 3 output classes and softmax activation for classification)
model.add(Dense(units=3, activation='softmax'))

# Step 5: Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Step 6: Generate some example data
X_train = np.random.rand(100, 4)  # Input features
y_train = np.random.randint(0, 3, 100)  # Output labels (3 classes)
y_train = to_categorical(y_train, num_classes=3)  # One-hot encode the labels
# Step 7: Train the model
model.fit(X_train, y_train, epochs=10, batch_size=10)

# Step 8: Save the trained model to a file
# model.save('my_trained_model.h5')

In [None]:
from keras.models import load_model

# model = load_model('my_trained_model.h5')  # Replace with the actual path to your saved model
output= model.predict([[1,2,3,4]])

# Get the class label with the highest probability
predicted_class = np.argmax(output)

# Print the predicted class label
print("Predicted Class Label:", predicted_class)

# Types of Activation Functions

* **Sigmoid function** :
    * also known as logistic / non-linear funtion
    * formula : **σ(x) = 1 / (1 + e^(-x))**
    * The function maps any real number to a value between 0 and 1.
    * Graph : 
        * The sigmoid function produces an **S- shaped curve** . It starts at zero, rises slowly from -∞ to ∞, and approaches 1 as the input becomes large (positive or negative).
        * The derivative values are in the range **(0, 0.25)**, with a maximum value of approximately 0.25 occurring at the midpoint.
        * This **bell-shaped curve** is typical of the sigmoid function's derivative. The derivative is highest (steepest) around the center (x=0)
        <img src="images/Perceptron architecture.png" style="width:500px; height:200px;">
    * Use Cases:
        * **Binary Classification** : It squashes the network's raw output to a probability-like value between 0 and 1
    * Advantages:
        * **Smooth Gradient**: It leads to more stable convergence during training.
        * **Output Range** : The output is bounded between 0 and 1, which is useful in the context of probabilities.
    * Disadvantages:
        * **Vanishing Gradients**
        * **Output Not Centered at Zero**: It can slow down learning in some cases. It may lead to vanishing gradient descent or the outputs of hidden neurons in subsequent layers are influenced by the outputs of neurons in previous layers, which can become biased towards either 0 or 1.
        * **Not Sparse** : They always produce some activation regardless of the input, which may not be efficient for certain tasks.
    * Due to the vanishing gradient issue and the availability of better activation functions like ReLU and its variants, sigmoid functions are less commonly used in hidden layers of deep neural networks today
* **Tanh Function** :
    * squashes input values to a range between -1 and 1.
    * 