# Problem: Binary Classification of Iris Flowers

1. Dataset: Use the Iris dataset, a popular dataset in machine learning. It contains four features: sepal length, sepal width, petal length, and petal width, along with the species of iris (Setosa, Versicolor, or Virginica).

2. Objective: We will simplify the problem to binary classification by considering only two classes: Setosa and Versicolor. You will train the perceptron to distinguish between these two classes based on the provided features.

3. Steps:

    a)Load the Iris dataset.

    b) Preprocess the data: Since we're considering only two classes, Setosa and Versicolor, you can select only the corresponding rows from the dataset and use only the first two features (sepal length and sepal width) for simplicity.
    
    c) Implement the perceptron algorithm to learn a decision boundary that separates the two classes.
    Train the perceptron on a portion of the dataset.
    
    d)Test the perceptron on the remaining portion of the dataset and evaluate its performance (e.g., accuracy).
    
    e) Evaluation: You can evaluate the performance of your perceptron algorithm by calculating the accuracy, which is the proportion of correctly classified instances over the total number of instances.

4. Extension: Once you have a basic perceptron working, you can experiment with different aspects such as learning rate, number of epochs, feature scaling, or even extend it to handle multiple classes using techniques like one-vs-all or one-vs-one.

5. This exercise will allow you to test your perceptron algorithm in a simple binary classification task and assess its performance.

## $1^{\underline{st}}$ stage: Preprocessing

In [52]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def load_data(dataset_path, column_names):

    # Load the dataset using Pandas.
    data = pd.read_csv(dataset_path, names=column_names)

    return data

def map_labels_to_int(data, class_column_name, label_0, label_1):

    # Map class labels to integers (e. g., 'category1' -> 0, 'category2' -> 1).
    class_mapping = {label_0: 0, label_1: 1}
    data[class_column_name] = data[class_column_name].map(class_mapping)

    return data

def extract_features_and_labels(data, class_column_name, features):
    
    # Extract features and labels.
    X = data[features].values
    y = data[class_column_name].values

    return X, y

def normalize_features(X):
    
    # Normalize_features.
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    return X_scaled

def split_test_train_sets(X_scaled, y, test_cardinality, randomness_status):

    # Split the dataset into training and testing sets.
    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=test_cardinality, random_state=randomness_status)

    return X_train, X_test, y_train, y_test


def preprocess_data(dataset_path, column_names, class_column_name, label_0, label_1, features, test_cardinality, randomness_status):
    
    data = load_data(dataset_path, column_names)

    data = map_labels_to_int(data, class_column_name, label_0, label_1)

    X, y = extract_features_and_labels(data, class_column_name, features)

    X_scaled = normalize_features(X)

    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=test_cardinality, random_state=randomness_status)

    # Print the shapes of training and testing sets
    print("X_train shape:", X_train.shape)
    print("X_test shape:", X_test.shape)
    print("y_train shape:", y_train.shape)
    print("y_test shape:", y_test.shape)

    return X_train, X_test, y_train, y_test


dataset_path = 'iris/iris.data'
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']

class_column_name = 'class'
label_0 = 'Iris-versicolor'
label_1 = 'Iris-setosa'
features = ['sepal_length', 'sepal_width']

test_cardinality = 0.2
randomness_status = 42


X_train, X_test, y_train, y_test = preprocess_data(dataset_path, column_names, class_column_name, label_0, label_1, features, test_cardinality, randomness_status)

X_train shape: (80, 2)
X_test shape: (20, 2)
y_train shape: (80,)
y_test shape: (20,)


## $2^{\underline{nd}}$ stage: Training
We are going to define that:
 - if the dot product $(weight*x) >= 0$, then $y$ should be versicolor (0)
 - if the dot product $(weight*x) < 0$, then $y$ should be setosa (1)

In [53]:
def train_data(X_train, y_train, number_of_epochs=100, learning_rate=0.1):
    
    # Variable weight_vector initialization with nÂº input features plus 1 element for threshold.
    number_of_features = X_train.shape[1] + 1
    weight_vector = np.zeros(number_of_features) + 1


    for epoch in range(number_of_epochs):
        for x, y in zip(X_train, y_train):

            # Insertion of an element [1] in x in order to enable its inner product with weight_vector.
            x_with_threshold = np.concatenate(([1], x))
            dot_product = np.dot(weight_vector, x_with_threshold)

            # Update of weight_vector in case of misclassification for observation equals 0
            if y == 0 and dot_product < 0:
                weight_vector = weight_vector + y*x_with_threshold

            # Update of weight_vector in case of misclassification for observation equals 1
            if y == 1 and dot_product >= 0:                
                weight_vector = weight_vector - learning_rate*x_with_threshold
        
    return weight_vector
    
trained_weight_vector = train_data(X_train, y_train)


## $3^{\underline{rd}}$ stage: Validation

In [54]:
def generate_hypothesis_image(X_test, trained_weight_vector):

    # Addition of threshold column to X_test
    X_test_with_threshold = np.column_stack((np.ones(len(X_test)), X_test))

    # Calculation of Inner products and putting them inside a set called inner_products_set
    inner_products_set = np.dot(X_test_with_threshold, trained_weight_vector)

    # Creation of set with labels according to hypothesis function (trained_weight_vector)
    hypothesis_image = np.where(inner_products_set >= 0, 0, 1)

    return hypothesis_image

def generate_accuracy(hypothesis_image, y_test):

    # Calculation of accuracy verifying percentage of correspondence between hypothesis_image and actual image (y_test).
    accuracy = np.mean(hypothesis_image == y_test)

    return accuracy

def validate_data(X_test, y_test, trained_weight_vector):

    hypothesis_image = generate_hypothesis_image(X_test, trained_weight_vector)

    accuracy = generate_accuracy(hypothesis_image, y_test)
    print("Perceptron accuracy:", accuracy)

    return hypothesis_image, accuracy


hypothesis_image, accuracy = validate_data(X_test, y_test, trained_weight_vector)

Perceptron accuracy: 1.0


## $4^{\underline{th}}$ stage: Storage

In [55]:
def storage_weight_vector(weight_vector):

    # Saves the model (trained weight vector) into the the file 'perceptron_weights.npy' inside the directory 'model'
    np.save('model/perceptron_weights.npy', weight_vector)

storage_weight_vector(trained_weight_vector)

## $5^{\underline{th}}$ stage: Prediction



In [56]:
'''
    I must create a script to predict new data using the weight vector found earlier. I don't know if I am leaving inside this file or if I should create another one: .ipynb or .py?
'''

"\n    I must create a script to predict new data using the weight vector found earlier. I don't know if I am leaving inside this file or if I should create another one: .ipynb or .py?\n"