# Perceptron

In [1]:
import pandas as pd
import numpy as np
from collections import Counter

## Implementation

In base to the implementation provided by the teacher.

In [2]:
class Perceptron:

    def __init__(self, learning_rate: float, input_size: int):
        self.learning_rate = learning_rate
        self.weights = np.zeros(input_size)
        # We had choose to make bias into his very own variable
        self.bias = 0

    def train(self, training_data: np.ndarray, binary_class: np.ndarray):
        # In this case we update the weights every time we evaluate a data vector
        for vector, binary in zip(training_data, binary_class):
            delta_error = (binary - self.predict(vector)) * self.learning_rate
            self.weights += vector * delta_error
            self.bias += delta_error

    def test(self, test_data, binary_class):
        # A simple test that print the proportion of wrong to right answers in the test data
        answer_evaluation = [int(self.predict(vector)) == binary
                             for vector, binary in zip(test_data, binary_class)]

        summarize = Counter(answer_evaluation)
        precision = sum(answer_evaluation) / len(answer_evaluation)

        print(f'The test gave {summarize[True]} correct prediction '
              + f'and {summarize[False]} wrong prediction '
              + f'having a precision of {precision}')

    def predict(self, vector):
        # A simple dot product of vectors and a sum as predicition
        calculation = np.dot(vector, self.weights) + self.bias
        return 1 * (calculation > 0)


## Dataset Processing

We are using the titanic passenger survival dataset provided from [Kaggle](https://www.kaggle.com/competitions/titanic) the dataset includes a serie of characteristics about the passengers and whenever they survived or not.

In [3]:
df = pd.read_csv('titanic/train.csv')
df.set_index('PassengerId', inplace=True)
# We get rid of the name, ticket identification and cabin identification
# as are string characteristics that we dont know how to handle
df.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
# We get rid of all the vextors with non valid data
df.dropna(inplace=True)
# We map sex to boolean values
df.Sex = df.Sex.apply(lambda x: int(x == 'female'))
# We map Embarkment to numeric values
df.Embarked = df.Embarked.apply(lambda x: {'S': 1, 'Q': 2, 'C': 3}[x])
# We turn age into a integer value
df.Age = df.Age.apply(int)
df

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,0,3,0,22,1,0,7.2500,1
2,1,1,1,38,1,0,71.2833,3
3,1,3,1,26,0,0,7.9250,1
4,1,1,1,35,1,0,53.1000,1
5,0,3,0,35,0,0,8.0500,1
...,...,...,...,...,...,...,...,...
886,0,3,1,39,0,5,29.1250,2
887,0,2,0,27,0,0,13.0000,1
888,1,1,1,19,0,0,30.0000,1
890,1,1,0,26,0,0,30.0000,3


In [4]:
# We select ther 80% of the set randomly as our train set and 20% as our test set
df_train = df.sample(frac=.8)
df_test = df[~ df.index.isin(df_train.index)]

# We divide info from classification and data
df_train_class = df_train.Survived.to_numpy()
df_train_data = df_train.drop(['Survived'], axis=1).to_numpy(dtype=np.float64)
df_test_class = df_test.Survived.to_numpy()
df_test_data = df_test.drop(['Survived'], axis=1).to_numpy(dtype=np.float64)


## Train and Test

In [5]:
# We train and test our perceptron
perceptron = Perceptron(0.1, df.shape[1] - 1)
perceptron.train(df_train_data, df_train_class)
perceptron.test(df_test_data, df_test_class)
print(perceptron.weights)

The test gave 72 correct prediction and 70 wrong prediction having a precision of 0.5070422535211268
[-6.       4.6     -2.5     -4.8     -4.2     10.98212 -0.7    ]


## Conclusions

Likely the perceptron is a unsufficient proposal for the analisys of this dataset given that this test when repeated multiple times havea big desviation in the possible results and only guess aorund 60% of the cases most time.

## Complexity

The perceptron algorithm have a complexity of O(n) in terms of the size of his train dataset, being a set of linear operations and comparisons in a n sized dataset, since the dot product is a linear combination simply.