# Preliminaries

We download the data:

In [1]:
! wget https://gitlab.com/andras.simonyi/10_days_AI_training_data/raw/65de5908dccf120762b305238e02610a8c18a3f9/titanic_train.csv

--2020-10-05 20:54:34--  https://gitlab.com/andras.simonyi/10_days_AI_training_data/raw/65de5908dccf120762b305238e02610a8c18a3f9/titanic_train.csv
Resolving gitlab.com (gitlab.com)... 172.65.251.78, 2606:4700:90:0:f22e:fbec:5bed:a9b9
Connecting to gitlab.com (gitlab.com)|172.65.251.78|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘titanic_train.csv.12’

titanic_train.csv.1     [ <=>                ]  59.76K  --.-KB/s    in 0.08s   

2020-10-05 20:54:35 (769 KB/s) - ‘titanic_train.csv.12’ saved [61194]



# A perceptron implementation in NumPy

We create a Perceptron class, which mimics scikit-learn estimators by providing a `fit` and a `predict` method:
+ The `predict` method returns a vector of predictions for an array of samples, while 
+ the `fit` method initializes the model parameters and trains the model using the perceptron learning rule.

In [2]:
import numpy as np
from numpy.random import permutation, seed
seed(50) # Fix the random seed

class Perceptron:
    """A simple implementation of the classic single-neuron perceptron model.

    Attrs:
        fitted (bool): Whether the model has been fitted.
        n_features (int): Number of input features.
        weights (numpy array): The model's weights.
        bias (float): The model's bias.
    """

    def __init__(self):
        """Create a perceptron model.

        Returns:
            A new perceptron instance.
        """
        self.fitted = False


    def fit(self, X, y, n_epochs = 20, lr = 0.005):
        """Fit the model to a data set.

        Args:
            X (numpy array of shape (n_samples, n_features)): Training data.
            y (numpy array of shape (n_samples,)): Target binary labels.
            n_epochs (int): Number of training epochs.
            lr (float): Learning rate.
        """
        n_samples, self.n_features = X.shape
        
        # Initialization
        # As we know, the ONLY neural model that can be initialized to zeroes
        # is the Perceptron, if we use the perceptron learning rule.
        # Please initialize the weights and biases to zero!
        # Use Numpy!
        # Bear in mind, that the weights are a VECTOR, not a one dimensional matrix!
        # Bias is a scalar.
        self.weights = np.zeros(self.n_features)
        self.bias = 0.00

        # Training
        # Implement the main loop, use the epoch parameter!
        for e in range(n_epochs):
            print("Starting epoch", e)

            # Random shuffle - in  a tricky way.
            # Generate a permutation mask with numpy 
            # which we will use to index into the data, thus realizing "shuffling".
            # Numpy has a permutation function. 
            # Please bear in mind, how many datapoints do we have. 
            # We have a variable for it above...
            perm = np.random.permutation(n_samples)

            # Do the epoch!
            # Observe the trick, please!
            # We use the permutation mask to index into X and y,
            # and we generate and iterator of tuples.
            # Then we parse the intividual tuples (x, y pairs), and look through them.
            for x, label in zip(X[perm], y[perm]): #you can uncomment this, if you have understood.
                # Please convert the 1/0 label to 1/-1 label
                # 1 should remain 1, 0 should become -1
                # Use some easy mathematics, or whatever you feel like.
                y_ = 2 * label - 1
                
                # Implement the update rule!
                # If the simple product of the output of our neuron and the true label is
                # less than or equal to zero (negative), we will update.
                # Remember, the activation is W times x plus the bias.
                # Please bear in mind, that the "times" here is denoting the dot product!
                if (np.dot(self.weights, x) + self.bias) * y_ <= 0:
                    # So, we have an error. We should "update" the weights and the bias!
                    # We just add to the weight the multiplication of the input vector 
                    # and the true label, mutiplied by the learning rate 
                    # (we have a parameter for that).
                    self.weights += lr * y_ * x
                    # For the bias we have the "virtual input" of 1, so we only add true label 
                    # multiplied by the learning rate.
                    self.bias += y_ * lr
                
                # Lo and behold, THAT'S IT!
                # We have implemented a perceptron!
        
        self.fitted = True
        print("Finished training.")
        
    def predict(self, X):
        """Predict labels for samples.
        
        Args:
            X (numpy array of shape (n_samples, n_features)):  Samples.

        Returns:
            A numpy array of shape (n_samples,) containing the predicted labels.
        """
        if not self.fitted:
            raise ValueError("Perceptron model is not fitted")
        elif X.shape[1] != self.n_features:
            raise ValueError(f"Incorrect number of input features (expected {self.n_features})")
        else:
            
            # Please think through and explain the trainer, WHY this is a matmul!?
            #.... #You can remove the line, just here to make you think.
            activations = np.matmul(X, self.weights) + self.bias
            signs = np.sign(activations).astype(int) 
            # Tricky conversion of signs to 0/1 labels
            return (signs + signs * signs) // 2

# Trying it out -- on the Titanic data set

Short description of the data set:

> The titanic [...] data frames describe the survival status of individual passengers on the Titanic.

> Non-obvious variables:

>- Pclass -- Passenger Class  (1 = 1st; 2 = 2nd; 3 = 3rd)
>- Survived -- Survival  (0 = No; 1 = Yes)
>- SibSp -- Number of Siblings/Spouses Aboard
>- Parch -- Number of Parents/Children Aboard
>- Fare (ticket price in British pound)
>- Embarked -- Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split


df = pd.read_csv("titanic_train.csv")

For the sake of simplicity, we divide our data only into a training and a validation part:

In [4]:
df_train, df_valid = train_test_split(df, test_size = 0.1)
df_train.reset_index(inplace = True)
df_valid.reset_index(inplace = True)
print("train shape:", df_train.shape)
print("validation shape:", df_valid.shape)

train shape: (801, 13)
validation shape: (90, 13)


## Inspecting and cleaning the data

In [5]:
df_train.head()

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,76,77,0,3,"Staneff, Mr. Ivan",male,,0,0,349208,7.8958,,S
1,483,484,1,3,"Turkula, Mrs. (Hedwig)",female,63.0,0,0,4134,9.5875,,S
2,368,369,1,3,"Jermyn, Miss. Annie",female,,0,0,14313,7.75,,Q
3,846,847,0,3,"Sage, Mr. Douglas Bullen",male,,8,2,CA. 2343,69.55,,S
4,32,33,1,3,"Glynn, Miss. Mary Agatha",female,,0,0,335677,7.75,,Q


In [6]:
df_train.describe(include="all")

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,801.0,801.0,801.0,801.0,801,801,638.0,801.0,801.0,801.0,801.0,183,799
unique,,,,,801,2,,,,620.0,,136,3
top,,,,,"Frauenthal, Mrs. Henry William (Clara Heinshei...",male,,,,1601.0,,C23 C25 C27,S
freq,,,,,1,521,,,,7.0,,4,583
mean,439.128589,440.128589,0.373283,2.308365,,,30.03906,0.530587,0.377029,,32.5379,,
std,256.872424,256.872424,0.483979,0.843233,,,14.516676,1.124445,0.807881,,48.441166,,
min,0.0,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,214.0,215.0,0.0,2.0,,,21.0,0.0,0.0,,7.925,,
50%,435.0,436.0,0.0,3.0,,,28.75,0.0,0.0,,14.5,,
75%,657.0,658.0,1.0,3.0,,,38.75,1.0,0.0,,31.275,,


In [7]:
df_train.isna().sum()

index            0
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            163
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          618
Embarked         2
dtype: int64

Based on our inspection, we drop the Cabin, Ticket, PassengerId, Name and index columns, since they are unusable for the prediction task:

In [8]:
columns_to_drop = ["Cabin", "Ticket", "PassengerId", "Name", "index"]
df_train = df_train.drop(columns = columns_to_drop)
df_valid = df_valid.drop(columns = columns_to_drop)
df_train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,,0,0,7.8958,S
1,1,3,female,63.0,0,0,9.5875,S
2,1,3,female,,0,0,7.75,Q
3,0,3,male,,8,2,69.55,S
4,1,3,female,,0,0,7.75,Q


We encode the the gender of passenger by numbers, and, as a primitive form of data imputation, replace missing age values with the mean age in the training data:

In [9]:
age_mean = df_train.Age.mean()
replacements = {"Sex": {"male": 1, "female":0}, "Age": {np.NaN: age_mean}}
df_train.replace(replacements, inplace = True)
df_valid.replace(replacements, inplace = True)
df_train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,30.03906,0,0,7.8958,S
1,1,3,0,63.0,0,0,9.5875,S
2,1,3,0,30.03906,0,0,7.75,Q
3,0,3,1,30.03906,8,2,69.55,S
4,1,3,0,30.03906,0,0,7.75,Q


We drop the remaining rows with missing data:

In [10]:
print("Train and test length before dropping:", len(df_train), len(df_valid))
df_train.dropna(inplace=True)
df_train.reset_index()
df_valid.dropna(inplace=True)
df_valid.reset_index()
print("Train and test length after dropping:", len(df_train), len(df_valid))

Train and test length before dropping: 801 90
Train and test length after dropping: 799 90


Finally, we one-hot encode the Embarked column.

In [11]:
df_train = pd.get_dummies(df_train)
df_valid = pd.get_dummies(df_valid)
df_train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
0,0,3,1,30.03906,0,0,7.8958,0,0,1
1,1,3,0,63.0,0,0,9.5875,0,0,1
2,1,3,0,30.03906,0,0,7.75,0,1,0
3,0,3,1,30.03906,8,2,69.55,0,0,1
4,1,3,0,30.03906,0,0,7.75,0,1,0


## Fitting a perceptron

In [12]:
p = Perceptron()

input_cols = list(df_train.columns.values)[1:]

X_train = df_train[input_cols].to_numpy()
y_train = df_train.Survived.to_numpy()

p.fit(X_train, y_train, n_epochs=100, lr=0.001)

Starting epoch 0
Starting epoch 1
Starting epoch 2
Starting epoch 3
Starting epoch 4
Starting epoch 5
Starting epoch 6
Starting epoch 7
Starting epoch 8
Starting epoch 9
Starting epoch 10
Starting epoch 11
Starting epoch 12
Starting epoch 13
Starting epoch 14
Starting epoch 15
Starting epoch 16
Starting epoch 17
Starting epoch 18
Starting epoch 19
Starting epoch 20
Starting epoch 21
Starting epoch 22
Starting epoch 23
Starting epoch 24
Starting epoch 25
Starting epoch 26
Starting epoch 27
Starting epoch 28
Starting epoch 29
Starting epoch 30
Starting epoch 31
Starting epoch 32
Starting epoch 33
Starting epoch 34
Starting epoch 35
Starting epoch 36
Starting epoch 37
Starting epoch 38
Starting epoch 39
Starting epoch 40
Starting epoch 41
Starting epoch 42
Starting epoch 43
Starting epoch 44
Starting epoch 45
Starting epoch 46
Starting epoch 47
Starting epoch 48
Starting epoch 49
Starting epoch 50
Starting epoch 51
Starting epoch 52
Starting epoch 53
Starting epoch 54
Starting epoch 55
St

In [13]:
y_train_predicted = p.predict(X_train)

X_valid = df_valid[input_cols].to_numpy()
y_valid = df_valid.Survived.to_numpy()

y_valid_predicted = p.predict(X_valid)

Let's see the metrics on our training and validation data:

In [14]:
from sklearn.metrics import classification_report

print(classification_report(y_train, y_train_predicted, labels=[1], target_names=["Survivor"]))

              precision    recall  f1-score   support

    Survivor       0.69      0.73      0.71       297

   micro avg       0.69      0.73      0.71       297
   macro avg       0.69      0.73      0.71       297
weighted avg       0.69      0.73      0.71       297



In [15]:
print(classification_report(y_valid, y_valid_predicted, labels=[1], target_names=["Survivor"]))

              precision    recall  f1-score   support

    Survivor       0.84      0.74      0.79        43

   micro avg       0.84      0.74      0.79        43
   macro avg       0.84      0.74      0.79        43
weighted avg       0.84      0.74      0.79        43

