In [1]:
import numpy as np

Introduction.
 On April 15, 1912, the largest passenger liner ever built col-
lided with an iceberg during its maiden voyage. When the Titanic sank, 1502 of
the 2224 passengers and crew were killed. This sensational tragedy shocked the
international community and led to improved safety regulations for ships. One
of the reasons that the shipwreck resulted in such a loss of life was that there
were not enough lifeboats for the passengers and crew. Although there was an
element of luck involved in surviving the sinking, some groups of people were
more likely to survive than others.

Data.
 The titanic.csv ﬁle in the directory data contains data for 887 of
the real Titanic passengers. Each row represents one passenger. The columns
describe attributes about the passengers, including whether they survived (S),
their passenger class (C), their gender (G), their age (A), and the fare they paid
(F).

Column S encodes survival as 1 and death as 0; column G encodes male as
0 and female as 1. Passenger classes C are 1 (top), 2 (middle), and 3 (bottom).
Task. Write a 1-nearest neighbor classiﬁer using NumPy to predict whether
a Titanic passenger survived or not. Follow the instructions in the notebook
titanic.ipynb.

# Titanic Dataset

1. **S** survived (1 survived, 0 drowned)
2. **C** passenger-class (1 upper, 2 middle, 3 lower)
3. **G** gender (0 male, 1 female)
4. **A** age
5. **F** fare

The first column `S` of the Titanic dataset represents the class label. 


---
### Load Data

The next cell reads the Titanic dataset and stores the table in the 2D array `Z`. Rows represent examples and columns represent features. 

In [3]:
file = './data/titanic.csv'
Z = np.loadtxt(file, skiprows=1, delimiter=',')
print(Z)

[[ 0.      3.      0.     22.      7.25  ]
 [ 1.      1.      1.     38.     71.2833]
 [ 1.      3.      1.     26.      7.925 ]
 ...
 [ 0.      3.      1.      7.     23.45  ]
 [ 1.      1.      0.     26.     30.    ]
 [ 0.      3.      0.     32.      7.75  ]]


---
### Split Data

Write a function that accepts two arguments: a dataset `Z` and a parameter `test_size`. The function randomly splits the dataset `Z` into two sets: a training set and a test set. The `test_size` parameter specifies the proportion of the data in `Z` that should be allocated to the test set. The function returns both the training and test sets.

In [19]:
def split(Z, test_size=0.2):
    num_lines = int((test_size*Z.shape[0])//1)
    indices = np.random.choice(Z.shape[0], num_lines, replace=False)
    test_data = Z[indices]
    train_data = np.delete(Z, indices, axis=0)
    return (test_data, train_data)

data_tuple = split(Z)
test_data = data_tuple[0]
train_data = data_tuple[1]

---
### Data Exploration

Write a function `info` that takes a dataset as 2D array and the index of the class label as input and displays the following information:

+ Number of examples
+ Number of features (including class label)
+ Numer of unique class labels
+ Total Number of values
+ Type of elements in the dataset
+ Mean and standard deviation of each feature 

Additional parameters may be added if necessary.

In [59]:
def info(X, label_index):
    mask = X[:,0] == label_index
    info_matrix = X[mask]
    num_example = len(info_matrix)
    num_feature = X.shape[1]
    unique_labels = len(np.unique(X[:, 0], axis=0))
    total_vals = X.size
    type_elem = X.dtype
    mean_matrix = np.mean(a=X, axis=0)
    sdv_matrix = np.std(a=X, axis=0)
    print(f'{"number of examples: ":25} {num_example:<5}')
    print(f'{"number of features: ":25} {num_feature:<5}')
    print(f'{"number of unique labels: ":25} {unique_labels:<5}')
    print(f'{"total number of values: ":25} {total_vals:<5}')
    print(f'{"type of all elements ":25} {type_elem}')
    print(f'{"Matrix of mean values: ":25} {mean_matrix}')
    print(f'{"Matrix of std: ":25} {sdv_matrix}')

    
info(Z, 1)

number of examples:       342  
number of features:       5    
number of unique labels:  2    
total number of values:   4435 
type of all elements      float64
Matrix of mean values:    [ 0.38556933  2.30552424  0.35400225 29.47144307 32.30542018]
Matrix of std:            [ 0.48672952  0.83619025  0.47820985 14.11394567 49.75397046]


---
### Standardization

Implement standardization to scale the feature matrix (excluding class labels) so that each column has a mean of zero and a standard deviation of one.

**Note:** Estimate the mean and standard deviation using only the training feature matrix. Then, scale both the training and test feature matrices using the estimated values.


In [66]:
# code
def standardization(X):
    x_samples, x_labels = X.shape
    mean_x = np.mean(a=X[:,1:], axis=0)
    std_x = np.std(a=X[:,1:], axis=0)
    
    X_std = (X[:,1:] - mean_x) / std_x
    X_std = np.hstack((X_std, X[:,1:]))
    print(X_std)
    
standardization(train_data)

[[ 0.82253676 -0.75551073 -0.51645248 ...  0.         22.
   7.25      ]
 [-1.5611412   1.32360794  0.62317631 ...  1.         38.
  71.2833    ]
 [ 0.82253676  1.32360794 -0.23154528 ...  1.         26.
   7.925     ]
 ...
 [ 0.82253676 -0.75551073 -0.30277208 ...  0.         25.
   7.05      ]
 [ 0.82253676  1.32360794 -1.58485448 ...  1.          7.
  23.45      ]
 [ 0.82253676 -0.75551073  0.19581551 ...  0.         32.
   7.75      ]]


---
### Nearest-Neighbor Classifier

Write a score function that returns the classification accuracy on the test set using the training set. 

In [None]:
# code

---
### Evaluation

Evaluate the 1-NN classifier with and without data standardization.


In [None]:
# code