# Machine Learning

This is a Python notebook.

## Configuration

In [None]:
# Jupyter config
%matplotlib inline
%config InlineBackend.figure_format = 'svg'  # Or 'retina'

In [None]:
# Python imports
import gc
import os
import pickle
import sys
from collections import defaultdict
from itertools import chain, combinations, zip_longest, tee, islice
from typing import *

import numpy as np
import numpy.typing as npt
import pandas as pd
import matplotlib.pyplot as plt
from joblib import Parallel, delayed, Memory
from scipy.stats import *
from sklearn.preprocessing import *
from sklearn.mixture import *
from tqdm.notebook import tqdm


memory = Memory('./')
#plt.style.use('seaborn-whitegrid')  # Set the aesthetic style of the plots

## Preprocessing

Import data from csv:

In [None]:
training_data = pd.read_csv('train_processed.csv')
test_data = pd.read_csv('test_processed.csv')

training_data.dropna(inplace=True)
test_data.dropna(inplace=True)

In [None]:
training_data

Convert non-numeric labels to numbers:

In [None]:
label_encoders = {
    'Sex': LabelEncoder(),
    'Ticket': LabelEncoder(),
    'Embarked': LabelEncoder(),
    'NameTitle': LabelEncoder(),
    'FirstName': LabelEncoder(),
    'MiddleNames': LabelEncoder(),
    'LastName': LabelEncoder(),
    'Deck': LabelEncoder(),
    'WithFamily': LabelEncoder(),
}
for feature, label_encoder in label_encoders.items():
    label_encoder.fit(pd.concat((training_data[feature], test_data[feature])))
    training_data[feature] = label_encoder.transform(training_data[feature])
    test_data[feature] = label_encoder.transform(test_data[feature])

In [None]:
training_data

In what follows, we detail our attempts to generating a machine learning model.
We want to only include working code in this Jupyter notebook so we only present our first and last attempt.
We direct readers curious about our prior attempts to older versions of this notebook.
They are available at the following links in the Git history:

- [Attempt 1]: `git checkout attempt-01`
- [Attempt 2]: `git checkout attempt-02`
- [Attempt 3]: `git checkout attempt-03`
- [Attempt 4]: `git checkout attempt-04`
- [Attempt 5]: `git checkout attempt-05`
- [Attempt 6]: `git checkout attempt-06`
- [Attempt 7]: `git checkout attempt-07`
- [Attempt 8]: `git checkout attempt-08`

[Attempt 1]: https://github.com/0x326/miami-university-cse-627-group-project/blob/attempt-01/03_Machine_Learning.ipynb
[Attempt 2]: https://github.com/0x326/miami-university-cse-627-group-project/blob/attempt-02/03_Machine_Learning.ipynb
[Attempt 3]: https://github.com/0x326/miami-university-cse-627-group-project/blob/attempt-03/03_Machine_Learning.ipynb
[Attempt 4]: https://github.com/0x326/miami-university-cse-627-group-project/blob/attempt-04/03_Machine_Learning.ipynb
[Attempt 5]: https://github.com/0x326/miami-university-cse-627-group-project/blob/attempt-05/03_Machine_Learning.ipynb
[Attempt 6]: https://github.com/0x326/miami-university-cse-627-group-project/blob/attempt-06/03_Machine_Learning.ipynb
[Attempt 7]: https://github.com/0x326/miami-university-cse-627-group-project/blob/attempt-07/03_Machine_Learning.ipynb
[Attempt 8]: https://github.com/0x326/miami-university-cse-627-group-project/blob/attempt-08/03_Machine_Learning.ipynb

## Attempt 1

### Selecting Features

We will initally select the features which we believe would most affect the survival odds of an individual aboard the titanic

We decide to keep the following features:

* **PClass** - the class of the ticket, as we all know this had a large say in deciding who got on the escape boats
* **Age** - An older person is weaker than a younger one on average.
* **Fare** - Someone who paid a lot more money would be in a far different position than someone who did not
* **Embarked** - Depending on the port they got on, (might play a role, not sure.. might get rid of this in other attempt)
* **Deck** - The deck of the boat the person was staying is important when a boat is floating
* **FamilySize** - If an individual had a family it is possible that they gave up their spot on an escape boat or attempted to rescue them
* **FarePerPerson** - The amount paid per person (based on family size) could indicate how they were treated

## Classifier Decision

Using information seen in <https://www.kaggle.com/mosleylm/titanic-data-set-exploration/execution#II.-Format-Data> we decide that we will test many different classifiers and then select the highest performing one based on the F1 score.
We will use a stratified 10-fold cross validation in order to train and test on all of our data.

We test the following classifiers:

* **Gradient Boosting**
* **Random Forest**
* **KNeighbors**
* **SVC**
* **Decision Tree**
* **Ada Boost**
* **GaussianNB**
* **Logistic Regression**

## Attempt 2-8

### Code definitions

In [None]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import f1_score, accuracy_score, log_loss

Based on <https://docs.python.org/3/library/itertools.html#itertools-recipes>

In [None]:
def powerset(iterable):
    "powerset([1,2,3]) --> (1,2,3) (1,2) (1,3) (2,3) (1,) (2,) (3,) ()"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in reversed(range(len(s)+1)))


def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

In [None]:
worker_number = '#WORKERS'  # Will be replaced with number using sed. See bash script later on
total_workers = os.cpu_count()

In [None]:
class Classifier(NamedTuple):
    name: str
    features: Sequence[str]


class ClassifierResult(NamedTuple):
    classifier: Any
    test_predictions: npt.ArrayLike
    f1_score: np.float64
    accuracy_score: np.float64


class AllClassifierResults(NamedTuple):
    classifiers: Dict[Classifier, Any]
    classifiers_by_f1_score: Dict[Classifier, np.float64]
    classifiers_by_accuracy_score: Dict[Classifier, np.float64]


classifiers = {
    'KNeighborsClassifier': lambda: KNeighborsClassifier(5),
    'SVC': lambda: SVC(probability=True),
    'DecisionTreeClassifier': lambda: DecisionTreeClassifier(),
    'RandomForestClassifier': lambda: RandomForestClassifier(),
    'AdaBoostClassifier': lambda: AdaBoostClassifier(),
    'GradientBoostingClassifier': lambda: GradientBoostingClassifier(),
    'GaussianNB': lambda: GaussianNB(),
    'LogisticRegression': lambda: LogisticRegression(),
}


@delayed
@memory.cache
def train_classifier(selected_features, train_true, classifier_name, num_splits=10) -> ClassifierResult:
    X = training_data[list(selected_features)]
    X = X.values
    y = training_data[list(train_true)]
    y = np.asarray(y).reshape(-1)

    splitter = StratifiedShuffleSplit(n_splits=num_splits, test_size=0.1, random_state=0)
    classifier = classifiers[classifier_name]()

    average_f1_score = np.float64()
    average_accuracy_score = np.float64()
    for train_idx, test_idx in splitter.split(X, y):  # 10 folds
        X_train, X_test = X[train_idx], X[test_idx]
        Y_train, Y_test = y[train_idx], y[test_idx]
        
        classifier.fit(X_train, Y_train)

        test_predictions = classifier.predict(X_test)
        average_f1_score += f1_score(Y_test, test_predictions)
        average_accuracy_score += accuracy_score(Y_test, test_predictions)

    average_f1_score /= num_splits
    average_accuracy_score /= num_splits

    return ClassifierResult(classifier=classifier,
                            test_predictions=test_predictions,
                            f1_score=average_f1_score,
                            accuracy_score=average_accuracy_score)


def train_all_classifiers(n_jobs=os.cpu_count()) -> AllClassifierResults:
    train_true = ('Survived',)
    assumed_features = (
        'Sex',
        'Pclass',
        'Age',
    )
    excluded_features = (
        'NameTitle',
        'FirstName',
        'MiddleNames',
        'LastName',
    )
    candidate_features = iter(training_data)
    candidate_features = filter(lambda feature: feature not in train_true, candidate_features)
    candidate_features = filter(lambda feature: feature not in assumed_features, candidate_features)
    candidate_features = filter(lambda feature: feature not in excluded_features, candidate_features)
    candidate_features = tuple(candidate_features)

    all_classifiers: Dict[Classifier, Any] = {}
    classifiers_by_f1_score: Dict[Classifier, np.float64] = defaultdict(int)
    classifiers_by_accuracy_score: Dict[Classifier, np.float64] = defaultdict(int)

    candidate_features_subsets = ((*assumed_features, *selected_features)
                                  for selected_features in powerset(candidate_features))
    candidate_features_subsets = filter(lambda features: len(features) > 0, candidate_features_subsets)
    candidate_features_subsets = islice(candidate_features_subsets, worker_number - 1, None, total_workers)
    candidate_features_subsets_1, candidate_features_subsets_2 = tee(candidate_features_subsets)
    del candidate_features_subsets
    job_inputs = (Classifier(name=classifier_name, features=selected_features)
                  for selected_features in candidate_features_subsets_1
                  for classifier_name in classifiers.keys())
    job_results = (train_classifier(selected_features, train_true, classifier_name)
                   for selected_features in candidate_features_subsets_2
                   for classifier_name in classifiers.keys())
    if n_jobs == 1:
        job_results = Parallel(n_jobs=1)(job_results)
    else:
        # Group by every 100 * CPU Count
        job_results = chain.from_iterable(Parallel(n_jobs=n_jobs, verbose=10)(filter(lambda job: job is not None, partial_job_list))
                                          for partial_job_list in grouper(job_results, n_jobs * 100))
    
    for classifier, results in tqdm(zip(job_inputs, job_results),
                                    total=(2 ** len(candidate_features) - 1) * 8):
        all_classifiers[classifier] = results.classifier
        classifiers_by_f1_score[classifier] = results.f1_score
        classifiers_by_accuracy_score[classifier] = results.accuracy_score

    return AllClassifierResults(classifiers=all_classifiers,
                                classifiers_by_f1_score=classifiers_by_f1_score,
                                classifiers_by_accuracy_score=classifiers_by_accuracy_score)


### Running Code

The following two cells export this Jupyter notebook to a simple IPython script.
Copies of this script are made for each worker and the value of `worker_number` is set to `'1'`, `'2'`, etc.

In [None]:
try:
    worker_number = int(worker_number)

except ValueError:
    print('Skipping since we are running in a Jupyter Notebook')
    
else:
    from tqdm import tqdm
    gc.collect()
    all_results = train_all_classifiers(n_jobs=1)
    with open(f'results_{worker_number}.pkl', 'wb') as file:
        pickle.dump(all_results, file, protocol=5)
    sys.exit(0)
    

In [None]:
%%script bash

NOTEBOOK=03_Machine_Learning

if [[ -e ".${NOTEBOOK}_running" ]]; then
    echo 'Test is already running. If you are sure this is not the case, run:' >&2
    echo "    rm .${NOTEBOOK}_running" >&2
    exit 1
fi
touch ".${NOTEBOOK}_running"

jupyter nbconvert --to script "${NOTEBOOK}.ipynb"
for WORKER_NUM in $(seq "$(nproc)"); do
    sed "s/#WORKERS/${WORKER_NUM}/g" "${NOTEBOOK}.py" > "_${NOTEBOOK}_${WORKER_NUM}.py"
    ipython "_${NOTEBOOK}_${WORKER_NUM}.py" &
done
wait

rm ".${NOTEBOOK}_running"

### Gathering Results

In [None]:
all_classifiers: Dict[Classifier, Any] = {}
classifiers_by_f1_score: Dict[Classifier, np.float64] = {}
classifiers_by_accuracy_score: Dict[Classifier, np.float64] = {}

for worker_number in range(1, os.cpu_count() + 1):
    worker_results: Dict[int, Any] = {}
    with open(f'results_{worker_number}.pkl', 'rb') as file:
        results = pickle.load(file)
        all_classifiers.update(results.classifiers)
        classifiers_by_f1_score.update(results.classifiers_by_f1_score)
        classifiers_by_accuracy_score.update(results.classifiers_by_accuracy_score)


In [None]:
classifier_ranks = pd.DataFrame({
                                    'Classifier': [classifier.name for classifier in all_classifiers.keys()],
                                    'Feature Count': [len(classifier.features) for classifier in all_classifiers.keys()],
                                    'Features': [classifier.features for classifier in all_classifiers.keys()],
                                    'F1 Score': [classifiers_by_f1_score[classifier] for classifier in all_classifiers],
                                    'Accuracy Score': [classifiers_by_accuracy_score[classifier] for classifier in all_classifiers],
                                },
                                columns=['Classifier', 'Feature Count', 'Features', 'F1 Score', 'Accuracy Score'])
classifier_ranks.sort_values('F1 Score', ascending=False, inplace=True)
classifier_ranks.to_csv('classifier_ranks.csv', index=False)
classifier_ranks

In [None]:
keys = ["KNeighbors", "SVC", "DecisionTree", "RandomForest", "AdaBoost", "GradientBoost", "GaussianNB", "LogisticRegression"]
plt.figure(figsize=(15,10))
plt.plot(keys, f1s.values())
max_score = max(f1s.values())
plt.plot(max(f1s, key=f1s.get),max_score, 'ro', label=f"max score: {max_score:.2f}")
plt.title("Classifier Investigation")
plt.legend(loc='upper left')

### Logistic regression seems to the best?? 

**Next I want to normalize features like in: https://www.kaggle.com/mosleylm/titanic-data-set-exploration/execution#II.-Format-Data**. Maybe score will improve?

Based on our results it appears that the