<h1>Importing Libraries</h1>

Please run "pip install -r requirements.txt" from the root of the project 

In [None]:
!pip install -r requirements.txt

In [None]:
import os
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, precision_score, accuracy_score, recall_score, ConfusionMatrixDisplay, confusion_matrix, classification_report
import warnings
# Disable all FutureWarnings
warnings.filterwarnings("ignore", category=FutureWarning)

<h1>NumPy vs Pandas</h1>

A Pandas DataFrame is a two-dimensional, tabular, mutable data structure in Python that can store tabular data containing objects of different data types.

A NumPy array is a type of multi-dimensional data structure in Python which can store objects of similar data types.

<h1>Titanic Dataset</h1>

https://www.kaggle.com/c/titanic/data

<h2> Variable Notes </h2>

| Variable  | Definition                  | Key                                            |
|-----------|-----------------------------|------------------------------------------------|
| survival  | Survival                    | 0 = No, 1 = Yes                                |
| pclass    | Ticket class                | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| sex       | Sex                         |                                                |
| Age       | Age in years                |                                                |
| sibsp     | # of siblings / spouses aboard the Titanic |                                      |
| parch     | # of parents / children aboard the Titanic |                                      |
| ticket    | Ticket number               |                                                |
| fare      | Passenger fare              |                                                |
| cabin     | Cabin number                |                                                |
| embarked  | Port of Embarkation         | C = Cherbourg, Q = Queenstown, S = Southampton |


<h1>Data Preparation</h1>

Here the goal is to make the necessary manipulations so that the models will accept them.
This includes filling in Nan (Not a number) values in the pandas.

In [None]:

label_encoder = preprocessing.LabelEncoder()
current_directory = os.getcwd()
titanic = pd.read_csv(current_directory+'\\titanic.csv')

# Operations to create a distinct value that I know I can encode.
titanic["Cabin"].fillna("Not Known")
titanic["Embarked"].fillna("Not Known")
titanic["Sex"].fillna("Not Known")
titanic["Ticket"].fillna("Not Known")
titanic["Name"].fillna("Not Known")

# Encoding the values in these columns then replacing the columns with the encoded version
titanic["Cabin"] = label_encoder.fit_transform(titanic["Cabin"])
titanic["Name"] = label_encoder.fit_transform(titanic["Name"])
titanic["Sex"] = label_encoder.fit_transform(titanic["Sex"])
titanic["Ticket"] = label_encoder.fit_transform(titanic["Ticket"])
titanic["Embarked"] = label_encoder.fit_transform(titanic["Embarked"])

titanic_labels = titanic["Survived"]
# The Survived column can be disgarded now as we have separated it from the features into its own variable.
titanic = titanic.drop("Survived", axis=1)

<h1>Data Splitting and Final Data Manipulations </h1>

A reason to do fillna that would uses an aggregate (sum, mean, min, etc.) is to prevent data leaking between the training set and test set!

In [None]:
# Play around and see what happens to the models when you change the test-training ratios
train_size = 0.8
test_size = 1 - train_size
assert train_size + test_size == 1

# I decided to drop the Name column as I believe it to generate unique values when encoded 
titanic.drop("Name",axis=1, inplace=True)

# train_test_split this is a method that scikit-learn provides in the library
titanic_train, titanic_test, labels_train, labels_test = train_test_split(titanic,titanic_labels , train_size=train_size, random_state=42)

In [None]:
'''
both ways are valid
we do fillna here instead of above as we want to avoid data leaking since an average would 
use all the data in the column before splitting
'''
titanic_train["Age"].fillna(titanic_train["Age"].mean(), inplace=True)

titanic_test["Age"] = titanic_test["Age"].fillna(titanic_test["Age"].mean())


<h1>Quick Check</h1>

Always good idea to validate that your data manipulations/operations expectations matches your results.

In [None]:
titanic.loc[titanic["Age"].isna()==True]

In [None]:
titanic_train.loc[titanic_train["Age"].isna()==True]
# no rows should appear

In [None]:
titanic_test.loc[titanic_test["Age"].isna()==True]
# no rows should appear

In [None]:
def metrics(true_y,prediction_y):
    print(f"Accuracy: {accuracy_score(y_true=true_y,y_pred= prediction_y)}")
    print(f"Precision (Weighted): {precision_score(y_true=true_y,y_pred= prediction_y, average='weighted')}")
    print(f"Recall (Weighted): {recall_score(y_true=true_y,y_pred= prediction_y, average='weighted')}")
    print(f"F1 Score (Weighted): {f1_score(y_true=true_y,y_pred= prediction_y,average='weighted')}")
    print(classification_report(y_true=true_y,y_pred= prediction_y, zero_division='warn'))
    confusion_matrix(y_true=true_y,y_pred= prediction_y)

<h2>2.2 Training and Test data split</h2>

<h2>Multinomial Naive Bayes Classifier (naive bayes.MultinomialNB.html)
with the default parameters.</h2>

In [None]:
classifier = MultinomialNB(random_state = 19)
if (os.path.exists('models/MNB1_model.pickle')):
    pickle_in = open('models/MNB1_model.pickle','rb')
    clf1 = pickle.load(pickle_in)
    print("pickle file used")
else:
    classifier = classifier.fit(titanic_train, labels_train)
    with open('models/MNB1_model.pickle','wb') as f:
        pickle.dump(classifier, f)

    pickle_in = open('models/MNB1_model.pickle','rb')
    clf1 = pickle.load(pickle_in)
pickle_in.close()

In [None]:
labels_titanic_predict = clf1.predict(titanic_test)

metrics(labels_test, labels_titanic_predict)
cmp = ConfusionMatrixDisplay(confusion_matrix(y_true=labels_test, y_pred= labels_titanic_predict))
fig, ax = plt.subplots(figsize=(10,10))
cmp.plot(ax=ax,cmap='magma')

<h2>Decision Tree (tree.DecisionTreeClassifier) with the default parameters.</h2>

In [None]:
classifier = DecisionTreeClassifier(criterion="entropy", random_state = 19) # default measure of chaos is gini in scikit-learn
if (os.path.exists('models/DT1_model.pickle')):
    pickle_in = open('models/DT1_model.pickle','rb')
    clf1 = pickle.load(pickle_in)
    print("pickle file used")
else:
    classifier = classifier.fit(titanic_train, labels_train)
    with open('models/DT1_model.pickle','wb') as f:
            pickle.dump(classifier, f)
    pickle_in = open('models/DT1_model.pickle','rb')
    clf1 = pickle.load(pickle_in)
pickle_in.close()

In [None]:
labels_titanic_predict = clf1.predict(titanic_test)

metrics(labels_test, labels_titanic_predict)
cmp = ConfusionMatrixDisplay(confusion_matrix(y_true=labels_test, y_pred= labels_titanic_predict))
fig, ax = plt.subplots(figsize=(10,10))
cmp.plot(ax=ax,cmap='magma')


In [None]:
plot_tree(clf1, feature_names= titanic_train.columns)

<h2>Multi-Layered Perceptron (neural network.MLPClassifier) with the default parameters.</h2>

In [None]:
classifier = MLPClassifier(random_state = 19)
if (os.path.exists('models/MLP1_model.pickle')):
    pickle_in = open('models/MLP1_model.pickle','rb')
    clf1 = pickle.load(pickle_in)
    print("pickle file used")
else:
    classifier = classifier.fit(titanic_train, labels_train)
    with open('models/MLP1_model.pickle','wb') as f:
            pickle.dump(classifier, f)
    pickle_in = open('models/MLP1_model.pickle','rb')
    clf1 = pickle.load(pickle_in)
pickle_in.close()

In [None]:
# This can be used 
# print(np.shape(titanic_train))
# print(np.shape(labels_train))

In [None]:
labels_titanic_predict = clf1.predict(titanic_test)
metrics(labels_test, labels_titanic_predict)
cmp = ConfusionMatrixDisplay(confusion_matrix(y_true=labels_test, y_pred= labels_titanic_predict))
fig, ax = plt.subplots(figsize=(10,10))
cmp.plot(ax=ax,cmap='magma')

<h1>Your Turn</h1>

Go find a model you want to use on this data<br>

Simple things you can try:
* manually selecting hyperparameters
* change the features
* try unsupervised learning 

If you don't know where to start feel free to discuss with myself or those around you!

All models and fancy things can be found here 
https://scikit-learn.org/stable/modules/classes.html

In [None]:
classifier = ModelClass # define a model 
modelName = "UserModel" # name your model

if (os.path.exists(f'models/{modelName}.pickle')):
    pickle_in = open(f'models/{modelName}.pickle','rb')
    clf1 = pickle.load(pickle_in)
    print("pickle file used")
else:
    classifier = classifier.fit(titanic_train, labels_train)
    with open(f'models/{modelName}.pickle','wb') as f:
            pickle.dump(classifier, f)
    pickle_in = open(f'models/{modelName}.pickle','rb')
    clf1 = pickle.load(pickle_in)
pickle_in.close()

<h1>Model Refinement Through Hyperparameter Search</h1>

Idea is to take find hyperparameter values that will improve the performance of the model's prediction. Hyperparameters are parameters you can control over the model (ex: maxDepth of a Decision Tree, etc.)

https://scikit-learn.org/stable/modules/classes.html#hyper-parameter-optimizers 

<h2>Multinomial Naive Bayes Classifier found using GridSearchCV</h2>

Hyperparameters are called attributes in scikit-learn documentation <br>
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB


In [None]:
parameters = {"alpha":(0.5,0,0.36,1)}
clfMNB = MultinomialNB()

classifier = GridSearchCV(estimator=clfMNB, param_grid=parameters,n_jobs=-1, scoring = "recall", random_state = 19)

if (os.path.exists('models/MNB1GridCV_model.pickle')):
    pickle_in = open('models/MNB1GridCV_model.pickle','rb')
    clf1 = pickle.load(pickle_in)
    print("pickle file used")
else:
    classifier = classifier.fit(titanic_train, labels_train)
    with open('models/MNB1GridCV_model.pickle','wb') as f:
        pickle.dump(classifier, f)

    pickle_in = open('models/MNB1GridCV_model.pickle','rb')
    clf1 = pickle.load(pickle_in)
pickle_in.close()


In [None]:
labels_titanic_predict = clf1.predict(titanic_test)

print(f"Best estimator: {clf1.best_estimator_}")
metrics(labels_test, labels_titanic_predict)
cmp = ConfusionMatrixDisplay(confusion_matrix(y_true=labels_test, y_pred= labels_titanic_predict))
fig, ax = plt.subplots(figsize=(10,10))
cmp.plot(ax=ax,cmap='magma')

<h2>Decision Tree Classifier found using GridSearchCV</h2>

Hyperparameters are called attributes in scikit-learn documentation <br>

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier 

In [None]:
parameters = {"criterion":("gini","entropy"),"max_depth":(100,3,10),"min_samples_split":(12,5,30)}
clfDT = DecisionTreeClassifier()

classifier = GridSearchCV(estimator=clfDT, param_grid=parameters,n_jobs=-1, scoring = "recall", random_state = 19)

if (os.path.exists('models/DT1GridCV_model.pickle')):
    pickle_in = open('models/DT1GridCV_model.pickle','rb')
    clf1 = pickle.load(pickle_in)
    print("pickle file used")
else:
    classifier = classifier.fit(titanic_train, labels_train)
    with open('models/DT1GridCV_model.pickle','wb') as f:
        pickle.dump(classifier, f)

    pickle_in = open('models/DT1GridCV_model.pickle','rb')
    clf1 = pickle.load(pickle_in)
pickle_in.close()

In [None]:
labels_titanic_predict = clf1.predict(titanic_test)

print(f"Best estimator: {clf1.best_estimator_}")
metrics(labels_test, labels_titanic_predict)
cmp = ConfusionMatrixDisplay(confusion_matrix(y_true=labels_test, y_pred= labels_titanic_predict))
fig, ax = plt.subplots(figsize=(10,10))
cmp.plot(ax=ax,cmap='magma')

In [None]:
plot_tree(clf1.best_estimator_, feature_names= titanic_train.columns)

<h2>Multi-Layered Perceptron found using GridSearchCV</h2>

Hyperparameters are called attributes in scikit-learn documentation <br>
https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier 

In [None]:
parameters = {"activation":("identity", "logistic", "tanh", "relu"),"hidden_layer_sizes":((11),(121,11),(6,12,2),(11,1,3,7)), "solver":("adam","sgd")}
#currently 5 neurons then 5 neurons and 5 neurons then 10 neurons in the hidden layer
clfMLP = MLPClassifier(early_stopping=True, verbose=True, max_iter=50)
classifier = GridSearchCV(estimator=clfMLP, param_grid=parameters, n_jobs=-1, scoring = "recall", random_state = 19)
skip=False # when set to False this search will execute

if(not skip):
    if (os.path.exists('models/MLP1GridCV_model.pickle')):
        pickle_in = open('models/MLP1GridCV_model.pickle','rb')
        clf1 = pickle.load(pickle_in)
        print("pickle file used")
    else:
        classifier = classifier.fit(titanic_train.values, labels_train.values)
        with open('models/MLP1GridCV_model.pickle','wb') as f:
            pickle.dump(classifier, f)

        pickle_in = open('models/MLP1GridCV_model.pickle','rb')
        clf1 = pickle.load(pickle_in)
else:
    print("skipped")

pickle_in.close()

In [None]:
if (os.path.exists('models/MLP1GridCV_model.pickle') and not skip):
    labels_titanic_predict = clf1.predict(titanic_test)
    print(f"Best estimator: {clf1.best_estimator_}")
    metrics(labels_test, labels_titanic_predict)
    cmp = ConfusionMatrixDisplay(confusion_matrix(y_true=labels_test, y_pred= labels_titanic_predict))
    fig, ax = plt.subplots(figsize=(10,10))
    cmp.plot(ax=ax,cmap='magma')
else:
    print("skipped")

In [None]:
# make each model take independent names so we can graph recall or precision

<h1>More Resources</h1>


Sci-kit learn linear models [Link](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model).

Sci-kit learn clustering models [Link](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster).

Sci-kit learn ensemble models [Link](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble).