## Introduction

I am pleased to introduce to you my **first** data processing using basic machine learning methods.

Our task is to use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

## Data gathering

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")
import seaborn as sns

from collections import Counter
import warnings
warnings.filterwarnings("ignore")
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current sessio

df = pd.read_csv("/kaggle/input/titanic/train.csv")
df_test = pd.read_csv("/kaggle/input/titanic/test.csv")
check = pd.read_csv('/kaggle/input/titanic/gender_submission.csv')

In [None]:
df.columns

## Data clean

Displaying columns that have any *null-values* and their count.

In [None]:
df.isna().agg(sum).apply(lambda x: x if x > 0 else None).dropna()

We can see that three columns have non-value values and one of them exactly **"Cabin"** has a significant number of them, therefore I decide to delete it.

In [None]:
df = df.drop(['Cabin'], axis=1)
df.columns

As we can see, the **"Cabin"** column has disappeared.

Let's also remove two rows from the **"Embraked"** column

In [None]:
df = df.dropna(subset=['Embarked'])
df['Embarked'].isnull().agg(sum)

Let's get more information from our columns.
Look for **unique items** for each column.

In [None]:
df.apply(lambda x: (len(x.unique()),x.unique())).T.rename(columns={0:"unique", 1:"elements"})

Now, I can collect some information such as:
The **"Pclass"** column has the values 1, 2, 3
**"Embarked"** column similar to "S", "C", "Q"
I will drop the **"Name"** and **"Ticket"** columns because they contain a lot of information that I won't use.

In [None]:
df = df.drop(['Ticket', 'Name'], axis=1)
df.columns

## Exploring

In [None]:
df.describe()

As we can see from the **"Age"** column, the oldest person is *80* years old, the youngest is less than a year.
Max number of siblings / spouses is *8*.
Max number of parents / children is *6*.
The highest fare is *512* and the lowest is *0*.

Let us now turn our attention to the degree of survivor due to the Sex division.

In [None]:
import seaborn as sns

_ = sns.barplot(x='Sex', y='Survived', data=df)

Women had a much better chance of survival(*74%*) than men(*18%*).

In [None]:
_ = sns.barplot(x='Pclass', y='Survived', data=df)

Passengers from **first** class are *more than twice* as likely to survive than passengers from **third** class

## Correlation

In [None]:
import numpy as np

corr = df.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
_ = sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, vmin=-1, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

Strongly correlated columns add *no value* and can spoil our interpretations.

In [None]:
sets = pd.cut(df['Age'], [0, 18, 80])
sex_age = pd.pivot_table(df, 'Survived', ['Sex', sets], 'Pclass')
_ = sns.heatmap(sex_age, cmap="RdBu_r", vmax=1, vmin=-1, annot=True,
                              square=True, linewidths=.5, fmt='.2f')

From the correlation above, we can conclude that **women** from the *first* and *second* categories had a chance of surviving almost **90%**
With **men**, the group under the age of *18* with the first category retained the **best chance** of survival.
On the other hand, the group of 
**adult men** did the worst, as did many in the *third category*.

I convert embarked string values to numbers for use in a chart.

In [None]:
from sklearn.preprocessing import LabelEncoder  

le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])
df['Embarked'] = le.fit_transform(df['Embarked'])
g = sns.barplot(x=df['Embarked'], y=df['Survived'])
g.set_xticklabels(['Southampton', 'Cherbourg', 'Queenstown'])
g.set_title('Chance of survival by embarked')
_ = g.set_xlabel('Port of Embarkation')

Statistically, above **50%** of the people who embarked from **Southampton** survived.

In [None]:
hist = df['Age'].hist(bins=15)
hist.set_title('Distribution of the number of people in a particular age')
hist.set_xlabel('Age')
_ = hist.set_ylabel('Number of people')

## [Normalize Data] 
## Preparation of data for testing 

In [None]:
df_test.isna().sum()

In [None]:
PassengerId = df_test['PassengerId']
df_test = df_test.drop(['PassengerId', 'Name', 'Age', 'Ticket', 'Cabin'], axis=1)
df_test.columns

I check if there are any **non-values** in the test data.

In [None]:
df_test.isna().sum()

In [None]:
median = df_test['Fare'].median()
df_test['Fare'] = df_test['Fare'].fillna(median)

In [None]:
df_test.isna().sum()

As before, we replace string names with numeric type.

In [None]:
df_test['Sex'] = le.fit_transform(df_test['Sex'])
df_test['Embarked'] = le.fit_transform(df_test['Embarked'])

In [None]:
X_test = df_test
y_test = check['Survived']

## Preparation of data for training

In [None]:
X_train = df.iloc[:,2:].drop(['Age'], axis=1)
y_train = df.iloc[:,1]

In [None]:
print(f"{'Traning'} {'data rows=':}{'{}, columns={}'.format(*X_train.shape)}",
         f"\n{'Test':>7} {'data rows='}{'{}, columns={}'.format(*X_test.shape)}")

### When we have the test variables ready, we can start modeling.

In [None]:
models = {}

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
logistic_regression = LogisticRegression()
logistic_regression = logistic_regression.fit(X_train, y_train)
score = logistic_regression.score(X_test, y_test)
models["logistic_regression"] = score
score

In [None]:
from yellowbrick.model_selection import LearningCurve
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6, 4))
lc3_viz = LearningCurve(LogisticRegression(), cv=10)
lc3_viz.fit(X_train, y_train)
_ = lc3_viz.poof()

The learning curve shows us the level of learning the algorithm depending on experience.

Learning curve can be used for diagnose if model is underfit, overfit or well-fit model.

Traning Score gives us information on how well the model learns

Cross Validation Score get us a proper estimate of the generalization

A plot of learning curves shows underfitting if validation line stays flat or decrease 

**Classification True vs False and Positive vs Negative**

In [None]:
def c_matrix(matrix, title):
    fig, ax = plt.subplots(1,1)
    im1 = plt.imshow(matrix, cmap='Greens')
    fig.colorbar(im1)

    for i in range(2):
        for j in range(2):
            text = ax.text(j, i,matrix[i, j],
                           ha="center", va="center", color="black")

    plt.xticks([])
    plt.yticks([])
    plt.xlabel('Predicted Class')
    plt.ylabel('True Class')
    plt.title(f'Confusion Matrix\n{title}')
    plt.show()

In [None]:
from sklearn.metrics import confusion_matrix

y_pred = logistic_regression.predict(X_test)
matrix = confusion_matrix(y_test, y_pred)
c_matrix(matrix, "LogisticRegression")

We want the greatest possible values to be diagonally and the rest as small as possible.

## Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
random_forest = random_forest.fit(X_train, y_train)
score = random_forest.score(X_test, y_test)
models["random_forest"] = score
score

In [None]:
fig, ax = plt.subplots(figsize=(6, 4))
lc3_viz = LearningCurve( RandomForestClassifier(n_estimators=100, random_state=42), cv=10)
lc3_viz.fit(X_train, y_train)
_ = lc3_viz.poof()

In [None]:
import matplotlib.pyplot as plt
data = random_forest.feature_importances_
column_names = X_train.columns.values
df1 = pd.DataFrame(index=column_names, data=data, columns=["feature importance"])
df1.plot(kind='bar')
plt.xlabel("Feature")
plt.ylabel('indicator')
plt.title('which features are relevant')
plt.show()

In [None]:
y_pred = random_forest.predict(X_test)
matrix = confusion_matrix(y_test, y_pred)
c_matrix(matrix, "Random Forest Classifier")

## Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier(random_state=42, max_depth=5)
decision_tree =decision_tree.fit(X_train, y_train)
score = decision_tree.score(X_test, y_test)
models["decision_tree"] = score
score

In [None]:
fig, ax = plt.subplots(figsize=(6, 4))
lc3_viz = LearningCurve(DecisionTreeClassifier(random_state=42, max_depth=5), cv=10)
lc3_viz.fit(X_train, y_train)
_ = lc3_viz.poof()

In [None]:
y_pred = decision_tree.predict(X_test)
matrix = confusion_matrix(y_test, y_pred)
c_matrix(matrix, "DecisionTreeClassifier")

## Support Vector Machine

In [None]:
from sklearn.svm import SVC
svc = SVC(random_state=42, probability=True)
svc = svc.fit(X_train, y_train)
score = svc.score(X_test, y_test)
models["svc"] = score
score

In [None]:
fig, ax = plt.subplots(figsize=(6, 4))
lc3_viz = LearningCurve(SVC(random_state=42, probability=True), cv=10)
lc3_viz.fit(X_train, y_train)
_ = lc3_viz.poof()

In [None]:
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_test, y_pred)
c_matrix(matrix, "Support Vector Machine")

## K-Nearest Neighbor

In [None]:
from sklearn.neighbors import KNeighborsClassifier
k_neigth = KNeighborsClassifier()
k_neigth = k_neigth.fit(X_train, y_train)
score = k_neigth.score(X_test, y_test)
models["k_neigth"] = score
score

In [None]:
fig, ax = plt.subplots(figsize=(6, 4))
lc3_viz = LearningCurve(KNeighborsClassifier(), cv=10)
lc3_viz.fit(X_train, y_train)
_ = lc3_viz.poof()

Despite the not very high score, it seems that with the increase of training data, the model learns faster and faster

In [None]:
y_pred = k_neigth.predict(X_test)
matrix = confusion_matrix(y_test, y_pred)
c_matrix(matrix, "K-Nearest Neighbor")

# Summary

In [None]:
dict(sorted(models.items(), key=lambda x: x[1], reverse=True))

Drawing the conclusion from my data, you can distinguish here the logistic regression that obtained the highest score

In [None]:
y_pred = logistic_regression.predict(X_test)

In [None]:
submission = pd.DataFrame({
    "PassengerId" : PassengerId.values,
    "Survived" : y_pred
})
submission

In [None]:
submission.to_csv('submission.csv', index = False)