In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib
from sklearn import metrics
# pip install graphviz
# conda install python-graphviz

In [None]:
matplotlib.rcParams.update({'font.size': 18,
                            'lines.linewidth' : 3,
                            'figure.figsize' : [15, 5],
                            'lines.markersize': 10})
pd.options.mode.chained_assignment = None

# Load our data

In [None]:
df_titanic = pd.read_csv('./00_data/titanic.csv')

In [None]:
df_titanic.head()

# Some quick data exploration

In [None]:
sns.countplot('Survived', data=df_titanic)

It is evident that not many passengers survived the accident.

Out of 891 passengers in training set, only around 350 survived i.e Only 38.4% of the total training set survived the crash. We need to dig down more to get better insights from the data and see which categories of the passengers did survive and who didn't.

In [None]:
sns.countplot('Sex', hue='Survived', data=df_titanic)

The number of men on the ship is lot more than the number of women. Still the number of women saved is almost twice the number of males saved. The survival rates for a **women on the ship is around 75% while that for men in around 18-19%.**

I guess the saying **woman and children first** did apply in the Titanic and it was a Hollywood stunt.

In [None]:
sns.countplot('Pclass', hue='Survived', data=df_titanic)

People say **Money Can't Buy Everything**. But we can clearly see that Passenegers Of Pclass 1 were given a very high priority while rescue. Even though the the number of Passengers in Pclass 3 were a lot higher, still the number of survival from them is very low, somewhere around **25%**.

For Pclass 1 %survived is around **63%** while for Pclass2 is around **48%**. So money and status matters.

In [None]:
sns.countplot('Embarked', hue='Survived', data=df_titanic)

Oh! So people who embarked from C (Cherbourg) and Q (Queenstown) where more likely to survive then those that boarded from S (Southampton).

# Pre-Processing


## Inputing missing data
Looking at our Titanic dataset, the feature for **cabin** clearly indicates that we have `NaN` values. We need to handle these and any other null columns.

In [None]:
sns.heatmap(df_titanic.isnull(), yticklabels=False, cbar=False, cmap='viridis')

In [None]:
def missingValues(data):
    total = data.isnull().sum().sort_values(ascending=False) # getting the sum of null values and ordering
    percent = (data.isnull().sum() / data.isnull().count() * 100).sort_values(ascending=False)  #getting the percent and order of null
    dt = pd.concat([total, percent], axis=1,keys=['Total','Percent'])  # Concatenating the total and percent
    return dt
missingValues(df_titanic)

So we have 3 types of missing information:
- **cabin** has 77% missing....not sure what information it can give us
- **age** we can maybe fill in the most common.....
- **embarked** & **fare**
    - we could delete and not loose too much info...only 0.15% and 0.07%
    - we could inpute with the most common embarkation port or mean_fare

In [None]:
# drop cabin


In [None]:
# inpute mean age - 29.69 seems about right
np.mean(df_titanic['Age'])

In [None]:
df_titanic['Age'].head(8)

In [None]:
df_titanic['Age'].head(8)

In [None]:
# mean fare
df_titanic['Fare'][ df_titanic['Fare'].isnull()] = round(np.mean(df_titanic['Fare']))

In [None]:
# inpute most common embarkation port
from collections import Counter
Counter(df_titanic['Embarked'])

In [None]:
Counter(df_titanic['Embarked'])

In [None]:
missingValues(df_titanic)

## Data types
**Categorical Features:**
A categorical variable is one that has two or more categories and each value in that feature can be categorised by them. For example, gender is a categorical variable having two categories (male and female). Now we cannot sort or give any ordering to such variables. They are also known as Nominal Variables.

Categorical Features in the dataset: **Sex, Embarked**.

**Ordinal Features:**
An ordinal variable is similar to categorical values, but the difference between them is that we can have relative ordering or sorting between the values. For example: If we have a feature like Height with values Tall, Medium, Short, then Height is a ordinal variable. Here we can have a relative sort in the variable.

Ordinal Features in the dataset: **PClass**


**Continous Feature:**
A feature is said to be continous if it can take values between any two points or between the minimum or maximum values in the features column.

Continous Features in the dataset: **Age, Fare**

In [None]:
df_titanic.head()

## Categorical Variables
### Sex - Binary

In [None]:
Counter(df_titanic['Sex'])

In [None]:
## use preprocessing LabelEncoder

### Embarked - Categorical

In [None]:
Counter(df_titanic['Embarked'])

In [None]:
## use preprpcessing OneHotEncoder



In [None]:
ohe_fitted.categories_

In [None]:
df_embarked = pd.DataFrame(ohe_fitted.transform(df_titanic[['Embarked']]),
                           columns= [ 'Embarked_{}'.format(ii) for ii in ohe_fitted.categories_[0].tolist()]
                          )
df_embarked.head()

In [None]:
df_titanic = pd.concat([ df_titanic.drop(['Embarked'], axis=1), df_embarked], axis=1)
df_titanic.head()

## Ordinal Data
What about **pclass** ?

## Continous Data

Sometimes our observations will be very unevenly distributed for a given feature. For example, *ticket* is roughly exponentially distributed. In cases like these it can be useful to transform the values of our features or our target to better highlight trends or to allow for use of models that might not otherwise be applicable.

In [None]:
def plotDist(data):
    sns.distplot(data, hist=True, kde=True, bins=20, color = 'darkblue', 
                 hist_kws={'edgecolor':'black'}, kde_kws={'linewidth': 4})
plotDist(df_titanic['Age'])

In [None]:
## use sklearn.preprocessing StandardScaler

ss = StandardScaler()
ss_transformed = ss.fit_transform(df_titanic[['Age']])
ss_transformed[:10]

In [None]:
plotDist(ss_transformed)

In [None]:
df_titanic.head()

In [None]:
df_titanic['Age']= ss_transformed

In [None]:
df_titanic.head()

## MinMaxScaler
Another technique used to scale data is [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler)
- scaling each feature to a given range
- *sensitive to outliers*

Formal definition for MinMaxScaler is given by: $$ x'_i = \frac{x_i-x_{min}}{x_{max} - x_{min}}$$

**In a Nutshell:**

**You probably won't go wrong if you use `StandardScaler` to scale your features.**

## Ready up for modelling

In [None]:
df_target = df_titanic['Survived']
# drop unwanted/redundant columns
df_titanic.drop(['PassengerId', 'Survived', 'Name', 'Ticket'], axis=1, inplace=True)
df_titanic.head()

In [None]:
X = np.array(df_titanic)
X

In [None]:
y = df_target
y[:20]

In [None]:
# save for later user
df_titanic.to_csv('./00_data/titanic_X.csv', index=False)
y.to_csv('./00_data/titanic_y.csv', header=False, index=False)

# Create a train-test split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
print('X_train : {}'.format(X_train.shape))
print('y_train : {}'.format(y_train.shape))
print('X_test : {}'.format(X_test.shape))
print('y_test : {}'.format(y_test.shape))

# Intro to our first model - Decision Trees
- both classification or regression.
- also outlier detection. 

The trained models resemble a tree, complete with branches and nodes. 

The model is essentially a series of questions with yes or no answers,
- resulting tree structure contains all the combination of responses.

Tree based models are popular:
- mimic human decision making process, 
- work well for a large class of problems, 
- naturally handle multiclassification, and 
- handle a mix of categorical and numerical data. 
- easy to understand and explain : good explainability

## Training decision tree classifiers
The best way to understand a decision tree is to construct one and visualize it. We'll train a decision tree classifier on the iris data set and visualize the tree with the Graphviz package. The iris data set is a famous data set of 150 observations of three different iris species: setosa, versicolor, and virginica. Each observation has measurements of the petal length and width and sepal length and width, for a total of four features.

In [None]:
import graphviz
from sklearn.tree import export_graphviz

## use DecisionTreeClassifier of depth 2. fit(X_train, y_train)

# train decision tree
tree


# visual tree
graphviz.Source(export_graphviz(tree, 
                                out_file=None,
                                feature_names=np.array(df_titanic.columns.tolist()),
                                class_names=np.array(['0','1'])))

Note how the model resembles an upside down tree and each box represents a node in the tree. Printed in each box is

* **samples**: the number of observations in the node.
* **Gini**: a measure of node purity.
* **value**: the distribution of observations in each class.
* **class**: the most common label in the node.

At the top of the tree is the __root node__. This node is _split_ to form two branches. Observations that satisfy the criterion printed at the top of the box are moved to one branch while the rest to the other. You can view a decision tree as a model that is making partitions in a space that contains your training data. The partitions are chosen to separate the different classes. For the tree displayed above, node splits were chosen to lead to an overall reduction of the Gini metric, discussed further in the next section. The nodes that do not branch off are called __terminal nodes__ or __leaves__.

With a trained tree, predictions are made on an observation by starting at the root and following the path as a result of the criterion in each node. Once at a leaf, the predicted class is the class with the plurality. For example, if an observation has **sex** of 0 (<=0.5) and **pclass** of 0,1,2, it will reside in the left most leaf in the figure. 

Our trained tree model only makes splits using three features, **sex**, **pclass** and **age** making it easy to visualize our model.

### Gini impurity
- node impurity
- probability of misclassifying an observation if it were randomly labeled

DT will node split that result in reducing the Gini metric. The equation for the Gini impurity for node $m$ is

$$ G_m = \sum_k p_{mk} (1 - p_{mk}), $$

where $p_{mk}$ is the fraction of observations of class $k$ in node $m$. 

Consider two cases where a node has 10 observations belonging to two classes:
* Case 1: [5, 5]
$$ G = \frac{5}{10} \left(1 - \frac{5}{10}\right) + \frac{5}{10} \left(1 - \frac{5}{10}\right) = 0.5 $$
* Case 2: [10, 0]
$$ G = \frac{10}{10} \left(1 - \frac{10}{10}\right) + \frac{0}{10} \left(1 - \frac{0}{10}\right) = 0 $$

The greater the node purity, the lower the Gini metric. See the plot below of how Gini varies with $p_{mk}$ when there are two classes.

In [None]:
p = np.linspace(1E-6, 1-1E-6, 100)
gini = p*(1-p) + (1-p)*p

plt.plot(p, gini)
plt.xlabel('$p$')
plt.ylabel('Gini')

### Entropy

In chemistry, entropy is a measure of the amount of disorder in your system. 

The equation for entropy of node $m$ is

$$ H_m = -\sum_{k} p_{mk} \log_2(p_{mk}).$$

Using the same two cases as before when calculating the Gini metric, the entropy is equal to

* Case 1: [5, 5]
$$ H = -\left[\frac{5}{10} \log_2 \left(\frac{5}{10}\right) + \frac{5}{10} \log_2 \left(\frac{5}{10}\right)\right] = 1  $$
* Case 2: [10, 0]
$$ H = -\left[\frac{10}{10} \log_2 \left(\frac{10}{10}\right) + \frac{0}{10} \log_2 \left(\frac{0}{10}\right)\right] = 0 $$

Similar to the Gini impurity, a more pure node will have lower entropy. Since entropy and Gini impurity are very similar metrics, using either will not make any substantial difference in your classifier. 

By default, the `DecisionTreeClassifier` class uses the Gini metric but can be switched to entropy by setting `criterion='entropy'`.

## Decision Tree Hyperparameters

| Decision Tree Hyperparameters | Description |
|:---:|---|
|max_depth| The maximum depth of the tree |
|max_features|The number of features to consider when deciding the best split|
|min_samples_split|Minimum number of samples to consider a split on an internal node|
|min_samples_leaf|Minimum number of samples required for a leaf (terminal node)|

- `scikit-learn` documentation or notebook documentation

## Putting it all together

In [None]:
## import DecisionTreeClassifier
## define max_depth 3, max_features=2, min_split=10, min_leaf=20
## dtc_fitted, dtc_pred, dtc_pred_proba




In [None]:
dtc_pred

In [None]:
dtc_pred_proba

In [None]:
print('Accuracy : {}'.format(metrics.accuracy_score(y_true=y_test, y_pred=dtc_pred)))
print('Precision : {}'.format(metrics.precision_score(y_test, dtc_pred)))
print('Recall : {}'.format(metrics.recall_score(y_test, dtc_pred)))
print('F1-score : {}'.format(metrics.f1_score(y_test, dtc_pred)))
print("Classification Report:")
print(metrics.classification_report(y_test, dtc_pred))
precision, recall, threshold = metrics.precision_recall_curve(y_test, dtc_pred_proba[:,1])
print("Precision-Recall AUC: {}".format(metrics.auc(recall, precision)))
print("Receiver-Operator AUC: {}".format(metrics.roc_auc_score(y_test, dtc_pred_proba[:,1])))

# Save and export your fitted model

In [None]:
import pickle
filename = 'finalized_model.pkl'
pickle.dump(dtc_fitted, open(filename, 'wb'))