### Intalling libs

### Column description

| Variável 	| Descrição 	|
|:-:	|:-	|
| PassangerID 	| ID de identificação do passageiro(a) 	|
| Survived 	| se o passageiro(a) sobreviveu (0 = não, 1 = sim) 	|
| Pclass 	| classe do passageiro:<br>     * **1 = primeira**,<br>     * **2 = segunda**,<br>     * **3 = terceira** 	|
| name 	| nome do passageiro(a) 	|
| sex 	| sexo do passageiro(a) 	|
| age 	| idade do passageiro(a) 	|
| Sibsp 	| número de irmão(ãs)/esposo(a) à bordo 	|
| Parch 	| número de pais/filhos(as) à bordo 	|
| Ticket 	| número da passagem 	|
| Fare 	| preço da passagem 	|
| Cabin 	| cabine 	|
| Embarked 	| local que o passageiro(a) embarcou:<br>     * **C = Cherboug**,<br>     * **Q = Queenstown**,<br>     * **S = Southamption** 	|
| WikiId 	| ID de identificação do passageiro(a) segundo Wikipedia 	|
| Name_wiki 	| nome do passageiro(a) 	|
| Age_wiki 	| idade do passageiro(a) 	|
| Hometown 	| cidade de nascimento do passageiro(a) 	|
| Boarded 	| cidade de embarque 	|
| Destination 	| destino da viagem 	|
| Lifeboat 	| identificação do bote salva-vidas 	|
| Body 	| número de identificação do corpo 	|


<font color='red'>**IMPORTANT**</font>

The new features (the ones after 'Embarked') are very similar to the original ones but they are more up-to-date and have much fewer missing values. Therefore, users can decide on the preferred features themselves.

### Importing Libs

In [2]:
# data visualization
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
from seaborn import (
    jointplot,
    pairplot,
    boxplot,
    heatmap
)
from yellowbrick.features import (
    JointPlotVisualizer,
    Rank2D, 
    RadViz,
    ParallelCoordinates
)

# data manipulation
import numpy as np
import pandas as pd
from pandas.plotting import(
    radviz
)

from collections import (
    Counter,
)

import janitor as jn

from ydata_profiling import ProfileReport

# missing values
import missingno as msno

from sklearn.impute import (
    SimpleImputer
)

# machine learning models
from sklearn import (
    ensemble,
    preprocessing,
    tree,
    impute,
    model_selection,
    preprocessing
)

from sklearn.dummy import (
    DummyClassifier
)

from sklearn.model_selection import (
    train_test_split
)

from sklearn.experimental import (
    enable_iterative_imputer
)

from sklearn.linear_model import (
    LogisticRegression
)

from sklearn.tree import (
    DecisionTreeClassifier
)

from sklearn.neighbors import (
    KNeighborsClassifier
)

from sklearn.naive_bayes import (
    GaussianNB
)

from sklearn.svm import (
    SVC
)

from sklearn.ensemble import (
    RandomForestClassifier
)

import xgboost

# data model metrics
from sklearn.metrics import (
    auc,
    confusion_matrix,
    roc_auc_score,
    roc_curve,
    precision_score,
    recall_score
)

from yellowbrick.classifier import (
    ConfusionMatrix
)

from yellowbrick.model_selection import (
    LearningCurve
)

# data prep-model
from sklearn.model_selection import (
    train_test_split,
    StratifiedKFold,
    learning_curve
)

# model deploy
import pickle

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


### Reading the Titanic Dataset

In [None]:
df = pd.read_csv("titanic_dataset.csv", index_col=0)
df.head(5)

### Deleting _Class_ feature at the end
We are deleting because is the same as _pclass_ (same result, same data)

In [None]:
df = df.drop('Class', axis = 'columns')
df

### Converting DataFrame Column Names to Lowercase snakecase

In [None]:
df.columns = (df.columns
                .str.replace('(?<=[a-z])(?=[A-Z])', '_', regex=True)
                .str.lower()
             )

df

### Preparing the dataset

In [None]:
# dropping columns taht do not add value
df = df.drop(columns = ['name',
                        'name_wiki',
                        'wiki_id',
                        'hometown',
                        'destination',
                        'ticket',
                        'lifeboat',
                        'body',
                        'cabin',
                        'age'])

# using get_dummies function to convert object to float
df = pd.get_dummies(df, dtype = float)

# dropping redundant features
df = df.drop(columns = ['sex_male'])

#remove rows with any values that are not finite (NaN or infite)
df = df[np.isfinite(df).all(1)]

# first, we need to create a series of the target feature
y = df.survived

# then, we create a DataFrame with the attributes
X = df.drop(columns = ['survived'])

In [None]:
df

### Standardizing the Data
Some Machine Learning algorithms perform better when the data is standardized, that is, each feature must have mean = 0 and standard deviation = 1<br>
In this case, we will use the StandardScaler from sklearn lib

In [None]:
# copying the dataset
df_std = df.copy()

# assigning StandardScaler to a variable
std = preprocessing.StandardScaler()

# applying the standard scaler
std.fit_transform(df_std)

#### Some attributes of StandardScaler

In [None]:
# mean
std.mean_

In [None]:
# variance
std.var_

### Min-Max-Scaler
Min-Max Scaling, also known as scaling normalization, is another widely used Feature Scaling technique. In this approach, feature values are set to a specific range, usually between 0 and 1.<br>
We are not going to use MinMaxScaler on the dataset, but we will leave the code for future possibilities

In [None]:
# copying the dataset
df_mms = df.copy()

# assigning MinMaxScaler to a variable
mms = preprocessing.MinMaxScaler()

# applying the minimum and maximum scaler
mms.fit_transform(df_mms)

### Dummies features
Dummy variables are binary variables (0 or 1) created to represent a variable with two or more categories.<br>
Dummy variables must be used whenever we wish to include categorical variables in models that only accept numerical variables.

In [None]:
# using get_dummies function to convert object to boolean
df = pd.get_dummies(df)
df

Note that variables that were previously _object_ are now of type _boolean_.
We can convert the features to float type

In [None]:
# using get_dummies function to convert object to float
df = pd.get_dummies(df, dtype=float)
df

### Label Encoder
An alternative to dummy variable coding is label coding. In this case, each category data will be assigned a number. It is a convenient method for data with high cardinality.

In [None]:
# copying the dataset
df_lbe = df['pclass'].copy()

# assigning LabelEncoder to a variable
labenc = preprocessing.LabelEncoder()

# applying the label encoder
labenc.fit_transform(df_lbe)

We can decode the LabelEncoder from the encoding

In [None]:
# getting the inverse codification
labenc.inverse_transform([2, 1, 0])

### Extracting Categories from Strings
One of the ways to increase the accuracy of models is to extract the titles of names.<br>
We can use the Counter class

In [None]:
c = Counter()

def triples(val):
    for i in range(len(val)):
        c[val[i : i + 3]] += 1

df = pd.read_csv("titanic_dataset.csv", index_col=0)

df.columns = (df.columns
                .str.replace('(?<=[a-z])(?=[A-Z])', '_', regex=True)
                .str.lower()
             )

df.name.apply(triples)
c.most_common(10)

### Other Encodings
The _categorical_encoding_ lib is a set of scikit-learn transformers used to convert object data to numeric data.<br>
A good feature of this lib is that it generates pandas DataFrames.<br>
One of the algorithms implemented in this lib is a hash encoder.
Another ordinal algorithm (_ordinal encoder_) can convert category columns that have an order into a single column of numbers.

In [None]:
#