**OVERVIEW**

The data has been split into two groups:

training set (train.csv)
test set (test.csv)
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

**INSTALLING PACKAGES**

In [None]:
## we start by importing the necessary libraries for data manipulation and viz
## !pip install seaborn
## !pip install statsmodels

from IPython.display import Image
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import seaborn as sb
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline 
# Esse comando serve para plotar os gráficos estáticos logo abaixo da célula,

# existem outras configurações do %matplolib que podem mostrar os gráficos em outras abas ou gráficos dinâmicos.
# Por padrão, desde a versão 3.7 do python anaconda, a configuração padrão do %matplotlib já é o inline.
# https://ipython.readthedocs.io/en/stable/interactive/plotting.html


In [None]:
## adding some ML capabilities with Scikit-learn
## !pip install scikit-learn

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.tree import plot_tree
from sklearn import preprocessing
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score, accuracy_score, auc
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score


**IMPORTING DATASET**

**COLUMNS DESCRIPTION**

. passengerid = ID do passageiro do navio (código primário).

. survived = Se sobreviveu ao naufrágio estará como 1 e caso esteja com 0 (zero) não sobreviveu.

. pclass = Tipo de classe de passagem (Do 1 ao 3), (1 = 1st, 2 = 2nd, 3 = 3rd).

. name = Nome do passageiro

. sex = Gênero do passageiro, sendo masculino e feminino.

. age = Idade do passageiro na data da ocorrência do naufrágio.

. sibsp = Número de irmãos / cônjuges a bordo.

. parch = Número de pais / filhos a bordo.

. ticket = Código do ticket.

. fare = Valor da passagem.

. cabin = Código de identificação da Cabine.

. embarked = Local ondem o passageiro embarcou no navio: C = Cherbourg, Q = Queenstown, S = Southampton)

In [None]:
# Importing the train dataset and verifying the first info
titanic = pd.read_csv('train.csv')
titanic.head(10)

In [None]:
#same thing for the test dataset
titanic_test = pd.read_csv('test.csv')
titanic_test.head(10)

**EXPLORATORY ANALYSIS**

In [None]:
#some detail on our columns
titanic.info()

In [None]:
#count the NAs
titanic.isna().sum()

#isna() returns true (or 1) when the value is non existent (NaN) then we can .sum() the colum


In [None]:
#or, to be more complete, let's see the proportion of NaN in each column
pd.DataFrame(
    zip(    ##zip joins two tupples
        titanic.isna().sum(),               ##first column
        titanic.isna().sum()/len(titanic)   ##second column
    ),
    columns = ['Count', 'Proportion'],
    index = titanic.columns
)

In [None]:
#Lets see some quantitative description of our dataset
titanic.describe()

In [None]:
#our target variable is the "Survived" column. Let's see how many people survived

titanic.Survived.value_counts()/len(titanic)*100

**INITIALIZING PRE-PROCESSING**

Starting by KDD process - Knowledge Discovery in Database

In [None]:
Image('kdd.png')

In [None]:
#I always like to start with pairplot. It shows the distribution of some variables and we can vizualise the possible correlation between them
# but in this case is not that impressive
sb.pairplot(titanic[['Survived','Age','Fare','Sex', 'Pclass']].dropna())

In [None]:
print("Seaborn version:", sb.__version__)

In [None]:
## lets check the variable fare

##sb.histplot(data = titanic, x="Fare")
sb.histplot(titanic['Fare'])


## seaborn is a great library for image plotting

In [None]:
sb.boxplot(data = titanic, x = "Survived", y = "Fare")
plt.title("Fare distribution for survivals and non survivals")
plt.show()

In [None]:
## Let's eliminate some outliers from the "Fare" columns

titanic.loc[titanic['Fare']>=300]


In [None]:
## We can truncate these three values to the maximum of Fare = 300

titanic.loc[titanic['Fare']>=300, 'Fare'] = 300

In [None]:
#repeating the same plot as before
sb.boxplot(data = titanic, x = "Survived", y = "Fare")
plt.title("Fare distribution for survivals and non survivals - Truncating outliers")
plt.show()

In [None]:
#evaluating the age of passengers

sb.histplot(data = titanic, x = 'Age')

In [None]:
sb.boxplot(data = titanic, y = 'Age', x = 'Survived')

**SUSBSTITUTING NaN VALUES**

In [None]:
## Completing NaN values
print('Age info:\nAverage= {} \nMedian = {}'.format(titanic['Age'].mean(), titanic['Age'].median()))

In [None]:
## evaluating for sex
C_median = titanic['Age'].groupby(by= titanic.Sex).median()

C_median

In [None]:
## evaluating for class
C_median = titanic['Age'].groupby(by= titanic.Pclass).median()

C_median

In [None]:
#We will find the mean age for each class/sex group

trainMeans = titanic.groupby(['Sex','Pclass'])['Age'].mean()

trainMeans

In [None]:
## Applying the averages

def age_estimate(x):
    if not np.isnan(x['Age']):                  ## if age is not NaN
        return x['Age']                         ## return itself (the age)
    return trainMeans[x["Sex"], x['Pclass']]    ## otherwise retuns the age calculated in the trainMeans formula


titanic['Age'] = titanic.apply(age_estimate, axis=1)

In [None]:
titanic.describe()

**QUALITATIVE PREDICTIVE VARIABLES**

In [None]:
#Evaluatint effect of "Sex"
titanic.groupby('Survived')['Sex'].value_counts().unstack(0).plot.bar()

In [None]:
#Evaluatint effect of where the passager embarked
titanic.groupby('Survived')['Embarked'].value_counts().unstack(0).plot.bar()

In [None]:
# Fill all NAN (only 2 values) of the "Embarked" column with the Mode

titanic['Embarked'] = titanic['Embarked'].fillna('S')
titanic.head()

**CREATING DUMMIES FOR QUALITATIVE VARIABLES**

Some algorithms have dificulties in evaluating categorical values. 
Specially if the category is represented as a numerical value.
In this case, one strategy is to create what we call dummy columns, one hot enconding.

In [None]:
Image('hot encoding dummy.png')

In [None]:
#The 'get_dummies' method will evaluate the column selected and return True or false for each possible value
pd.get_dummies(titanic['Sex'])

In [None]:
#Using the 'dropfirst' is commonly used, because you if you have n possible results...
#you can determine all the values with n-1 columns (i.e, if all columns return False, the dropped column would be True)
pd.get_dummies(titanic['Sex'], drop_first=True)

In [None]:
# creating a column that says if the passanger is male or not
titanic['male'] = pd.get_dummies(
    titanic['Sex'],
    drop_first=True
)

titanic.head()

In [None]:
#doing the same for the embarked place
embark_dummies = pd.get_dummies(
    titanic['Embarked'], #This time we will do the same for the embarked column
    drop_first=True, #dropping the first column (should be C)
    prefix='Embarked_' #putting a prefix so we end up with Embarked_Q and Embarked_S columns
)

embark_dummies.head()

In [None]:
#adding these columns to my dataframe with the concat method
titanic = pd.concat(
    [ titanic , embark_dummies ],   #the two dataframes we want to unite
    axis=1                          #the axis=1 indicate we will concatenate the dataframes horizontally (add columns)
)

titanic.head()

In [None]:
#now we do not need the "Sex" or "Embarked" columns anymore
titanic.drop(
    ['Sex','Embarked'],     # the columns to be dropped
    axis = 1,               # the axis of drop (1 means column)
    inplace=True            # means we will substitute the original dataframe
)

#The inplace true means we will replace the original dataframe. It is the same as if we had typed:
#titanic = titanic.drop(['Sex','Embarked'],axis = 1)

titanic.head()

**WORKING ON OTHER QUALITATIVE VARIABLES**

In [None]:
#Let's review where we are so far

titanic.info()

In [None]:
#Pclass is described as Int64 because the value is numerical (1, 2 or 3)
#But we do not want our model to view it as a quantitative value
#so we change to class

titanic['Pclass'].value_counts()

In [None]:
titanic['Pclass'] = titanic['Pclass'].astype('category')
titanic['Survived'] = titanic['Survived'].astype('bool')
titanic.info()

In [None]:
#Now let's create our dummies for Pclass
pclass_dummies = pd.get_dummies(
    titanic['Pclass'],              #Create a dummy that returns the columns 
    drop_first=True,                #Droppint the Pclass_1 column that will not be necessary
    prefix="Pclass_"                #Pclass_2 and #Pclass_3
)

titanic = pd.concat(
    [ titanic , pclass_dummies ],   #the two dataframes we want to unite
    axis=1                          #the axis=1 indicate we will concatenate the dataframes horizontally (add columns)
)

titanic.drop(
    ['Pclass'],                     #with the dummies ready, we do not need the original Pclass column
    axis = 1,
    inplace=True
)

In [None]:
titanic.head()

In [None]:
##finally, let's drop the individual values that will not contribute to the machine learning
titanic_train = titanic.drop(['PassengerId','Name','Cabin', 'Ticket'], axis=1)
titanic_train.head()

In [None]:
titanic_train.info()

**SEPARATING TEST AND TRAIN DATA**

In this phase we divide our train model in two parts

one part will be used to train the model

the other part will evaluate the assertiveness of the model

In [None]:
##Using the train_test_split from SKLEARN

'''
(function) def train_test_split(
    *arrays: Any,
    test_size: Float | None = None,
    train_size: Float | None = None,
    random_state: Int | RandomState | None = None,
    shuffle: bool = True,
    stratify: ArrayLike | None = None
) -> list

Split arrays or matrices into random train and test subsets.

Quick utility that wraps input validation, next(ShuffleSplit().split(X, y)), 
and application to input data into a single call for splitting (and optionally subsampling) data into a one-liner.
'''

X_train, X_test, y_train, y_test = train_test_split(
    titanic_train.drop('Survived',axis=1),          # independent values, will be the dataframe without the target column
    titanic_train['Survived'],                      # dependent value, target column
    test_size=0.10,                                 # how much of the dataframe will be used for testing (in this case 90% will be used for training)
    random_state=10                                 # including a random state just ensures the result of the shuffle will always be the same for this block
)

In [None]:
##X_train is the matrix of values that the model will use to try and understand the behaviour of y_train
X_train

In [None]:
y_train

In [None]:
##after the training, the model will try to apply the resulting formula into the X_test values, and see if got correct results when comparing
##to the y_test results
X_test

In [None]:
test = "adding one line to see if clearoutput is working"