**OVERVIEW**

The data has been split into two groups:

training set (train.csv)
test set (test.csv)
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

**INSTALLING PACKAGES**

In [None]:
## we start by importing the necessary libraries for data manipulation and viz
## !pip install seaborn
## !pip install statsmodels

from IPython.display import Image
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import seaborn as sb
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline 
# Esse comando serve para plotar os gráficos estáticos logo abaixo da célula,

# existem outras configurações do %matplolib que podem mostrar os gráficos em outras abas ou gráficos dinâmicos.
# Por padrão, desde a versão 3.7 do python anaconda, a configuração padrão do %matplotlib já é o inline.
# https://ipython.readthedocs.io/en/stable/interactive/plotting.html


In [None]:
## adding some ML capabilities with Scikit-learn
## !pip install scikit-learn

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.tree import plot_tree
from sklearn import preprocessing
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score, accuracy_score, auc
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score


**IMPORTING DATASET**

**COLUMNS DESCRIPTION**

. passengerid = ID do passageiro do navio (código primário).

. survived = Se sobreviveu ao naufrágio estará como 1 e caso esteja com 0 (zero) não sobreviveu.

. pclass = Tipo de classe de passagem (Do 1 ao 3), (1 = 1st, 2 = 2nd, 3 = 3rd).

. name = Nome do passageiro

. sex = Gênero do passageiro, sendo masculino e feminino.

. age = Idade do passageiro na data da ocorrência do naufrágio.

. sibsp = Número de irmãos / cônjuges a bordo.

. parch = Número de pais / filhos a bordo.

. ticket = Código do ticket.

. fare = Valor da passagem.

. cabin = Código de identificação da Cabine.

. embarked = Local ondem o passageiro embarcou no navio: C = Cherbourg, Q = Queenstown, S = Southampton)

In [None]:
# Importing the train dataset and verifying the first info
titanic = pd.read_csv('train.csv')
titanic.head(10)

In [None]:
#same thing for the test dataset
titanic_test = pd.read_csv('test.csv')
titanic_test.head(10)

**EXPLORATORY ANALYSIS**

In [None]:
#some detail on our columns
titanic.info()

In [None]:
#count the NAs
titanic.isna().sum()

#isna() returns true (or 1) when the value is non existent (NaN) then we can .sum() the colum


In [None]:
#or, to be more complete, let's see the proportion of NaN in each column
pd.DataFrame(
    zip(    ##zip joins two tupples
        titanic.isna().sum(),               ##first column
        titanic.isna().sum()/len(titanic)   ##second column
    ),
    columns = ['Count', 'Proportion'],
    index = titanic.columns
)

In [None]:
#Lets see some quantitative description of our dataset
titanic.describe()

In [None]:
#our target variable is the "Survived" column. Let's see how many people survived

titanic.Survived.value_counts()/len(titanic)*100

**INITIALIZING PRE-PROCESSING**

Starting by KDD process - Knowledge Discovery in Database

In [None]:
Image('kdd.png')

In [None]:
#I always like to start with pairplot. It shows the distribution of some variables and we can vizualise the possible correlation between them
# but in this case is not that impressive
sb.pairplot(titanic[['Survived','Age','Fare','Sex', 'Pclass']].dropna())

In [None]:
print("Seaborn version:", sb.__version__)

In [None]:
## lets check the variable fare

##sb.histplot(data = titanic, x="Fare")
sb.histplot(titanic['Fare'])


## seaborn is a great library for image plotting

In [None]:
sb.boxplot(data = titanic, x = "Survived", y = "Fare")
plt.title("Fare distribution for survivals and non survivals")
plt.show()

In [None]:
## Let's eliminate some outliers from the "Fare" columns

titanic.loc[titanic['Fare']>=300]


In [None]:
## We can truncate these three values to the maximum of Fare = 300

titanic.loc[titanic['Fare']>=300, 'Fare'] = 300

In [None]:
#repeating the same plot as before
sb.boxplot(data = titanic, x = "Survived", y = "Fare")
plt.title("Fare distribution for survivals and non survivals - Truncating outliers")
plt.show()

In [None]:
#evaluating the age of passengers

sb.histplot(data = titanic, x = 'Age')

In [None]:
sb.boxplot(data = titanic, y = 'Age', x = 'Survived')

**SUSBSTITUTING NaN VALUES**

In [None]:
## Completing NaN values
print('Age info:\nAverage= {} \nMedian = {}'.format(titanic['Age'].mean(), titanic['Age'].median()))

In [None]:
## evaluating for sex
C_median = titanic['Age'].groupby(by= titanic.Sex).median()

C_median

In [None]:
## evaluating for class
C_median = titanic['Age'].groupby(by= titanic.Pclass).median()

C_median

In [None]:
#We will find the mean age for each class/sex group

trainMeans = titanic.groupby(['Sex','Pclass'])['Age'].mean()

trainMeans

In [None]:
## Applying the averages

def age_estimate(x):
    if not np.isnan(x['Age']):                  ## if age is not NaN
        return x['Age']                         ## return itself (the age)
    return trainMeans[x["Sex"], x['Pclass']]    ## otherwise retuns the age calculated in the trainMeans formula


titanic['Age'] = titanic.apply(age_estimate, axis=1)

In [None]:
titanic.describe()

**QUALITATIVE PREDICTIVE VARIABLES**

In [None]:
#Evaluatint effect of "Sex"
titanic.groupby('Survived')['Sex'].value_counts().unstack(0).plot.bar()

In [None]:
#Evaluatint effect of where the passager embarked
titanic.groupby('Survived')['Embarked'].value_counts().unstack(0).plot.bar()

In [None]:
# Fill all NAN (only 2 values) of the "Embarked" column with the Mode

titanic['Embarked'] = titanic['Embarked'].fillna('S')
titanic.head()

**CREATING DUMMIES FOR QUALITATIVE VARIABLES**

Some algorithms have dificulties in evaluating categorical values. 
Specially if the category is represented as a numerical value.
In this case, one strategy is to create what we call dummy columns, one hot enconding.

In [None]:
Image('hot encoding dummy.png')

In [None]:
#The 'get_dummies' method will evaluate the column selected and return True or false for each possible value
pd.get_dummies(titanic['Sex'])

In [None]:
#Using the 'dropfirst' is commonly used, because you if you have n possible results...
#you can determine all the values with n-1 columns (i.e, if all columns return False, the dropped column would be True)
pd.get_dummies(titanic['Sex'], drop_first=True)

In [None]:
# creating a column that says if the passanger is male or not
titanic['male'] = pd.get_dummies(
    titanic['Sex'],
    drop_first=True
)

titanic.head()

In [None]:
#doing the same for the embarked place
embark_dummies = pd.get_dummies(
    titanic['Embarked'], #This time we will do the same for the embarked column
    drop_first=True, #dropping the first column (should be C)
    prefix='Embarked_' #putting a prefix so we end up with Embarked_Q and Embarked_S columns
)

embark_dummies.head()

In [None]:
#adding these columns to my dataframe with the concat method
titanic = pd.concat(
    [ titanic , embark_dummies ],   #the two dataframes we want to unite
    axis=1                          #the axis=1 indicate we will concatenate the dataframes horizontally (add columns)
)

titanic.head()

In [None]:
#now we do not need the "Sex" or "Embarked" columns anymore
titanic.drop(
    ['Sex','Embarked'],     # the columns to be dropped
    axis = 1,               # the axis of drop (1 means column)
    inplace=True            # means we will substitute the original dataframe
)

#The inplace true means we will replace the original dataframe. It is the same as if we had typed:
#titanic = titanic.drop(['Sex','Embarked'],axis = 1)

titanic.head()

**WORKING ON OTHER QUALITATIVE VARIABLES**

In [None]:
#Let's review where we are so far

titanic.info()

In [None]:
#Pclass is described as Int64 because the value is numerical (1, 2 or 3)
#But we do not want our model to view it as a quantitative value
#so we change to class

titanic['Pclass'].value_counts()

In [None]:
titanic['Pclass'] = titanic['Pclass'].astype('category')
titanic['Survived'] = titanic['Survived'].astype('bool')
titanic.info()

In [None]:
#Now let's create our dummies for Pclass
pclass_dummies = pd.get_dummies(
    titanic['Pclass'],              #Create a dummy that returns the columns 
    drop_first=True,                #Droppint the Pclass_1 column that will not be necessary
    prefix="Pclass_"                #Pclass_2 and #Pclass_3
)

titanic = pd.concat(
    [ titanic , pclass_dummies ],   #the two dataframes we want to unite
    axis=1                          #the axis=1 indicate we will concatenate the dataframes horizontally (add columns)
)

titanic.drop(
    ['Pclass'],                     #with the dummies ready, we do not need the original Pclass column
    axis = 1,
    inplace=True
)

In [None]:
titanic.head()

In [None]:
##finally, let's drop the individual values that will not contribute to the machine learning
titanic_train = titanic.drop(['PassengerId','Name','Cabin', 'Ticket'], axis=1)
titanic_train.head()

In [None]:
titanic_train.info()

**SEPARATING TEST AND TRAIN DATA**

In this phase we divide our train model in two parts

one part will be used to train the model

the other part will evaluate the assertiveness of the model

In [None]:
##Using the train_test_split from SKLEARN

'''
(function) def train_test_split(
    *arrays: Any,
    test_size: Float | None = None,
    train_size: Float | None = None,
    random_state: Int | RandomState | None = None,
    shuffle: bool = True,
    stratify: ArrayLike | None = None
) -> list

Split arrays or matrices into random train and test subsets.

Quick utility that wraps input validation, next(ShuffleSplit().split(X, y)), 
and application to input data into a single call for splitting (and optionally subsampling) data into a one-liner.
'''

X_train, X_test, y_train, y_test = train_test_split(
    titanic_train.drop('Survived',axis=1),          # independent values, will be the dataframe without the target column
    titanic_train['Survived'],                      # dependent value, target column
    test_size=0.10,                                 # how much of the dataframe will be used for testing (in this case 90% will be used for training)
    random_state=10                                 # including a random state just ensures the result of the shuffle will always be the same for this block
)

In [None]:
##X_train is the matrix of values that the model will use to try and understand the behaviour of y_train
X_train

In [None]:
y_train

In [None]:
##after the training, the model will try to apply the resulting formula into the X_test values, and see if got correct results when comparing
##to the y_test results
X_test

**FIRST MODEL - DECISION TREE**

HYPERPARAMETERS

tree = DecisionTreeClassifier(

criterion='gini', # gini vem por default, mas podemos optar por entropy

splitter='best', # a estratégia utilizada para fazer a separação de cada nó # ela também pode ser feita de forma randômica utilizando 'random'

max_depth= None, # a máxima profundida que sua árvore de decisão pode ter # se for None ela vai buscar a máxima pureza possível

min_samples_split = 2, # o mínimo de registros necessários para que uma separação seja feita

min_samples_leaf = 1, # o mínimo de registros necessários em cada nós-folha (veja a primeira imagem)

max_features = None, # o número de atributos que será considerado durante o split # None -> seleciona todos os atributos, 'sqrt' -> raiz quadrada do número dos atributos, 'log2' -> log de base 2 do número de atributos

max_leaf_nodes=None, # a quantidade máxima de nós-folha que a árvore pode ter # se for None ele não limitará o número de nós-folha

min_impurity_decrease=0.0, # o split irá ocorrer em cada nó se o decréscimo da impureza foi maior ou igual a este valor

random_state= 17, # permite gerar a mesma amostra (o notebook ser reproduzível)

In [None]:
Image('Decision_Tree.png')

In [None]:
#Let's use the DecisionTreeClassifier class

'''
class DecisionTreeClassifier(
    *,
    criterion: Literal['gini', 'entropy', 'log_loss'] = "gini",
    splitter: Literal['best', 'random'] = "best",
    max_depth: Int | None = None,
    min_samples_split: float | int = 2,
    min_samples_leaf: float | int = 1,
    min_weight_fraction_leaf: Float = 0,
    max_features: float | int | Literal['auto', 'sqrt', 'log2'] | None = None,
    random_state: Int | RandomState | None = None,
    max_leaf_nodes: Int | None = None,
    min_impurity_decrease: Float = 0,
    class_weight: Mapping | str | Sequence[Mapping] | None = None,
    ccp_alpha: float = 0
)
'''

Classif_tree = DecisionTreeClassifier(random_state=10)

In [None]:
#training the decision tree with our X_train matrix and y_train results

classif = Classif_tree.fit(X_train,y_train)

In [None]:
#Lets check the most important features

classif.feature_importances_

In [None]:
#Damn! Which is which here? Lets find out.
X_train.columns

In [None]:
df = pd.DataFrame(X_train.columns)

df2 = pd.DataFrame(classif.feature_importances_)

df3 = pd.concat(
    [df, df2],
    axis =1
)

tree_importance = df3.set_axis(['Feature', 'Importance'], axis = 'columns').sort_values(by=['Importance'], ascending=False)

tree_importance.head(10)

In [None]:
#predicting our y with the X_train and the classif model

y_pred_all = classif.predict(X_train)

**CONFUSION MATRIX**

In [None]:
pd.crosstab(y_train, y_pred_all)

In [None]:
# Or, from sklearn.metrics import confusion_matrix

# Matriz de Confusão

confusion_matrix = confusion_matrix(y_train, y_pred_all)
confusion_matrix

In [None]:
print(classification_report(y_train, y_pred_all,digits=2))

# Resultado do classification_report:
# Precision score = VP/(VP+FP)
# Recall score = VP/(VP+FN)
# F1 Score = 2* Precision Score * Recall Score/ (Precision Score + Recall Score)

**EVALUATING THE RESULTS**

Seems good right? Very high precision, recall and F1 score....

Well, actually not. In the decision tree if we "let it go" as further as it wants, it will basically explain every individual case by its own parameters

Usually if we get a precision higher than 0.8, we should suspect something is too good to be true

Let's plot the current tree so we can see what we are talking about


In [None]:
# criando o fig e o axes  - selecionando alguns niveis
fig, ax = plt.subplots(figsize=(16,12),dpi=130)
#criando o plot
plot_tree(classif, # a decision tree que será plotada
          feature_names =(['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'male', 'Embarked_C','Embarked_Q', 'Embarked_S',
                           'Pclass_1', 'Pclass_2', 'Pclass_3']), # trará o nome das features utilizadas na primeira linha de cada nó
          ax=ax, # plotamos no axes criado do matplotlib
          precision=1, # precisão dos valores numéricos
          filled=True,
          max_depth=4, #  escolhemos a profundidade da árvore
          proportion = True, # retorna a proporção dos valores das amostras
          fontsize = 12 # mudar o tamanho da fonte
        )
#plotando o gráfico
plt.tight_layout();

In [None]:
#And down it goes... we should get the most important features and limit how much we want our decision tree to go down...

#Let's go back and do the DecisionTreeClassifier again... but now using the available parameters

'''
class DecisionTreeClassifier(
    *,
    criterion: Literal['gini', 'entropy', 'log_loss'] = "gini",
    splitter: Literal['best', 'random'] = "best",
    max_depth: Int | None = None,
    min_samples_split: float | int = 2,
    min_samples_leaf: float | int = 1,
    min_weight_fraction_leaf: Float = 0,
    max_features: float | int | Literal['auto', 'sqrt', 'log2'] | None = None,
    random_state: Int | RandomState | None = None,
    max_leaf_nodes: Int | None = None,
    min_impurity_decrease: Float = 0,
    class_weight: Mapping | str | Sequence[Mapping] | None = None,
    ccp_alpha: float = 0
)
'''

Classif_tree_optimized = DecisionTreeClassifier(
    random_state=10,
    max_depth = 4
)

classif2 = Classif_tree_optimized.fit(X_train, y_train)

df = pd.DataFrame(X_train.columns)

df2 = pd.DataFrame(classif2.feature_importances_)

df3 = pd.concat(
    [df, df2],
    axis =1
)

tree_importance = df3.set_axis(['Feature', 'Importance'], axis = 'columns').sort_values(by=['Importance'], ascending=False)

tree_importance.head(10)

**TESTING THE DECISION TREE IN OUR TEST.CSV**

In [None]:
#let's train our new model
y_pred2 = classif2.predict(X_train)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_train, y_pred2)
confusion_matrix

In [None]:
print(classification_report(y_train, y_pred2,digits=2))

In [None]:
#Let's test this model in our test.csv file

test_decision_tree = pd.read_csv('test.csv')

test_decision_tree.info()

In [None]:
#Filling up the NaN Ages
trainMeans2 = test_decision_tree.groupby(['Sex','Pclass'])['Age'].mean()

def age_estimate(x):
    if not np.isnan(x['Age']):                  ## if age is not NaN
        return x['Age']                         ## return itself (the age)
    return trainMeans2[x["Sex"], x['Pclass']]    ## otherwise retuns the age calculated in the trainMeans formula


test_decision_tree['Age'] = test_decision_tree.apply(age_estimate, axis=1)
test_decision_tree.info()

In [None]:
#filling the NaN Fare
trainfareAverage = test_decision_tree.groupby(['Sex','Pclass'])['Fare'].mean()

def fare_estimate(x):
    if not np.isnan(x['Fare']):                  ## if age is not NaN
        return x['Fare']                         ## return itself (the age)
    return trainfareAverage[x["Sex"], x['Pclass']]    ## otherwise retuns the age calculated in the trainMeans formula


test_decision_tree['Fare'] = test_decision_tree.apply(fare_estimate, axis=1)
test_decision_tree.info()

#fare = Average
#test_decision_tree['Fare'].fillna()

In [None]:
#Getting our dummies

## Sex Dummies
test_decision_tree['male'] = pd.get_dummies(
    test_decision_tree['Sex'],
    drop_first=True
)
## EMBARK DUMMIES
embark_dummies2 = pd.get_dummies(
    test_decision_tree['Embarked'],                 #This time we will do the same for the embarked column
    drop_first=True,                                #dropping the first column (should be C)
    prefix='Embarked_'                              #putting a prefix so we end up with Embarked_Q and Embarked_S columns
)

test_decision_tree = pd.concat(
    [ test_decision_tree , embark_dummies2 ],    #the two dataframes we want to unite
    axis=1                                       #the axis=1 indicate we will concatenate the dataframes horizontally (add columns)
)

##P_Class Dummies
pclass_dummies2 = pd.get_dummies(
    test_decision_tree['Pclass'],              #Create a dummy that returns the columns 
    drop_first=True,                #Droppint the Pclass_1 column that will not be necessary
    prefix="Pclass_"                #Pclass_2 and #Pclass_3
)

test_decision_tree = pd.concat(
    [ test_decision_tree , pclass_dummies2 ],   #the two dataframes we want to unite
    axis=1                          #the axis=1 indicate we will concatenate the dataframes horizontally (add columns)
)

test_decision_tree.info()

In [None]:
X_test.info()

In [None]:
test_df = test_decision_tree[['Age','SibSp','Parch','Fare','male','Embarked__Q','Embarked__S', 'Pclass__2','Pclass__3']]
test_df.info()

In [None]:
test_decision_tree['Survived'] = classif2.predict(test_decision_tree[['Age','SibSp','Parch','Fare','male','Embarked__Q','Embarked__S', 'Pclass__2','Pclass__3']])

In [None]:
test_decision_tree.head()

In [None]:
test_decision_tree['Survived'] = test_decision_tree['Survived'].astype(int)

kaggle_response = test_decision_tree[['PassengerId','Survived']].sort_values(by = 'PassengerId')

kaggle_response.head(418)

In [None]:
kaggle_response.to_csv('decision_tree_response.csv', index=False)

# SECOND MODEL - LOGISTIC REGRESSION

In [None]:
titanic2 = titanic
titanic2.head(10)

In [None]:
titanic2.info()

In [None]:
# We will try to use the Logistic regression with the 3 most important variables, but first we turn the target to numeric
titanic2['Survived']= titanic2['Survived'].astype(int)
titanic2['Pclass__3']= titanic2['Pclass__3'].astype(int)
titanic2['Pclass__2']= titanic2['Pclass__2'].astype(int)
titanic2['male']= titanic2['male'].astype(int)

In [None]:
titanic2.info()

In [None]:
#smf is the stat model 
modelo = smf.glm(formula='Survived ~ male + Pclass__3 + Age + Pclass__2', data=titanic2,
               family = sm.families.Binomial()).fit()
print(modelo.summary())

In [None]:
# good results for the p-values. Which are all bellow 0.05 for 95% confidence
#this results shows that,from the baselines of our dummies, that all the selected variables  have negative coeficients
#remember that in our case, the target variable is 1 (survived) or 0 (did not survive)
# that means that being male, being on any class other than first, and being older decreased the chance of surviving
# the fact that both Pclass__3 and Pclass__2 have positive coefficients, shows that it being

print(np.exp(modelo.params[1:]))

In [None]:
(np.exp(modelo.params[1:]) - 1) * 100

In [None]:
# now that's easier to understand
# we interpret that as:
# men have 92% less chance to survive than women
# people from third class had 92% less change
# each year a passenger were, less 3.7% chance of surviving

## LOGISTIC REGRESSION PARAMETERS

In [None]:
log_reg_model = LogisticRegression(penalty=None, solver = 'newton-cg')

In [None]:
#let's separate only the main factors
baseline_df = titanic2[['Survived','Pclass__3', 'Pclass__2', 'Age', 'male']]
baseline_df

In [None]:
y = baseline_df['Survived']
X = baseline_df[['Pclass__3', 'Pclass__2', 'Age', 'male']]

In [None]:
log_reg_model.fit(X,y)

In [None]:
print(log_reg_model.coef_)

In [None]:
##Same values found before
accuracy = accuracy_score(y, log_reg_model.predict(X))
print('The model got %0.4f of accuracy.' % accuracy)

In [None]:
print(classification_report(y, log_reg_model.predict(X)))

In [None]:
# Predizendo as probabilidades
yhat = log_reg_model.predict_proba(X) 
print('AUC: %0.2f' % roc_auc_score(y, yhat[:, 1]))   

In [None]:
def plot_roc_curve(y_true, y_score, figsize=(10,6)):
    fpr, tpr, _ = roc_curve(y_true, y_score)
    plt.figure(figsize=figsize)
    auc_value = roc_auc_score(y_true, y_score)
    plt.plot(fpr, tpr, color='orange', label='ROC curve (area = %0.2f)' % auc_value)
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()

In [None]:
plot_roc_curve(y, yhat[:, 1])

## Predicting the test.csv file



In [None]:
#Let's test this model in our test.csv file------------------------------------------------

test_decision_tree = pd.read_csv('test.csv')

#Filling up the NaN Ages------------------------------------------------------------------
trainMeans2 = test_decision_tree.groupby(['Sex','Pclass'])['Age'].mean()

def age_estimate(x):
    if not np.isnan(x['Age']):                  ## if age is not NaN
        return x['Age']                         ## return itself (the age)
    return trainMeans2[x["Sex"], x['Pclass']]    ## otherwise retuns the age calculated in the trainMeans formula


test_decision_tree['Age'] = test_decision_tree.apply(age_estimate, axis=1)


#filling the NaN Fare-------------------------------------------------------------------------
trainfareAverage = test_decision_tree.groupby(['Sex','Pclass'])['Fare'].mean()

def fare_estimate(x):
    if not np.isnan(x['Fare']):                  ## if age is not NaN
        return x['Fare']                         ## return itself (the age)
    return trainfareAverage[x["Sex"], x['Pclass']]    ## otherwise retuns the age calculated in the trainMeans formula


test_decision_tree['Fare'] = test_decision_tree.apply(fare_estimate, axis=1)

#fare = Average
#test_decision_tree['Fare'].fillna()

#Getting our dummies-----------------------------------------------------------------------------

## Sex Dummies
test_decision_tree['male'] = pd.get_dummies(
    test_decision_tree['Sex'],
    drop_first=True
)
## EMBARK DUMMIES
embark_dummies2 = pd.get_dummies(
    test_decision_tree['Embarked'],                 #This time we will do the same for the embarked column
    drop_first=True,                                #dropping the first column (should be C)
    prefix='Embarked_'                              #putting a prefix so we end up with Embarked_Q and Embarked_S columns
)

test_decision_tree = pd.concat(
    [ test_decision_tree , embark_dummies2 ],    #the two dataframes we want to unite
    axis=1                                       #the axis=1 indicate we will concatenate the dataframes horizontally (add columns)
)

##P_Class Dummies
pclass_dummies2 = pd.get_dummies(
    test_decision_tree['Pclass'],              #Create a dummy that returns the columns 
    drop_first=True,                #Droppint the Pclass_1 column that will not be necessary
    prefix="Pclass_"                #Pclass_2 and #Pclass_3
)

test_decision_tree = pd.concat(
    [ test_decision_tree , pclass_dummies2 ],   #the two dataframes we want to unite
    axis=1                          #the axis=1 indicate we will concatenate the dataframes horizontally (add columns)
)

test_decision_tree.info()

In [None]:
test_log_reg = test_decision_tree

In [None]:
## Remember what columsn our baseline used:

## X = baseline_df[['Pclass__3', 'Pclass__2', 'Age', 'male']]

test_log_reg_X = test_log_reg[['Pclass__3', 'Pclass__2', 'Age', 'male']] ## let's use this df that is already transformed for us



In [None]:
log_reg_model.predict(test_log_reg_X)

In [None]:
test_log_reg['Survived'] = log_reg_model.predict(test_log_reg_X).astype(int)

In [None]:
test_log_reg

In [None]:
kaggle_response2 = test_log_reg[['PassengerId','Survived']].sort_values(by = 'PassengerId')
kaggle_response2

In [None]:
kaggle_response2.to_csv('Logistic_regression_response.csv', index=False)