<a href="https://colab.research.google.com/github/LucasNatalePires/kaggle_titanic/blob/main/titanic_version4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [24]:
import pandas as pd

**The datas are avaliable to download via [kaggle](https://www.kaggle.com/competitions/titanic/data)**

In [25]:
#Getting to know our dataset
test = pd.read_csv('/content/test.csv')
test.head(3)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q


In [26]:
train = pd.read_csv('/content/train.csv')
train.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


**This data processing process below, despite seeming long, is nothing more than all the treatments I did in the previous three versions, treating all null values, and transforming/creating columns so that the algorithm can have greater precision, more details in the respective versions:**

[Version 1](https://github.com/LucasNatalePires/kaggle_titanic/blob/main/titanic_version1.ipynb)

[Version 2](https://github.com/LucasNatalePires/kaggle_titanic/blob/main/titanic_version2.ipynb)

[Version 3](https://github.com/LucasNatalePires/kaggle_titanic/blob/main/titanic_version3.ipynb)

In [27]:
#Version 1

#Deleting those colums from train
train = train.drop(['Name','Ticket','Cabin'],axis=1)
#Deleting those colums from train
test = test.drop(['Name','Ticket','Cabin'],axis=1)



# Now it's time to deal with null values, starting with age
train_age = train.Age.mean()
test_age = test.Age.mean()
train.loc[train.Age.isnull(),'Age'] = train_age
test.loc[test.Age.isnull(),'Age'] = test_age


#Treating null values in the fare column
test_fare = test.Fare.mean()
test.loc[test.Fare.isnull(),'Fare'] = test_fare


#Treating null values in the embarked column
train_embarked = train.Embarked.mode()[0]
train.loc[train.Embarked.isnull(),'Embarked'] = train_embarked

In [28]:
#Version 2

#Function to check and change the 'female' data to '1', so I can work with more data in the model used.
train['MaleCheck'] = train.Sex.apply(lambda x:1 if x == 'male' else 0)

#Let's repeat the same thing with the datas from test
test['MaleCheck'] = test.Sex.apply(lambda x:1 if x == 'male' else 0)

**As I said previously about "work for nothing" or mistakes, in this function to check the passenger's sexuality, I initially checked if it was female, and yes, I would do the other steps. When I calculated the accuracy using the MLP Classifier, the value was 0.8135593220338984. Analyzing the code again, I replaced 'check female' with 'check male', despite a simple change, the accuracy rose to 0.8305084745762712**

In [29]:
#Version 3

#Import
from sklearn.preprocessing import RobustScaler
#Creating the scaler
transformer = RobustScaler().fit(train[['Age','Fare']])
#Transforming the datas
train[['Age', 'Fare']] = transformer.transform(train[['Age','Fare']])

#Creating the scaler
transformer = RobustScaler().fit(test[['Age','Fare']])
test[['Age', 'Fare']] = transformer.transform(test[['Age','Fare']])


#Create a function to check if the person is alone
def check_alone(x,y):
    if (x == 0 and y == 0):
        return 1
    else:
        return 0
#Create the column and apply the function
train['Alone'] = train.apply(lambda x:check_alone(x.SibSp, x.Parch), axis=1)
#Create the column and apply the function
test['Alone'] = test.apply(lambda x:check_alone(x.SibSp, x.Parch), axis=1)


#Creating the column 'relatives' on train
train['Relatives'] = train['SibSp'] + train['Parch']
#Doing the same thing on test
test['Relatives'] = test['SibSp'] + test['Parch']

#Treating the column 'Embarked'
#Import
from sklearn.preprocessing import OrdinalEncoder
#Creating the encoder
ports = ['S','C','Q']
enc = OrdinalEncoder(categories=[ports],dtype = 'int32')
#Fit with datas
enc = enc.fit(train[['Embarked']])
# Add these datas on the column train
train['Embarked'] = enc.transform(train[['Embarked']])
#The same job with test, Fit with datas
enc = enc.fit(test[['Embarked']])
# Add these datas on the column test
test['Embarked'] = enc.transform(test[['Embarked']])


#Dele the column 'Sex'
test = test.drop('Sex', axis=1)
train = train.drop('Sex',axis=1)

In [30]:
#View the treated dataset
train.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Embarked,MaleCheck,Alone,Relatives
0,1,0,3,-0.59224,1,0,-0.312011,0,1,0,1
1,2,1,1,0.638529,1,0,2.461242,1,0,0,1
2,3,1,3,-0.284548,0,0,-0.282777,0,0,1,0


**Separating the file in train and test, As this process has already been done several times, I will not focus on it again.**

[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [31]:
#Import
from sklearn.model_selection import train_test_split

#Separating the training base into X and y
X = train.drop(['PassengerId','Survived'],axis = 1)
y = train.Survived

#Separate train and validation
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.33, random_state=42)

**Based on the accuracy of the algorithms used, it was found that logistic regression had better performance, so I will not include KNC or Tree Decision, but I will include other algorithms, to see if there is any improvement**

[Logistic regression](https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)

[Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

[MPL Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier)

**After that, let's import the [accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) and see the results**

**I will check the [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) and finally, submit the predict**

In [32]:
from sklearn.linear_model import LogisticRegression

# Criando o classificador

clf_lr = LogisticRegression(random_state=42)

# Fazendo o fit com os dados

clf_lr = clf_lr.fit(X_train,y_train)

# Fazendo a previsão

y_pred_lr = clf_lr.predict(X_validation)

In [33]:
#Import
from sklearn.ensemble import RandomForestClassifier

#Creating the Classifier
clf_rf = RandomForestClassifier(random_state=42)

#Fit with datas
clf_rf = clf_rf.fit(X_train, y_train)

#Making the predction
y_pred_rf = clf_rf.predict(X_validation)


In [34]:
#Import
from sklearn.neural_network import MLPClassifier

#Creating the Classifier
clf_mlp = MLPClassifier(random_state=42, max_iter= 5000)

#Fit with datas
clf_mlp = clf_mlp.fit(X_train, y_train)

#Making the predction
y_pred_mlp = clf_mlp.predict(X_validation)

In [35]:
#Import
from sklearn.metrics import accuracy_score

accuracy_score(y_validation,y_pred_lr)

0.8067796610169492

In [36]:
# mpl accuracy
accuracy_score(y_validation,y_pred_mlp)

0.8305084745762712

In [37]:
#Import
from sklearn.metrics import confusion_matrix

In [38]:
#logistic regression confusion_matrix
confusion_matrix(y_validation, y_pred_lr)

array([[152,  23],
       [ 34,  86]])

In [39]:
#random forest confusion_matrix
confusion_matrix(y_validation, y_pred_rf)

array([[148,  27],
       [ 35,  85]])

In [40]:
# mlp Classifier confusion_matrix
confusion_matrix(y_validation, y_pred_mlp)

array([[160,  15],
       [ 35,  85]])

In [41]:
X_train.head(3)

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Embarked,MaleCheck,Alone,Relatives
6,1,1.869299,0,0,1.620136,0,1,1,0
718,3,0.0,0,0,0.045293,2,1,1,0
685,2,-0.361471,1,2,1.174771,1,1,0,3


In [42]:
# Is necessary to delete the colum 'PassengerId'
X_test = test.drop('PassengerId', axis=1)

In [43]:
#Using logistic regression on your test data
y_pred = clf_mlp.predict(X_test)

In [44]:
#Creating the colum 'Survived'
test['Survived'] = y_pred

In [45]:
#Selecting just the colums 'PassengerId'and 'Survived' to send and the dataset updated
result = test[['PassengerId','Survived']]

In [46]:
#Exporting to .CSV
result.to_csv('result3.csv', index = False)

# The score was 0.7488, lower than expected, around 0.8305084745762712. This occurred due to OVERFITTING ,which is basically when the algorithm works very well for training, but does not perform the same in testing. Obviously there are ways to reduce this discrepancy and these will be addressed in version 5