<a href="https://colab.research.google.com/github/LucasNatalePires/kaggle_titanic/blob/main/titanic_version2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**The first version had a result below what I expected and so it was necessary to rethink what could be done to improve it. And it is very evident that sex influences the final result, so, as the first attempt was to exclude the column, we will have to treat it and transform the object type variable into a float or int, so the Logistic Regression model can be used taking into account count more information and consequently have greater accuracy**

**For the above reason, and because I have already treated the data and I explained why each thing in version 1, in version 2 we will focus on converting this data, seeking to achieve our objective of reaching around 85%, but I'll leave the link here, in case you want to review it in more detail:
[version1](https://github.com/LucasNatalePires/kaggle_titanic/blob/main/titanic_version1.ipynb)**

In [3]:
#Import
import pandas as pd

In [4]:
#Getting to know our dataset
test = pd.read_csv("/content/test.csv")
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [5]:
train = pd.read_csv("/content/train.csv")
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**As explained above, we already know the data information, which values are null and how to treat them, this was explained in detail in the first version, for this reason, I performed all these steps without much explanation**

In [6]:
#Deleting that columns from train
train = train.drop(['Name', 'Ticket', 'Cabin'],axis = 1)

In [7]:
#Deleting that columns from test
test = test.drop(['Name', 'Ticket' ,'Cabin'], axis = 1)

In [8]:
#Now it's time to deal with null values, starting with age
train_age = train.Age.mean()
train.loc[train.Age.isnull(),'Age'] = train_age

In [9]:
#Treating null values ​​in the embarked column
train_embarked = train.Embarked.mode()[0]
train.loc[train.Embarked.isnull(),'Embarked'] = train_embarked

In [10]:
#Now test age
test_age = test.Age.mean()
test.loc[test.Age.isnull(),'Age'] = test_age

In [11]:
#Treating null values ​​in the fare column
test_fare = test.Fare.mean()
test.loc[test.Fare.isnull(),'Fare'] = test_fare

**Let's just check if we finished with the null values**

In [12]:
#Checking train
pd.isnull(train).sum().sort_values(ascending = False)

PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64

In [13]:
#Checking test
pd.isnull(test).sum().sort_values(ascending = False)

PassengerId    0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64

**The data is ok! In the first project, the first approach to treating object data was to remove it from the table, but this reduced accuracy. Now let's treat them so that they can be used.**

**The gender is male or female, we can replace it with 1 or 0. In this case, the hierarchy is not important**

In [14]:
#Function to check and change the 'female' data to '1', so I can work with more data in the model used.
train['FemaleCheck'] = train.Sex.apply(lambda x:1 if x =='female' else 0)

In [15]:
#Checking the new values
train[['Sex', 'FemaleCheck']].value_counts()

Sex     FemaleCheck
male    0              577
female  1              314
dtype: int64

In [16]:
#Let's repeat the same thing with the datas from test
test['FemaleCheck'] = test.Sex.apply(lambda x:1 if x =='female' else 0)

In [17]:
#Checking the new values
test[['Sex', 'FemaleCheck']].value_counts()

Sex     FemaleCheck
male    0              266
female  1              152
dtype: int64

**Now that we have dealt with the 'Sex' column, let's deal with the 'Embarked' column**

In [18]:
#Checking all possible datas in the column 'Embarked'
train['Embarked'].unique()

array(['S', 'C', 'Q'], dtype=object)

**The natural thought is: since in the problem above, you replace it with 0 or 1, in this problem, I should just add 1 more, considering that we have 3 possible variables, so it will be 0, 1, 2. But, this can generate a problem, as the program may interpret that there is a certain hierarchy, which is not true**

**For this reason, we will use the [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)**

In [19]:
#Import OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

In [20]:
#Create the encoder. I chose dtype = int32 to make the display more friendly, but there is not problem choosing float
ohe = OneHotEncoder(handle_unknown= 'ignore', dtype = 'int32')

In [21]:
#Fit with datas
ohe = ohe.fit(train[['Embarked']])

In [22]:
#Transforming the data
ohe.transform(train[['Embarked']]).toarray()

array([[0, 0, 1],
       [1, 0, 0],
       [0, 0, 1],
       ...,
       [0, 0, 1],
       [1, 0, 0],
       [0, 1, 0]], dtype=int32)

In [23]:
#Transforming the array into dataFrame
ohe_df = pd.DataFrame(ohe.transform(train[['Embarked']]).toarray(), columns = ohe.get_feature_names_out())
ohe_df.head()

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1


In [24]:
#Now I will add these columns on Train
train = pd.concat([train,ohe_df],axis = 1)

In [25]:
#I can see separately how OneHotEncoder worked compared to the Embarked column
train[['Embarked', 'Embarked_C','Embarked_Q','Embarked_S']].value_counts()

Embarked  Embarked_C  Embarked_Q  Embarked_S
S         0           0           1             646
C         1           0           0             168
Q         0           1           0              77
dtype: int64

**Now let's do the same job with the Test**

In [26]:
#Create the encoder
ohe = OneHotEncoder(handle_unknown= 'ignore', dtype = 'int32')

In [27]:
#Fit with datas
ohe = ohe.fit(test[['Embarked']])

In [28]:
#Transforming the data
ohe.transform(test[['Embarked']]).toarray()

array([[0, 1, 0],
       [0, 0, 1],
       [0, 1, 0],
       ...,
       [0, 0, 1],
       [0, 0, 1],
       [1, 0, 0]], dtype=int32)

In [29]:
#Transforming the array into dataFrame
ohe_df = pd.DataFrame(ohe.transform(test[['Embarked']]).toarray(), columns = ohe.get_feature_names_out())

In [30]:
#Now I will add these columns on Test
test = pd.concat([test, ohe_df], axis = 1)

In [31]:
#I can see separately how OneHotEncoder worked compared to the Embarked column
train[['Embarked', 'Embarked_C','Embarked_Q','Embarked_S']].value_counts()

Embarked  Embarked_C  Embarked_Q  Embarked_S
S         0           0           1             646
C         1           0           0             168
Q         0           1           0              77
dtype: int64

In [32]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,FemaleCheck,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,male,22.0,1,0,7.25,S,0,0,0,1
1,2,1,1,female,38.0,1,0,71.2833,C,1,1,0,0
2,3,1,3,female,26.0,0,0,7.925,S,1,0,0,1
3,4,1,1,female,35.0,1,0,53.1,S,1,0,0,1
4,5,0,3,male,35.0,0,0,8.05,S,0,0,0,1


In [33]:
test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,FemaleCheck,Embarked_C,Embarked_Q,Embarked_S
0,892,3,male,34.5,0,0,7.8292,Q,0,0,1,0
1,893,3,female,47.0,1,0,7.0,S,1,0,0,1
2,894,2,male,62.0,0,0,9.6875,Q,0,0,1,0
3,895,3,male,27.0,0,0,8.6625,S,0,0,0,1
4,896,3,female,22.0,1,1,12.2875,S,1,0,0,1


In [34]:
#Now I can delete the columns 'Sex' and 'Embarked'
test = test.drop(['Sex', 'Embarked'], axis = 1)
train = train.drop(['Sex', 'Embarked'], axis = 1)

**Now that we have processed the data, let's separate the training base into training and validation, to finally implement the algorithms. To this part we'll use the [Train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)**

In [35]:
#Import train_test_split
from sklearn.model_selection import train_test_split

In [36]:
#Separating the training base into X and y
X = train.drop(['PassengerId','Survived'], axis = 1)
y = train.Survived

In [37]:
#Separate train and validation
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.33, random_state=42)

**A metric for evaluating the competition is this:**

**Metric Your score is the percentage of passengers you correctly predict. This is known as accuracy.**

**For this reason, we will test the accuracy of these models:**

[Decision Trees](https://scikit-learn.org/stable/modules/tree.html#classification)


[KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)


[LogisticRegression](https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)


In [38]:
#Import
from sklearn import tree

In [39]:
#Creating the classifier
clf_dt = tree.DecisionTreeClassifier(random_state = 42)

In [40]:
#Fit with datas
clf_dt = clf_dt.fit(X_train, y_train)

In [41]:
#Making the prediction
y_pred_dt = clf_dt.predict(X_validation)

**Let's do the same job with the KNeighborsClassifier and Logistic Regression**

In [42]:
#Import
from sklearn.neighbors import KNeighborsClassifier

In [43]:
#Creating the classifier
clf_knc = KNeighborsClassifier(n_neighbors=3)

In [44]:
#Fit with datas
clf_knc = clf_knc.fit(X_train, y_train)

In [45]:
#Making the prediction
y_pred_knc = clf_knc.predict(X_validation)

**Now Logistic Regression**

In [46]:
#Import
from sklearn.linear_model import LogisticRegression

In [47]:
#Creating the classifier
clf_lr = LogisticRegression(random_state = 42, max_iter= 1000)

In [48]:
#Fit with datas
clf_lr = clf_lr.fit(X_train, y_train)

In [49]:
#Creating the classifier

y_pred_lr = clf_lr.predict(X_validation)

**As I said the metric will be accuracy, so it's time to check the best model to find the best result:**

[Accuracy](https://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score)

In [50]:
#Import Accuracy
from sklearn.metrics import accuracy_score

In [51]:
#Decision Tree Accuracy
accuracy_score(y_validation, y_pred_dt)

0.7457627118644068

In [52]:
#KNeighborsClassifier accuracy
accuracy_score(y_validation, y_pred_knc)

0.7152542372881356

In [53]:
#Logistic Regression Accuracy
accuracy_score(y_validation, y_pred_lr)

0.8169491525423729

**I can also use the confusion_matrix to better visualize the distribution of errors and evaluate the accuracy of a classification.**


[Confusion Matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)

In [54]:
#Import
from sklearn.metrics import confusion_matrix

In [55]:
#Decision Tree confusion_matrix
confusion_matrix(y_validation, y_pred_dt)

array([[137,  38],
       [ 37,  83]])

In [56]:
#KNeighborsClassifier confusion_matrix
confusion_matrix(y_validation, y_pred_knc)

array([[147,  28],
       [ 56,  64]])

In [57]:
#Logistic Regression confusion_matrix
confusion_matrix(y_validation, y_pred_lr)

array([[153,  22],
       [ 32,  88]])

**The best model to use still is the Logistic Regression!**

In [58]:
#If I compare both files, will be a difference, let's check
X_train.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,FemaleCheck,Embarked_C,Embarked_Q,Embarked_S
6,1,54.0,0,0,51.8625,0,0,0,1
718,3,29.699118,0,0,15.5,0,0,1,0
685,2,25.0,1,2,41.5792,0,1,0,0
73,3,26.0,1,0,14.4542,0,1,0,0
882,3,22.0,0,0,10.5167,1,0,0,1


In [59]:
test.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,FemaleCheck,Embarked_C,Embarked_Q,Embarked_S
0,892,3,34.5,0,0,7.8292,0,0,1,0
1,893,3,47.0,1,0,7.0,1,0,0,1
2,894,2,62.0,0,0,9.6875,0,0,1,0
3,895,3,27.0,0,0,8.6625,0,0,0,1
4,896,3,22.0,1,1,12.2875,1,0,0,1


In [60]:
# Is necessary to delete the colum 'PassengerId'.
X_test = test.drop(['PassengerId'], axis = 1)

In [61]:
#Using logistic regression on your test data
y_pred = clf_lr.predict(X_test)

In [62]:
#Creating the colum 'Survived'
test['Survived'] = y_pred

In [63]:
#Selecting just the colums 'PassengerId'and 'Survived' to send and the dataset updated
result = test[['PassengerId','Survived']]
train_clean_upd = train
test_clean_upd = test

In [64]:
#Exporting to .CSV
result.to_csv('result2.csv', index = False)
train_clean_upd.to_csv('train_clean_upd.csv', index = False)
test_clean_upd.to_csv('test_clean_upd.csv', index = False)