This repository was created to publish the Titanic challenge
The code was divided into 5 steps, in search of the best possible result, each with different alternatives, which will be explained below, what was done and why.
-
Due to the high cardinality, detected by the nunique() function of some columns, at this stage, I chose to execute them since there was no pattern initially
-
I excluded the 'Embarked' column because it had string values. At first, I tested the model's accuracy without treating it
-
I created 3 different models using KNC, Random Forest and Logistic Regression I also tested the accuracy and Matrix Confusion of the respective models
Score: 0.66746
- In addition to everything that was done in the first code, the only addition was:
- I treated the 'Embarked' column, considering that the variables contained in it were of the string type, therefore, the One Hot Encoder algorithm models would not work
Score: 0.76555
-
I used Robust Scaler to scale the 'Age' and 'Fare' columns, very discrepant values.
-
These values can be easily detected using Mat Plot Lib
-
Creation of columns from the 'SibSp' and 'Parch' columns seeking the best accuracy
-
Correlation of variables to understand what can be created/deleted
Score: 0.76555
-
In this stage, all treatments already carried out in the previous stage were applied.
-
In addition to Random Forest, I applied MLP Classifier (neural networks) to select the best parameters
-
Despite the apparent improvement, there was Overfitting(basically when the algorithm works very well for training, but does not perform the same in testing)
Score: 0.69856
-
To solve the problem of OverfittingI used Grid Search CV to find the best parameters
-
In the end, we used Random Forest to make the submission and had an improvement compared to the previous code
Score: 0.7799