Skip to content

LucasNatalePires/kaggle_titanic

Repository files navigation

This repository was created to publish the Titanic challenge

The code was divided into 5 steps, in search of the best possible result, each with different alternatives, which will be explained below, what was done and why.

  • Handling null data using mean() and mode()

  • Due to the high cardinality, detected by the nunique() function of some columns, at this stage, I chose to execute them since there was no pattern initially

  • I excluded the 'Embarked' column because it had string values. At first, I tested the model's accuracy without treating it

  • I created 3 different models using KNC, Random Forest and Logistic Regression I also tested the accuracy and Matrix Confusion of the respective models

Score: 0.66746

  • In addition to everything that was done in the first code, the only addition was:
    • I treated the 'Embarked' column, considering that the variables contained in it were of the string type, therefore, the One Hot Encoder algorithm models would not work

Score: 0.76555

  • I used Robust Scaler to scale the 'Age' and 'Fare' columns, very discrepant values.

  • These values ​​can be easily detected using Mat Plot Lib

  • Creation of columns from the 'SibSp' and 'Parch' columns seeking the best accuracy

  • Correlation of variables to understand what can be created/deleted

Score: 0.76555

  • In this stage, all treatments already carried out in the previous stage were applied.

  • In addition to Random Forest, I applied MLP Classifier (neural networks) to select the best parameters

  • Despite the apparent improvement, there was Overfitting(basically when the algorithm works very well for training, but does not perform the same in testing)

Score: 0.69856

  • To solve the problem of OverfittingI used Grid Search CV to find the best parameters

  • In the end, we used Random Forest to make the submission and had an improvement compared to the previous code

Score: 0.7799