# <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4484c2; text-align: center;">Progress Task 2 (Machine Learning Model Applications and Analysis)</div>

This notebook is a summary of the work done in this second task. We will describe briefly the preprocessing steps, the models used and their performance in the competition and the lessons we have learned.

<div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #0096FF; text-align: center;">FINAL POSITION</div>

We have tested several models for this task, a Neural Network, a Random Forest and a KNN model. More specifically the implementations used were:
* For the neural network:
  * A normal Keras model, with 3 hidden layers.
  * Using a combination of `scikit-learn`'s `RandomSearchCV` model, and a Keras model, to try to find the optimal hyperparameters for the network
  * `MLPClassifier`, a neural network implementation found in `scikit-learn`, instead of being taken from an external library like Keras.
* A Random Forest Model, developed using `scikit-learn`'s implementation. This model was the one with the best performance.
* KNN model <!-- TODO Not read this one, fill this in -->


## <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4484c2;">Methodology</div>

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">**First attempt.** Random Forest 0.8520</div>

The first thing we did was inspecting the dataset and we realize that it had too many null values. Then, we decided to delete the rows that had these values since the dataset was so large that deleting a few rows wouldn't affect the training.  

**We were wrong.**

When we tried to do the same with the test dataset, we realized that we couldn’t simply ignore the null values and that we needed to handle them properly. For the treatment of nulls in categorical variables, we assigned a new category called 'missing,' and for numerical values, we used -1.
Later we removed some variables that seemed useless employment_industry or hhs_geo and encode both the training and test set using OrdinalEncoder. This is how we obtained the first dataset. After that, we needed to choose a model. 

Based on the results from the first task, we decided to try Random Forest with hyperparameter search, as it was the best among all the classifiers. 

**Result on competition: 0.8520** around 800th place which was around the middle of the leaderboard. Not bad for being the first attempt.


### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">**Trying to improve our ranking.** Gradient Boosting 0.8558</div>

To improve our ranking, we decided to follow two approaches: one would explore more Random Forest, and the other would investigate other models.

**1. Other models**

One teammate tried MLPClassifer as an inital approach to Neural Networks while other implemented a complete pipeline using Keras. The third one, tried KNN, but it didn't improve on the Random Forest results. In addition, KNN is influenced by class imbalance and outliers, and we haven't addressed these issues in the dataset.

After some time, we thought about trying other ensemble models. Searching into scikit-learn, we found out about Gradient Boosting models.

**Result on competition: 0.8487 for MLPClassifier**  
**Result on competition: 0.8464 for Neural Networks**, we had some issues with this model.  
**Result on competition: NOT tried for KNN**  
**Result on competition: 0.8558 for Gradient Boosting**  


**2. Random Forest**

We already had a satisfactory result with Random Forest, but we wondered if it could be improved. Since we decided not to fix any random states, we were able to run the tests multiple times and obtain very different results, which we saved in MLflow.

We tried several approaches:

- Running two separate Random Forest models, one for each target, and optimizing them independently instead of using a multi-output model.
- We also tried filtering the dataset for each target, using only the features relevant to that target. However, this approach performed much worse, as we lost relationships between the features.
- Finally, seeing that all of this was insufficient, we decided to use a "retraining" method on our initial approach. This improved performance, but only slightly.

We will no longer work with Random Forest as we have run out of ideas.

**Result on competition: 0.8538** with the best local Random Forest. 


### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">**What's next?: Use AI**</div>

We thought we had reached the limit with our approach, but to our surprise, the other teams were making significant improvements. We realized that we could also enhance our ranking by using all the tools available to us: Chat GPT and Claude.

We decided to try some prompt engineering by explaining to Chat GPT the context of our project, the problem we were tackling, and the models we had already tested. We asked for suggestions on other models we could use to improve our ranking score. After "debating" a bit with the chat, we came to the following conclusions:

- We could modify the dataset: This included addressing class imbalance or eliminating correlations.  
- Use basic models: Such as logistic regression or decision trees. However, we completely ruled this out since it wasn't aligned with our goals.  
- **Try more complex models:** Like Gradient Boosting Machines (e.g., XGBoost, LightGBM, CatBoost).  
- Optimize hyperparameters: By using libraries like Optuna or GridSearchCV.

Extracts of the conversations:

1. Is Optuna the same as Random Grid Search in sklearn?

🤖: Not exactly. Although Optuna and Grid/RandomizedSearchCV share the idea of searching for optimal hyperparameters, Optuna has several key differences that make it more flexible and efficient.

2. Explain briefly XGBoost, LightGBM, CatBoost

🤖: All of them are not native sklearn.   
**XGBoost**: Highly optimized for speed and performance but more complex to configure and tune.  
**LightGBM**: Faster than XGBoost in many cases, especially for large and high-dimensional datasets, handles categorical data natively (no encoding required), however can be sensitive to imbalanced data.   
**CatBoost**: Handles categorical features directly, often reduces overfitting automatically but has slower initial training compared to LightGBM.

**Therefore we decided to implement CatBoost with Optuna.**

**Result on competition: 0.8590**

------------

## <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4484c2;">Implementation</div>