# AutoPrep

---

## Table of contents 
1. [Autoprep's objective](#autopreps-objective)  
2. [Specifications](#specifications)  
3. [Existing solutions](#existing-solutions)  
4. [Preprocessing](#preprocessing)
5. [Modelling](#modelling)
6. [Report](#report)
7. [Example usage](#example-usage)
8. [Conclusion](#conclusion)

---

## About AutoPrep
---

## Autoprep's objective
The goal of the Autoprep project is to provide users with a fully-automated machine learning package that handles most tasks for them. We aim to enhance the significance of preprocessing steps in machine learning tasks (hence Autoprep's name). We deliver extensive preprocessing as well as detailed reporting in researchers' beloved LaTeX. Additionally, hyperparameter tuning and modelling steps are definitely *not* neglected. Since we provide an *auto*-ML package, the system defines the task (regression, binary, or multiclass classification). Keeping in mind the AI Act, Autoprep delivers explainable solutions using Shapley Plots.


## Specifications
Our package provides an automated machine learning system for **tabular** data. Users specify which column is the target, and the process begins. Autoprep emphasizes **preprocessing** by choosing from up to 48 possible pipelines (if non-required steps are chosen). Not only do we provide users with tremendous results in terms of the chosen metric, but we also generate an extensive (around 20 pages, depending on the dataset) LaTeX report consisting of: 
- dataset overview,  
- exploratory data analysis,  
- preprocessing, hyperparameter tuning, and modeling steps details,  
- best model's interpretations with Shapley Plots.


## Existing solutions
It is nearly impossible to create something truly *new* these days, yet Autoprep somehow stands out from the crowd. Let's take a look at existing automated ML solutions and how they differentiate from Autoprep.

1. **Auto-sklearn**: Focuses on model selection and hyperparameter tuning but lacks extensive preprocessing capabilities and detailed reporting.
2. **TPOT**: Automates the entire machine learning pipeline but does not provide detailed LaTeX reports or emphasize preprocessing as much as Autoprep.
3. **H2O.ai**: Offers a comprehensive suite of tools for automated machine learning but does not focus specifically on preprocessing or detailed LaTeX reporting.
4. **PyCaret**: Focuses on simplifying the machine learning process with a user-friendly interface. While PyCaret provides fantastic prototyping, model comparison, and blending, it does not offer the same level of preprocessing options or detailed LaTeX reporting as Autoprep.
5. **MLJAR**: Focuses on automating the machine learning pipeline with a range of models for both classification and regression. While it offers solid preprocessing capabilities like handling missing values, scaling, and feature importance-based reduction, it lacks some of the advanced preprocessing techniques, such as VIF or UMAP, that are provided by more specialized tools like Auto-prep.
6. **Hyperopt-Sklearn**: lacks advanced preprocessing capabilities, requiring manual setup for scaling, imputation, and feature selection, while also not supporting dimensionality reduction methods like PCA or UMAP. In contrast, Auto-prep offers a more comprehensive preprocessing pipeline, including advanced techniques like VIF for feature selection and UMAP for dimensionality reduction, along with automated handling of missing data and scaling and creating a detailed LaTeX report.
7. **Google AutoML Tables**: offers automated preprocessing but lacks fine-grained customization and advanced techniques like VIF for feature selection or UMAP for dimensionality reduction. In contrast, Auto-prep provides more flexibility with advanced preprocessing methods and better control over feature engineering and dimensionality reduction.

While these solutions are powerful, Autoprep aims to differentiate itself by focusing extensively on the preprocessing steps and providing detailed LaTeX reports. Our goal is to offer a comprehensive and explainable automated machine learning package that meets the needs of both novice and experienced users.

---

## Technical details 
---

## Preprocessing  
As the name suggests, it is Autoprep's core component.

In Autoprep, we distinguish between **required** and **additional** (aka. non-required) steps.  

The obligatory phases consist of:
- **Missing data imputation**: for numerical data, we impute the median, and for categorical data, we impute the most frequent value or "Missing" string if NAs dominate in the column.
- **Removing columns with 100% unique categorical values**
- **Categorical features encoding**: if there are fewer than 5 values, One Hot Encoding is used; otherwise, Label Encoding is applied.
- **Scaling**: three scalers are possible: min-max, robust, and standard scaler.
- **Removing columns with 0 variance**
- **Detecting highly correlated features**: if features are highly correlated (default threshold = 0.8), one of them is removed.

The additional phases consist of:
- **Feature selection**: features may be selected based on their correlation with the target (default threshold = 0.7) or on Random Forest feature importance (default threshold: top 70%). 
- **Dimension reduction**: using Principal Component Analysis (PCA) (threshold = 0.95), Uniform Manifold Approximation and Projection (UMAP) (50 components for datasets with over 100 columns or 50% features otherwise), or Variance Inflation Factor (VIF).

Multiple pipelines are generated, from which we choose up to 16 (to save time). Then they are scored using a Random Forest (Classifier or Regressor) model: preprocessed data is fit into the model, AUC/MSE score is calculated, and each pipeline receives its rank. Subsequently, the 3 best pipelines are saved to a .joblib file and continue their journey to modelling.

*Note:* We choose the top 3 best pipelines instead of just 1, since the Random Forest results might not differ significantly. Presenting this information in the report is beneficial to our business objective.


## Modeling

For classification tasks, there are 5 implemented models:
- K Neighbors Classifier,
- Logistic Regression,
- Gaussian Naive Bayes,
- Support Vector Machine (Classifier),
- Decision Tree Classifier.

For regression tasks, Autoprep has 6 models:
- Linear Support Vector Machine (Regressor),
- K Neighbors Regressor,
- Random Forest Regressor,
- Bayesian Ridge,
- Gradient Boosting Regressor,
- Linear Regression.

We have chosen simple models for two reasons: they consume less time and are easier to explain.

Autoprep at this stage uses the three best pipelines (see [Preprocessing](#preprocessing) section), so there are 3 different datasets generated. Each model is fit with them. Of course, Autoprep adheres to the train-test split rule. In conclusion, there are 15 or 18 (for classification or regression, respectively) models evaluated to ensure the best performance. 

All models' hyperparameters are tuned using Randomized Search CV with 10 iterations (again, because of time). 

Based on test datasets' AUC/MSE score, three best models are selected and presented in the report.

---

## Report 


### Why LaTeX reports?

Using PyLatex technology, we provide users with extensive (approx. 20 pages) LaTex reports. We have chosen LaTeX, since:
- it is highly customizable,
- it supports complex mathematical notation,
- it provides clear, transparent and adaptive style,
- it allows for high-quality typesetting.


### What is in the report?

The report includes the following sections:
1. **Overview**: Provides a summary of the dataset.
    1. **system information**: results may differ on different soft- and hardware (Table 1),
    2. **dataset overview**: number of samples, number of features (categorical or numerical) (Table 2),
    3. *[classification only]* **target class distribution** presented in table,
    4. **missing values**: counts and percentage of missing values for each feature (from here we do not provide exact Table/Plot identifiers, as they differ depending on task type),
    5. **description of all features in the dataset**: types, memory usage,
    6. **description of numerical features in the dataset**: basic statistics like mean, sd, count, min, max, etc.,
    7. **description of categorical features in the dataset**: count, number of unique instances, most frequent.

2. **Exploratory Data Analysis**: Includes visualizations and statistical summaries to understand the data distribution and relationships between features.
    1. **target variable**: barplot (class distribution incl.) for *classification* or histogram (mean, median incl.) for *regression*
    2. **missing values distribuiton**: on barplot,
    2. **distribution of all features**: presented on histogrms for numerical and on barplots for categorical,
    3. **correlation heatmap**: for numerical features,
    4. **boxplots**: for numerical features.

3. **Preprocessing**: Details the preprocessing steps applied to the data, including missing data imputation, encoding, scaling, and feature selection.
    1. **list of preprocessing steps**: all possible steps (required and non-required)
    2. **pipelines**: 16 chosen for examination, with all steps listed,
    3. **best pipelines**: 3 best pipelines with respect to scoring function, with fit time,
    4. **best pipelines' details**: the best pipelines' description and parameters,
    5. **best pipelines' output overview**: enables user to see, how data has changed,
    6. **preprocessing pipelines runtime statistics**: pipelines fit time and scoring statistics.


4. **Modelling**: Describes the models used, their hyperparameters, and the performance metrics.
    1. **examined models list**
    2. **hyperparameter grids**
    3. **best models and pipelines along with their hyperparameters**: (after tunning) information about mean fit time, hyperparameters and test score.
5. **Model Interpretations**: Uses Shapley Plots to explain the predictions of the best models. Waterfall, bar and summary plots are presented (for each class if task is *classification*).

The report is generated automatically and saved as a .pdf file, providing users with a comprehensive overview of the entire machine learning process.

---

## Example usage
Here we present, how to use Autoprep.

---

In [7]:
import openml
import numpy as np

from auto_prep.utils.config import config
from auto_prep.prep import AutoPrep

### Binary classsification

Dataset: titanic

In [2]:
config.set(
    raport_name="titanic",
    root_dir="raports"
)

In [3]:
data = openml.datasets.get_dataset(40945).get_data()[0]
data["survived"] = data["survived"].astype(np.uint8)

pipeline = AutoPrep()
pipeline.run(data, target_column="survived")


posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
Fitting pipelines: 13pipeline [00:04,  2.53pipeline/s]OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Fitting pipelines: 19pipeline [00:04,  4.68pipeline/s]OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #276: omp_set_nested routine deprecated, please use

---

### Multiclass classification

Dataset:


In [4]:
config.update(
    raport_name="cpu"
)

In [9]:
data = openml.datasets.get_dataset(338).get_data(dataset_format="dataframe")[0]

pipeline = AutoPrep()
pipeline.run(data, target_column="GG_new")

KeyboardInterrupt: 

---

### Regression

Dataset:


In [None]:
config.update(
    raport_name="ailerons"
)

In [None]:
data = openml.datasets.get_dataset(540).get_data(dataset_format="dataframe")[0]

pipeline = AutoPrep()
pipeline.run(data, target_column="CL")

---

## Conclusion
Autoprep is designed to simplify the machine learning process by automating the preprocessing, modelling, and reporting steps. By focusing on extensive preprocessing and providing detailed LaTeX reports, Autoprep aims to deliver high-quality, explainable machine learning solutions for both novice and experienced users. Note that it is only the beginning of the Autoprep project. In further steps we would add extra preprocessing functionalities, such as outlier detection or binning.