# AutoML MLJar
The mljar-supervised library is an automated machine learning (AutoML) tool tailored for working with tabular datasets in Python. Aimed at optimizing a data scientist's workflow, it simplifies the process by automating data preprocessing, machine learning model construction, and hyperparameter optimization to identify the optimal model. Far from being a mysterious black-box, it provides complete transparency into the construction of the ML pipeline, offering detailed Markdown reports for each model created.

## Setup

In [None]:
import sys
import os

# Get the current working directory
current_working_directory = os.getcwd()

# Go up one level from the current working directory
parent_directory = os.path.join(current_working_directory, '..')

# Add the parent directory to sys.path
sys.path.append(parent_directory)

os.getcwd()

In [None]:
%pip install mljar-supervised
%pip install scikit-learn
%pip install pandas

In [None]:
%load_ext autoreload

In [None]:
%autoreload 

# Import the necessary libraries
%matplotlib inline
import warnings
import pandas as pd
from sklearn.metrics import accuracy_score, classification_report 
from supervised.automl import AutoML

pd.set_option('display.max_columns', 200)
warnings.filterwarnings('ignore')

from src.ml_service import prepare_data, prepare_test_data, save_predictions

## Load data

In [None]:
x_train, _, x_test, y_train, _, y_test = prepare_data(validation_size=0.0, test_size=0.1)

## Train model

**Evaluation metrics:**
- for binary classification: `logloss`, `auc`, `f1`, `average_precision`, `accuracy` - default is logloss (if left "auto")
- for mutliclass classification: `logloss`, `f1`, `accuracy` - default is `logloss` (if left "auto")
- for regression: `rmse`, `mse`, `mae`, `r2`, `mape`, `spearman`, `pearson` - default is `rmse` (if left "auto")

**Explain level:**
Specifies the amount of interoperability detail provided with the model's predictions, ranging from 0 (minimal) to 2 (extensive), enabling users to adjust the balance between simplicity and depth of insight into how the model makes its decisions.

**Golden features:**
Activates the creation of new features from existing ones by exploring their interactions, potentially uncovering extremely valuable patterns to enhance model accuracy. 

**n_jobs:**
Determines the number of CPU cores used for parallel processing, with -1 utilizing all available cores to speed up the training process.

**stack_models:**
Enables stacking of multiple models to improve predictions, leveraging the strengths of various models by using their predictions as inputs to a final model, thereby potentially increasing overall accuracy.

In [None]:
# Initialize MLJAR AutoML
time_limit = 4 * 60 # 24 * 60 * 60 
predictor = AutoML(mode="Explain", 
    random_state=42,
    total_time_limit=time_limit,
    n_jobs=-1, 
    golden_features=True,
    features_selection=True,
    stack_models=True,
    explain_level=2,
    )

# Train the model
predictor.fit(x_train, y_train)


## Make predictions

In [None]:
# Evaluate on the test set
y_test_pred = predictor.predict(x_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Accuracy: ", test_accuracy)
print("Test Classification Report:\n", classification_report(y_test, y_test_pred))
# MLJAR also provides a leaderboard with model performance
predictor.report()

## Save model

In [None]:
x_test = prepare_test_data()
final_predictions = pd.DataFrame(predictor.predict(x_test))

save_predictions(final_predictions, 'mljar_automl')