# Part 2: Training

In this part you learn how to use MLRun's **Feature Store** to easily define a **Feature Vector** and create the dataset you need to run the training process.  
By the end of this tutorial you’ll learn how to:
- Combine multiple data sources to a single feature vector
- Create training dataset
- Create a model using an MLRun hub function

In [1]:
project_name = "fraud-demo"

In [2]:
import mlrun

# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name, context="./", user_project=True)

## Step 1 - Create a feature vector  
In this section you create a feature vector.  
The Feature vector has a `name` so you can reference to it later via the URI or your serving function, and it has a list of 
`features` from the available feature sets.  You can add a feature from a feature set by adding `<FeatureSet>.<Feature>` to 
the list, or add `<FeatureSet>.*` to add all the feature set's available features.  

By default, the first FeatureSet in the feature list acts as the spine, meaning that all the other features are joined to it.  
For example, in this instance you use the early sense sensor data as the spine, so for each early sense event you create produces a row in the resulted feature vector.

In [3]:
# Define the list of features to use
features = [
    "events.*",
    "transactions.amount_max_2h",
    "transactions.amount_sum_2h",
    "transactions.amount_count_2h",
    "transactions.amount_avg_2h",
    "transactions.amount_max_12h",
    "transactions.amount_sum_12h",
    "transactions.amount_count_12h",
    "transactions.amount_avg_12h",
    "transactions.amount_max_24h",
    "transactions.amount_sum_24h",
    "transactions.amount_count_24h",
    "transactions.amount_avg_24h",
    "transactions.es_transportation_sum_14d",
    "transactions.es_health_sum_14d",
    "transactions.es_otherservices_sum_14d",
    "transactions.es_food_sum_14d",
    "transactions.es_hotelservices_sum_14d",
    "transactions.es_barsandrestaurants_sum_14d",
    "transactions.es_tech_sum_14d",
    "transactions.es_sportsandtoys_sum_14d",
    "transactions.es_wellnessandbeauty_sum_14d",
    "transactions.es_hyper_sum_14d",
    "transactions.es_fashion_sum_14d",
    "transactions.es_home_sum_14d",
    "transactions.es_travel_sum_14d",
    "transactions.es_leisure_sum_14d",
    "transactions.gender_F",
    "transactions.gender_M",
    "transactions.step",
    "transactions.amount",
    "transactions.timestamp_hour",
    "transactions.timestamp_day_of_week",
]

In [4]:
# Import MLRun's Feature Store
import mlrun.feature_store as fstore

# Define the feature vector name for future reference
fv_name = "transactions-fraud"

# Define the feature vector using the feature store (fstore)
transactions_fv = fstore.FeatureVector(
    fv_name,
    features,
    label_feature="labels.label",
    description="Predicting a fraudulent transaction",
)

# Save the feature vector in the feature store
transactions_fv.save()

## Step 2 - Preview the feature vector data

Obtain the values of the features in the feature vector, to ensure the data appears as expected.

In [5]:
# Import the Parquet Target so you can directly save your dataset as a file
from mlrun.datastore.targets import ParquetTarget

# Get offline feature vector as dataframe and save the dataset to parquet
train_dataset = fstore.get_offline_features(fv_name, target=ParquetTarget())

In [6]:
# Preview your dataset
train_dataset.to_dataframe().tail(5)

## Step 3 - Train models and choose the highest accuracy

With MLRun, you can easily train different models and compare the results. In the code below, you train three different models.
Each one uses a different algorithm (random forest, XGBoost, adabost), and you choose the model with the highest accuracy.

In [7]:
# Import the Sklearn classifier function from the functions hub
classifier_fn = mlrun.import_function("hub://auto_trainer")

In [8]:
# Prepare the parameters list for the training function
# you use 3 different models
training_params = {
    "model_name": [
        "transaction_fraud_rf",
        "transaction_fraud_xgboost",
        "transaction_fraud_adaboost",
    ],
    "model_class": [
        "sklearn.ensemble.RandomForestClassifier",
        "sklearn.ensemble.GradientBoostingClassifier",
        "sklearn.ensemble.AdaBoostClassifier",
    ],
}

# Define the training task, including your feature vector, label and hyperparams definitions
train_task = mlrun.new_task(
    "training",
    inputs={"dataset": transactions_fv.uri},
    params={"label_columns": "label"},
)

train_task.with_hyper_params(training_params, strategy="list", selector="max.accuracy")

# Specify your cluster image
classifier_fn.spec.image = "mlrun/mlrun"

# Run training
classifier_fn.run(train_task, local=False)

## Step 4 - Perform feature selection

As part of the data science process, try to reduce the training dataset's size to get rid of bad or unuseful features and save computation time.

Use your ready-made feature selection function from MLRun's [`hub://feature_selection`](https://github.com/mlrun/functions/blob/development/feature_selection/feature_selection.ipynb) to select the best features to keep on a sample from your dataset, and run the function on that.


In [9]:
feature_selection_fn = mlrun.import_function("hub://feature_selection")

feature_selection_run = feature_selection_fn.run(
    params={
        "k": 18,
        "min_votes": 2,
        "label_column": "label",
        "output_vector_name": fv_name + "-short",
        "ignore_type_errors": True,
    },
    inputs={"df_artifact": transactions_fv.uri},
    name="feature_extraction",
    handler="feature_selection",
    local=False,
)

In [10]:
mlrun.get_dataitem(feature_selection_run.outputs["top_features_vector"]).as_df().tail(5)

## Step 5 - Train your models with top features

Following the feature selection, you train new models using the resultant features. You can observe that the accuracy 
and other results remain high,
meaning you get a model that requires less features to be accurate and thus less error-prone.

In [11]:
# Define your training task, including your feature vector, label and hyperparams definitions
ensemble_train_task = mlrun.new_task(
    "training",
    inputs={"dataset": feature_selection_run.outputs["top_features_vector"]},
    params={"label_columns": "label"},
)
ensemble_train_task.with_hyper_params(
    training_params, strategy="list", selector="max.accuracy"
)

classifier_fn.run(ensemble_train_task)

## Done!

You've completed Part 2 of the model training with the feature store.
Proceed to [Part 3](03-deploy-serving-model.html) to learn how to deploy and monitor the model.