# Package training jobs 📦


## What you will learn in this course 🧐🧐

Now that you know a little more about MLFlow projects, you might wonder: *Why would we want to package ML models?*. This is definitely a good question. The answer is that packaging models solves a lot of problems. Among them are: 

1. No more - *I don't understand, it works on my machine* - Your model works everywhere 
2. You can easily run training jobs on more powerful remote machines to reduce training time significantly

In this course, we will especially focus on the second aspect. 

## Remote training 📺


Just like `MLModel` file for standardizing our models, we will create a `MLProject` that is yet another `yaml` file that will describe all the configurations of our project. Our working directory will therefore contain the following files

```shell 
├── MLProject # Configuration of our training job
└── train.py # Our python file containing our training script
```

### Entry Point file 🚪🚶

`train.py` is called the entry point file. You need to have a script or an `.sh` file because notebooks are not accepted. Therefore you need to have your whole training process into your own file. The content look like this:

```python
import argparse
import pandas as pd
import time
import mlflow
from mlflow.models.signature import infer_signature
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import  StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline


if __name__ == "__main__":

    # Set your variables for your environment
    EXPERIMENT_NAME="Default"

    # Set experiment's info 
    mlflow.set_experiment(EXPERIMENT_NAME)
    # Get our experiment info
    experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)

    print("training model...")
    
    # Time execution
    start_time = time.time()

    # Call mlflow autolog
    mlflow.sklearn.autolog(log_models=False) # We won't log models right away

    # Parse arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--n_estimators")
    parser.add_argument("--min_samples_split")
    args = parser.parse_args()

    # Import dataset
    df = pd.read_csv("https://julie-2-next-resources.s3.eu-west-3.amazonaws.com/full-stack-full-time/linear-regression-ft/californian-housing-market-ft/california_housing_market.csv")

    # X, y split 
    X = df.iloc[:, :-1]
    y = df.iloc[:, -1]

    # Train / test split 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

    # Pipeline 
    n_estimators = int(args.n_estimators)
    min_samples_split=int(args.min_samples_split)

    model = Pipeline(steps=[
        ("standard_scaler", StandardScaler()),
        ("Regressor",RandomForestRegressor(n_estimators=n_estimators, min_samples_split=min_samples_split))
    ])

    # Log experiment to MLFlow

    with mlflow.start_run(experiment_id = experiment.experiment_id):
        model.fit(X_train, y_train)
        predictions = model.predict(X_train)

        # Log model seperately to have more flexibility on setup 
        mlflow.sklearn.log_model(
            sk_model=model,
            artifact_path="modeling_housing_market",
            registered_model_name="random_forest",
            signature=infer_signature(X_train, predictions)
        )
        
    print("...Done!")
    print(f"---Total training time: {time.time()-start_time}")

```

### `MLProject`

Now let's configure our `MLProject` file. We'll use Docker to standardize our environement that we will specify by writing:

```yaml
docker_env:
  image: IMAGE_NAME
```

> If you haven't already, you will need to create an image with the required environment. In our case, we will be using `jedha/sample-mlflow-server` that we already created (you can use it as well if you don't want to bother building another image).

Let's write the full file:

```yaml
# Name of the experiment (and the new docker image that MLFlow will create when running the file)
name: californian_housing_market 

docker_env:
  # Name of the image we'll be running a container from
  image: sample-mlflow-server
  # Volume binding
  volumes: ["$(pwd):/home/app"]
  # Set environment variables
  environment: [
      "MLFLOW_TRACKING_URI", 
      "AWS_ACCESS_KEY_ID",
      "AWS_SECRET_ACCESS_KEY",
      "BACKEND_STORE_URI",
      "ARTIFACT_ROOT"
    ]
    
entry_points:
  main:
    parameters:
      # Parameter for our training
      n_estimators: {type: int, default: 15} 
      # Parameter for our training
      min_samples_split: {type: int, default: 3} 
    # Command that will be run when running that file 
    command: "python train.py --n_estimators {n_estimators} --min_samples_split {min_samples_split}" 
```


As you can see in the `entry_points:` section we specified some `parameters` and provide the command `python train.py --n_estimators {n_estimators} --min_samples_split {min_samples_split}`. This is going to be useful when we will be running our training job. 

Also we have specified that there will be **environment variables**. We haven't given any values to it but `MLProject` will look for these variables when we will be running the file. 

### Set environment variables 

Last piece of config: *environment variables*. Remember, environment variables are configuration that usually needs to be secret. **You should not share your environment variables to anybody but yourself** 🔐

Here, we need to setup some variables. To do so, we will create a separate file that we will call `secrets.sh`. It will contain the following bash commands:

```bash 
export MLFLOW_TRACKING_URI="REPLACE_ME_WITH_YOUR_MLFLOW_TRACKING_URI"
export AWS_ACCESS_KEY_ID="REPLACE_ME_WITH_YOUR_AWS_ACCESS_KEY_ID"
export AWS_SECRET_ACCESS_KEY="REPLACE_ME_WITH_YOUR_AWS_SECRET_ACCESS_KEY"
export BACKEND_STORE_URI="REPLACE_ME_WITH_YOUR_BACKEND_STORE_URI"
export ARTIFACT_ROOT="REPLACE_ME_WITH_YOUR_ARTIFACT_ROOT"
```

Once you are done, you will need to run this file by doing:

* `source secrets.sh`

## Run a training process 🏃‍♀️

Alright, now that we have everything configured. 

```shell 
$ mlflow run path_to_your_project
```

> NB: if you are already in the project folder, you can simply do:<br />
> ```shell 
> $ mlflow run . 
> ```

To add parameters:
```shell
$ mlflow run path_to_your_project -P n_estimators=80
```

> 👋 Make sure that mlflow is installed on your computer (it might not be the case if you only worked in Docker containers) 👉 `pip install mlflow`

## Resources 📚📚

* [MLFlow Project](https://mlflow.org/docs/latest/projects.html#project-environments)