# 0 - Create Model

In this tutorial we will create a model with the following `mlflow setup`:
- Tracking server: remote server (EC2)
- Backend store: postgresql database
- Artifacts store: s3 bucket

[Tutorial Link](https://www.youtube.com/watch?v=1ykg4YmbFVA&list=PL3MmuxUbc_hIUISrluw_A7wDSmfOhErJK&index=16)

## 1. Create Cloud Infrastructure

### 1.1. Create a `IAM` user in AWS

- Provides fine-grained access control across all of AWS. One can specify who can access which services and resources, and under which conditions. Also, with IAM policies, you manage permissions to your workforce and systems to ensure least-privilege permissions.

- [Tutorial](https://mlbookcamp.com/article/aws) on how to create one


### 1.2. EC2 Instance

-  [Tutorial](https://www.youtube.com/watch?v=1ykg4YmbFVA&list=PL3MmuxUbc_hIUISrluw_A7wDSmfOhErJK&index=16)
- Amazon Elastic Cloud Compute (EC2) - virtual server for running applications on the AWS infrastructure - [further explanation](https://www.techtarget.com/searchaws/definition/Amazon-EC2-instances)

Configurations:
- Name: `mlflow-track-server`;
- OSImage: `Amazon Linux 2 - Kernel 5.10 SSD` (free tier);
- Architecture: `64-bit`
- Instance type: `t2.micro` (free tier)
- Create a key-pair: (required because we need it to connect via ssh)
    - Name: `mlflow-key-pair`
    - Save the `.pem` file
- Default values
- Launch the instance

Security Configurations:
- Go to security groups of the instance created;
- Security group: `launch-wizard-1` (Important to select the same for the 
- Add a new rule in inbound rules to allow http connection;
- Custom TCP - Port:`5000` - Anywhere/MyIP
- Save Rules

### 1.3. Create a S3 bucket - location `Ireland`

- Bucket name: `mlflow-artifacts-rmt` (remote);
- AWS region: `eu-west-1`;
- Everything else as default;
- Create bucket

**Note:** Check how much they are charging for it

### 1.4. Create a postgresql DB - Location: `Ireland`

- RDS service;
- Create a Database (not a RDS Multi-AZ deployment);
- Standard create;
- Engine options: `PostgreSQL (13.4-R1)`
- Templates: Free tier
- DB instance identifier: `mlflow-db`
- Master username: `mlflow_user`
- Autogenerate a password;
- Instance configuration: `db.t3.micro`
- Port: `5432`


Additional Configurations:
- Initial db name: `mlflow_initial_db`;
- Create Database

Important details:
- Save the password (only time you will see it)
- Save the endpoint and port
- Change the security group:
    - Edit inbound rules
    - Add a new rule to allow the `EC2 instance` to connect with the database;
    - Postgresql - same security group than `EC2 instance`

**Note:** No public access, because we only want the `EC2 instance`to reach this database

### 1.5. Connect to `EC2 instance` - `Ireland`
- Opens up a command line for the machine
- Update the machine `sudo yum update`
- Install required packages: `pip3 install mlflow boto3 psycopg2-binary` (to run mlflow, connect to aws, artifact store, and postgresql)
- Configure aws account - `aws configure`:
    - Use your own credentials - one can leave `region name`and `output format` unchanged
- Check if there is any s3 bucket with: `aws s3 ls` - we are suppose to see the `s3 bucket`created;
- Run the server:
- `mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://DB_USER:DB_PASSWORD@DB_ENDPOINT:5432/DB_NAME --default-artifact-root s3://S3_BUCKET_NAME`
- To access the tracking server, one just needs to go to `EC2 instance` and get the `public IPv4 DNS` and add the port, which is `5000` in our case.

### 1.6 Create a AWS configure file in your local computer

- Follow [this](https://docs.aws.amazon.com/rekognition/latest/dg/setup-awscli-sdk.html) tutorial
- How to create the files [here](https://docs.aws.amazon.com/rekognition/latest/dg/setup-awscli-sdk.html)

## 2. Access MLflow

In [1]:
import os
import mlflow


# Configuration variables
os.environ["AWS_PROFILE"] = "ml-in-action" # point to the profile create in your local computer

TRACKING_SERVER_HOST = "ec2-18-212-215-30.compute-1.amazonaws.com" # this changes everytime we stop and restart and instance
mlflow.set_tracking_uri(f"http://{TRACKING_SERVER_HOST}:5000")
print(f"tracking URL: {mlflow.get_tracking_uri()}")

tracking URL: http://ec2-18-212-215-30.compute-1.amazonaws.com:5000


## 3. Train the model and track it with mlflow

- If optimization is needed - [code](https://github.com/FDelca/mlops_datatalks_notes/blob/main/2-ExperimentTracking/Week2-LearningExercises.ipynb)

In [5]:
mlflow.list_experiments()

[<Experiment: artifact_location='s3://mlflow-artifacts-rmt1/0', experiment_id='0', lifecycle_stage='active', name='Default', tags={}>]

### 3.1 Data Prerequisites

- [data source](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

In [6]:
# Create a data directory if it does not exist
path = 'data'

if not os.path.exists(path):
    os.makedirs(path)

for data_ in ['2021-01', '2021-02']:
    data_path = f"{path}/green_tripdata_{data_}.parquet"
    if not os.path.exists(data_path):
        url = f"https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_{data_}.parquet"
        try:
            os.system(f"wget {url} -P \data")
        except:
            print(f"{url} not available")

### 3.2 Train model

In [10]:
import pickle

import pandas as pd

from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

from sklearn.pipeline import make_pipeline

In [11]:
class TrainModel():
    
    def data_treatment(self, filename: str):
        data = pd.read_parquet(filename)
        data['duration'] = data.lpep_dropoff_datetime - data.lpep_pickup_datetime
        data.duration = data.duration.dt.total_seconds() / 60
        data = data[(data.duration >= 1) & (data.duration <= 60)]
        
        categorical = ['PULocationID', 'DOLocationID']
        data[categorical] = data[categorical].astype(str)
        return data
        
    def prepare_features(self, data: pd.DataFrame()):
        data['PU_DO'] = data['PULocationID'] + '_' + data['DOLocationID']
        categorical = ['PU_DO']
        numerical = ['trip_distance']
        dicts = data[categorical + numerical].to_dict(orient='records')
        return dicts
    
    def fit(self, dict_train, y_train, dict_val, y_val):
        
        with mlflow.start_run():
            params = dict(max_depth=20, n_estimators=100, min_samples_leaf=10, random_state=0)
            mlflow.log_params(params)
            
            pipeline = make_pipeline(
                DictVectorizer(),
                RandomForestRegressor(**params, n_jobs=-1)
            )
            
            pipeline.fit(dict_train, y_train)
            
            y_pred_train = pipeline.predict(dict_train)
            y_pred_val = pipeline.predict(dict_val)
            
            rmse_train = mean_squared_error(y_pred_train, y_train, squared=False)
            rmse_val = mean_squared_error(y_pred_val, y_val, squared=False)
            
            print(f"Params: {params}")
            print(f"rmse_train: {round(rmse_train, 2)}, rmse_val: {round(rmse_val, 2)}")
            
            mlflow.log_metric('rmse_train', rmse_train)
            mlflow.log_metric('rmse_val', rmse_val)
            
            mlflow.sklearn.log_model(pipeline, artifact_path="model")

### 3.3 Run to Train

In [9]:
model_train = TrainModel()

df_train = model_train.data_treatment('data/green_tripdata_2021-01.parquet')
df_val = model_train.data_treatment('data/green_tripdata_2021-02.parquet')

target = 'duration'
y_train = df_train[target].values
y_val = df_val[target].values

dict_train = model_train.prepare_features(df_train)
dict_val = model_train.prepare_features(df_val)

model_train.fit(
    dict_train=dict_train, 
    y_train=y_train, 
    dict_val=dict_val, 
    y_val=y_val)

Params: {'max_depth': 20, 'n_estimators': 100, 'min_samples_leaf': 10, 'random_state': 0}
rmse_train: 5.75, rmse_val: 6.76




### 4. Load model from `s3 bucket` and validate

In [5]:
import os
import mlflow
# Configuration variables
os.environ["AWS_PROFILE"] = "ml-in-action"
def load_model(run_id):
    logged_model = f"s3://mlflow-artifacts-rmt/0/{run_id}/artifacts/model"
    model = mlflow.pyfunc.load_model(logged_model)
    return model

In [6]:
RUN_ID = '8b4afe073de2423cad4b858170ac574f'
model = load_model(run_id=RUN_ID)

In [7]:
model

mlflow.pyfunc.loaded_model:
  artifact_path: model
  flavor: mlflow.sklearn
  run_id: 8b4afe073de2423cad4b858170ac574f

In [12]:
model_train = TrainModel()
df_val = model_train.data_treatment('data/green_tripdata_2021-02.parquet')

target = 'duration'
y_val = df_val[target].values
dict_val = model_train.prepare_features(df_val)

In [13]:
y_pred_val = model.predict(dict_val)
rmse_val = mean_squared_error(y_pred_val, y_val, squared=False)
print(f"rmse_val: {round(rmse_val, 2)}")

rmse_val: 6.76


### Costs Saving

After the end of this tutorial to only keep the model one can stop any other resource except the `s3 bucket` that actually holds the project online.