# MLFlow on GCP for Experiment Tracking
Source: https://kargarisaac.github.io/blog/mlops/data%20engineering/2022/06/15/MLFlow-on-GCP.html

- MLflow setup:
    - Tracking server: Virtual Machine in GCP
    - Backend store: Postgresql database
    - Artifacts store: Google Cloud Storage Bucket

## Steps
### Virtual Machine as The Tracking Server

#### Create a firewall rule
```gcloud_command
    gcloud compute firewall-rules create mlflow-tracking-server \
        --network default \
        --priority 1000 \
        --direction ingress \
        --action allow \
        --target-tags mlflow-tracking-server-tag \
        --source-ranges 0.0.0.0/0 \
        --rules tcp:5000 \
        --enable-logging
```
##### Understanding the code:
1. `gcloud compute firewall-rules create mlflow-tracking-server`  
    - **Purpose:** Creates a new firewall rule in GCP.
    - **mlflow-tracking-server:** The name of the firewall rule. You can choose any name, but it should be descriptive.

2. `--network default`  
    - **Purpose:** Specifies the network to which this firewall rule applies.
    - **default:** Refers to the default VPC (Virtual Private Cloud) network in your GCP project. If you're using a custom network, replace default with the name of your network.

3. `--priority 1000`  
    - **Purpose:** Sets the priority of the firewall rule.
    - **1000:** The priority value. Lower numbers indicate higher priority. GCP evaluates firewall rules in order of priority, so a rule with priority 1000 will be evaluated before a rule with priority 2000.

    Note: If multiple rules match the same traffic, the rule with the lowest priority number takes precedence.

4. `--direction ingress`  
    - **Purpose:** Specifies the direction of traffic to which the rule applies.
    - **ingress:** This rule applies to incoming traffic (traffic coming into your VM). For outgoing traffic, you would use egress.

5. `--action allow`  
    - **Purpose:** Specifies the action to take when traffic matches this rule.
    - **allow:** Allows traffic that matches the rule. You could also use deny to block traffic.

6. `--target-tags mlflow-tracking-server`  
    - **Purpose:** Applies the firewall rule to specific VM instances.
    - **mlflow-tracking-server-tag:** A network tag that you assign to the VM(s) running the MLflow Tracking Server. Only VMs with this tag will be affected by this firewall rule.  
    **How to Assign Tags:**
    When creating a VM, you can add the tag **mlflow-tracking-server-tag** in the Networking section.
    For an existing VM, go to the VM's details page, click Edit, and add the tag under Network tags.

7. `--source-ranges 0.0.0.0/0`  
    - **Purpose:** Specifies the source IP ranges for the traffic.
    - **0.0.0.0/0:** Allows traffic from any IP address (i.e., the entire internet). This is useful for public-facing services but can be a security risk. For better security, restrict this to specific IP ranges (e.g., your office IP or VPN IP).

8. `--rules tcp:5000`  
    - **Purpose:** Specifies the protocol and port to which the rule applies.
    - **tcp:5000:** Allows TCP traffic on port 5000. This is the default port for the MLflow Tracking Server.

9. `--enable-logging`  
    - **Purpose:** Enables logging for this firewall rule.
    - **What It Does:** 
        - Logs all traffic that matches this rule to Cloud Logging.
        - Useful for monitoring and debugging traffic to your MLflow Tracking Server.

    - **Where to View Logs:**
    Go to Logging > Logs Explorer in the Google Cloud Console and filter by the firewall rule name (mlflow-tracking-server).

    **Note:** enabling logging will increase cost

##### What Does This Firewall Rule Do?
- Allows incoming TCP traffic on port 5000 to any VM in the default network that has the tag mlflow-tracking-server.
- Source: Any IP address (0.0.0.0/0).
- Action: Allows the traffic.
- Logging: Logs all allowed traffic for monitoring.

##### Example Use Case
If you have a VM running the MLflow Tracking Server:
- Assign the tag mlflow-tracking-server-tag to the VM.
- This firewall rule will allow external users to access the MLflow UI and API at http://<VM_PUBLIC_IP>:5000.

##### Security Considerations
- **Restrict Source IPs:** Instead of 0.0.0.0/0, specify a smaller range of trusted IPs (e.g., your office IP or VPN IP).
- **Use HTTPS:** If exposing the server to the internet, use a reverse proxy (e.g., Nginx) with SSL/TLS to encrypt traffic.
- **Authentication:** Add authentication to the MLflow Tracking Server to prevent unauthorized access.

##### How to Delete the Firewall Rule
If you no longer need the rule, you can delete it with:  
`gcloud compute firewall-rules delete mlflow-tracking-server`


#### Create a virtual instance as the tracking server
```gcloud_command
    gcloud compute instances create mlflow-tracking-server \
        --project=terraform-demo-project-412717 \
        --zone=us-central1-a \
        --machine-type=e2-standard-2 \
        --network-interface=network-tier=PREMIUM,subnet=default \
        --maintenance-policy=TERMINATE \
        --provisioning-model=STANDARD \
        --service-account=terraform-runner@terraform-demo-project-412717.iam.gserviceaccount.com \
        --scopes=https://www.googleapis.com/auth/cloud-platform \
        --tags=mlflow-tracking-server-tag \
        --create-disk=auto-delete=yes,boot=yes,device-name=mlflow-tracking-server,image=projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20250213,mode=rw,size=10,type=projects/terraform-demo-project-412717/zones/us-central1-a/diskTypes/pd-balanced \
        --no-shielded-secure-boot \
        --shielded-vtpm \
        --shielded-integrity-monitoring \
        --reservation-affinity=any
```

##### Understanding the code
1. `gcloud compute instances create mlflow-tracking-server`
- **Purpose:** Creates a new Compute Engine instance.
- **mlflow-tracking-server:** The name of the VM instance. You can choose any name, but it should be descriptive.

2. `--project=<PROJECT_ID>`
- **Purpose:** Specifies the GCP project where the VM will be created.
- **<PROJECT_ID>:** Replace this with your GCP project ID.

3. `--zone=us-central1-a`
- **Purpose:** Specifies the zone where the VM will be created.
- **us-central1-a:** The zone in the us-central1 region. Replace this with your preferred zone.

4. `--machine-type=e2-standard-2`
- **Purpose:** Specifies the machine type for the VM.
- **e2-standard-2:** A machine type with `2 vCPUs` and `8 GB of RAM`. You can choose a different machine type based on your workload.

5. `--network-interface=network-tier=PREMIUM,subnet=default`
- **Purpose:** Configures the network interface for the VM.
- **network-tier=PREMIUM:** Uses the Premium network tier for better performance and global load balancing.
- **subnet=default:** Connects the VM to the default subnet in your VPC network. If you're using a custom subnet, replace default with the name of your subnet.

6. `--maintenance-policy=MIGRATE`
- **Purpose:** Specifies the maintenance policy for the VM.
- **TERMINATE:** If maintenance is required, it stops the VM . The alternative is MIGRATE, which migrates the VM to another host without downtime. 

7. `--provisioning-model=STANDARD`
- **Purpose:** Specifies the provisioning model for the VM.
- **STANDARD:** The VM is provisioned with guaranteed resources. The alternative is SPOT, which uses preemptible VMs at a lower cost but with the risk of termination.

8. `--service-account=<PROJECT_NUMBER>-compute@developer.gserviceaccount.com`
- **Purpose:** Assigns a service account to the VM.
- **<PROJECT_NUMBER>-compute@developer.gserviceaccount.com:** The default Compute Engine service account. Replace <PROJECT_NUMBER> with your project number. This service account allows the VM to interact with other GCP services.

9. `--scopes=https://www.googleapis.com/auth/cloud-platfor`m
- **Purpose**: Specifies the access scopes for the service account.
- **https://www.googleapis.com/auth/cloud-platform:** Grants the VM `full access` to all Google Cloud APIs. You can restrict this to specific APIs if needed.

10. `--tags=mlflow-tracking-server-tag`
- **Purpose:** Assigns network tags to the VM.
- **mlflow-tracking-server-tag:** A tag that can be used to apply firewall rules or route traffic. For example, you can create a firewall rule that allows traffic only to VMs with this tag.

11. `--create-disk=auto-delete=yes,boot=yes,device-name=mlflow-tracking-server,image=projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20220610,mode=rw,size=10,type=projects/<PROJECT_ID>/zones/europe-west1-b/diskTypes/pd-balanced`
- **Purpose:** Configures the boot disk for the VM.
- **auto-delete=yes:** The disk will be automatically deleted when the VM is deleted.
- **boot=yes:** This is the boot disk for the VM.
- **device-name=mlflow-tracking-server:** The name of the disk.
- **image=projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20220610:** The OS image for the disk. In this case, it's Ubuntu 20.04 LTS.
- **mode=rw:** The disk is mounted in read-write mode.
- **size=10:** The disk size in GB.
- **type=projects/<PROJECT_ID>/zones/europe-west1-b/diskTypes/pd-balanced:** The disk type. pd-balanced is a `balanced persistent disk type`. Replace <PROJECT_ID> with your project ID.

12. `--no-shielded-secure-boot`
- **Purpose:** Disables Shielded Secure Boot.
- **Shielded Secure Boot:** A security feature that ensures the VM boots only with trusted software. Disabling it may be necessary for certain custom images or workloads.

13. `--shielded-vtpm`
- **Purpose:** Enables virtual Trusted Platform Module (vTPM).
- **vTPM:** A virtualized hardware-based security feature for storing encryption keys and performing cryptographic operations.

14. `--shielded-integrity-monitoring`
- **Purpose:** Enables Shielded Integrity Monitoring.
- **Shielded Integrity Monitoring:** Monitors the boot process for unauthorized changes.

15. `--reservation-affinity=any`
- **Purpose:** Specifies the reservation affinity for the VM.
- **any:** The VM can be created in any available reservation. You can also specify specific to use a specific reservation.


##### What Does This Command Do?
- Creates a VM named `mlflow-tracking-server` in the specified project and zone.
- Uses an `e2-standard-2` machine type with 2 vCPUs and 8 GB of RAM.
- Connects the VM to the `default subnet` in the d`efault VPC network`.
- Assigns the default Compute Engine service account with `full access to Google Cloud APIs`.
- Creates a 10 GB boot disk with Ubuntu 20.04 LTS.
- Enables security features like vTPM and Shielded Integrity Monitoring.
- Tags the VM with `mlflow-tracking-server-tag` for firewall rules.

#### Connect with tracking server
- Syntax: `ssh -i ~/.ssh/<ssh_key> <user>@<ip>`  

#### Reserving an external IP for the vm
- **static address name:** mlflow-tracking-server
- **address**: 34.55.209.168

### Database as the Backend Store
#### Creating database instance

- **Database version:** PostgreSQL 13
- **Instance ID:** mlflow-database
- **Password:** postgres
- **vCPUs**: 2 vCPU
- **RAM**: 8 GB
- **Storage**: 10 GB


#### Creating database
GCP will create a default one named `postgres`, let's create a new one.
- **database_name:** mlflow_db


#### Adding User account
GCP will create a default one named `postgres`, let's create a new one.
- **user_name:** mlflow
- **password:** mlflow

#### Connecting the tracking server to database
- Install dependencies
    - `sudo apt-get update`
    - `sudo apt-get install postgresql-client`

- Connect to database  
`psql -h CLOUD_SQL_PRIVATE_IP_ADDRESS -U USERNAME DATABASENAME`  
Example: `psql -h 10.56.80.6 -U mlflow mlflow_db`

- Few psql commands:
    - show all available commands: `\?`
    - show all available databases: `\l`
    - switching to another database: `\c <database_name>`
    - show available tables in current database: `\dt`
    - describe a particular table: `\d <table_name>`
    - knowing the version of postgresl: `select version();`
    - knowing the Syntaxes of PostgreSQL Statements: `\h DROP TABLE`
    - select statement: `select * from <table_name> limit 10;`
    - quit psql: `\q`

### Google Cloud Storage Bucket as Artifact Store
Bucket name: mlflow-test-bucket-2025


### Run the MLFlow Server on Tracking Server

- Install dependencies
```
sudo apt install python3.8-venv
python3 -m venv mlflow
source mlflow/bin/activate
pip install mlflow boto3 google-cloud-storage psycopg2-binary
```

These packages provide the necessary functionality:
- **mlflow:** Core MLflow functionality.
- **boto3:** Enables MLflow to interact with GCS as an artifact store.
- **google-cloud-storage:** Provides additional GCS functionality if needed.
- **psycopg2-binary:** Enables MLflow to connect to a PostgreSQL database for metadata storage.


Then run the mlflow server:
```
Syntax:
mlflow server \
    -h 0.0.0.0 \
    -p 5000 \
    --backend-store-uri postgresql://<user>:<pass>@<db private ip>:5432/<db name> \
    --default-artifact-root gs://<bucket name>/<folder name>

Example:
mlflow server \
    -h 0.0.0.0 \
    -p 5000 \
    --backend-store-uri postgresql://mlflow:mlflow@10.56.80.6:5432/mlflow_db \
    --default-artifact-root gs://mlflow-test-bucket-2025/mlruns \
    --gunicorn-opts "--log-level debug"
```

Fix address already in use error on port 5000.
1. Use `lsof -i :5000` to check what processes are using the port
2. Use `kill -9 <PID>` to terminate specific processes
3. Or use `pkill -f ".*0.0.0.0:5000.*"` to kill all processes matching the port pattern


##### MLflow UI: 
Syntax: `http:<tracking server external IP>:5000`  
Example: `http:34.55.209.168:5000` 


Facing issue with configuration