# Orchestration

Setup up Airflow with:

In [None]:
!export AIRFLOW_HOME=$(pwd) && export DAGS_FOLDER=$(pwd)/dags

Run the following commands to start the Airflow web server and the MLflow server:

```bash
airflow standalone
mlflow server \
  --backend-store-uri sqlite:///$(pwd)/mlflow.db \
  --default-artifact-root mlruns
```

The `mlflow` UI can be found at `http://127.0.0.1:5000`, while the `airflow` UI can be found at `http://127.0.0.1:8080`

## Downloading the Data

Download the January 2023 Green Taxi data and use it for your training data and the February 2023 Green Taxi data and use it for your validation data.

In [1]:
import subprocess
from pathlib import Path


PROJECT_DIR = Path().absolute().parent.parent
DATA_DIR = PROJECT_DIR / "data"
S3_URL = "https://d37ci6vzurychx.cloudfront.net/trip-data/"
FILE_NAMES = [
    "green_tripdata_2023-01.parquet",
    "green_tripdata_2023-02.parquet",
    "green_tripdata_2023-03.parquet",
]


def download_data(file_name: str) -> None:
    file_path = DATA_DIR / file_name
    url = S3_URL + file_name

    if not file_path.is_file():
        print("File does not exist, downloading from S3 bucket.")
        if not file_path.parent.exists():
            file_path.parent.mkdir(parents=True)
        subprocess.run(["wget", "-O", file_path, url])
        print(f"File downloaded successfully and saved at {file_path}")
    else:
        print("File already exists.")


for file_name in FILE_NAMES:
    download_data(file_name)

File already exists.
File already exists.
File does not exist, downloading from S3 bucket.


--2023-06-24 16:59:32--  https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-03.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 13.225.242.37, 13.225.242.89, 13.225.242.58, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|13.225.242.37|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1730999 (1.7M) [binary/octet-stream]
Saving to: ‘/home/fernando/code/mlops-zoomcamp/data/green_tripdata_2023-03.parquet’

     0K .......... .......... .......... .......... ..........  2%  588K 3s
    50K .......... .......... .......... .......... ..........  5%  590K 3s
   100K .......... .......... .......... .......... ..........  8%  675K 3s
   150K .......... .......... .......... .......... .......... 11% 15.7M 2s
   200K .......... .......... .......... .......... .......... 14% 12.3M 1s
   250K .......... .......... .......... .......... .......... 17% 6.23M 1s
   300K .......... .......... 

File downloaded successfully and saved at /home/fernando/code/mlops-zoomcamp/data/green_tripdata_2023-03.parquet


......... 76%  791K 0s
  1300K .......... .......... .......... .......... .......... 79% 25.4M 0s
  1350K .......... .......... .......... .......... .......... 82% 50.3M 0s
  1400K .......... .......... .......... .......... .......... 85%  108M 0s
  1450K .......... .......... .......... .......... .......... 88% 43.8M 0s
  1500K .......... .......... .......... .......... .......... 91%  125M 0s
  1550K .......... .......... .......... .......... .......... 94%  189M 0s
  1600K .......... .......... .......... .......... .......... 97%  112M 0s
  1650K .......... .......... .......... ..........           100%  145M=0.5s

2023-06-24 16:59:33 (3.35 MB/s) - ‘/home/fernando/code/mlops-zoomcamp/data/green_tripdata_2023-03.parquet’ saved [1730999/1730999]

