# ETL how to run?
> At here we will talk about how to run ETL. There is 2 steps to run ETL.

1. prepare config
2. put config to ETLPipeline

## 🌌 1. prepare config

In [1]:
import os
from pathlib import Path
from dataverse.config import Config 
from omegaconf import OmegaConf

# E = Extract, T = Transform, L = Load
main_path = Path(os.path.abspath('../..'))
E_path = main_path / "./dataverse/config/etl/sample/data_ingestion___sampling.yaml"
T_path = main_path / "./dataverse/config/etl/sample/data_preprocess___dedup.yaml"
L_path = main_path / "./dataverse/config/etl/sample/data_load___hf_obj.yaml"

E_config = Config.load(E_path)
T_config = Config.load(T_path)
L_config = Config.load(L_path)

### 🌠 Extract Config

- load fake generation UFL data
- sample 10% of total data to reduce the size of dataset
- save to parquet `dataverse/sample/sample_ufl.parquet`

In [2]:
print(OmegaConf.to_yaml(E_config))

spark:
  appname: dataverse_etl_sample
  driver:
    memory: 16g
etl:
- name: data_ingestion___test___generate_fake_ufl
- name: utils___sampling___random
  args:
    sample_n_or_frac: 0.1
- name: data_load___parquet___ufl2parquet
  args:
    save_path: ./sample/sample_ufl.parquet



### 🌠 Transform Config

- load parquet `./sample/sample_ufl.parquet`
- deduplicate by `text` column, 15-gram minhash jaccard similarity
- save to parquet `./sample/preprocess_ufl.parquet`

In [3]:
print(OmegaConf.to_yaml(T_config))

spark:
  appname: dataverse_etl_sample
  driver:
    memory: 16g
etl:
- name: data_ingestion___parquet___pq2ufl
  args:
    input_paths:
    - ./sample/sample_ufl.parquet
- name: deduplication___minhash___lsh_jaccard
- name: data_load___parquet___ufl2parquet
  args:
    save_path: ./sample/preprocess_ufl.parquet



### 🌠 Load Config

- load parquet `./sample/preprocess_ufl.parquet`
- convert to huggingface dataset and return the object

In [4]:
print(OmegaConf.to_yaml(L_config))

spark:
  appname: dataverse_etl_sample
  driver:
    memory: 16g
etl:
- name: data_ingestion___parquet___pq2ufl
  args:
    input_paths:
    - ./sample/preprocess_ufl.parquet
- name: data_load___huggingface___ufl2hf_obj



## 🌌 2. put config to ETLPipeline

In [5]:
from dataverse.etl import ETLPipeline

etl_pipeline = ETLPipeline()

In [6]:
# raw -> ufl
etl_pipeline.run(E_config)

# ufl -> dedup -> ufl
etl_pipeline.run(T_config)

# ufl -> hf_obj
spark, dataset = etl_pipeline.run(L_config)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/14 18:59:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/11/14 18:59:47 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
23/11/14 18:59:53 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
23/11/14 18:59:59 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).


Downloading and preparing dataset spark/-1386837710 to /root/.cache/huggingface/datasets/spark/-1386837710/0.0.0...


                                                                                

Dataset spark downloaded and prepared to /root/.cache/huggingface/datasets/spark/-1386837710/0.0.0. Subsequent calls will reuse this data.


In [7]:
dataset

Dataset({
    features: ['id', 'meta', 'name', 'text'],
    num_rows: 14
})

In [8]:
dataset[0]

{'id': '1329ca9c-ce67-449b-b4ba-348ffb70432f',
 'meta': '{"name": "Mr. Javier Johnson MD", "age": 47, "address": "450 Jackson Track Apt. 402\\nLake Robertchester, AK 89060", "job": "Tree surgeon"}',
 'name': 'test_fake_ufl',
 'text': 'Ever difference protect seat trial argue man. Pm area clear from. Station speak road cut short fish theory. Need big home north trouble affect.'}