# ETL one cycle
> Normally ETL is processed by 3 steps, E, T, L :) but we could do it by one cycle, ETL.

We are going to use the 3 configs from `ETL_how_to_run.ipynb` and merge it to one config file.

## 🌌 1. prepare config

In [1]:
import os
from pathlib import Path
from dataverse.config import Config 
from omegaconf import OmegaConf

# E = Extract, T = Transform, L = Load
main_path = Path(os.path.abspath('../..'))
ETL_path = main_path / "./dataverse/config/etl/sample/ETL___one_cycle.yaml"

ETL_config = Config.load(ETL_path)

#### Wait! If you haven't clone the repository, run the shell script below.

In [2]:
ETL_config = OmegaConf.create({
    'spark': { 
        'appname': 'dataverse_etl_sample',
        'driver': {'memory': '16g'}  
    },
    'etl': [
        {
            'name': 'data_ingestion___test___generate_fake_ufl'
        },
        {
            'name': 'utils___sampling___random',
            'args': {'sample_n_or_frac': 0.1}
        },
        {
            'name': 'deduplication___minhash___lsh_jaccard'
        },
        {
            'name': 'data_save___huggingface___ufl2hf_obj'
        }
    ]
})

### 🌠 ETL Config
> One cycle from raw to huggingface dataset

- load fake generation UFL data
- sample 10% of total data to reduce the size of dataset
- deduplicate by `text` column, 15-gram minhash jaccard similarity
- convert to huggingface dataset and return the object

In [3]:
print(OmegaConf.to_yaml(ETL_config))

spark:
  appname: dataverse_etl_sample
  driver:
    memory: 16g
etl:
- name: data_ingestion___test___generate_fake_ufl
- name: utils___sampling___random
  args:
    sample_n_or_frac: 0.1
- name: deduplication___minhash___lsh_jaccard
- name: data_save___huggingface___ufl2hf_obj



## 🌌 2. put config to ETLPipeline

In [None]:
from dataverse.etl import ETLPipeline

etl_pipeline = ETLPipeline()

In [5]:
# raw -> hf_obj
spark, dataset = etl_pipeline.run(ETL_config)

[ No AWS Credentials Found] - Failed to set spark conf for S3


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/15 22:26:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/04/15 22:26:20 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
                                                                                

In [6]:
dataset

Dataset({
    features: ['id', 'meta', 'name', 'text'],
    num_rows: 14
})

In [7]:
dataset[0]

{'id': '32ff39e5-2a88-45dc-a69e-b59b05f51216',
 'meta': '{"name": "Laura White", "age": 49, "address": "126 Javier Islands Apt. 925\\nPort Jasonshire, UT 60978", "job": "Mining engineer"}',
 'name': 'test_fake_ufl',
 'text': 'Your whose admit ask herself. Public mission far program tough.\nEconomic talk few minute. Budget face yeah along difference. Evening heart throughout.'}