# ETL test etl process
> when you want to get `test`(sample) data to quickly test your ETL process, or need `data from a certain point` to test your ETL process, you can check how to do it here.

## 🌌 Get `test`(sample) data

### 🌠 get `test`(sample) data `w/o config`
> when you have created a ETL process and don't wanna set config from the scratch here is a quick way to get the sample data

In [1]:
from dataverse.etl import ETLPipeline

etl_pipeline = ETLPipeline()
spark, data = etl_pipeline.sample()

# default sampling will return 100 `ufl` data
print(f"total data # : {data.count()}")
print(f"sample data :")
data.take(1)

  from .autonotebook import tqdm as notebook_tqdm
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/08 12:26:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/11/08 12:26:06 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


[ SAMPLE MODE ]
This is a quick way to get the sample data for testing or debugging w/o config.
If you want to test the ETL pipeline with your own data, please use `run` w/ config.
=> spark, data = etl_pipeline.sample()
=> data = data.map(add awesome duck to column)



                                                                                

total data # : 100
sample data :


[{'id': '1aad997c-c6d6-4780-93e5-f653c4182243',
  'name': 'test_fake_ufl',
  'text': 'Early group family ahead movie. Reveal west us he heart board trip foot. Less else when pick compare while.\nNew relate daughter eat idea. Road professional social trade ten.',
  'meta': '{"name": "Brendan Cunningham", "age": 78, "address": "76915 Tanya Gateway\\nNew Shelbymouth, ME 17878", "job": "Nurse, adult"}'}]

when you want to increase the sample size do the following
```python
spark, data = etl_pipeline.sample(n=10000)
spark, data = etl_pipeline.sample(10000)
```

In [2]:
spark, data = etl_pipeline.sample(10000)
print(f"total data # : {data.count()}")
print(f"sample data :")
data.take(1)

[ SAMPLE MODE ]
This is a quick way to get the sample data for testing or debugging w/o config.
If you want to test the ETL pipeline with your own data, please use `run` w/ config.
=> spark, data = etl_pipeline.sample()
=> data = data.map(add awesome duck to column)



                                                                                

total data # : 10000
sample data :


[{'id': 'a225e2e1-4051-4d36-bf61-bb00fa36e2d2',
  'name': 'test_fake_ufl',
  'text': 'Couple could impact approach agency fund day clear. Wife drop surface discover project.\nWord source reveal country. Community population method on. Kitchen standard between six enough government.',
  'meta': '{"name": "Michael Sawyer", "age": 32, "address": "60245 Charles Spurs Apt. 866\\nPort Felicia, MP 61824", "job": "Civil Service fast streamer"}'}]

### 🌠 get `test`(sample) data `w/ config`
> this might took some time to get the data but you can choose your own data
- this was also introduced in `ETL_03_create_new_etl_process.ipynb`

Getting sample data `you want`

In [3]:
from omegaconf import OmegaConf

# load from dict
ETL_config = OmegaConf.create({
    'spark': {
        'appname': 'ETL',
        'driver': {'memory': '16g'},
    },
    'etl': [
        {
            'name': 'data_ingestion___huggingface___hf2raw',
            'args': {'name_or_path': ['ai2_arc', 'ARC-Challenge']}
        },
        {'name': 'utils___sampling___random'}
    ]
})

print(OmegaConf.to_yaml(ETL_config))

spark:
  appname: ETL
  driver:
    memory: 16g
etl:
- name: data_ingestion___huggingface___hf2raw
  args:
    name_or_path:
    - ai2_arc
    - ARC-Challenge
- name: utils___sampling___random



In [4]:
from dataverse.etl import ETLPipeline

etl_pipeline = ETLPipeline()
spark, data = etl_pipeline.run(ETL_config)
print(f"total data # : {data.count()}")
print(f"sample data :")
data.take(1)

23/11/08 12:26:22 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


  table = cls._concat_blocks(blocks, axis=0)


Dataset already exists at /home/vscode/.cache/dataverse/dataset/huggingface_66b1e70af513110c.parquet


                                                                                

[ DEBUG MODE ]
Last ETL process was assigned for [ utils ]
Spark session will not be stopped and will be returned
If this is not intended, please assign [ data_load ] at the end.
Example:
=> spark, data = etl_pipeline.run(config)
=> data = data.map(add awesome duck to column)



                                                                                

total data # : 280
sample data :


[{'id': 'Mercury_7029645',
  'question': 'Metal atoms will most likely form ions by the',
  'choices': Row(text=['loss of electrons.', 'loss of protons.', 'gain of electrons.', 'gain of protons.'], label=['A', 'B', 'C', 'D']),
  'answerKey': 'A'}]

## 🌌 Test your ETL process
> its time to test your ETL process with the sample data. define ETL process and run it

In [5]:
from dataverse.etl import ETLPipeline
from dataverse.etl import register_etl

etl_pipeline = ETLPipeline()

# get sample data
spark, data = etl_pipeline.sample()
print(f"total data # : {data.count()}")
print(f"sample data :")
data.take(1)

23/11/08 12:26:31 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


[ SAMPLE MODE ]
This is a quick way to get the sample data for testing or debugging w/o config.
If you want to test the ETL pipeline with your own data, please use `run` w/ config.
=> spark, data = etl_pipeline.sample()
=> data = data.map(add awesome duck to column)



                                                                                

total data # : 100
sample data :


[{'id': 'b9276a8f-3a0a-4474-9fe5-0e454b3f2449',
  'name': 'test_fake_ufl',
  'text': 'Year main scene husband grow carry range. Tonight himself sell since across.',
  'meta': '{"name": "Kimberly Fields", "age": 41, "address": "625 Cynthia Expressway Suite 971\\nAllisonmouth, MH 13154", "job": "Civil engineer, contracting"}'}]

In [6]:
@register_etl
def test___your___etl_process(spark, data, *args, **kwargs):
    # add your custom process here
    # here we are going to simply remove 'id' key
    data = data.map(lambda x: {k: v for k, v in x.items() if k != 'id'})

    return data

In [7]:
# test right away
# - successfully removed `id` key
etl = test___your___etl_process
etl()(spark, data).take(1)

[{'name': 'test_fake_ufl',
  'text': 'Year main scene husband grow carry range. Tonight himself sell since across.',
  'meta': '{"name": "Kimberly Fields", "age": 41, "address": "625 Cynthia Expressway Suite 971\\nAllisonmouth, MH 13154", "job": "Civil engineer, contracting"}'}]

In [8]:
# test it is registered by calling it from etl_pipeline
# - successfully removed `id` key
etl = etl_pipeline.get('test___your___etl_process')
etl()(spark, data).take(1)

[{'name': 'test_fake_ufl',
  'text': 'Year main scene husband grow carry range. Tonight himself sell since across.',
  'meta': '{"name": "Kimberly Fields", "age": 41, "address": "625 Cynthia Expressway Suite 971\\nAllisonmouth, MH 13154", "job": "Civil engineer, contracting"}'}]

## 🌌 Experiments on the data itself
> there is no chosen way to use this `test`(sample) data. you can do whatever you want with it. here are some examples

In [9]:
data.map(lambda x: {**x, 'duck': 'is quarking (physics)'}).take(1)

[{'id': 'b9276a8f-3a0a-4474-9fe5-0e454b3f2449',
  'name': 'test_fake_ufl',
  'text': 'Year main scene husband grow carry range. Tonight himself sell since across.',
  'meta': '{"name": "Kimberly Fields", "age": 41, "address": "625 Cynthia Expressway Suite 971\\nAllisonmouth, MH 13154", "job": "Civil engineer, contracting"}',
  'duck': 'is quarking (physics)'}]