### Data Preparation

##### Here we prepare the data for model training and inference. We load the [Rotten Tomatoes dataset](https://huggingface.co/datasets/rotten_tomatoes), from Hugging Face's *datasets* library, and then create features from it using the [*Tokenizer*](https://huggingface.co/docs/transformers/main_classes/tokenizer) class from the [*transformers*](https://huggingface.co/docs/transformers/index) library. The dataset with features is then saved as parquet files in the [Microsoft Fabric Lakehouse](https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-overview).

##### Import the necessary packages.

In [1]:
import os
import pandas as pd

from pyspark.sql.functions import pandas_udf, monotonically_increasing_id
from pyspark.sql.types import IntegerType, ArrayType, StructType, StructField

from transformers import AutoConfig, AutoTokenizer
from datasets import load_dataset

from notebookutils import mssparkutils

StatementMeta(, f891e34e-3d37-4c80-afac-3f5bf4561d03, 5, Finished, Available)

##### Here we mount a *Lakehouse* for local access through Python. This wouldn’t be necessary if the *Lakehouse* was set as default for this Notebook, as in this case one could access the *Lakehouse* using the *File API path*.

In [2]:
mssparkutils.fs.mount( 
 'abfss://<YOUR FABRIC WORKSPACE NAME>@msit-onelake.dfs.fabric.microsoft.com/<YOUR FABRIC LAKEHOUSE NAME>.Lakehouse/Files/', 
 '/lakehouse'
)

StatementMeta(, f891e34e-3d37-4c80-afac-3f5bf4561d03, 6, Finished, Available)

SynapseWidget(Synapse.DataFrame, 0affb068-09a1-464d-bba6-504b01a0e0da)

##### We load the training, validation, and testing partitions from the *Rotten Tomatoes* dataset and save the corresponding *Dataset* objects as temporary parquet files in the *Lakehouse*. This makes it easier to load them as *Spark DataFrame* objects later. 

In [3]:
ds_train = load_dataset('rotten_tomatoes', split='train')
ds_val = load_dataset('rotten_tomatoes', split='validation')
ds_test = load_dataset('rotten_tomatoes', split='test')

# raw_data_dir = '/lakehouse/default/Files/raw_data' # we could use this default mounted path, if the Lakehouse was set as default for the Notebook
raw_data_dir = os.path.join(mssparkutils.fs.getMountPath('/lakehouse'), 'raw_data')
os.makedirs(raw_data_dir, exist_ok=True)

ds_train.to_parquet(os.path.join(raw_data_dir, 'train_data.parquet'))
ds_val.to_parquet(os.path.join(raw_data_dir, 'val_data.parquet'))
ds_test.to_parquet(os.path.join(raw_data_dir, 'test_data.parquet'))

StatementMeta(, f891e34e-3d37-4c80-afac-3f5bf4561d03, 7, Finished, Available)

Downloading builder script:   0%|          | 0.00/5.03k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.02k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.25k [00:00<?, ?B/s]

Downloading and preparing dataset rotten_tomatoes/default to /home/trusted-service-user/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46...


Downloading data:   0%|          | 0.00/488k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Dataset rotten_tomatoes downloaded and prepared to /home/trusted-service-user/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46. Subsequent calls will reuse this data.
Found cached dataset rotten_tomatoes (/home/trusted-service-user/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46)
Found cached dataset rotten_tomatoes (/home/trusted-service-user/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46)


Creating parquet from Arrow format:   0%|          | 0/9 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

135968

##### Here we load the training, validation, and testing partitions as *Spark DataFrame* objects and we force a data repartition to illustrate the parallel processing with *Pandas UDF* later, given that the data is relatively small.

In [4]:
num_partitions = 8

# raw_data_dir_spark = 'Files/raw_data' # we could use this relative path for Spark access, if the Lakehouse was set as default for the Notebook
raw_data_dir_spark = 'abfss://<YOUR FABRIC WORKSPACE NAME>@msit-onelake.dfs.fabric.microsoft.com/<YOUR FABRIC LAKEHOUSE NAME>.Lakehouse/Files/raw_data'

sdf_train = spark.read.parquet(os.path.join(raw_data_dir_spark, 'train_data.parquet')).repartition(num_partitions)
sdf_val = spark.read.parquet(os.path.join(raw_data_dir_spark, 'val_data.parquet')).repartition(num_partitions)
sdf_test = spark.read.parquet(os.path.join(raw_data_dir_spark, 'test_data.parquet')).repartition(num_partitions)

display(sdf_train.limit(10))

StatementMeta(, f891e34e-3d37-4c80-afac-3f5bf4561d03, 8, Finished, Available)

SynapseWidget(Synapse.DataFrame, c226656b-27fe-4aab-bde1-8c5f3809bed6)

##### Here is where we define a function to create the features needed by the model. We choose the [*RoBERTa*](https://huggingface.co/docs/transformers/model_doc/roberta) model, which is [available as a pre-trained model](https://huggingface.co/roberta-base) from the Hugging Face model catalog. We use Hugging Face's [*Auto Classes*](https://huggingface.co/docs/transformers/model_doc/auto#auto-classes) to instantiate the appropriate *Tokenizer* object from the model chosen.

##### The *tokenization* process is performed in parallel, over the *Spark DataFrame*, using Spark's [*Pandas UDF*](https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs) functionality.

In [5]:
model_type = 'roberta-base'
tokenizer = AutoTokenizer.from_pretrained(model_type)
max_length = 128

schema = StructType([
    StructField('input_ids', ArrayType(IntegerType())),
    StructField('attention_mask', ArrayType(IntegerType()))
])

@pandas_udf(schema)
def tokenize(text: pd.Series) -> pd.DataFrame:
    tokens = tokenizer(text.tolist(), max_length=max_length, truncation=True, padding='max_length')
    return pd.DataFrame({'input_ids': tokens['input_ids'], 'attention_mask': tokens['attention_mask']})

StatementMeta(, f891e34e-3d37-4c80-afac-3f5bf4561d03, 9, Finished, Available)

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

StatementMeta(, f891e34e-3d37-4c80-afac-3f5bf4561d03, 14, Finished, Available)

StatementMeta(, f891e34e-3d37-4c80-afac-3f5bf4561d03, 15, Finished, Available)

##### The *tokenization* function defined above is then executed and returns the two types of inputs needed by the model: *input_ids* and *attention_mask*. 

In [6]:
sdf_tokens_train = sdf_train.select('label', tokenize('text').alias('tokens'))
sdf_tokens_val = sdf_val.select('label', tokenize('text').alias('tokens'))
sdf_tokens_test = sdf_test.select('label', tokenize('text').alias('tokens'))

display(sdf_tokens_train.limit(10))

StatementMeta(, f891e34e-3d37-4c80-afac-3f5bf4561d03, 10, Finished, Available)

SynapseWidget(Synapse.DataFrame, dfd52935-4f7c-4e04-891c-0dd87b84672f)

##### We split the *tokenization* results into separate columns in the corresponding *Spark DataFrame* objects. These are our *featurized* data.

In [7]:
sdf_tokens_train = sdf_tokens_train.withColumns({'input_ids': sdf_tokens_train['tokens'].getItem('input_ids'),
                                                 'attention_mask': sdf_tokens_train['tokens'].getItem('attention_mask')}).drop('tokens')

sdf_tokens_val = sdf_tokens_val.withColumns({'input_ids': sdf_tokens_val['tokens'].getItem('input_ids'),
                                             'attention_mask': sdf_tokens_val['tokens'].getItem('attention_mask')}).drop('tokens')

sdf_tokens_test = sdf_tokens_test.withColumns({'input_ids': sdf_tokens_test['tokens'].getItem('input_ids'),
                                               'attention_mask': sdf_tokens_test['tokens'].getItem('attention_mask')}).drop('tokens')

display(sdf_tokens_train.limit(10))

StatementMeta(, f891e34e-3d37-4c80-afac-3f5bf4561d03, 11, Finished, Available)

SynapseWidget(Synapse.DataFrame, 83872fee-5a29-4920-9219-2f557c66c965)

##### We then write the *featurized* data as parquet files in the *Lakehouse*.

In [8]:
# prepared_data_dir_spark = 'Files/prepared_data' # we could use this relative path for Spark access, if the Lakehouse was set as default for the Notebook
prepared_data_dir_spark = 'abfss://<YOUR FABRIC WORKSPACE NAME>@msit-onelake.dfs.fabric.microsoft.com/<YOUR FABRIC LAKEHOUSE NAME>.Lakehouse/Files/prepared_data'

sdf_tokens_train.write.parquet(os.path.join(prepared_data_dir_spark, 'tokens_train_data'), mode='overwrite')
sdf_tokens_val.write.parquet(os.path.join(prepared_data_dir_spark, 'tokens_val_data'), mode='overwrite')
sdf_tokens_test.write.parquet(os.path.join(prepared_data_dir_spark, 'tokens_test_data'), mode='overwrite')

StatementMeta(, f891e34e-3d37-4c80-afac-3f5bf4561d03, 12, Finished, Available)

##### Unmount the *Lakehouse* local path.

In [9]:
mssparkutils.fs.unmount('/lakehouse')

StatementMeta(, f891e34e-3d37-4c80-afac-3f5bf4561d03, 13, Finished, Available)

True