## Importing Libraries and Modules

In this cell, we import the necessary libraries and modules required for the task:

- **pandas**: For data manipulation and analysis.
- **transformers**: Includes the `T5Tokenizer` and `T5ForConditionalGeneration` classes for tokenizing text and generating predictions using the T5 model.
- **datasets**: Provides the `Dataset` and `DatasetDict` classes for handling datasets.
- **numpy**: For numerical operations.

These libraries and modules will be used for data processing, model training, and evaluation.


In [6]:
import pandas as pd
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments, TrainerCallback
from datasets import Dataset, DatasetDict
import numpy as np

## Reading Data from JSONL Files

In this cell, we define a function `read_jsonl` to read data from JSON Lines (JSONL) files into pandas DataFrames. We then use this function to read the following datasets:

- **Training Data**: `attrebute_train.data` and `attrebute_train.solution`, with the first 1000 rows.
- **Testing Data**: `attrebute_test.data` and `attrebute_test.solution`, with the first 200 rows.
- **Validation Data**: `attrebute_val.data` and `attrebute_val.solution`, with the first 200 rows.

The commented-out lines are for reading the entire datasets if needed. This setup allows us to work with a subset of the data for initial experimentation and testing.


In [7]:
train_data = pd.read_json('./input_data/attribute_train.data', lines=True)
train_solution = pd.read_json('./input_data/attribute_train.solution', lines=True)
test_data = pd.read_json('./input_data/attribute_test.data', lines=True)
val_data = pd.read_json('./input_data/attribute_val.data', lines=True)
val_solution = pd.read_json('./input_data/attribute_val.solution', lines=True)

In [8]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 443499 entries, 0 to 443498
Data columns (total 4 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   indoml_id             443499 non-null  int64 
 1   title                 443499 non-null  object
 2   store                 443499 non-null  object
 3   details_Manufacturer  437775 non-null  object
dtypes: int64(1), object(3)
memory usage: 13.5+ MB


In [9]:
val_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95035 entries, 0 to 95034
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   indoml_id             95035 non-null  int64 
 1   title                 95035 non-null  object
 2   store                 95035 non-null  object
 3   details_Manufacturer  93802 non-null  object
dtypes: int64(1), object(3)
memory usage: 2.9+ MB


In [10]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95036 entries, 0 to 95035
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   indoml_id             95036 non-null  int64 
 1   title                 95036 non-null  object
 2   store                 95036 non-null  object
 3   details_Manufacturer  93816 non-null  object
dtypes: int64(1), object(3)
memory usage: 2.9+ MB


In [11]:
total_data = pd.concat([train_data, val_data, test_data], axis=0)
total_data = total_data.reset_index(drop=True)
total_data

Unnamed: 0,indoml_id,title,store,details_Manufacturer
0,0,"Enclume Angled Pot Hook, Set of 6, Use with Po...",Enclume,Enclume
1,1,Schutt Vengeance DCT Hybrid Youth Football H,Schutt,Schutt
2,2,Easton 2014 MAKO SL14MK9 Baseball Bat (-9),Easton,"Easton Sports, Inc."
3,3,Bilstein B46-0929 Heavy-Duty Gas Shock Absorber,Bilstein,Bilstein
4,4,Apple Red Cardstock - 8.5 x 11 inch - 65Lb Cov...,Clear Path Paper,Clear Path Paper
...,...,...,...,...
633565,95031,Discraft Avenger SS Elite Z Golf Disc,Discraft,Discraft
633566,95032,ProLume Prolumeme F30T12DL 9236 F30 T12 Daylight,Prolume,Halco Lighting Technologies
633567,95033,"Nearly Natural 4842-S3 Colorful Cactus, Set of 3",Nearly Natural,Nearly Natural
633568,95034,Gorilla Automotive 90038 Acorn Bulge Open End ...,GORILLA,Gorilla Automotive


In [12]:
nonan_data = total_data.dropna()
nonan_data = nonan_data.reset_index(drop=True)
nan_data = total_data[total_data.isna().any(axis=1)]
nan_data = nan_data.reset_index(drop=True)

In [13]:
print(len(nonan_data))
print(len(nan_data))
print(len(nonan_data)+len(nan_data))

625393
8177
633570


In [14]:
nonan_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 625393 entries, 0 to 625392
Data columns (total 4 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   indoml_id             625393 non-null  int64 
 1   title                 625393 non-null  object
 2   store                 625393 non-null  object
 3   details_Manufacturer  625393 non-null  object
dtypes: int64(1), object(3)
memory usage: 19.1+ MB


In [15]:
nan_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8177 entries, 0 to 8176
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   indoml_id             8177 non-null   int64 
 1   title                 8177 non-null   object
 2   store                 8177 non-null   object
 3   details_Manufacturer  0 non-null      object
dtypes: int64(1), object(3)
memory usage: 255.7+ KB


In [16]:
tstdata = nan_data.drop(['details_Manufacturer'], axis=1)

In [17]:
tstdata

Unnamed: 0,indoml_id,title,store
0,27,Munchkin Dora the Explorer Click Lock Insulate...,Munchkin
1,78,"Carter's Forest Friends Fitted Sheet, Tan/Choc...",Carter's
2,139,VP-Zoom Sportscope - 4x9 Magnification - Manuf...,XGATML
3,192,Seventh Generation Baby Free & Clear Overnight...,Seventh Generation
4,303,Operandi 12 Pack Melon Cradle Supports Durable...,Aeiniwer
...,...,...,...
8172,94854,BIBS Baby Pacifier | BPA-Free Natural Rubber |...,Bibs
8173,94893,Jackson Pro Series Soloist SL2P MAH Electric G...,Jackson
8174,94915,Wallmonkeys WM249565 Baby Two-Toed Sloth 4 Mon...,Wallmonkeys
8175,94917,"Dr. Brown's Training Cup, Blue",Dr. Brown's


In [18]:
soln = pd.DataFrame(nonan_data['details_Manufacturer'].copy())
data = nonan_data.drop(['details_Manufacturer', 'indoml_id'], axis=1)

In [19]:
from sklearn.model_selection import train_test_split

trdata, valdata, trsoln, valsoln = train_test_split(data, soln, test_size=0.02)

In [20]:
trdata.info()

<class 'pandas.core.frame.DataFrame'>
Index: 612885 entries, 549316 to 156852
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   title   612885 non-null  object
 1   store   612885 non-null  object
dtypes: object(2)
memory usage: 14.0+ MB


In [21]:
len(trdata)

612885

In [22]:
valdata.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12508 entries, 299645 to 218401
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   12508 non-null  object
 1   store   12508 non-null  object
dtypes: object(2)
memory usage: 293.2+ KB


In [23]:
len(valdata)

12508

In [24]:
trsoln.info()

<class 'pandas.core.frame.DataFrame'>
Index: 612885 entries, 549316 to 156852
Data columns (total 1 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   details_Manufacturer  612885 non-null  object
dtypes: object(1)
memory usage: 9.4+ MB


In [25]:
len(trsoln)

612885

In [26]:
valsoln.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12508 entries, 299645 to 218401
Data columns (total 1 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   details_Manufacturer  12508 non-null  object
dtypes: object(1)
memory usage: 195.4+ KB


In [27]:
len(valsoln)

12508

In [28]:
trdata.iloc[0]

title    BOSCH BP1397 QuietCast Premium Semi-Metallic D...
store                                                BOSCH
Name: 549316, dtype: object

## Data Preprocessing and Formatting

In this cell, we define a function `preprocess_data` to prepare the data for model training. This function merges the product description data with the corresponding attribute labels, then formats the data into `input_text` and `target_text` pairs:

- **`input_text`**: Constructed by combining the product title, store, and manufacturer details.
- **`target_text`**: Constructed by specifying the attribute-value pairs for brand and categories.

### Data Processing

We apply the `preprocess_data` function to the training, testing, and validation datasets to generate the `input_text` and `target_text`.

Finally, the processed data is converted into the Hugging Face Dataset format using `Dataset.from_pandas` for further model training and evaluation.


In [29]:
def combine_infos(row):
    text = 'The title of the product is ' + row['title'] + '. This product was bought at the store ' + row['store'] + '.'
    return text

def preprocess_data(data):
    print(data.isnull().sum())
    df = pd.DataFrame()
    df['input_text'] = data.apply(combine_infos, axis=1)

    return df

def preprocess_target(solution):
    print(solution.isnull().sum())
    df = pd.DataFrame()
    df['target_text'] = solution.apply(lambda row: f"details_Manufacturer: {row['details_Manufacturer']}", axis=1)

    return df

trftrs_processed = preprocess_data(trdata)
trtgt_processed = preprocess_target(trsoln)
valftrs_processed = preprocess_data(valdata)
valtgt_processed = preprocess_target(valsoln)
tstftrs_processed = preprocess_data(tstdata)

title    0
store    0
dtype: int64
details_Manufacturer    0
dtype: int64
title    0
store    0
dtype: int64
details_Manufacturer    0
dtype: int64
indoml_id    0
title        0
store        0
dtype: int64


In [30]:
trftrs_processed

Unnamed: 0,input_text
549316,The title of the product is BOSCH BP1397 Quiet...
512557,The title of the product is Power Stop K1797-3...
258568,The title of the product is Lexmark 10N0227 In...
289041,The title of the product is Skinit Decal Audio...
548613,The title of the product is Rainlemon Thanksgi...
...,...
593474,"The title of the product is Shimano STC Spin, ..."
572174,The title of the product is Shark Navigator Up...
245751,The title of the product is Fisch Brad Point D...
434984,The title of the product is Lamy Logo M + 204 ...


In [31]:
trftrs_processed.info()

<class 'pandas.core.frame.DataFrame'>
Index: 612885 entries, 549316 to 156852
Data columns (total 1 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   input_text  612885 non-null  object
dtypes: object(1)
memory usage: 9.4+ MB


In [32]:
trtgt_processed

Unnamed: 0,target_text
549316,details_Manufacturer: Bosch
512557,details_Manufacturer: Power Stop
258568,"details_Manufacturer: LEXMARK INT'L, INC."
289041,details_Manufacturer: Skinit
548613,details_Manufacturer: Rainlemon
...,...
593474,details_Manufacturer: Shimano
572174,details_Manufacturer: SharkNinja
245751,details_Manufacturer: Affinity Tools
434984,details_Manufacturer: Lamy GmbH


In [33]:
trtgt_processed.info()

<class 'pandas.core.frame.DataFrame'>
Index: 612885 entries, 549316 to 156852
Data columns (total 1 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   target_text  612885 non-null  object
dtypes: object(1)
memory usage: 9.4+ MB


In [34]:
# Convert to Hugging Face Dataset format
train_dataset = Dataset.from_pandas(pd.concat([trftrs_processed, trtgt_processed], axis=1))
val_dataset = Dataset.from_pandas(pd.concat([valftrs_processed, valtgt_processed], axis=1))
test_dataset = Dataset.from_pandas(tstftrs_processed)

In [35]:
train_dataset[:5]

{'input_text': ['The title of the product is BOSCH BP1397 QuietCast Premium Semi-Metallic Disc Brake Pad Set - Compatible With Select Hyundai Elantra; Kia Forte, Forte Koup, Forte5, Soul; FRONT. This product was bought at the store BOSCH.',
  'The title of the product is Power Stop K1797-36 Front and Rear Z36 Truck & Tow Brake Kit, Carbon Fiber Ceramic Brake Pads and Drilled/Slotted Brake Rotors. This product was bought at the store Power Stop.',
  'The title of the product is Lexmark 10N0227 Ink Cartridge (27), Tri-Color - in Retail Packaging. This product was bought at the store Lexmark.',
  'The title of the product is Skinit Decal Audio Skin Compatible with Apple AirPods with Lightning Charging Case - Officially Licensed NFL Kansas City Chiefs Red Performance Series Design. This product was bought at the store Skinit.',
  'The title of the product is Rainlemon Thanksgiving Day Happy Fall Halloween Harvest Pumpkin Burlap Banner Garland Bunting Home Party Decoration. This product was

## Creating Dataset Dictionary

In this cell, we create a `DatasetDict` to organize the processed datasets for training, testing, and validation. The `DatasetDict` is a convenient way to manage multiple datasets in Hugging Face's `datasets` library.

- **`train`**: Contains the training dataset (`train_dataset`).
- **`test`**: Contains the test dataset (`test_dataset`).
- **`validation`**: Contains the validation dataset (`val_dataset`).

The `DatasetDict` will be used for training and evaluating the model, allowing for easy access to different subsets of data.


In [36]:
dataset_dict = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset
})

In [37]:
dataset_dict

DatasetDict({
    train: Dataset({
        features: ['input_text', 'target_text', '__index_level_0__'],
        num_rows: 612885
    })
    validation: Dataset({
        features: ['input_text', 'target_text', '__index_level_0__'],
        num_rows: 12508
    })
})

## Loading the T5 Model and Tokenizer

In this cell, we load the T5 model and tokenizer from the Hugging Face `transformers` library:

- **`T5Tokenizer`**: Tokenizer for converting text into tokens and vice versa, using the `t5-small` pre-trained model.
- **`T5ForConditionalGeneration`**: T5 model for sequence-to-sequence tasks, also using the `t5-small` pre-trained model.

These components will be used for encoding the input text, generating predictions, and decoding the output text.


In [38]:
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-small")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small")

## Tokenizing the Dataset

In this cell, we define the `preprocess_function` to tokenize the `input_text` and `target_text` using the T5 tokenizer:

- **`inputs`**: Tokenized input texts with a maximum length of 352 tokens, padded and truncated as necessary.
- **`targets`**: Tokenized target texts with a maximum length of 128 tokens, padded and truncated as necessary.
- **`model_inputs`**: Contains the tokenized inputs and labels (target texts) for model training.

The `preprocess_function` is applied to the entire dataset using the `map` method with `batched=True`, ensuring efficient processing of the data in batches.

The result, `tokenized_datasets`, is a `DatasetDict` containing the tokenized versions of the train, test, and validation datasets, ready for model training.


In [39]:
def preprocess_trval(examples):
    inputs = examples['input_text']
    targets = examples['target_text']
    
    model_inputs = tokenizer(inputs, max_length=128, padding='max_length', truncation=True)
    labels = tokenizer(targets, max_length=128, padding='max_length', truncation=True)
    model_inputs['labels'] = labels['input_ids']

    return model_inputs

tokenized_datasets = dataset_dict.map(preprocess_trval, batched=True)

In [40]:
tokenized_datasets

In [41]:
tokenized_datasets.save_to_disk('./impute_Input_tokens/')

In [42]:
from datasets import load_from_disk

tokenized_datasets = load_from_disk('./impute_Input_tokens/')

In [43]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_text', 'target_text', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 612885
    })
    validation: Dataset({
        features: ['input_text', 'target_text', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 12508
    })
})

## Configuring Training Arguments

In this cell, we set up the `TrainingArguments` for training the T5 model using the Hugging Face `Trainer`:

- **`output_dir`**: Directory to save the model checkpoints and results.
- **`evaluation_strategy`**: Strategy for evaluation, set to `'epoch'`, meaning evaluation will occur at the end of each epoch.
- **`learning_rate`**: Learning rate for optimization, set to `2e-5`.
- **`per_device_train_batch_size`**: Batch size for training, set to `16`.
- **`per_device_eval_batch_size`**: Batch size for evaluation, set to `16`.
- **`num_train_epochs`**: Number of training epochs, set to `2`.
- **`weight_decay`**: Weight decay for regularization, set to `0.01`.
- **`save_total_limit`**: Limit on the number of checkpoints to keep, set to `3`.
- **`logging_dir`**: Directory for logging information.
- **`logging_steps`**: Frequency of logging, set to every 20 steps.
- **`report_to`**: Reporting options, set to `'none'` to disable reporting.

These arguments control various aspects of the training process and ensure efficient training and logging.


In [44]:
training_args = TrainingArguments(
    output_dir='./impute_fT5small',
    eval_strategy='epoch',
    eval_accumulation_steps=32,
    save_strategy='steps',
    save_total_limit=1,
    learning_rate=2e-3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    fp16=True,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_first_step=True,
    logging_steps=20,
    report_to='tensorboard'
)

## Defining a Custom Callback for Logging

In this cell, we define a custom callback class `CustomCallback` that extends `TrainerCallback` from the Hugging Face `transformers` library:

- **`on_log` Method**: This method is triggered during the training process whenever logging occurs. It prints:
  - The current training step (`state.global_step`).
  - Each key-value pair in the `logs` dictionary.

This custom callback allows for detailed logging of training progress and metrics directly to the console, providing real-time feedback during the training process.


In [45]:
class CustomCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            print(f"Step: {state.global_step}")
            for key, value in logs.items():
                print(f"{key}: {value}")
            print("\n")

## Training the Model

In this cell, we initialize and run the `Trainer` for training the T5 model:

- **`model`**: The T5 model to be trained.
- **`args`**: The `TrainingArguments` specified in the previous cell.
- **`train_dataset`**: The tokenized training dataset.
- **`eval_dataset`**: The tokenized validation dataset.
- **`callbacks`**: The list of callbacks to use during training, including the custom `CustomCallback` defined earlier.

After setting up the `Trainer`, we call `trainer.train()` to start the training process. The custom callback will print detailed logging information during training.


In [46]:
import warnings

warnings.filterwarnings("ignore")

In [47]:
import torch

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    callbacks=[CustomCallback()]
)

torch.cuda.empty_cache()
trainer.train()

## Saving the Fine-Tuned Imputer Model

In this cell, we save the fine-tuned T5 model and tokenizer to a specified directory:

- **`model.save_pretrained('./finetuned_fT5_impute')`**: Saves the trained T5 model to the directory `./fine_tuned_t5`. This allows you to load the model later without retraining.

- **`tokenizer.save_pretrained('./finetuned_fT5_impute')`**: Saves the tokenizer associated with the T5 model to the same directory. This ensures that you can use the same tokenizer for encoding and decoding text during inference.

Saving both the model and tokenizer ensures that you can resume work or deploy the model in the future with consistent results.


In [49]:
model.save_pretrained('./finetuned_fT5_impute')
tokenizer.save_pretrained('./finetuned_fT5_impute')

## Loading the Fine-Tuned Model and Tokenizer

In this cell, we load the fine-tuned T5 model and tokenizer from the specified directory and set up the environment for evaluation:

- **`device`**: Determines whether to use a GPU (`cuda`) or CPU for computation based on availability.

- **`model`**: Loads the fine-tuned T5 model and moves it to the appropriate device (`cuda` or `cpu`).

- **`tokenizer`**: Loads the tokenizer associated with the fine-tuned T5 model.

The model is set to evaluation mode with `model.eval()`, preparing it for generating predictions.

### Functions

- **`generate_text(inputs)`**: Takes a batch of input texts, tokenizes them, and generates predictions using the fine-tuned model. It returns the generated texts after decoding them from token IDs.

- **`extract_details(text)`**: Extracts attribute details from the generated or target text using regular expressions. It returns the details for brand and categories, defaulting to `'na'` if not found.

- **`clean_repeated_patterns(text)`**: Cleans the generated text by removing redundant patterns, specifically handling the `L4_category`.

These functions will be used for generating predictions and extracting and cleaning the details from the results.


In [71]:
import re
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch
from tqdm import tqdm

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = T5ForConditionalGeneration.from_pretrained('./finetuned_fT5_impute').to(device)
tokenizer = T5Tokenizer.from_pretrained('./finetuned_fT5_impute')

model.eval()

test_data = test_dataset['input_text']

def generate_text(inputs):
    inputs = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True, truncation=True, max_length=352)
    inputs = {key: value.to(device) for key, value in inputs.items()}

    with torch.no_grad():
        outputs = model.generate(**inputs, max_length=128)

    generated_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return generated_texts

def extract_details(text):
    pattern = r'details_Manufacturer:\s*(.+)'
    match = re.search(pattern, text)
    if match:
        return match.group(1).strip()
    return 'na'

def clean_repeated_patterns(text):
    cleaned_data = text.split(' L4_category')[0]
    return cleaned_data

In [72]:
test_data

['The title of the product is Munchkin Dora the Explorer Click Lock Insulated Straw Cups BPA Free 9oz 266ml - 3pk. This product was bought at the store Munchkin.',
 "The title of the product is Carter's Forest Friends Fitted Sheet, Tan/Choc, 28 X 52 (Discontinued by Manufacturer). This product was bought at the store Carter's.",
 'The title of the product is VP-Zoom Sportscope - 4x9 Magnification - Manufactured by Mickelson Group - Patent Protected. This product was bought at the store XGATML.',
 'The title of the product is Seventh Generation Baby Free & Clear Overnight Diapers, Size 6 (68 Count). This product was bought at the store Seventh Generation.',
 'The title of the product is Operandi 12 Pack Melon Cradle Supports Durable Plastic with 3 Sturdy Legs 21.5 X 20 cm Includes 5 Metres of Soft Garden Plant Ties. This product was bought at the store Aeiniwer.',
 'The title of the product is Halo Pink Dot Butterfly Sleepsack Wearable Baby Blanket, Micro-Fleece, Small. This product was

## Generating Predictions and Extracting Details

In this cell, we process the test data in batches to generate predictions and extract attribute details:

- **`batch_size`**: The number of samples processed in each batch, set to `128`.

- **`generated_details`**: List to store extracted details from generated texts.
- **`target_details`**: List to store extracted details from target texts.

### Processing Loop

We iterate over the test data in batches:
1. **Batch Extraction**: For each batch of inputs, we generate predictions using the `generate_text` function.
2. **Details Extraction**: For each generated text and corresponding label, we extract and append details using the `extract_details` function.

**Note**: The `batch_labels` are included here for completeness, but they are not used in this code snippet for generating predictions.

Finally, a message is printed to indicate that the extraction of generated information is complete.


In [73]:
batch_size = 64
generated_details = []

for i in tqdm(range(0, len(test_data), batch_size), desc="Processing test data"):
    batch_inputs = test_data[i:i+batch_size]

    generated_texts = generate_text(batch_inputs)

    for generated_text in generated_texts:
        generated_details.append(extract_details(generated_text))

print('Generated info extracted.............')

Processing test data: 100%|██████████| 128/128 [00:43<00:00,  2.92it/s]

Generated info extracted.............





In [74]:
generated_text

"details_Manufacturer: Dr. Brown's"

In [75]:
generated_details

['The Holster Store',
 "Carter's",
 'XGATML',
 'SAMSUNG',
 'Aomeiwer',
 'EATON',
 'Seymour Duncan',
 'Odyssey',
 'SwaddleDesigns',
 'Huggies',
 'Stupell Industries',
 'AIRWAY',
 'CHAUVET DJ',
 'decalmile',
 'Aden + Anais',
 'Pampers',
 'EATON',
 'EPtech',
 'Nuby',
 'Jackson',
 'Hohner',
 'Pampers',
 'Jackson',
 'GRAPHICS & MORE',
 'OMG SPORTS',
 'Thirsties',
 'LOP',
 'Gibson',
 'XGATML',
 'SwaddleDesigns',
 'Fender',
 'Baby Jog',
 'Kala',
 'Fender',
 'Jojo Designs, LLC',
 'JIM DUNLOP',
 'Seymour Duncan',
 'decalmile',
 'Seymour Duncan',
 'The First Success',
 'Seymour Duncan',
 'Graco',
 'Pirasty',
 'EATON',
 'ELECTRONICS',
 'Aomeiwer',
 'The Regatta Group DBA Beauty Depot',
 'XGATML',
 'Stupell Industries',
 'Pampers',
 'aden',
 'Ibanez',
 'Prins',
 'Inktastic',
 'The HONEST COMPANY',
 'EATON',
 'K-MuskuloAABBCC',
 'Jojo Designs, LLC',
 'PreSonus',
 'BRITAX',
 'DECOWALL',
 'Pampers',
 'Jojo Designs, LLC',
 'Hohner',
 'Dean',
 'MyVolts',
 'Woombie',
 'Graco',
 'OUTLOOK GROUP CORP',
 'F