
### 1. Required Libraries
The necessary libraries for text generation and dataset management are:

- **Pandas**: For handling tabular data and DataFrame manipulation.
- **Transformers**: Provides models and tokenizers for various NLP tasks, such as text generation and sequence-to-sequence learning.
- **Datasets**: A library for handling datasets efficiently, especially large-scale datasets. It works seamlessly with Hugging Face models.
- **NumPy**: For numerical operations and data manipulation.

In [1]:
import pandas as pd
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments, TrainerCallback
from datasets import Dataset, DatasetDict
import numpy as np

### 2. Loading and Preparing Data

In this section, we load the training, validation, and test datasets from JSON files using `pandas`. The datasets contain the attributes for training and validation, along with their corresponding solutions.


In [2]:
train_data = pd.read_json('./input_data/attribute_train.data', lines=True)
train_solution = pd.read_json('./input_data/attribute_train.solution', lines=True)
test_data = pd.read_json('./input_data/attribute_test.data', lines=True)
val_data = pd.read_json('./input_data/attribute_val.data', lines=True)
val_solution = pd.read_json('./input_data/attribute_val.solution', lines=True)

In [3]:
train_data

Unnamed: 0,indoml_id,title,store,details_Manufacturer
0,0,"Enclume Angled Pot Hook, Set of 6, Use with Po...",Enclume,Enclume
1,1,Schutt Vengeance DCT Hybrid Youth Football H,Schutt,Schutt
2,2,Easton 2014 MAKO SL14MK9 Baseball Bat (-9),Easton,"Easton Sports, Inc."
3,3,Bilstein B46-0929 Heavy-Duty Gas Shock Absorber,Bilstein,Bilstein
4,4,Apple Red Cardstock - 8.5 x 11 inch - 65Lb Cov...,Clear Path Paper,Clear Path Paper
...,...,...,...,...
443494,443494,Sony DCR-HC32 MiniDV Handycam Camcorder with 2...,Sony,Sony
443495,443495,Monster Truck Parking Sign Rally Owner Driver ...,SignMission,Mighty Skins -- Dropship
443496,443496,"3dRose Pyrenees Dog Dad Mug, 11 oz, Black",3dRose,3EROS
443497,443497,adidas F50 Lesto Shin Guard,adidas,adidas


### 3. Preprocessing Data and Targets

We define two functions, `preprocess_data` and `preprocess_target`, to transform the raw data and solution files into a structured format that can be used for model training. The goal is to generate `input_text` for the features and `target_text` for the labels, both of which are formatted for natural language processing tasks.


In [None]:
train_data['details_Manufacturer'].fillna(value='Unknown', inplace=True)
val_data['details_Manufacturer'].fillna(value='Unknown', inplace=True)
test_data['details_Manufacturer'].fillna(value='Unknown', inplace=True)


#### 3.1 `combine_infos`: Generating Descriptive Input Text

The `combine_infos` function combines multiple pieces of information (e.g., product title, store, and manufacturer) from each row of the data into a single descriptive text.

In [18]:
def combine_infos(row):
    text = 'The title of the product is ' + row['title'] + '. This product was bought at the store ' + row['store'] + '. The manufacturer of this product is ' + row['details_Manufacturer'] + '.'
    return text

#### 3.2 `preprocess_data`: Processing Feature Data

The `preprocess_data` function creates a new DataFrame that includes the indoml_id and the combined input_text for each row.

In [None]:

def preprocess_data(data):
    df = pd.DataFrame()

    df['indoml_id'] = data['indoml_id']
    df['input_text'] = data.apply(combine_infos, axis=1)

    return df

#### 3.3 preprocess_target: Processing Target Data

The `preprocess_target` function transforms the solution dataset by concatenating various category labels (brand, category hierarchy) into a single string format, which will be the model's target output.

In [None]:
def preprocess_target(solution):

    df = pd.DataFrame()

    df['indoml_id'] = solution['indoml_id']
    df['target_text'] = solution.apply(lambda row: f"details_Brand: {row['details_Brand']} L0_category: {row['L0_category']} L1_category: {row['L1_category']} L2_category: {row['L2_category']} L3_category: {row['L3_category']} L4_category: {row['L4_category']}", axis=1)

    return df

#### 3.4 Applying Preprocessing to Datasets

We apply the `preprocess_data` and `preprocess_target` functions to the training, validation, and test datasets to create the processed features and targets.

In [19]:

trftrs_processed = preprocess_data(train_data)
trtgt_processed = preprocess_target(train_solution)
valftrs_processed = preprocess_data(val_data)
valtgt_processed = preprocess_target(val_solution)
tstftrs_processed = preprocess_data(test_data)

In [20]:
trftrs_processed

Unnamed: 0,indoml_id,input_text
0,0,The title of the product is Enclume Angled Pot...
1,1,The title of the product is Schutt Vengeance D...
2,2,The title of the product is Easton 2014 MAKO S...
3,3,The title of the product is Bilstein B46-0929 ...
4,4,The title of the product is Apple Red Cardstoc...
...,...,...
443494,443494,The title of the product is Sony DCR-HC32 Mini...
443495,443495,The title of the product is Monster Truck Park...
443496,443496,The title of the product is 3dRose Pyrenees Do...
443497,443497,The title of the product is adidas F50 Lesto S...


In [21]:
trtgt_processed

Unnamed: 0,indoml_id,target_text
0,0,details_Brand: Enclume L0_category: Home & Kit...
1,1,details_Brand: Schutt L0_category: Sports & Ou...
2,2,details_Brand: Easton L0_category: Sports & Ou...
3,3,details_Brand: Bilstein L0_category: Automotiv...
4,4,details_Brand: Clear Path Paper L0_category: A...
...,...,...
443494,443494,details_Brand: Sony L0_category: Electronics L...
443495,443495,details_Brand: SignMission L0_category: Home &...
443496,443496,details_Brand: 3dRose L0_category: Home & Kitc...
443497,443497,details_Brand: adidas L0_category: Sports & Ou...


### 4. Converting Data to Hugging Face `Dataset` Format

Once the data has been preprocessed and structured, the next step is to convert it into the `Dataset` format from the Hugging Face `datasets` library. This format is essential for efficient use with Hugging Face models and the `Trainer` API.

#### 4.1 Merging Processed Features and Targets

We first merge the preprocessed features and target datasets on the `indoml_id` column to create a complete dataset for both training and validation sets.

In [22]:
# Convert to Hugging Face Dataset format
train_dataset = Dataset.from_pandas(pd.merge(trftrs_processed, trtgt_processed, on='indoml_id'))
val_dataset = Dataset.from_pandas(pd.merge(valftrs_processed, valtgt_processed, on='indoml_id'))
test_dataset = Dataset.from_pandas(tstftrs_processed)

In [23]:
train_dataset[:5]

{'indoml_id': [0, 1, 2, 3, 4],
 'input_text': ['The title of the product is Enclume Angled Pot Hook, Set of 6, Use with Pot Racks, Copper Plated. This product was bought at the store Enclume. The manufacturer of this product is Enclume.',
  'The title of the product is Schutt Vengeance DCT Hybrid Youth Football H. This product was bought at the store Schutt. The manufacturer of this product is Schutt.',
  'The title of the product is Easton 2014 MAKO SL14MK9 Baseball Bat (-9). This product was bought at the store Easton. The manufacturer of this product is Easton Sports, Inc..',
  'The title of the product is Bilstein B46-0929 Heavy-Duty Gas Shock Absorber. This product was bought at the store Bilstein. The manufacturer of this product is Bilstein.',
  'The title of the product is Apple Red Cardstock - 8.5 x 11 inch - 65Lb Cover - 100 Sheets - Clear Path Paper. This product was bought at the store Clear Path Paper. The manufacturer of this product is Clear Path Paper.'],
 'target_text'

In [59]:
test_dataset

Dataset({
    features: ['indoml_id', 'input_text'],
    num_rows: 95036
})

#### 4.2 Constructing the `DatasetDict`

We create a `DatasetDict` by assigning the previously converted `train_dataset` and `val_dataset` to the corresponding keys, `train` and `validation`.


In [24]:
dataset_dict = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset
})

In [25]:
dataset_dict

DatasetDict({
    train: Dataset({
        features: ['indoml_id', 'input_text', 'target_text'],
        num_rows: 443499
    })
    validation: Dataset({
        features: ['indoml_id', 'input_text', 'target_text'],
        num_rows: 95035
    })
})

In [26]:
dataset_dict['train']

Dataset({
    features: ['indoml_id', 'input_text', 'target_text'],
    num_rows: 443499
})

In [27]:
dataset_dict['validation']

Dataset({
    features: ['indoml_id', 'input_text', 'target_text'],
    num_rows: 95035
})

### 5. Loading the Tokenizer and Model

For this task, we use the **FLAN-T5** model, a powerful variant of T5 (Text-to-Text Transfer Transformer) that excels in various NLP tasks. The model is pre-trained by Google, and we will fine-tune it for our specific task using the Hugging Face Transformers library.

#### Loading the Tokenizer

The tokenizer is responsible for converting the input text into tokens, which are numerical representations that the model can understand. We use the tokenizer from the `google/flan-t5-base` checkpoint.

#### Loading the Model

The model we are using is the T5ForConditionalGeneration class, which is specifically designed for text generation tasks. It is loaded from the same `google/flan-t5-base` checkpoint.

In [28]:
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


#### Preprocessing and Tokenizing the Dataset

In this step, we will preprocess and tokenize the datasets to convert the raw text data into a format suitable for the model. The tokenization process converts the input and target text into token IDs, which are numerical representations of the text that the model can process.

##### Preprocessing Function: `preprocess_trval`

The `preprocess_trval` function tokenizes both the `input_text` (the product descriptions) and the `target_text` (the labels we want the model to generate). It ensures that the tokenized sequences are truncated or padded to a fixed length.

In [29]:
def preprocess_trval(examples):
    inputs = examples['input_text']
    targets = examples['target_text']
    model_inputs = tokenizer(inputs, max_length=128, padding='max_length', truncation=True)
    labels = tokenizer(targets, max_length=128, padding='max_length', truncation=True)
    model_inputs['labels'] = labels['input_ids']

    return model_inputs

tokenized_datasets = dataset_dict.map(preprocess_trval, batched=True)

Map:   0%|          | 0/443499 [00:00<?, ? examples/s]

Map:   0%|          | 0/95035 [00:00<?, ? examples/s]

In [30]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['indoml_id', 'input_text', 'target_text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 443499
    })
    validation: Dataset({
        features: ['indoml_id', 'input_text', 'target_text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 95035
    })
})

Saving the generated tokens to the disk

In [31]:
tokenized_datasets.save_to_disk('./flanT5base_Input_tokens/')

Saving the dataset (0/2 shards):   0%|          | 0/443499 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/95035 [00:00<?, ? examples/s]

Loading the saved tokens from the disk

In [32]:
from datasets import load_from_disk

tokenized_datasets = load_from_disk('./flanT5base_Input_tokens/')

In [33]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['indoml_id', 'input_text', 'target_text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 443499
    })
    validation: Dataset({
        features: ['indoml_id', 'input_text', 'target_text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 95035
    })
})

### 6. Setting Up Training Arguments and Custom Logging

In this section, we configure the **training arguments** and set up a custom **logging callback** to monitor and log the training process. We use the `Trainer` class from Hugging Face's Transformers library, which streamlines the process of training and evaluating models.

#### 6.1 Training Arguments

The `TrainingArguments` object contains various hyperparameters that dictate how the model is trained. These arguments are essential for controlling the behavior of the training loop, logging, evaluation, and checkpointing.


In [34]:
training_args = TrainingArguments(
    output_dir='./tuned_fT5base',
    eval_strategy='epoch',
    eval_accumulation_steps=32,
    save_strategy='steps',
    save_total_limit=1,
    learning_rate=2e-2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    fp16=True,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_first_step=True,
    logging_steps=100,
    report_to='tensorboard'
)

#### 6.2 Custom Logging Callback

To enhance logging and keep track of the training progress, we define a custom `LoggingCallback` class that writes the training progress to a text file and outputs information to the console.

In [36]:
import os

class LoggingCallback(TrainerCallback):
    def __init__(self, log_dir='./logs', log_file='training_log(new).txt'):
        super().__init__()
        self.log_dir = log_dir
        self.log_file = log_file
        os.makedirs(self.log_dir, exist_ok=True)
        self.log_path = os.path.join(self.log_dir, self.log_file)
        with open(self.log_path, 'w') as f:
            f.write("Dataset Information:\n")
            f.write(f"Number of Training Datapoints: {len(tokenized_datasets['train'])}\n")
            f.write(f"Number of Validation Datapoints: {len(tokenized_datasets['validation'])}\n\n")
            
            f.write("Training Hyperparameters:\n")
            for arg, value in vars(training_args).items():
                f.write(f"{arg}: {value}\n")
            f.write("\n")
            
            f.write("Training Logs:\n\n")
    
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            with open(self.log_path, 'a') as f:
                f.write(f"Step: {state.global_step}\n")
                print(f"Step: {state.global_step}\n")
                for key, value in logs.items():
                    f.write(f"{key}: {value}\n")
                    print(f"{key}: {value}\n")
                f.write("\n")
                print("\n")

#### 16.3 Setting Up the Trainer

We initialize the `Trainer` with the model, training arguments, tokenized datasets, and the custom logging callback.


In [37]:
import warnings

warnings.filterwarnings("ignore")

In [38]:
import torch

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    callbacks=[LoggingCallback()]
)

torch.cuda.empty_cache()
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.7428,0.612218
2,0.6602,0.575476
3,0.4451,0.363491


Step: 1

loss: 26.2212

grad_norm: 5323933.5

learning_rate: 0.019999518999519

epoch: 7.215007215007215e-05



Step: 100

loss: 1.2074

grad_norm: 36275.3671875

learning_rate: 0.019951899951899953

epoch: 0.007215007215007215



Step: 200

loss: 0.656

grad_norm: 42530.078125

learning_rate: 0.019903799903799903

epoch: 0.01443001443001443



Step: 300

loss: 0.7907

grad_norm: 327186.8125

learning_rate: 0.019855699855699856

epoch: 0.021645021645021644



Step: 400

loss: 1.1233

grad_norm: 144540.46875

learning_rate: 0.01980759980759981

epoch: 0.02886002886002886



Step: 500

loss: 0.9654

grad_norm: 73057.1484375

learning_rate: 0.019759499759499758

epoch: 0.03607503607503607



Step: 600

loss: 1.0009

grad_norm: 42994.625

learning_rate: 0.01971139971139971

epoch: 0.04329004329004329



Step: 700

loss: 0.9493

grad_norm: 51563.22265625

learning_rate: 0.019663299663299664

epoch: 0.050505050505050504



Step: 800

loss: 0.9191

grad_norm: 133501.828125

learning_rate: 0.0

TrainOutput(global_step=41580, training_loss=0.6780548685614222, metrics={'train_runtime': 13201.2257, 'train_samples_per_second': 100.786, 'train_steps_per_second': 3.15, 'total_flos': 6.183168970928947e+16, 'train_loss': 0.6780548685614222, 'epoch': 3.0})

### 7. Evaluating the Model on the Validation Set

After the model has been trained, the next step is to evaluate its performance on the validation dataset. This step helps us assess how well the model generalizes to unseen data.

#### 7.1 Evaluating the Model

We use the `trainer.evaluate()` method to compute the evaluation metrics on the validation dataset. The evaluation loop computes the loss and other metrics without updating the model's weights.

In [39]:
val_results = trainer.evaluate(eval_dataset=tokenized_datasets['validation'])
print(f"Validation Loss: {val_results['eval_loss']}")

Step: 41580

eval_loss: 0.3634907901287079

eval_runtime: 347.1605

eval_samples_per_second: 273.749

eval_steps_per_second: 8.555

epoch: 3.0



Validation Loss: 0.3634907901287079


#### 7.1 Saving the Model

To save the trained model, use the `save_pretrained()` method. This will store the model's weights and configuration to a specified directory.

In [40]:
model.save_pretrained('./finetuned_flanT5base_3E')
tokenizer.save_pretrained('./finetuned_flanT5base_3E')

('./finetuned_flanT5small_3E(imputed)/tokenizer_config.json',
 './finetuned_flanT5small_3E(imputed)/special_tokens_map.json',
 './finetuned_flanT5small_3E(imputed)/spiece.model',
 './finetuned_flanT5small_3E(imputed)/added_tokens.json')

### 8. Inference and Text Generation

After saving the model and tokenizer, we proceed to use the fine-tuned model for generating predictions on new or test data. This involves loading the model and tokenizer, generating text from inputs, and extracting relevant details from the generated text.

#### 8.1 Loading the Model and Tokenizer

First, load the fine-tuned model and tokenizer from the saved directory:

In [60]:
import re
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch
from tqdm import tqdm

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = T5ForConditionalGeneration.from_pretrained('./finetuned_flanT5base_3E').to(device)
tokenizer = T5Tokenizer.from_pretrained('./finetuned_flanT5base_3E')

model.eval()

In [None]:
test_data = test_dataset['input_text']

def generate_text(inputs):
    inputs = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True, truncation=True, max_length=352)
    inputs = {key: value.to(device) for key, value in inputs.items()}

    with torch.no_grad():
        outputs = model.generate(**inputs, max_length=128)

    generated_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return generated_texts

def extract_details(text):
    pattern = r'details_Brand: (.*?) L0_category: (.*?) L1_category: (.*?) L2_category: (.*?) L3_category: (.*?) L4_category: (.*)'
    match = re.match(pattern, text)
    if match:
        return tuple(item if item is not None else 'na' for item in match.groups())
    return 'na', 'na', 'na', 'na', 'na', 'na'

In [61]:
test_data

['The title of the product is CURT 58180 Trailer-Side 7-Pin Round Wiring Harness Socket with Spring. This product was bought at the store CURT. The manufacturer of this product is Curt Manufacturing.',
 'The title of the product is CafePress Andrew Jackson Quote Sticker (Bumper) 10x3 Rectangle Vinyl Bumper Sticker Car Decal. This product was bought at the store CafePress. The manufacturer of this product is CafePress.',
 'The title of the product is Garage-Pro Driver Side Mirror Glass Compatible with 2007-2014 Chevrolet Silverado 2500 HD, Tahoe, 2007-2013 Chevrolet Avalanche Heated, with Signal Light and auto dimming. This product was bought at the store Garage-Pro. The manufacturer of this product is Garage-Pro.',
 'The title of the product is Husky Liners Front & 2nd Seat Floor Liners Fits 12-17 Prius V. This product was bought at the store Husky Liners. The manufacturer of this product is Husky Liners.',
 'The title of the product is Nearly Natural 1306-YL Rose and Calla with Vase A

### 9. Extraction and Submission

In this section, we process the test data in batches to generate predictions using the fine-tuned model. The generated texts are then parsed to extract relevant details.

#### 9.1 Batch Processing and Text Generation

We process the test data in batches to handle large datasets efficiently. This involves generating predictions for each batch and extracting details from the generated texts.


In [62]:
batch_size = 64
generated_details = []

for i in tqdm(range(0, len(test_data), batch_size), desc="Processing test data"):
    batch_inputs = test_data[i:i+batch_size]

    generated_texts = generate_text(batch_inputs)

    for generated_text in generated_texts:
        generated_details.append(extract_details(generated_text))

print('Generated info extracted.............')

Processing test data: 100%|██████████| 1485/1485 [51:00<00:00,  2.06s/it]

Generated info extracted.............





In [63]:
test_data

['The title of the product is CURT 58180 Trailer-Side 7-Pin Round Wiring Harness Socket with Spring. This product was bought at the store CURT. The manufacturer of this product is Curt Manufacturing.',
 'The title of the product is CafePress Andrew Jackson Quote Sticker (Bumper) 10x3 Rectangle Vinyl Bumper Sticker Car Decal. This product was bought at the store CafePress. The manufacturer of this product is CafePress.',
 'The title of the product is Garage-Pro Driver Side Mirror Glass Compatible with 2007-2014 Chevrolet Silverado 2500 HD, Tahoe, 2007-2013 Chevrolet Avalanche Heated, with Signal Light and auto dimming. This product was bought at the store Garage-Pro. The manufacturer of this product is Garage-Pro.',
 'The title of the product is Husky Liners Front & 2nd Seat Floor Liners Fits 12-17 Prius V. This product was bought at the store Husky Liners. The manufacturer of this product is Husky Liners.',
 'The title of the product is Nearly Natural 1306-YL Rose and Calla with Vase A

In [65]:
generated_texts

['details_Brand: Microsoft L0_category: Electronics L1_category: Computers & Accessories L2_category: Computer Accessories & Peripherals L3_category: Keyboards, Mice & Accessories L4_category: Mice',
 'details_Brand: Raybestos L0_category: Automotive L1_category: Replacement Parts L2_category: Brake System L3_category: Rotors L4_category: na',
 'details_Brand: LCL L0_category: Office Products L1_category: Office Electronics L2_category: Printers & Accessories L3_category: Printer Parts & Accessories L4_category: Printer Ink & Toner',
 'details_Brand: myCartridge L0_category: Office Products L1_category: Office Electronics L2_category: Printers & Accessories L3_category: Printer Parts & Accessories L4_category: Printer Ink & Toner',
 'details_Brand: Alice Peterson L0_category: Arts, Crafts & Sewing L1_category: Needlework L2_category: Needlepoint L3_category: Kits L4_category: na',
 'details_Brand: Hogue L0_category: Sports & Outdoors L1_category: Hunting & Fishing L2_category: Shooting

In [66]:
generated_details

[('CURT',
  'Automotive',
  'Exterior Accessories',
  'Towing Products & Winches',
  'Hitch Accessories',
  'Wiring'),
 ('CafePress',
  'Automotive',
  'Exterior Accessories',
  'Bumper Stickers, Decals & Magnets',
  'Bumper Stickers',
  'na'),
 ('Garage-Pro',
  'Automotive',
  'Replacement Parts',
  'Body & Trim',
  'Body',
  'Mirrors & Parts'),
 ('Husky Liners',
  'Automotive',
  'Interior Accessories',
  'Floor Mats & Cargo Liners',
  'Cargo Liners',
  'na'),
 ('Nearly Natural',
  'Home & Kitchen',
  'Home Dcor Products',
  'Artificial Plants & Flowers',
  'Artificial Flowers',
  'na'),
 ('Dorman',
  'Automotive',
  'Replacement Parts',
  'Windshield Wipers & Washers',
  'Wipers',
  'Arms'),
 ('Eurosport Daytona',
  'Automotive',
  'Exterior Accessories',
  'License Plate Covers & Frames',
  'Covers',
  'na'),
 ('Military Vet Shop',
  'Automotive',
  'Exterior Accessories',
  'Bumper Stickers, Decals & Magnets',
  'Decals',
  'na'),
 ('Gates',
  'Automotive',
  'Replacement Parts',


#### 9.2 Saving Predictions to a File

We save the predictions as a JSON file, where each line represents a JSON object with the predicted details.

In [67]:
import json
categories = ['details_Brand', 'L0_category', 'L1_category', 'L2_category', 'L3_category', 'L4_category']

with open('attribute_test_V8.predict', 'w') as file:

    for indoml_id, details in enumerate(generated_details):
        result = {"indoml_id": indoml_id}
        for category, value in zip(categories, details):
            result[category] = value

        file.write(json.dumps(result) + '\n')

#### 9.3 Compressing Predictions into a Zip File

The predictions file is compressed into a zip archive.

In [68]:
import zipfile

file_to_zip = 'attribute_test_V8.predict'
zip_file_name = 'submission_V8.zip'

with zipfile.ZipFile(zip_file_name, 'w') as zipf:
    zipf.write(file_to_zip, arcname=file_to_zip)