# POS TECH - IA PARA DEVS
### Tech Challenge - Fase 03

**Aluno:** Inacio Ribeiro - RM362328



O objetivo deste Tech Challenge é executar o processo de fine-tuning de um modelo de fundação (BERT) utilizando o dataset "The Amazon Titles-1.3MM".

In [1]:
!pip install transformers datasets pandas torch accelerate



In [2]:
import pandas as pd
import json
import torch
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForQuestionAnswering,
    TrainingArguments,
    Trainer,
    pipeline
)
import os

### 1. Importando o Dataset

In [3]:
file_path = '/content/drive/MyDrive/trn.json'

sample_size_limit = 20000

data = []
print(f"Iniciando a leitura do arquivo. Vamos carregar no máximo {sample_size_limit} registros.")

try:
    with open(file_path, 'r') as f:
        for i, line in enumerate(f):
            # Para quando atingir o limite
            if i >= sample_size_limit:
                print(f"Limite de {sample_size_limit} registros atingido. Parando a leitura.")
                break

            # Converte a linha (que é um texto JSON) em um dicionário Python
            data.append(json.loads(line))

            # Imprime um status a cada 1000 linhas para sabermos que está funcionando
            if (i + 1) % 1000 == 0:
                print(f"Lidos {i + 1} registros...")

    # Cria o DataFrame somente com a amostra de dados
    df = pd.DataFrame(data)
    print("\nDataset carregado com sucesso em um DataFrame!")
    print("Formato da amostra:", df.shape)
    display(df.head())

except FileNotFoundError:
    print(f"Arquivo {file_path} não encontrado!")
    print("Verifique o caminho e certifique-se de que seu Google Drive está montado.")


Iniciando a leitura do arquivo. Vamos carregar no máximo 20000 registros.
Lidos 1000 registros...
Lidos 2000 registros...
Lidos 3000 registros...
Lidos 4000 registros...
Lidos 5000 registros...
Lidos 6000 registros...
Lidos 7000 registros...
Lidos 8000 registros...
Lidos 9000 registros...
Lidos 10000 registros...
Lidos 11000 registros...
Lidos 12000 registros...
Lidos 13000 registros...
Lidos 14000 registros...
Lidos 15000 registros...
Lidos 16000 registros...
Lidos 17000 registros...
Lidos 18000 registros...
Lidos 19000 registros...
Lidos 20000 registros...
Limite de 20000 registros atingido. Parando a leitura.

Dataset carregado com sucesso em um DataFrame!
Formato da amostra: (20000, 5)


Unnamed: 0,uid,title,content,target_ind,target_rel
0,31909,Girls Ballet Tutu Neon Pink,High quality 3 layer ballet tutu. 12 inches in...,"[12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 2...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ..."
1,32034,Adult Ballet Tutu Yellow,,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 16, 33, 36, 37,...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ..."
2,913154,The Way Things Work: An Illustrated Encycloped...,,"[116, 117, 118, 119, 120, 121, 122]","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]"
3,1360000,Mog's Kittens,Judith Kerr&#8217;s best&#8211;selling adventu...,"[146, 147, 148, 149, 495]","[1.0, 1.0, 1.0, 1.0, 1.0]"
4,1381245,Misty of Chincoteague,,[151],[1.0]


### 2. Vamos testar a resposta atual do modelo sem treinamento prévio.

In [9]:
from transformers import pipeline
import textwrap

model_name_baseline = "deepset/bert-base-cased-squad2"

qa_pipeline_baseline = pipeline("question-answering", model=model_name_baseline, tokenizer=model_name_baseline)
print("Pipeline carregado com sucesso!")

if 'df' in locals() and not df.empty:

    df_valid = df[df['content'].str.strip().astype(bool)]

    if not df_valid.empty:
        # 2. Selecionamos um exemplo aleatório do DataFrame JÁ FILTRADO
        indice_fixo = 10
        test_sample = df_valid.iloc[indice_fixo]

        product_title = test_sample['title']
        product_context = test_sample['content']

        question = f"What are the features of the product '{product_title}'?"

        print("\n" + "="*50)
        print("--- EXECUTANDO TESTE COM O MODELO PRÉ-TREINADO ---")
        print("="*50)
        print(f"\nPERGUNTA: {question}")
        print("\nCONTEXTO (Descrição do Produto):")
        print(textwrap.fill(product_context, width=120))

        result_baseline = qa_pipeline_baseline(question=question, context=product_context)

        print("\n" + "="*50)
        print("--- RESULTADO ATUAL DO MODELO (ANTES DO FINE-TUNING) ---")
        print("="*50)
        print(f"\nRESPOSTA EXTRAÍDA: '{result_baseline['answer']}'")
        print(f"CONFIANÇA (SCORE): {result_baseline['score']:.4f}")
    else:
        print("ERRO: Não foram encontradas linhas com descrições válidas no seu DataFrame de amostra.")
else:
    print("ERRO: O DataFrame 'df' não foi encontrado. Execute a célula de carregamento de dados primeiro.")

Some weights of the model checkpoint at deepset/bert-base-cased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


Pipeline carregado com sucesso!

--- EXECUTANDO TESTE COM O MODELO PRÉ-TREINADO ---

PERGUNTA: What are the features of the product 'The Book of Revelation'?

CONTEXTO (Descrição do Produto):
American Baptist pastor, Bible teacher, and writer, Clarence Larkin was born October 28, 1850, in Chester, Delaware
County, Pennsylvania. He was converted to Christ at the age of 19 and then felt called to the Gospel ministry, but the
doors of opportunity for study and ministry did not open immediately. He then got a job in a bank. When he was 21 years
old, he left the bank and went to college, graduating as a mechanical engineer. He continued as a professional draftsman
for a while, then he became a teacher of the blind. Later, failing health compelled him to give up his teaching career.
After a prolonged rest, he became a manufacturer. When he was converted he had become a member of the Episcopal Church,
but in 1882,he became a Baptist and was ordained as a Baptist minister two years later. He w

### 3. Preparando os dados para treinamento

In [11]:
df_prepared = df[['title', 'content']].copy()

df_prepared.dropna(subset=['title', 'content'], inplace=True)
df_prepared = df_prepared[df_prepared['content'] != '']

DATASET_SAMPLE_SIZE = 10000
df_sample = df_prepared.sample(n=min(DATASET_SAMPLE_SIZE, len(df_prepared)), random_state=42)

In [19]:
def format_dataset(df):
    formatted_data = []
    for _, row in df.iterrows():
        context = str(row['content'])

        question = f"What are the features of {row['title']} ?"

        answer_text = context
        answer_start = 0

        formatted_data.append({
            'question': question,
            'context': context,
            'answers': {
                'text': [answer_text],
                'answer_start': [answer_start]
            }
        })
    return formatted_data

In [20]:
qa_data = format_dataset(df_sample)
hf_dataset = Dataset.from_list(qa_data)

hf_dataset_split = hf_dataset.train_test_split(test_size=0.1)
train_dataset = hf_dataset_split['train']
eval_dataset = hf_dataset_split['test']

print(train_dataset[0])



### 4. Execução do Fine-Tuning

In [28]:
model_checkpoint = "bert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [29]:
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = start_char + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [30]:
tokenized_train_dataset = train_dataset.map(preprocess_function, batched=True, remove_columns=train_dataset.column_names)
tokenized_eval_dataset = eval_dataset.map(preprocess_function, batched=True, remove_columns=eval_dataset.column_names)


Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [35]:
output_dir_drive = '/content/drive/MyDrive/TechChallenge/results'
final_model_path_drive = '/content/drive/MyDrive/TechChallenge/fine_tuned_bert_amazon_qa'

training_args = TrainingArguments(
    output_dir=output_dir_drive,
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01,
    save_strategy="epoch",
    load_best_model_at_end=True,
)

In [36]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_eval_dataset,
    tokenizer=tokenizer,
)

  trainer = Trainer(


### 5. Executando o treinamento

In [None]:
print("Iniciando o processo de Fine-Tuning...")
trainer.train()
print("Fine-Tuning concluído!")

Iniciando o processo de Fine-Tuning...


  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

### 6. Salvado o modelo para uso posterior

In [None]:
trainer.save_model(final_model_path_drive)
print(f"Modelo final salvo com sucesso em: {final_model_path_drive}")

### 7. Criando um pipeline com nosso novo modelo


In [None]:
qa_pipeline_finetuned = pipeline("question-answering", model=final_model_path, tokenizer=final_model_path)

print("\n--- Teste Pós-Fine-Tuning ---")
print(f"Pergunta: {question_baseline}")

result_finetuned = qa_pipeline_finetuned(question=question_baseline, context=context_baseline)

print(f"\nResposta do Modelo Baseline: '{result_baseline['answer']}' (Score: {result_baseline['score']:.4f})")
print(f"Resposta do Modelo Fine-Tuned: '{result_finetuned['answer']}' (Score: {result_finetuned['score']:.4f})")

new_sample = eval_dataset[10]
new_question = new_sample['question']
new_context = new_sample['context']

print(f"Pergunta: {new_question}")
result_new = qa_pipeline_finetuned(question=new_question, context=new_context)
print(f"Resposta Gerada: '{result_new['answer']}'")