#Tech Challenge 3

- Executar o fine-tuning de um foundation model(Llama, BERT, MISTRAL etc.), utilizando o dataset "The AmazonTitles-1.3MM", um JSON de livros da Amazon.

O modelo treinado deve:
- Receber perguntas com um contexto obtido por meio do arquivo json
“trn.json” que está contido dentro do dataset.
- A partir do prompt formado pela pergunta do usuário sobre o título do
produto, o modelo deverá gerar uma resposta baseada na pergunta do
usuário trazendo como resultado do aprendizado do fine-tuning os
dados da sua descrição.

# Conectando ao Drive

In [1]:
# Conexão com o Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Limpeza da Base de Dados

In [None]:
def clean_text(text):
    """Remove caracteres indesejados, acentos e normaliza o texto."""
    # Remove acentos
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')
    # Remove caracteres especiais indesejados, preservando .,!? e espaços
    text = re.sub(r'[^\w\s\.,!?]', '', text)
    # Remove múltiplos espaços
    text = re.sub(r'\s+', ' ', text)
    return text.strip().lower()

In [None]:
def load_and_clean_dataset_streaming(file_path, output_path=None, return_data=False):
    """
    Carrega, limpa e salva os dados de forma eficiente.

    Args:
        file_path (str): Caminho para o arquivo JSON de entrada.
        output_path (str): Caminho para salvar os dados limpos. Opcional.
        return_data (bool): Indica se os dados limpos devem ser retornados.

    Returns:
        list: (Opcional) Dados limpos, caso return_data seja True.
    """
    seen_hashes = set()
    total_lines = 0
    unique_lines = 0
    cleaned_data = [] if return_data else None

    with open(file_path, 'r', encoding='utf-8') as infile:
        outfile = open(output_path, 'w', encoding='utf-8') if output_path else None

        for line in infile:
            total_lines += 1
            try:
                # Parse o JSON
                item = json.loads(line)
                title = clean_text(item.get("title", ""))
                content = clean_text(item.get("content", ""))

                if title and content:
                    # Hash único para detectar duplicatas
                    entry_hash = md5(f"{title}{content}".encode('utf-8')).hexdigest()
                    if entry_hash not in seen_hashes:
                        record = {"title": title, "content": content}
                        if return_data:
                            cleaned_data.append(record)
                        if outfile:
                            json.dump(record, outfile)
                            outfile.write("\n")
                        seen_hashes.add(entry_hash)
                        unique_lines += 1

                # Exibe progresso a cada 10.000 linhas
                if total_lines % 10000 == 0:
                    print(f"Processadas {total_lines} linhas, {unique_lines} únicas...")

            except json.JSONDecodeError:
                print(f"Erro ao decodificar linha {total_lines}. Pulando...")
            except Exception as e:
                print(f"Erro inesperado na linha {total_lines}: {e}")

        if outfile:
            outfile.close()

    print(f"Total de linhas processadas: {total_lines}")
    print(f"Total de registros únicos: {unique_lines}")

    return cleaned_data

# Reduzindo a base de dados

Com a base de dados já limpa, optamos por selecionar 1/4 da mesma para treinar o modelo.

In [None]:
# Ajustando a base de dados

# Importar bibliotecas
import json

# Contar o número de linhas no arquivo
arquivo_json = '/content/drive/MyDrive/cleaned_data.json'
arquivo_reduzido = '/content/drive/MyDrive/reduced_data.json'

with open(arquivo_json, 'r', encoding='utf-8') as file:
    total_linhas = sum(1 for _ in file)

# Carregar o arquivo JSON
with open(arquivo_json, 'r', encoding='utf-8') as file:
    dados = [json.loads(line) for line in file]

# Determinar o tamanho e selecionar um quarto
total_elementos = len(dados)
um_quarto = total_elementos // 4

# Utilizar apenas o primeiro quarto do arquivo
dados_reduzidos = dados[:um_quarto]

# Calcular o tamanho do arquivo reduzido em bytes
dados_reduzidos_json = json.dumps(dados_reduzidos, indent=4)  # Converte para string JSON com formatação
tamanho_bytes = len(dados_reduzidos_json.encode('utf-8'))  # Tamanho em bytes
tamanho_mb = tamanho_bytes / (1024 * 1024)  # Conversão para MB

# Salvar o arquivo reduzido
with open(arquivo_reduzido, 'w', encoding='utf-8') as file:
    file.write(dados_reduzidos_json)

# Exibir informações
print(f"Quantidade total de linhas no arquivo: {total_linhas}")
print(f"Tamanho total de elementos carregados: {total_elementos}")
print(f"Elementos selecionados: {um_quarto}")
print(f"Tamanho do arquivo reduzido: {tamanho_mb:.2f} MB")
print(f"Arquivo reduzido salvo em: {arquivo_reduzido}")


-----

# Instalando o Unsloth

In [2]:
# Instalar dependências
%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [3]:
# Instalar dependências
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://download.pytorch.org/whl/cu118


In [4]:
# Instalar dependências
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None
load_in_4bit = True   # Utiliza apenas 4 casas após a vírgula

fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",
    "unsloth/Mistral-Small-Instruct-2409",
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",

    "unsloth/Llama-3.2-1B-bnb-4bit",
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",

    "unsloth/Llama-3.3-70B-Instruct-bnb-4bit"
]

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

# Adição da técnica LoRA

In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2024.12.4 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


# Tratando JSON
Trantando a base de dados (JSON) para funcionar de acordo com as pré configurações do LLM Llama

In [6]:
# Instalar dependências
import pandas as pd
import json

# Carrega o arquivo JSON
with open('/content/drive/MyDrive/reduced_data.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

# Converte para Dataframe
df = pd.DataFrame(data)

# Cria uma nova coluna com as informações no formato desejado
def create_conversation(row):
    return [
        {"from": "human", "value": row['title']},
        {"from": "gpt", "value": row['content']}
    ]

# Aplica a função para criar a nova coluna
df['conversation'] = df.apply(create_conversation, axis=1)

# Mostra o Dataframe atualizado
print(df)

# Salva o Dataframe atualizado em um novo arquivo JSON
df.to_json('/content/drive/MyDrive/output_with_conversations.json', orient='records', lines=True, force_ascii=False)

                                                    title  \
0                             girls ballet tutu neon pink   
1                                            mogs kittens   
2                             girls ballet tutu neon blue   
3                                             the prophet   
4                               rightly dividing the word   
...                                                   ...   
341671                            doc savage skull island   
341672  mexican spanish accelerated 8 one hour audio c...   
341673  the nonprofit business plan a leaders guide to...   
341674        gracious wild a shamanic journey with hawks   
341675                           love and spirit medicine   

                                                  content  \
0       high quality 3 layer ballet tutu. 12 inches in...   
1       judith kerr8217s best8211selling adventures of...   
2       dance tutu for girls ages 28 years. perfect fo...   
3       in a distant, t

# Preparação do Dataset

In [7]:
!pip install datasets



In [8]:
# Instalar dependências
from unsloth.chat_templates import get_chat_template
from datasets import Dataset

# Utilização da função 'get_chat_template'
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset

# Carregar o dataset customizado
dataset = Dataset.from_json('/content/drive/MyDrive/output_with_conversations.json')

#dataset = load_dataset("mlabonne/FineTome-100k", split = "train")

Generating train split: 0 examples [00:00, ? examples/s]

Agora, usamos  a função `standardize_sharegpt` para converter o estilo de dataset ShareGPT no formato genérico do HuggingFace. Isso muda o dataset de aparecer dessa maneira:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
para essa maneira:
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [9]:
# Instalar dependências
from unsloth.chat_templates import standardize_sharegpt

# Renomeia a coluna 'conversation' para 'conversations'
dataset = dataset.rename_column('conversation', 'conversations')

# Aplica a função standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched=True)

Standardizing format:   0%|          | 0/341676 [00:00<?, ? examples/s]

Map:   0%|          | 0/341676 [00:00<?, ? examples/s]

**[Aviso]** o padrão do Llama 3.1 Instruct adiciona no chat: `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`

<a name="Train"></a>
### Treinar o modelo

In [10]:
# Instala dependências
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

# Parâmetros
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 50,
        #num_train_epochs = 5,
        max_steps = 500,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

Map (num_proc=2):   0%|          | 0/341676 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [11]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map:   0%|          | 0/341676 [00:00<?, ? examples/s]

In [12]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nworship with don moen vhs<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nworship with don moen vhs<|eot_id|>'

In [13]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                          \n\nworship with don moen vhs<|eot_id|>'

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
2.635 GB of memory reserved.


In [14]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 341,676 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 500
 "-____-"     Number of trainable parameters = 24,313,856


Step,Training Loss
1,3.2119
2,3.5511
3,3.2053
4,2.9116
5,3.6089
6,3.417
7,3.3493
8,3.3333
9,3.3097
10,4.1297


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

446.5262 seconds used for training.
7.44 minutes used for training.
Peak reserved memory = 6.531 GB.
Peak reserved memory for training = 3.896 GB.
Peak reserved memory % of max memory = 44.284 %.
Peak reserved memory for training % of max memory = 26.417 %.


<a name="Inference"></a>
### Inferências
Vamos rodar o modelo!

In [15]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model)

messages = [
    {"role": "user", "content": "description of girls ballet tutu neon pink"},
    {"role": "user", "content": "description of harry potter"},
    {"role": "user", "content": "description of the prophet"},
    {"role": "user", "content": "description of girls ballet tutu neon blue"},
    {"role": "user", "content": "description of rightly dividing the word 	"},
    {"role": "user", "content": "description of doc savage skull island"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\ndescription of girls ballet tutu neon pink<|eot_id|><|start_header_id|>user<|end_header_id|>\n\ndescription of harry potter<|eot_id|><|start_header_id|>user<|end_header_id|>\n\ndescription of the prophet<|eot_id|><|start_header_id|>user<|end_header_id|>\n\ndescription of girls ballet tutu neon blue<|eot_id|><|start_header_id|>user<|end_header_id|>\n\ndescription of rightly dividing the word \t<|eot_id|><|start_header_id|>user<|end_header_id|>\n\ndescription of doc savage skull island<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\ndescription of 100 things to make your house look amazing<|eot_id|>']

<a name="Save"></a>
# Salvando a carregando o modelo finetuned
Para salvar o modelo final como LoRA adapters, é possível usar o  Huggingface `push_to_hub` para salvar online ou `save_pretrained` para salvar localmente.

In [16]:
model.save_pretrained("lora_model") # salvando localmente
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

Agora, se você quiser carregar os adaptadores LoRA que acabamos de salvar para inferência, defina `False` como `True`:

In [17]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model)

messages = [
    {"role": "user", "content": "description of girls ballet tutu neon pink"},
    {"role": "user", "content": "description of harry potter"},
    {"role": "user", "content": "description of the prophet"},
    {"role": "user", "content": "description of girls ballet tutu neon blue"},
    {"role": "user", "content": "description of rightly dividing the word 	"},
    {"role": "user", "content": "description of doc savage skull island"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

description of girls ballet tutu pink<|eot_id|>


In [18]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

# Geração de Respostas

In [None]:
rry potter

In [None]:
# Função para gerar respostas dinâmicas
def generate_response(question, max_new_tokens=128, temperature=1.5, min_p=0.1):
    """
    Gera uma resposta para a pergunta fornecida utilizando o modelo fine-tuned.

    Args:
        question (str): A pergunta do usuário.
        max_new_tokens (int): O número máximo de tokens na resposta gerada.
        temperature (float): Controla a aleatoriedade na geração de texto.
        min_p (float): Define o limite inferior para a probabilidade cumulativa.

    Returns:
        str: A resposta gerada pelo modelo.
    """
    # Cria a estrutura de mensagem com base na entrada do usuário
    messages = [
        {"role": "user", "content": question},
    ]

    # Prepara a entrada para o modelo usando o template do tokenizer
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,  # Adiciona prompt necessário para geração
        return_tensors="pt",
    ).to("cuda")

    # Gera a resposta usando o modelo
    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        min_p=min_p,
        use_cache=True
    )

    # Decodifica a resposta gerada
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    return response.strip()

# Exemplo de uso da função dinâmica
if __name__ == "__main__":
    print("Bem-vindo ao sistema de perguntas sobre livros da Amazon!")
    while True:
        user_question = input("Faça sua pergunta (ou digite 'sair' para encerrar): ")
        if user_question.lower() == 'sair':
            print("Encerrando o sistema. Até mais!")
            break
        try:
            answer = generate_response(user_question)
            print(f"Resposta: {answer}")
        except Exception as e:
            print(f"Erro ao gerar resposta: {e}")


Bem-vindo ao sistema de perguntas sobre livros da Amazon!
Resposta: system

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

user

harry potterassistant

j.k. rowling was born and raised in england and lives in scotland with her husband and her son.
Resposta: system

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

user

doc savage skull islandassistant

this is one of the best adventures ever written, says author gregory king. with its fast moving story and the action and excitement of savage adventures, this volume should find its way onto many book lists for the years top 100 fantasy adventure stories.... the great thing about a savage story is that it will make you feel good, feel inspired, and maybe even urge you on to a little adventure of your own. he takes a well known idea and makes it his own...this story is a fun ride for anyone who likes a little action, adventure, and fun. a wild time, says robert n. buie. in
Resposta: system

Cutting Knowledg

In [21]:
# Selecionar uma amostra de 10 linhas aleatórias do dataset

import pandas as pd
df = pd.DataFrame(dataset) # Create a DataFrame from the 'dataset'

display(df)

Unnamed: 0,title,content,conversations,text
0,girls ballet tutu neon pink,high quality 3 layer ballet tutu. 12 inches in...,"[{'content': 'girls ballet tutu neon pink', 'r...",<|begin_of_text|><|start_header_id|>system<|en...
1,mogs kittens,judith kerr8217s best8211selling adventures of...,"[{'content': 'mogs kittens', 'role': 'user'}, ...",<|begin_of_text|><|start_header_id|>system<|en...
2,girls ballet tutu neon blue,dance tutu for girls ages 28 years. perfect fo...,"[{'content': 'girls ballet tutu neon blue', 'r...",<|begin_of_text|><|start_header_id|>system<|en...
3,the prophet,"in a distant, timeless place, a mysterious pro...","[{'content': 'the prophet', 'role': 'user'}, {...",<|begin_of_text|><|start_header_id|>system<|en...
4,rightly dividing the word,this text refers to thepaperbackedition.,"[{'content': 'rightly dividing the word', 'rol...",<|begin_of_text|><|start_header_id|>system<|en...
...,...,...,...,...
341671,doc savage skull island,will murray is the author of more than 50 nove...,"[{'content': 'doc savage skull island', 'role'...",<|begin_of_text|><|start_header_id|>system<|en...
341672,mexican spanish accelerated 8 one hour audio c...,"if you travel to mexico, eat mexican food, or ...",[{'content': 'mexican spanish accelerated 8 on...,<|begin_of_text|><|start_header_id|>system<|en...
341673,the nonprofit business plan a leaders guide to...,"ldquono matter who you aremdashnonprofit ceo, ...",[{'content': 'the nonprofit business plan a le...,<|begin_of_text|><|start_header_id|>system<|en...
341674,gracious wild a shamanic journey with hawks,"simultaneously realistic and mystical,gracious...",[{'content': 'gracious wild a shamanic journey...,<|begin_of_text|><|start_header_id|>system<|en...


# Conclusões e Aprendizados

- As Features utilizadas foram 'title' - Título do Livro e 'content' - Descrição do Livro.

- O modelo recebe perguntas sobre um livro, e gera uma resposta, com base em seu conhecimento adquirido com o treinamento.

## Passo a Passo utilizado:

- Começamos analisando a base de dados e realizando uma limpeza na mesma, excluindo linhas com dados nulos e duplicados.

- Selecionamos um quarto dos dados limpos para relaizar o treinamento do modelo (cerca de 340K de dados).

- Fizemos algumas tratativas na base de dados para ficar no formato aceito pelo LLM Llama 3.1.

- Optamos por utlizar o LLM Llama 3 em conjunto com o unsloth e a técnica LoRA, para aumentar a eficiência do treinamento, já que a base de dados é relativamente grande.

- Treinamos o modelo com 500 'steps' e 50 'warmup_steps' e o otimizador 'adam'.
Com esses parâmetros, conseguimos treinar o modelo e obter respostas condizentes.


Inicialmente, tentamos treinar um modelo usando o LLM GPT2, porém com esse grande volume de dados, o treinamento ficou muito demorado e não obtivemos respostas condizentes.