## 10. Large Language Models (LLM)

En este notebook exploraremos uno de los grandes modelos de lenguaje más comunes, Flan T5 de Gooogle (https://huggingface.co/google-t5/t5-base), para entender cómo extraer información de textos y generar la misma utilizando la librería de HuggingFace.


Tarea:

* Comparar los resultados con las versiones pequeña (small) y grande (large) de T5.
* Explorar otros modelos open-source disponibles en HuggingFace (https://huggingface.co/models?other=LLM)

## LLM: Flan T5

**Instalación de la librería de Transformers (HugggingFace)**

In [None]:
# instalamos librerias necesarias
!pip install transformers
!pip install datasets

Collecting datasets
  Downloading datasets-2.19.2-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.32.1 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2

In [None]:
# cargamos librerias
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig

### 1. Carga y Exploración del Dataset

Utilizaremos como referencia el dataset de Diálogos del usuario Knkarthick, quien ha compilado miles de conversaciones con su resumen y tópico respectivo.

In [None]:
huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

La función de dataset ya nos separa los datos en sus conjuntos de entrenamiento, validación y prueba.

In [None]:
print('Tamaño entrenamiento: ', len(dataset['train']))
print('Tamaño validación: ', len(dataset['validation']))
print('Tamaño prueba: ', len(dataset['test']))

Tamaño entrenamiento:  12460
Tamaño validación:  500
Tamaño prueba:  1500


Analicemos unos ejemplos de los datos.

In [None]:
# para imprimir las salidas de los datos
def print_text(index, data):
    for i, idx in enumerate(index):
        print('Ejemplo ', i + 1)
        print('\n')
        print('DIÁLOGO:')
        print(data[index]['dialogue'][0])
        print('\n')
        print('RESUMEN:')
        print(data[index]['summary'][0])
        print('\n')

In [None]:
print_text([42, 200, 1300], dataset['test'])

Ejemplo  1


DIÁLOGO:
#Person1#: I don't know how to adjust my life. Would you give me a piece of advice?
#Person2#: You look a bit pale, don't you?
#Person1#: Yes, I can't sleep well every night.
#Person2#: You should get plenty of sleep.
#Person1#: I drink a lot of wine.
#Person2#: If I were you, I wouldn't drink too much.
#Person1#: I often feel so tired.
#Person2#: You better do some exercise every morning.
#Person1#: I sometimes find the shadow of death in front of me.
#Person2#: Why do you worry about your future? You're very young, and you'll make great contribution to the world. I hope you take my advice.


RESUMEN:
#Person1# wants to adjust #Person1#'s life and #Person2# suggests #Person1# be positive and stay healthy.


Ejemplo  2


DIÁLOGO:
#Person1#: I don't know how to adjust my life. Would you give me a piece of advice?
#Person2#: You look a bit pale, don't you?
#Person1#: Yes, I can't sleep well every night.
#Person2#: You should get plenty of sleep.
#Person1#: I drink a

## 1. Resumir con el LLM - Prompt Engineering

In [None]:
# por simplicidad, nos quedamos para este ejemplo con los datos de prueba
data = dataset['test']

In [None]:
# definimos el modelo base
model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base')

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
# tokenizamos el texto
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base', use_fast=True)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Podemos ver la salida del tokenizador.

In [None]:
sentence = "Hello, today is a good day"

sentence_encoded = tokenizer(sentence, return_tensors='pt')

sentence_decoded = tokenizer.decode(
        sentence_encoded["input_ids"][0],
        skip_special_tokens=True
    )

print('ORACION CODIFICADA:')
print(sentence_encoded["input_ids"][0])
print('\nORACION DECODIFICADA:')
print(sentence_decoded)

ORACION CODIFICADA:
tensor([8774,    6,  469,   19,    3,    9,  207,  239,    1])

ORACION DECODIFICADA:
Hello, today is a good day


Intentemos obtener un resumen base del modelo únicamente dándole el texto y el resumen a generar.

In [None]:
for i, index in enumerate([42, 200]):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    inputs = tokenizer(dialogue, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"],
            max_new_tokens=50,
        )[0],
        skip_special_tokens=True
    )

    print('\n')
    print('Ejemplo ', i + 1)
    print('\n')
    print(f'DIÁLOGO:\n{dialogue}')
    print('\n')
    print(f'RESUMEN ORIGINAL:\n{summary}')
    print('\n')
    print(f'RESUMEN GENERADO:\n{output}\n')



Ejemplo  1


DIÁLOGO:
#Person1#: I don't know how to adjust my life. Would you give me a piece of advice?
#Person2#: You look a bit pale, don't you?
#Person1#: Yes, I can't sleep well every night.
#Person2#: You should get plenty of sleep.
#Person1#: I drink a lot of wine.
#Person2#: If I were you, I wouldn't drink too much.
#Person1#: I often feel so tired.
#Person2#: You better do some exercise every morning.
#Person1#: I sometimes find the shadow of death in front of me.
#Person2#: Why do you worry about your future? You're very young, and you'll make great contribution to the world. I hope you take my advice.


RESUMEN ORIGINAL:
#Person1# wants to adjust #Person1#'s life and #Person2# suggests #Person1# be positive and stay healthy.


RESUMEN GENERADO:
Person1: I'm worried about my future.



Ejemplo  2


DIÁLOGO:
#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting progra

Vemos que el resultado es muy malo, ya que el modelo no tiene guía sobre lo que debe generar. Podemos mejorarlo utilizando *prompt engineering*.

In [None]:
for i, index in enumerate([42,200]):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    prompt = f"""
Dialogue:

{dialogue}

What was going on?
"""

    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"],
            max_new_tokens=50,
        )[0],
        skip_special_tokens=True
    )

    print('\n')
    print('Ejemplo ', i + 1)
    print('\n')
    print(f'DIÁLOGO:\n{dialogue}')
    print('\n')
    print(f'RESUMEN ORIGINAL:\n{summary}')
    print('\n')
    print(f'RESUMEN GENERADO:\n{output}\n')



Ejemplo  1


DIÁLOGO:
#Person1#: I don't know how to adjust my life. Would you give me a piece of advice?
#Person2#: You look a bit pale, don't you?
#Person1#: Yes, I can't sleep well every night.
#Person2#: You should get plenty of sleep.
#Person1#: I drink a lot of wine.
#Person2#: If I were you, I wouldn't drink too much.
#Person1#: I often feel so tired.
#Person2#: You better do some exercise every morning.
#Person1#: I sometimes find the shadow of death in front of me.
#Person2#: Why do you worry about your future? You're very young, and you'll make great contribution to the world. I hope you take my advice.


RESUMEN ORIGINAL:
#Person1# wants to adjust #Person1#'s life and #Person2# suggests #Person1# be positive and stay healthy.


RESUMEN GENERADO:
Person1 is worried about his future.



Ejemplo  2


DIÁLOGO:
#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program

Podemos mejorar aún el resultado introduciendo un ejemplo adicional, es decir, realizando un *one-shot prompt engineering*

In [None]:
for i, index in enumerate([10]):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    # dialogo de ejemplo
    prompt += f"""
  Dialogue:

  {dialogue}

  What was going on?
  {summary}


  """

  # dialogo de prueba
dialogue_test = dataset['test'][100]['dialogue']
summary_test = dataset['test'][100]['summary']

prompt = f"""
  Dialogue:

  {dialogue_test}

  What was going on?
  """

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0],
    skip_special_tokens=True
)

print('\n')
print('Ejemplo ', i + 1)
print('\n')
print(f'DIÁLOGO:\n{dialogue_test}')
print('\n')
print(f'RESUMEN ORIGINAL:\n{summary_test}')
print('\n')
print(f'RESUMEN GENERADO:\n{output}\n')



Ejemplo  1


DIÁLOGO:
#Person1#: OK, that's a cut! Let's start from the beginning, everyone.
#Person2#: What was the problem that time?
#Person1#: The feeling was all wrong, Mike. She is telling you that she doesn't want to see you any more, but I want to get more anger from you. You're acting hurt and sad, but that's not how your character would act in this situation.
#Person2#: But Jason and Laura have been together for three years. Don't you think his reaction would be one of both anger and sadness?
#Person1#: At this point, no. I think he would react the way most guys would, and then later on, we would see his real feelings.
#Person2#: I'm not so sure about that.
#Person1#: Let's try it my way, and you can see how you feel when you're saying your lines. After that, if it still doesn't feel right, we can try something else.


RESUMEN ORIGINAL:
#Person1# and Mike have a disagreement on how to act out a scene. #Person1# proposes that Mike can try to act in #Person1#'s way.


RESUMEN