# Notebook destiné à faire du résumé en découpant plusieurs fois un long texte

## Import du modèle déjà fine tuned

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoConfig, TrainingArguments, Trainer
from datasets import load_dataset
from peft import LoraConfig
import torch
from torch.cuda.amp import autocast, GradScaler

In [2]:
peft_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_rslora=False)

In [3]:
quantization_config = BitsAndBytesConfig(
        # load_in_8bit=True,
        # load_in_4bit=True,
        llm_int8_enable_fp32_cpu_offload=True,
        # llm_int8_has_fp16_weight=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype="float16"
)

In [4]:
device = 'cuda'

In [5]:
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", quantization_config=quantization_config, device_map="auto")

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [6]:
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

### Définition de l'output attendu

In [25]:
messages = [
        {
            "role": "user",
            "content": "summarize harry potter 1"
        }
]

In [26]:
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
# model.to(device)

generated_ids = model.generate(encodeds, do_sample=True, max_new_tokens=10000)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] summarize harry potter 1 [/INST] "Harry Potter and the Philosopher's Stone" is the first book in J.K. Rowling's popular Harry Potter series. The story follows Harry Potter, an eleven-year-old orphan who discovers that he is a wizard and is invited to attend Hogwarts School of Witchcraft and Wizardry.

At Hogwarts, Harry learns that a mysterious entity called the "Philosopher's Stone" is capable of turning any substance into gold and creating an Elixir of Life, which can extend a person's lifetime indefinitely. Harry also makes new friends like Hermione Granger and Ron Weasley, and they form a close bond as they navigate their magical studies and face various challenges, including attacks from a three-headed dog named Fluffy and a troll in the school.

However, Harry also attracts unwanted attention, as the villainous Lord Voldemort has returned and is seeking to obtain the Philosopher's Stone for himself, believing it will make him immortal once again. In the climax of the s

### Test sur une requete

In [28]:
# # Chat Completion API
# 
# summary = llm.create_chat_completion(
#     messages = [
#         {"role": "system", "content": "You job is to make summaries of long texts."},
#         {
#             "role": "user",
#             "content": "summarize harry potter 1"
#         }
#     ]
# )
# 


llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from ./mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                

#### Sortons uniquement l'ouput de la requete qui nous intéresse

In [40]:
# liste_dictionnaires = summary["choices"][0]['message']['content']
# 
# print(liste_dictionnaires)

 Title: Harry Potter and the Philosopher's Stone (Sorcerer's Stone in some editions)

Summary:

Harry Potter, an orphan boy living with his abusive aunt and uncle, receives a letter on his eleventh birthday inviting him to attend Hogwarts School of Witchcraft and Wizardry. Unbeknownst to Harry, he is a famous survivor of Voldemort, a dark wizard who had tried to kill him as an infant. At Hogwarts, Harry learns about his magical abilities from his kind mentor, Professor Albus Dumbledore, and makes new friends, Hermione Granger and Ron Weasley.

The trio hears rumors of a legendary Philosopher's (Sorcerer's) Stone that can turn any metal into gold and produce the Elixir of Life, granting immortality. Harry learns that Voldemort is after this stone to regain his youth and power. With Dumbledore's guidance, they embark on a quest to protect the stone from falling into Voldemort's hands.

The trio faces numerous challenges, including battling trolls, cheating in exams, and dealing with a mi

## Fonctions

In [34]:
# Diviser le texte en segments de 500 caractères ou moins
def split_text_into_segments(text, segment_length=1000, instruction="make me a short summary of the following text:"):
    segments = []
    current_segment = ""

    words = text.split()
    for word in words:
        # Ajouter le mot au segment en cours
        current_segment += word + " "

        # Vérifier si le segment en cours dépasse la longueur maximale
        if len(current_segment) >= segment_length:
            # Ajouter l'instruction au début du segment
            segment_with_instruction = [f"{current_segment.strip()}"]
            # Ajouter le segment complet à la liste des segments
            segments.append(segment_with_instruction)
            # Réinitialiser le segment en cours
            current_segment = ""

    # Ajouter le segment final s'il n'a pas été ajouté
    if current_segment:
        # Ajouter l'instruction au début du segment final
        segment_with_instruction = [f"{current_segment.strip()}"]
        segments.append(segment_with_instruction)

    return segments


In [28]:
# Résumez chaque segment
def summarize_text_segments(segments):
    summaries = []
    for segment in segments:
        
        message = [
        {
            "role": "user",
            "content": f"summarize this text : "
                       f"{segment}"
        }
]
        
        encodeds = tokenizer.apply_chat_template(message, return_tensors="pt")

        model_inputs = encodeds.to(device)

        generated_ids = model.generate(encodeds, do_sample=True, max_new_tokens=10000)
        decoded = tokenizer.batch_decode(generated_ids)
        
        summaries.append(decoded[0])
    return summaries


In [13]:
# # Résumez chaque segment
# def summarize_text_segments(segments):
#     summaries = []
#     for segment in segments:
#         print("start of segment")
#         requete = llm.create_chat_completion(
#                 messages = [
#                     {"role": "system", "content": "You job is to make summaries of long texts."},
#                     {
#                         "role": "user",
#                         "content": f"{segment}"
#                     }
#                     ]
#                 )
#         
#         summary = [requete["choices"][0]['message']['content']]
#         summaries.append(summary)
#     return summaries


In [29]:
# Rassemblez les résumés et résumez-les à nouveau
def summarize_summaries(summaries):
    concatenated_summaries = ""
    for summary in summaries:
        concatenated_summaries += str(summary) + " "
    
    message = [
        {
            "role": "user",
            "content": f"summarize this text : "
                       f"{concatenated_summaries}"
        }
]
        
    encodeds = tokenizer.apply_chat_template(message, return_tensors="pt")

    model_inputs = encodeds.to(device)

    generated_ids = model.generate(encodeds, do_sample=True, max_new_tokens=10000)
    decoded = tokenizer.batch_decode(generated_ids)
        
    return decoded[0]

## Applications à un long texte extrait de livre

### Import d'un dataset contenant des chapitres de livres

In [30]:
from datasets import load_dataset
dataset = load_dataset("pszemraj/booksum-short")

In [35]:
# Votre très long texte
long_text = dataset["train"]["chapter"][0]

#### division du long texte en segments

In [36]:
# Divisez le texte en segments
text_segments = split_text_into_segments(long_text)

In [37]:
print(text_segments[0])

['"Before these fields were shorn and tilled, Full to the brim our rivers flowed; The melody of waters filled The fresh and boundless wood; And torrents dashed, and rivulets played, And fountains spouted in the shade." BRYANT. Leaving the unsuspecting Heyward and his confiding companions to penetrate still deeper into a forest that contained such treacherous inmates, we must use an author\'s privilege, and shift the scene a few miles to the westward of the place where we have last seen them. On that day, two men were lingering on the banks of a small but rapid stream, within an hour\'s journey of the encampment of Webb, like those who awaited the appearance of an absent person, or the approach of some expected event. The vast canopy of woods spread itself to the margin of the river overhanging the water, and shadowing its dark current with a deeper hue. The rays of the sun were beginning to grow less fierce, and the intense heat of the day was lessened, as the cooler vapors of the spri

### Résumons les différentes parties de ce chapitre

In [38]:
# Résumez chaque segment
segment_summaries = summarize_text_segments(text_segments)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attentio

In [39]:
print(segment_summaries)

[['<s> [INST] summarize this text : [\'"Before these fields were shorn and tilled, Full to the brim our rivers flowed; The melody of waters filled The fresh and boundless wood; And torrents dashed, and rivulets played, And fountains spouted in the shade." BRYANT. Leaving the unsuspecting Heyward and his confiding companions to penetrate still deeper into a forest that contained such treacherous inmates, we must use an author\\\'s privilege, and shift the scene a few miles to the westward of the place where we have last seen them. On that day, two men were lingering on the banks of a small but rapid stream, within an hour\\\'s journey of the encampment of Webb, like those who awaited the appearance of an absent person, or the approach of some expected event. The vast canopy of woods spread itself to the margin of the river overhanging the water, and shadowing its dark current with a deeper hue. The rays of the sun were beginning to grow less fierce, and the intense heat of the day was l

## Rajoutons une fonction pour regroupe les résumés entre eux, et refaire du résumé de résumé

In [40]:
def group_summaries(summaries, group_size=3):
    grouped_summaries = []
    current_group = []

    for summary_list in summaries:
        # Convertir la liste de résumés en une seule chaîne de caractères
        summary_string = ' '.join(summary_list)
        current_group.append(summary_string)
        
        if len(current_group) == group_size:
            grouped_summaries.append(current_group)
            current_group = []

    # Ajouter le dernier groupe s'il est incomplet
    if current_group:
        grouped_summaries.append(current_group)

    return grouped_summaries

In [41]:
summary_step_2 = group_summaries(segment_summaries)

In [42]:
print(summary_step_2[6])

['<s> [INST] summarize this text : [\'the other, raising himself with dignity, and resuming his seat on the log with his former composure. "Hawkeye, they are your brothers; speak to them." "That will I, and in English that the king needn\\\'t be ashamed to answer," returned the hunter, speaking in the language of which he boasted; "but I see nothing, nor do I hear the sounds of man or beast; \\\'tis strange that an Indian should understand white sounds better than a man who, his very enemies will own, has no cross in his blood, although he may have lived with the redskins long enough to be suspected! Ha! there goes something like the cracking of a dry stick, too--now I hear the bushes move--yes, yes, there is a trampling that I mistook for the falls--and--but here they come themselves; God keep them from the Iroquois!"\'] [/INST] The text describes a situation where Hawkeye, a white man who has lived among the Native Americans for a long time, encounters a group of Native Americans. Th

In [43]:
# Résumez chaque segment
segment_summaries_2 = summarize_text_segments(summary_step_2)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attentio

In [44]:
print(segment_summaries_2)

[['<s> [INST] summarize this text : [\'<s>  [INST] summarize this text : [\\\'"Before these fields were shorn and tilled, Full to the brim our rivers flowed; The melody of waters filled The fresh and boundless wood; And torrents dashed, and rivulets played, And fountains spouted in the shade." BRYANT. Leaving the unsuspecting Heyward and his confiding companions to penetrate still deeper into a forest that contained such treacherous inmates, we must use an author\\\\\\\'s privilege, and shift the scene a few miles to the westward of the place where we have last seen them. On that day, two men were lingering on the banks of a small but rapid stream, within an hour\\\\\\\'s journey of the encampment of Webb, like those who awaited the appearance of an absent person, or the approach of some expected event. The vast canopy of woods spread itself to the margin of the river overhanging the water, and shadowing its dark current with a deeper hue. The rays of the sun were beginning to grow les

#### notre résumé est toujours trop long par rapport à nos max tokens du débuts, réduisons encore

In [45]:
# Résumez chaque segment
summaries_step_3 = group_summaries(segment_summaries_2)

In [46]:
print(summaries_step_3[2])

['<s> [INST] summarize this text : [\'<s>  [INST] summarize this text : [\\\'the other, raising himself with dignity, and resuming his seat on the log with his former composure. "Hawkeye, they are your brothers; speak to them." "That will I, and in English that the king needn\\\\\\\'t be ashamed to answer," returned the hunter, speaking in the language of which he boasted; "but I see nothing, nor do I hear the sounds of man or beast; \\\\\\\'tis strange that an Indian should understand white sounds better than a man who, his very enemies will own, has no cross in his blood, although he may have lived with the redskins long enough to be suspected! Ha! there goes something like the cracking of a dry stick, too--now I hear the bushes move--yes, yes, there is a trampling that I mistook for the falls--and--but here they come themselves; God keep them from the Iroquois!"\\\'] [/INST] The text describes a situation where Hawkeye, a white man who has lived among the Native Americans for a long

In [47]:
# Résumez chaque segment
segment_summaries_3 = summarize_text_segments(summaries_step_3)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [48]:
# Résumez les résumés
final_summary = summarize_summaries(segment_summaries_3)

print(final_summary)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.




In [49]:
print(long_text)


  "Before these fields were shorn and tilled,
    Full to the brim our rivers flowed;
  The melody of waters filled
    The fresh and boundless wood;
  And torrents dashed, and rivulets played,
    And fountains spouted in the shade."

  BRYANT.


Leaving the unsuspecting Heyward and his confiding companions to
penetrate still deeper into a forest that contained such treacherous
inmates, we must use an author's privilege, and shift the scene a few
miles to the westward of the place where we have last seen them.

On that day, two men were lingering on the banks of a small but rapid
stream, within an hour's journey of the encampment of Webb, like those
who awaited the appearance of an absent person, or the approach of some
expected event. The vast canopy of woods spread itself to the margin of
the river overhanging the water, and shadowing its dark current with a
deeper hue. The rays of the sun were beginning to grow less fierce, and
the intense heat of the day was lessened, as the cool