<a href="https://colab.research.google.com/github/Chahinezehallaci/Chahinezehallaci/blob/main/CBoT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Transformers, question answering
On veut extraire des réponses pertinentes à des questions à propos du
texte “story.txt”. Pour cela, on pourra utiliser la bibliothèque “transformers”de Huggingface :
https://huggingface.co/transformers/
- Faire un script pour pouvoir traiter un nombre quelconque de questions.
- Si la personne n’a plus de question, le programme devra se terminer.

In [None]:
 !pip install transformers

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 5.2 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 41.0 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 55.4 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 57.6 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.9 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing

# 1) Choix modèle
Pour essayer d’obtenir les meilleures performances possibles de notre agent conversationnel.


In [None]:
# Nous importons torch, un package qui contient tous les modèles préentraînés ( à l'exception de "Speech2Text2"), 
import torch 
# Nous importons un tokenizer « rapide » soutenu par les bibliothèques AutoTokenizer, AutoModelForQuestionAnswering pris en charge dans PyTorch 
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import pipeline

## Modèle 1 

In [None]:
# L’architecture que nous souhaitons utiliser peut être devinée à partir du nom ou du chemin du modèle préentraîné que nous fournissons à la méthode
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
I am Amelie Poulain.  I was born in June 1974.   I lived alone with my father when I was a child. 
Now I live in Montmartre. I work in a small café whose name is Les Deux Moulins. I am single, and I used to feel very lonely. 
I like dipping my hand into grain sacks and throwing stones on the Saint-Martin canal. One day, I dropped a plastic perfume-stopper, which dislodged a wall tile. 
I discovered an old metal box of childhood memorabilia. This box was hidden by a boy who lived in my apartment decades earlier.  
I decide to track down the boy and return the box to him. If you know this boy, you need to come to see me in Montmartre.
"""

questions = [
    "Who are you?",
    "When was you born?",
    "What did you find?",
    "What did you decide?"
]

for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    # Get the most likely beginning of answer with the argmax of the score
    answer_start = torch.argmax(answer_start_scores)
    # Get the most likely end of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1

    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
    )

    print(f"Question: {question}")
    print(f"Answer: {answer}")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

Question: Who are you?
Answer: amelie poulain
Question: When was you born?
Answer: june 1974
Question: What did you find?
Answer: an old metal box of childhood memorabilia
Question: What did you decide?
Answer: to track down the boy and return the box to him


Pb modèle 1 : ne reconnaît pas Amélie Poulain comme nom propre (pas de majuscules) 
On recherche un autre modèle pour régler ce problème

## Modèle 2 

Création d’une fonctionnalité de base de réponse aux questions avec la bibliothèque Transformers

In [None]:
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

In [None]:
tokenizer = AutoTokenizer.from_pretrained("mrm8488/bert-tiny-5-finetuned-squadv2")
model = AutoModelForQuestionAnswering.from_pretrained("mrm8488/bert-tiny-5-finetuned-squadv2")

text = r"""
I am Amelie Poulain.  I was born in June 1974.   I lived alone with my father when I was a child. 
Now I live in Montmartre. I work in a small café whose name is Les Deux Moulins. I am single, and I used to feel very lonely. 
I like dipping my hand into grain sacks and throwing stones on the Saint-Martin canal. One day, I dropped a plastic perfume-stopper, which dislodged a wall tile. 
I discovered an old metal box of childhood memorabilia. This box was hidden by a boy who lived in my apartment decades earlier.  
I decide to track down the boy and return the box to him. If you know this boy, you need to come to see me in Montmartre.
"""

questions = [
    "Who are you?",
    "When was you born?",
    "What did you find?",
    "What did you decide?"
]

for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    # Get the most likely beginning of answer with the argmax of the score
    answer_start = torch.argmax(answer_start_scores)
    # Get the most likely end of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1

    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
    )

    print(f"Question: {question}")
    print(f"Answer: {answer}")

Downloading:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/463 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/24.3M [00:00<?, ?B/s]

Question: Who are you?
Answer: amelie poulain
Question: When was you born?
Answer: june 1974
Question: What did you find?
Answer: an old metal box of childhood memorabilia. this box was hidden by a boy who lived in my apartment decades earlier. i decide to track down the boy and return the box to him. if you know this boy, you need to come to see me in montmartre
Question: What did you decide?
Answer: to track down the boy


Pb modèle 2: même que modèle 1 et en plus répond mal à la dernière question (on devrait le mettre en modèle 1 celui ci comme c'est le pire)

## Modèle 3

In [None]:
modelname = 'distilbert-base-cased-distilled-squad'
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased-distilled-squad')
model = AutoModelForQuestionAnswering.from_pretrained('distilbert-base-cased-distilled-squad')

text = r"""
I am Amelie Poulain.  I was born in June 1974.   I lived alone with my father when I was a child. 
Now I live in Montmartre. I work in a small café whose name is Les Deux Moulins. I am single, and I used to feel very lonely. 
I like dipping my hand into grain sacks and throwing stones on the Saint-Martin canal. One day, I dropped a plastic perfume-stopper, which dislodged a wall tile. 
I discovered an old metal box of childhood memorabilia. This box was hidden by a boy who lived in my apartment decades earlier.  
I decide to track down the boy and return the box to him. If you know this boy, you need to come to see me in Montmartre.
"""

questions = [
    "Who are you?",
    "When was you born?",
    "What did you find?",
    "What did you decide?"
]

for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    # Get the most likely beginning of answer with the argmax of the score
    answer_start = torch.argmax(answer_start_scores)
    # Get the most likely end of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1

    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
    )

    print(f"Question: {question}")
    print(f"Answer: {answer}")

Question: Who are you?
Answer: Amelie Poulain
Question: When was you born?
Answer: June 1974
Question: What did you find?
Answer: an old metal box of childhood memorabilia
Question: What did you decide?
Answer: track down the boy and return the box to him


Bon modèle

## Modèle 4 

On teste un dernier modèle pour voir si les scores peuvent être améliorés par rapport au modèle 3

In [None]:
modelname = 'distilbert-base-cased-distilled-squad'
tokenizer = AutoTokenizer.from_pretrained(modelname)
model = AutoModelForQuestionAnswering.from_pretrained(modelname)

text = r"""
I am Amelie Poulain.  I was born in June 1974.   I lived alone with my father when I was a child. 
Now I live in Montmartre. I work in a small café whose name is Les Deux Moulins. I am single, and I used to feel very lonely. 
I like dipping my hand into grain sacks and throwing stones on the Saint-Martin canal. One day, I dropped a plastic perfume-stopper, which dislodged a wall tile. 
I discovered an old metal box of childhood memorabilia. This box was hidden by a boy who lived in my apartment decades earlier.  
I decide to track down the boy and return the box to him. If you know this boy, you need to come to see me in Montmartre.
"""

questions = [
    "Who are you?",
    "When was you born?",
    "What did you find?",
    "What did you decide?"
]

for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    # Get the most likely beginning of answer with the argmax of the score
    answer_start = torch.argmax(answer_start_scores)
    # Get the most likely end of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1

    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
    )

    print(f"Question: {question}")
    print(f"Answer: {answer}")

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Question: Who are you?
Answer: Amelie Poulain
Question: When was you born?
Answer: June 1974
Question: What did you find?
Answer: an old metal box of childhood memorabilia
Question: What did you decide?
Answer: track down the boy and return the box to him


Même performance pour les réponses

*Score amélioré ?*

# 2) Chatbot




Nous décidons d'utilier, pour notre chatbot, le modèle préentrainé                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 qui nous donnait les meilleurs résultats lors des questions réponses réalisées précèdemment:

In [None]:
modelname = 'distilbert-base-cased-distilled-squad'
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased-distilled-squad')
model = AutoModelForQuestionAnswering.from_pretrained('distilbert-base-cased-distilled-squad')

question_answerer = pipeline("question-answering")

context = r"""
I am Amelie Poulain.  I was born in June 1974.   I lived alone with my father when I was a child. 
Now I live in Montmartre. I work in a small café whose name is Les Deux Moulins. I am single, and I used to feel very lonely. 
I like dipping my hand into grain sacks and throwing stones on the Saint-Martin canal. One day, I dropped a plastic perfume-stopper, which dislodged a wall tile. 
I discovered an old metal box of childhood memorabilia. This box was hidden by a boy who lived in my apartment decades earlier.  
I decide to track down the boy and return the box to him. If you know this boy, you need to come to see me in Montmartre.
"""

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


In [None]:
result = question_answerer(question="Who are you?", context=context)
print(
    f"Do you have any question?",
    f"\nQuestion: Who are you?",
    f"\nAnswer: {result['answer']}"
)

Do you have any question? 
Question: Who are you? 
Answer: Amelie Poulain


In [None]:
from transformers import AutoModelForCausalLM

In [None]:
model_name = "microsoft/DialoGPT-large"
# model_name = "microsoft/DialoGPT-medium"
# model_name = "microsoft/DialoGPT-small"
tokenizertest = AutoTokenizer.from_pretrained(model_name)
modeltest = AutoModelForCausalLM.from_pretrained(model_name)

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/642 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

In [None]:
for step in range(5):
  # take user input
  if step==0:
    print("Do you have any questions?");
  else: 
    print("Any other questions?");
  text = input(">> You:")
  if text =='no':
    print("Good bye!")
    break
  else:
    inputs = tokenizer(text, context, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]
    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits
    # Get the most likely beginning of answer with the argmax of the score
    answer_start = torch.argmax(answer_start_scores)
    # Get the most likely end of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1
    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
        )
    print(f"{answer}")

Do you have any questions?


## Chatbot Yoda

In [None]:
# chatting 5 times with Top K sampling & tweaking temperature
for step in range(5):
  if step==0:
    print("Yoda: Do you have any questions?")
  else: 
    print("Yoda: Any other questions?")
  # take user input
  text = input(">> You:")
  # encode the input and add end of string token
  input_ids = tokenizertest.encode(text + tokenizertest.eos_token, return_tensors="pt")
  # concatenate new user input with chat history (if there is)
  bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
  # generate a bot response
  chat_history_ids = modeltest.generate(
      bot_input_ids,
      max_length=1000,
      do_sample=True,
      top_k=100,
      temperature=0.75,
      pad_token_id=tokenizertest.eos_token_id
      )
  #print the output
  output = tokenizertest.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)
  print(f"Yoda: {output}")

## Combining the codes.....


In [None]:
modelname = 'distilbert-base-cased-distilled-squad'
tokenizer = AutoTokenizer.from_pretrained(modelname)
model = AutoModelForQuestionAnswering.from_pretrained(modelname)

In [None]:
context = r"""
I am Amelie Poulain.  I was born in June 1974.   I lived alone with my father when I was a child. 
Now I live in Montmartre. I work in a small café whose name is Les Deux Moulins. I am single, and I used to feel very lonely. 
I like dipping my hand into grain sacks and throwing stones on the Saint-Martin canal. One day, I dropped a plastic perfume-stopper, which dislodged a wall tile. 
I discovered an old metal box of childhood memorabilia. This box was hidden by a boy who lived in my apartment decades earlier.  
I decide to track down the boy and return the box to him. If you know this boy, you need to come to see me in Montmartre.
"""

In [None]:
for step in range(5):
  # take user input
  if step==0:
    print("Bobby: Do you have any questions?");
  else: 
    print("Bobby: Any other questions?");
  text = input(">> You:")
  if text =='no':
    print("Bobby: Good bye!")
    break
  else:
    inputs = tokenizer(text, context, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]
    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits
    # Get the most likely beginning of answer with the argmax of the score
    answer_start = torch.argmax(answer_start_scores)
    # Get the most likely end of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1
    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
        )
    print(f"Bobby: {answer}")

Bobby: Do you have any questions?
>> You:who are you?
Bobby: Amelie Poulain
Bobby: Any other questions?
>> You:What did you find?
Bobby: an old metal box of childhood memorabilia
Bobby: Any other questions?
>> You:What did you decide?
Bobby: track down the boy and return the box to him
Bobby: Any other questions?
>> You:no
Bobby: Good bye!
