#  Разработка чат-бота с применением генеративных моделей 🤗 Transformers

__Автор задач: Блохин Н.В. (NVBlokhin@fa.ru)__

Материалы:
* https://huggingface.co/docs/transformers/generation_strategies
* https://huggingface.co/docs/transformers/main/tasks/prompting
* https://huggingface.co/docs/huggingface_hub/guides/inference
* https://huggingface.co/settings/tokens
* https://huggingface.co/tiiuae/falcon-7b-instruct
* https://python.langchain.com/
* https://www.youtube.com/watch?v=cKjh5ZOWqus
* https://docs.chainlit.io/get-started/overview
* https://www.youtube.com/watch?v=cKjh5ZOWqus

## Задачи для совместного разбора

1\. Рассмотрите работу с генеративными моделями из 🤗 Transformers при помощи Inference API

In [1]:
!pip install huggingface_hub



In [2]:
api_token = "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"

In [35]:
# prompt: create InferenceClient object from this package

from huggingface_hub import InferenceClient

# Create an InferenceClient object
client = InferenceClient(
    model="gpt2",
    token="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"
)

In [37]:
client.text_classification('Who directed The Shawshank Redemption?')

{'generated_text': "Who directed The Shawshank Redemption?\n\n\nThat makes sense! I've been wanting to do a remake of The Shawshank Redemption in the past few weeks, so I think one of the best things about The Shawshank Redemption is that"}

In [6]:
client.text_generation(
    prompt="Students were upset because",
    return_full_text=True,
)

'Students were upset because they were told that the school would not be able to provide them with a place to stay.\n'

## Задачи для самостоятельного решения

In [1]:
from huggingface_hub import InferenceClient
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


<p class="task" id="1"></p>

1\. Загрузите любую большую языковую модель для генерации текста на русском языке из 🤗 Transformers. Используя данную модель, продолжите текст `prompt`. Изучите, как на результат влияют следующие параметры:

* max_new_tokens;
* do_sample;
* num_beams;
* num_beam_groups.

Выведите несколько примеров сгенерированного текста с разными настройками.


- [ ] Проверено на семинаре

In [2]:
model_name = "ai-forever/rugpt3large_based_on_gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

generator = pipeline(
                "text-generation",
                model=model,
                tokenizer=tokenizer,
                device='cuda' if torch.cuda.is_available() else 'cpu'
                )

prompt = "Студенты были расстроены, потому что"

tokenizer_config.json:   0%|          | 0.00/1.25k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.71M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/574 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/622 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.14G [00:00<?, ?B/s]

In [3]:
generator(
        prompt
    )[0]['generated_text']



'Студенты были расстроены, потому что не знали, что делать.\n\n—\xa0Я'

In [4]:
generator(
        prompt,
        max_new_tokens=8,
    )[0]['generated_text']

'Студенты были расстроены, потому что не знали, что делать.\n\n'

In [5]:
generator(
        prompt,
        do_sample=True,
    )[0]['generated_text']



'Студенты были расстроены, потому что хотели принять участие в обсуждении. Но к удивлению всех,'

In [6]:
generator(
        prompt,
        num_beams=100,
    )[0]['generated_text']

'Студенты были расстроены, потому что не знали, что делать дальше.\n\n—\xa0'

In [7]:
generator(
        prompt,
        num_beams=4,
        num_beam_groups=2,
        diversity_penalty=0.5
    )[0]['generated_text']

'Студенты были расстроены, потому что не знали, что им делать.\n\n—\xa0'

In [8]:
generator(
        prompt,
        max_new_tokens=12,
        num_beams=6,
        num_beam_groups=2,
        diversity_penalty=0.5
    )[0]['generated_text']

'Студенты были расстроены, потому что не знали, что им делать дальше.\n\n—\xa0'

<p class="task" id="2"></p>

2\. Загрузите любую большую языковую модель для генерации текста на английском языке из 🤗 Transformers. Придумайте prompt, который позволит по тексту вопроса из файлов каталога `qst_eng_txt/questions` выделить именованную сущность и определить интент вопроса. Продемонстрируйте работу модели на примерах различных интентов.

В случае нехватки ресурсов на машине для разворачивания языковой модели вы можете воспользоваться Inference API 🤗 Transformers.

Советы по созданию промпта:
* максимально четко опишите, что вы хотите получить на выходе;
* можно показать в промпте несколько примеров вопросов, правильных ответов и формата ответов.

- [ ] Проверено на семинаре

In [4]:
actor = pd.read_csv('/content/drive/MyDrive/Учеба/nlp/06/qst_eng_txt/questions/film_actors.csv')
actor['intent'] = 'actor'
cameraman = pd.read_csv('/content/drive/MyDrive/Учеба/nlp/06/qst_eng_txt/questions/film_cameraman.csv')
cameraman['intent'] = 'cameraman'
director = pd.read_csv('/content/drive/MyDrive/Учеба/nlp/06/qst_eng_txt/questions/film_director.csv')
director['intent'] = 'director'

df = pd.concat([actor, cameraman, director], axis=0)
df.sample(5)

Unnamed: 0,question,intent
25,who operated the camera for mad max fury road?,cameraman
27,Who directed Forrest Gump?,director
14,forrest gump director?,director
6,inception main actors?,actor
16,who operated the camera for la la land?,cameraman


In [10]:
entities = ['actor', 'cameraman', 'director']
prompt = '''
1. sentense: who played in Dark Knight? entity - Dark Knight, intent - actor.
2. sentence: who directed Home alone? entity - Home alone, intent - director.
3. sentence: who directed the movie Inception?", entity - the movie Inception, intent - director.
4. sentence: which actor played the lead role in The Godfather? entity - The Godfather, intent: actor.
5. sentence: who was the cameraman for the film Interstellar? entity - film Interstellar, intent - cameraman.
6. sentence: list cameramen The Revenant. entity - The Revenant, intent: cameraman.
7. sentence: {} entity -
'''

for quest, intent in df.sample(3).values:
  ans = generator(prompt.format(quest),
                  max_new_tokens=12,
                  num_beams=12,
                  num_beam_groups=3,
                  diversity_penalty=0.3
                  )
  # entity_pred = set([w for w in ans.lower().split() if w in entities])
  print(quest)
  print(ans[0]['generated_text'].split('7. ')[-1].strip())
  # print(intent)
  print('----------')



Can you name the Cameraman of Gravity?
sentence: Can you name the Cameraman of Gravity? entity - 
 Cameraman of Gravity, intent:
----------
Who played the lead roles in the matrix?
sentence: Who played the lead roles in the matrix? entity - 
 The Matrix, intent: actor.
8.
----------
List cameramen Gravity.
sentence: List cameramen Gravity. entity - 
 Gravity, intent: cameraman.
----------


In [12]:
entities = ['actor', 'cameraman', 'director']
prompt = '''
1. sentense: who played in Dark Knight? entity - Dark Knight, intent - actor.
2. sentence: who directed Home alone? entity - Home alone, intent - director.
3. sentence: who directed the movie Inception?", entity - the movie Inception, intent - director.
4. sentence: which actor played the lead role in The Godfather? entity - The Godfather, intent: actor.
5. sentence: who was the cameraman for the film Interstellar? entity - film Interstellar, intent - cameraman.
6. sentence: list cameramen The Revenant. entity - The Revenant, intent: cameraman.
7. sentence: {} entity -
'''

for quest, intent in df.sample(3).values:
  ans = generator(prompt.format(quest),
                  max_new_tokens=15,
                  num_beams=24,
                  num_beam_groups=4,
                  diversity_penalty=0.3
                  )
  # entity_pred = set([w for w in ans.lower().split() if w in entities])
  print(quest)
  print(ans[0]['generated_text'].split('7. ')[-1].strip())
  # print(intent)
  print('----------')

blade runner 2049 cameraman?
sentence: blade runner 2049 cameraman? entity - 
 The Revenant, intent: cameraman.
8
----------
who operated the camera for la la land?
sentence: who operated the camera for la la land? entity - 
 la la land, intent: cameraman.
8.
----------
Can you name the Main Actors of The Matrix?
sentence: Can you name the Main Actors of The Matrix? entity - 
 The Matrix, intent: actor.
8. sentence:
----------


In [15]:
entities = ['actor', 'cameraman', 'director']
prompt = '''
1. sentense: who played in Dark Knight? entity: Dark Knight, intent: actor.
2. sentence: who directed Home alone? entity: Home alone, intent: director.
3. sentence: who directed the movie Inception?", entity: the movie Inception, intent: director.
4. sentence: which actor played the lead role in The Godfather? entity: The Godfather, intent: actor.
5. sentence: who was the cameraman for the film Interstellar? entity: film Interstellar, intent: cameraman.
6. sentence: list cameramen The Revenant. entity: The Revenant, intent: cameraman.
7. sentence: {} entity:
'''

for quest, intent in df.sample(3).values:
  ans = generator(prompt.format(quest),
                  max_new_tokens=15,
                  num_beams=24,
                  num_beam_groups=4,
                  diversity_penalty=0.5
                  )
  # entity_pred = set([w for w in ans.lower().split() if w in entities])
  print(quest)
  print(ans[0]['generated_text'].split('7. ')[-1].strip())
  # print(intent)
  print('----------')



gravity cameraman?
sentence: gravity cameraman? entity: 
 The Revenant, intent: cameraman.
8
----------
shawshank redemption director?
sentence: shawshank redemption director? entity: 
 shawshank redemption director, intent: director.
----------
inception main actors?
sentence: inception main actors? entity: 
 The Revenant, intent: cameraman.
8
----------


<p class="task" id="3"></p>

3\. Создайте словарь `db` следующего вида:
```
{
    "film_director": {
        "The Shawshank Redemption": "Frank Darabont",
        ...
    },
    ...
}
```

Напишите функции `parse_llm_result` и `find_answer`. Продемонстрируйте работоспособность на нескольких примерах.


- [ ] Проверено на семинаре

In [21]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [22]:
!pip install thefuzz

Collecting thefuzz
  Downloading thefuzz-0.20.0-py3-none-any.whl (15 kB)
Collecting rapidfuzz<4.0.0,>=3.0.0 (from thefuzz)
  Downloading rapidfuzz-3.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, thefuzz
Successfully installed rapidfuzz-3.6.0 thefuzz-0.20.0


In [23]:
from thefuzz import process

In [61]:
def parse_llm_result(result: str) -> tuple[str, str]:
    """Возвращает интент и название сущности из ответа LLM"""
    result = result[0]['generated_text'].split('7. ')[-1].strip()
    result = result.split('8')[0].strip()
    _, sent_entity, entitiy_, intent_ = result.split(':')
    entity = entitiy_.split(',')[0].strip()
    intent = intent_.replace('.', '').strip()
    return entity, intent

In [53]:
ans # из прошлого задания

[{'generated_text': '\n1. sentense: who played in Dark Knight? entity: Dark Knight, intent: actor. \n2. sentence: who directed Home alone? entity: Home alone, intent: director. \n3. sentence: who directed the movie Inception?", entity: the movie Inception, intent: director.\n4. sentence: which actor played the lead role in The Godfather? entity: The Godfather, intent: actor.\n5. sentence: who was the cameraman for the film Interstellar? entity: film Interstellar, intent: cameraman.\n6. sentence: list cameramen The Revenant. entity: The Revenant, intent: cameraman.\n7. sentence: inception main actors? entity: \n The Revenant, intent: cameraman.\n8'}]

In [55]:
ans[0]['generated_text'].split('7. ')[-1]

'sentence: inception main actors? entity: \n The Revenant, intent: cameraman.\n8'

In [56]:
parse_llm_result(ans)

('The Revenant', 'cameraman')

In [60]:
film_actors_ans = pd.read_csv('/content/drive/MyDrive/Учеба/nlp/06/qst_eng_txt/answers/film_actors.csv')
film_cameraman_ans = pd.read_csv('/content/drive/MyDrive/Учеба/nlp/06/qst_eng_txt/answers/film_cameraman.csv')
film_director_ans = pd.read_csv('/content/drive/MyDrive/Учеба/nlp/06/qst_eng_txt/answers/film_director.csv')

db = {
    'film_actors': dict(film_actors_ans.values),
    'film_cameraman': dict(film_cameraman_ans.values),
    'film_director': dict(film_director_ans.values),
}
db

{'film_actors': {'The Matrix': 'Keanu Reeves, Laurence Fishburne, Carrie-Anne Moss, and others.',
  'The Dark Knight': 'Christian Bale and Heath Ledger.',
  'Pulp Fiction': 'John Travolta, Uma Thurman, and Samuel L. Jackson.',
  'Inception': 'Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen Page, and others.',
  'Forrest Gump': 'Tom Hanks, Robin Wright, Gary Sinise, and others.'},
 'film_cameraman': {'The Revenant': 'Emmanuel Lubezki.',
  'Mad Max': 'John Seale.',
  'Blade Runner 2049': 'Roger Deakins.',
  'La La Land': 'Linus Sandgren.',
  'Dunkirk': 'Hoyte van Hoytema.',
  'Gravity': 'Emmanuel Lubezki.'},
 'film_director': {'The Shawshank Redemption': 'Frank Darabont.',
  'Inception': 'Christopher Nolan.',
  'The Godfather': 'Francis Ford Coppola.',
  'The Dark Knight': 'Christopher Nolan.',
  'Forrest Gump': 'Robert Zemeckis.'}}

In [26]:
def find_answer(entity: str, intent: str, db: dict) -> tuple[str, int]:
    """entity и intent - результат работы parse_llm_result
    Для поиска ключа в словаре db[intent] воспользуйтесь методом process.extractOne из пакета thefuzz
    """
    answers = db[intent]
    if entity != '':
        answer = process.extractOne(entity, answers.keys())
        return {'answer': answers[answer[0]], 'score': answer[1]}
    return {'answer': '', 'score': 0}

In [62]:
entity = 'La La Land'
intent = 'film_actors'
print('entity:', entity)
print('intent:', intent)
print('answer:', find_answer(entity, intent, db))

entity: La La Land
intent: film_actors
answer: {'answer': 'Christian Bale and Heath Ledger.', 'score': 37}



In [63]:
entity = 'Inception'
intent = 'film_director'
print('entity:', entity)
print('intent:', intent)
print('answer:', find_answer(entity, intent, db))

entity: Inception
intent: film_director
answer: {'answer': 'Christopher Nolan.', 'score': 100}


# Пример использования обеих функций

In [76]:
# ответ LLM модели
quest = df.sample(1)['question'].values[0]
ans1 = generator(prompt.format(quest),
                 max_new_tokens=15,
                 num_beams=24,
                 num_beam_groups=4,
                 diversity_penalty=0.5
                 )

print(quest)
print(parse_llm_result(ans1))

who was the director of forrest gump?
('forrest gump', 'director')


In [78]:
d = {'director': 'film_director', 'actor': 'film_actors', 'cameraman': 'film_cameraman'}
entity, intent = parse_llm_result(ans1)
intent = d[intent] # переименуем в правильное название

print('entity:', entity)
print('intent:', intent)
print('answer:', find_answer(entity, intent, db))

entity: forrest gump
intent: film_director
answer: {'answer': 'Robert Zemeckis.', 'score': 100}


<p class="task" id="4"></p>

4\. Реализуйте функцию для ведения диалога с LLM по следующему принципу.

Вам дан начальный контекст диалога `context`.

```
Some context
```

Вы придумываете сообщение и расширяете строку с контекстом:
```
Some context

User: some message
```

Далее вы передаете полученную строку в LLM и расширяете контекст на основе ответа, сгенерированного моделью:

```
Some context

User: some message

AI: some answer
```

Обменяйтесь с языковой моделью несколькими репликами в таком стиле и покажите, что у языковой модели получается извлекать информацию из накопленного контекста.
- [ ] Проверено на семинаре

In [83]:
context = """
AI love machine learning and remembers a lot of information about it. He will be happy to help you to take the upcoming exam. The user will interact with the AI, the AI must give an answer and wait for the next replica of the person. AI does not generate replica for a human.

"""

In [86]:
from huggingface_hub import InferenceClient

# Create an InferenceClient object
client = InferenceClient(
    model="gpt2",
    token="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"
)
client.text_generation(
    prompt=context,
    return_full_text=False,
).strip()



"The AI will be able to learn from the user's experience. The user will be able to"

In [97]:
def chat(context):
  message = 'start'
  while message != 'stop':
    message = input()
    print(f"User: {message}")
    context = ' '.join([context,message])
    respond = client.text_generation(
              prompt=' '.join([context,message]),
              return_full_text=False,
              do_sample=True
          ).strip()
    print(f"AI: {respond}")
    print('----------------')
    context = ' '.join([context,respond])

In [96]:
chat(context)

How can AI help a human?
User: How can AI help a human?




How can AI help human? There are several different ways:

How can an AI

Can it help with exam?
User: Can it help with exam?
How do algorithms work? Which algorithms do algorithms do algorithm? How algorithms work?

stop
User: stop
stop stop make turn turn turn turn turn all turn turn turn turn turn turn turn turn turn turn all



In [98]:
chat("I had a bad day")

User: how do you encourage people?
AI: what are you talking about?

Do you believe that they should give themselves away? Do you
----------------
User: Of course, what do you think about it?
AI: Do you think it needs improvement, should it be taken away?

Can you give
----------------
improvements are always good
User: improvements are always good
AI: improvement are always good improvements are always good

What is your opinion as to what should be done
----------------


KeyboardInterrupt: ignored

## Обратная связь
- [ ] Хочу получить обратную связь по решению