<a href="https://colab.research.google.com/github/AIAlchemy1/Generative-AI/blob/main/01_LLM_Models/BookSummarisation_Extraction_TTS_ASR_Math.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1. Book summarisation

3 points

OpenAI API is a great tool for summarization, but it is able to process only relatively short texts, as prescribed by the MAX TOKENS restrictions of the models. So, what if we want to summarize a whole book? Let's try to make a workaround for this.

To test our solutions we will be using CMU dataset for book summarization. Let's start with downloading a sample of book dataset from huggingface

```
@article{kryscinski2021booksum,
      title={BookSum: A Collection of Datasets for Long-form Narrative Summarization},
      author={Wojciech Kry{\'s}ci{\'n}ski and Nazneen Rajani and Divyansh Agarwal and Caiming Xiong and Dragomir Radev},
      year={2021},
      eprint={2105.08209},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

In [None]:
!curl -X GET \
     "https://datasets-server.huggingface.co/rows?dataset=kmfoda%2Fbooksum&config=default&split=train&offset=0&limit=100" > book_sum_dataset.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3771k  100 3771k    0     0  4973k      0 --:--:-- --:--:-- --:--:-- 4968k


In [None]:
import json

book_dataset = json.load(open("book_sum_dataset.txt"))

Let's look at one row of our database. Beware: a large text will get printed.

In [None]:
book_dataset['rows'][0]

{'row_idx': 0,
 'row': {'bid': 27681,
  'is_aggregate': True,
  'source': 'cliffnotes',
  'chapter_path': 'all_chapterized_books/27681-chapters/chapters_1_to_2.txt',
  'summary_path': 'finished_summaries/cliffnotes/The Last of the Mohicans/section_1_part_0.txt',
  'book_id': 'The Last of the Mohicans.chapters 1-2',
  'summary_id': 'chapters 1-2',
  'content': None,
  'summary': '{"name": "Chapters 1-2", "url": "https://web.archive.org/web/20201101053205/https://www.cliffsnotes.com/literature/l/the-last-of-the-mohicans/summary-and-analysis/chapters-12", "summary": "Before any characters appear, the time and geography are made clear. Though it is the last war that England and France waged for a country that neither would retain, the wilderness between the forces still has to be overcome first. Thus it is in 1757, in the New York area between the head waters of the Hudson River and Lake George to the north. Because only two years earlier General Braddock was disgracefully routed by a hand

In [None]:
len(book_dataset['rows'][0]['row']['chapter'])

40844

As you can see the chapters are pretty long. Let's see how many tokens we have in those chapters. Based on instruction from OpenAI we need a package `tiktoken`

We'll be following this instruction:

https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

In [None]:
!pip install tiktoken



In [None]:
import tiktoken
encoder = tiktoken.encoding_for_model("gpt-3.5-turbo")

In [None]:
encoder.encode("Hello World!")

[9906, 4435, 0]

Please write a small function to count tokens:

In [None]:
def count_chatgpt_tokens(text: str, tokenizer: tiktoken.Encoding) -> int:
    pass

In [None]:
def count_chatgpt_tokens(text: str, tokenizer: tiktoken.Encoding) -> int:
    return len(encoder.encode(text))

Let's check how many tokens are there in a very simple string:

In [None]:
count_chatgpt_tokens("Hello world!", tokenizer=encoder)

3

Now that we have this function, let's fing the maximum token length of a chapter?

In [None]:
max(
    count_chatgpt_tokens(row['row']['chapter'], encoder)
    for row in
    book_dataset['rows']
)

13237

As we can see at the api reference page, `gpt-3.5-turbo ` has only 4096 tokens context length. Our chapters are longer then that. We could still use `pt-3.5-turbo-16k`, but it's easy to imagine texts that are longer than this, so let's learn how to do it with a smaller-context LLM.

An obvious way to cope with the problem is to:
1. Split the text into chunks of sentences that can fit into the context window.
2. Summarize each of the chunks.
3. Concatenate all the summaries. If the total length is still too big, repeat the steps 1 and 2 until it's ok.
4. Summarize the concatenations of the summaries.

## Your task

Write a function

```summarize_long_text_with_chatgpt(chapter: str) -> str```

implementing the above method of summarization.

Don't forget to log the lengths each iteration to see how much texts shrink.

For a given example please analyse your intermediate and final results. Is it indeed a good summary of the text?

**Hints and suggestions**:
- Keep in mind, that MAX TOKENS restrictions takes into account both request and model answer. So, you also need to leave some tokens for a response. So we'd suggest using at least 2:1 token ratio for chapter and summary.
- You can control the length of the summary with prompts.
- If you just use `split(".")`, you won't get a proper splitting into sentences. Luckily we have convenient Python libraries for text processing. We recommend using `sent_tokenize` or `split_into_sentences` from the `nltk` library. You can also try splitting the text into chunks of paragraphs instead.
- It's difficult to measure the quality of summarization, but please analyze at lease two examples. Are the summaries coherent?
- If you need inspiration in prompt building, take a look at this [paper](https://arxiv.org/pdf/2312.16171v1.pdf)

**Bonus parts:**

- Summarized text often starts with something like "This text is about", and after merging the partial summaries you'll probably have things like that all over the text. You may wish to get rid of such introductory phrases either by tuning a prompt or by post editing.

In [None]:
!pip install openai==0.28
!pip install nltk

Collecting openai==0.28
  Using cached openai-0.28.0-py3-none-any.whl (76 kB)
Installing collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.7.2
    Uninstalling openai-1.7.2:
      Successfully uninstalled openai-1.7.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.[0m[31m
[0mSuccessfully installed openai-0.28.0


In [None]:
def summarize_long_text_with_chatgpt(chapter: str) -> str:
    pass

In [None]:
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')
import openai

from google.colab import drive
drive.mount('/content/drive')

openai.api_key = open("/content/drive/MyDrive/.open-ai-api-key.txt").read().strip()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from nltk.tokenize import sent_tokenize
import logging

def summarize_long_text_with_chatgpt(chapter: str) -> str:
    MAX_TOKENS = 4096  # Max tokens for GPT-3.5-turbo
    SUMMARY_RATIO = 2  # Ratio of chapter to summary length

    def summarize(text: str) -> str:
        # Adjust the prompt to control the summary length
        prompt = f"Summarize the following text:\n\n{text}"
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "system", "content": "You are a summarization assistant."},
                      {"role": "user", "content": prompt}]
        )
        return response.choices[0].message['content'].strip()

    def split_text(text: str, max_length: int) -> list:
        sentences = sent_tokenize(text)
        chunks = []
        current_chunk = ""

        for sentence in sentences:
            if len(current_chunk) + len(sentence) < max_length:
                current_chunk += sentence + " "
            else:
                chunks.append(current_chunk)
                current_chunk = sentence + " "
        chunks.append(current_chunk)  # Add the last chunk
        return chunks

    logging.basicConfig(level=logging.INFO)

    # Initial split and summarize
    chunks = split_text(chapter, MAX_TOKENS)
    summaries = [summarize(chunk) for chunk in chunks]
    combined_summary = " ".join(summaries)

    # Iteratively summarize if the text is still too long
    iteration = 1
    while len(combined_summary.split()) > MAX_TOKENS / SUMMARY_RATIO:
        logging.info(f"Iteration {iteration}, length: {len(combined_summary.split())}")
        chunks = split_text(combined_summary, MAX_TOKENS)
        summaries = [summarize(chunk) for chunk in chunks]
        combined_summary = " ".join(summaries)
        iteration += 1

    logging.info(f"Final iteration {iteration}, length: {len(combined_summary.split())}")
    return combined_summary


In [None]:
sample_chapter = book_dataset['rows'][0]['row']['chapter']
summarized_chapter = summarize_long_text_with_chatgpt(sample_chapter)
print(summarized_chapter[:200])
print(len(summarized_chapter))
assert len(summarized_chapter) < len(sample_chapter) // 2

During the colonial wars in North America, the toils and dangers of the wilderness had to be overcome before the opposing forces could engage in conflict. The boundaries between the French and English
7270


# Task 2. Extracting information with LLMs

4 points

At the practice session we were usually happy if we got something coherent. However, in real applications we often need to obtain concrete answers. Let's explore how to do it with LLMs.

Let's imagine that you work for a marketing agency, and you need to gather analytics about the passing events dedicated to AI and Machine Learning. For that, you need to process press releases and extract:
- Event name,
- Event date,
- Number of participants,
- Number of speakers,
- Attendance price.

Of course, you can do it manually, but it's much more fun to use Generative AI! So, your task will be to write a function that does this with only one request to OpenAI API.

Below there is an example of a press release (generated by ChatGPT, of course, so that both the event and the personae are fictional). All of them are in the press_releases.zip archive in the hometask week 1 folder.

<blockquote>
<p>PRESS RELEASE

InnovAI Summit 2023: A Glimpse into the Future of Artificial Intelligence</p>

City of Virtue, Cyberspace - November 8, 2023 - The most anticipated event of the year, InnovAI Summit 2023, successfully concluded last weekend, on November 5, 2023. Held in the state-of-the-art VirtuTech Arena, the summit saw a massive turnout of over 3,500 participants, from brilliant AI enthusiasts and researchers to pioneers in the field.

Esteemed speakers took to the stage to shed light on the latest breakthroughs, practical implementations, and ethical considerations in AI. Dr. Evelyn Quantum, renowned for her groundbreaking work on Quantum Machine Learning, emphasized the importance of this merger and how it's revolutionizing computing as we know it. Another keynote came from Prof. Leo Nexus, whose current project 'AI for Sustainability' highlights the symbiotic relationship between nature and machine, aiming to use AI in restoring our planet's ecosystems.

This year's panel discussion, moderated by the talented Dr. Ada Neura, featured lively debates on the limits of AI in creative arts. Renowned digital artist, Felix Vortex, showcased how he uses generative adversarial networks to create surreal art pieces, while bestselling author, Iris Loom, explained her experiments with AI-assisted story crafting.

Among other highlights were hands-on workshops, interactive Q&A sessions, and an 'AI & Ethics' debate which was particularly well-received, emphasizing the need for transparency and fairness in AI models. An exclusive 'Start-up Alley' allowed budding entrepreneurs to showcase their innovations, gaining attention from global venture capitalists and media.

The event wrapped up with an announcement for InnovAI Summit 2024, set to be even grander. Participants left with a renewed enthusiasm for the vast possibilities that the AI and ML world promises.

For media inquiries, please contact:
Jane Cipher
Director of Communications, InnovAI Summit
Email: jane.cipher@innovai.org
Phone: +123-4567-8910</p>
</blockquote>

More specifically, you should write a function

```python
parse_press_release(pr: str) -> dict
```

where the output should be in the format

```python
{
  name: 'InnovAI Summit 2023',
  date: '08.11.2023',
  n_participants: 3500,
  n_speakers: 4,
  price:
}
```

If any of the four characteristics is not mentioned in the text, put `None` in the respective field.

At the end, calculate the statistics of right answers and analyse what kind of mistakes you "model" makes the most.

**Hints and suggestions:**
- It's gonna be more convenient to experiment in OpenAI chat interface https://chat.openai.com/. Plus this doesn't cost API requests money.
- You need to be very accurate with what you want from the model.
- It will help if you specify in the prompt that the output should be in JSON format, this way you will spend less time parsing the output.
- Please be careful with the details. For example, Jane Cipher in the text above is not a speaker and shouldn't be counter as such (how to get rid of a contact person?). Also pay attention to the date format,
- If the model is too wilful with the output format, don't hesitate to show some examples. Decreasing the temperature of predictions can help reduce the creativity of the answer, which is what we want for such task.
- Debugging an LLM-powered application may become a tough business. When you think that you've polished it, an LLM can still surprise you. So, we don't expect 100% accuracy in this task, but we expect that you do your best to achieve high quality results.

In [1]:
press_release = """PRESS RELEASE

InnovAI Summit 2023: A Glimpse into the Future of Artificial Intelligence

City of Virtue, Cyberspace - November 8, 2023 - The most anticipated event of the year, InnovAI Summit 2023, successfully concluded last weekend, on November 5, 2023. Held in the state-of-the-art VirtuTech Arena, the summit saw a massive turnout of over 3,500 participants, from brilliant AI enthusiasts and researchers to pioneers in the field.

Esteemed speakers took to the stage to shed light on the latest breakthroughs, practical implementations, and ethical considerations in AI. Dr. Evelyn Quantum, renowned for her groundbreaking work on Quantum Machine Learning, emphasized the importance of this merger and how it's revolutionizing computing as we know it. Another keynote came from Prof. Leo Nexus, whose current project 'AI for Sustainability' highlights the symbiotic relationship between nature and machine, aiming to use AI in restoring our planet's ecosystems.

This year's panel discussion, moderated by the talented Dr. Ada Neura, featured lively debates on the limits of AI in creative arts. Renowned digital artist, Felix Vortex, showcased how he uses generative adversarial networks to create surreal art pieces, while bestselling author, Iris Loom, explained her experiments with AI-assisted story crafting.

Among other highlights were hands-on workshops, interactive Q&A sessions, and an 'AI & Ethics' debate which was particularly well-received, emphasizing the need for transparency and fairness in AI models. An exclusive 'Start-up Alley' allowed budding entrepreneurs to showcase their innovations, gaining attention from global venture capitalists and media.

The event wrapped up with an announcement for InnovAI Summit 2024, set to be even grander. Participants left with a renewed enthusiasm for the vast possibilities that the AI and ML world promises.

For media inquiries, please contact: Jane Cipher Director of Communications, InnovAI Summit Email: jane.cipher@innovai.org Phone: +123-4567-8910"""

In [2]:
!pip install openai==0.28
import openai

from google.colab import drive
drive.mount('/content/drive')

openai.api_key = open("/content/drive/MyDrive/.open-ai-api-key.txt").read().strip()

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.11.1
    Uninstalling openai-1.11.1:
      Successfully uninstalled openai-1.11.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires tiktoken, which is not installed.[0m[31m
[0mSuccessfully installed openai-0.28.0
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
def parse_press_release(pr: str) -> dict:
    pass

In [7]:
import json

def parse_press_release(pr: str) -> dict:
    prompt = (
        f"Here's a press release\n{pr}\n\nExtract from it the following json:"\
       '{"name": NAME_OF_EVENT, "date": DATE_OF_EVENT, "n_participants": NUM_PARTICIPANTS, "n_speakers": NUM_SPEAKERS, "price": PRICE}'\
       "NAME_OF_EVENT should be the name of event advertised,\n"\
       "DATE_OF_EVENT should be the date of event mentioned in format DD.MM.YYYY or DD.MM.YYYY-DD.MM.YYYY if the event lasted for several days,\n"\
       "NUM_PARTICIPANTS should be the estimated amount of participants of said event in a format like 200 or 1000 or 10000, do not write it like 2,000,\n"\
       "NUM_SPEAKERS is a number, corresponding to amount of names of speakers and hosts mentioned\n"\
       "PRICE should be the price of event in the format EUR 100 or USD 1000 or GBP 100 depending on currency. Do not write currency symbol, instead write an abbreviation.\n"\
       "If any information needed for JSON is not available, write 'No information mentioned' instead."
    )

    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a data extraction assistant."},
            {"role": "user", "content": prompt}
        ]
    )

    # Extract and parse the response
    try:
        extracted_info = response.choices[0].message['content'].strip()
        # Assuming the model returns a JSON-like string
        info_dict = json.loads(extracted_info)
        return info_dict
    except Exception as e:
        print(f"Error in parsing response: {e}")
        return {}


In [8]:
parse_press_release(press_release)

{'name': 'InnovAI Summit 2023',
 'date': '05.11.2023',
 'n_participants': '3500',
 'n_speakers': '6',
 'price': 'No information mentioned'}

###Testing
We prepared a small dataset for you to test your prompt on.
Provided you've written your function, try running the following code.
At the end you also have an opportunity to look at the results in a table side-by-side in `with_results.csv`.
Your goal is to get at least 60% accuracy, or 26 fields right.

Please don't forget to output these metrics, they will be used for grading.

In [9]:
import pandas

from google.colab import drive
drive.mount('/content/drive')

pr_df = pandas.read_csv("/content/drive/MyDrive/press_release_extraction.csv")
pr_df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,pr_text,pr_parsed
0,InnovAI Summit 2023: A Glimpse into the Future...,"{\n ""name"": ""InnovAI Summit 2023"",\n ""date"":..."
1,Press Dispatch: 'Artificial Mariners: Navigati...,"{""name"": ""Artificial Mariners: Navigatin' the ..."
2,FOR IMMEDIATE RELEASE\n\nAI Innovators Convene...,"{""name"": ""Annual Machine Learning Symposium 20..."
3,Press Release: Cutting-Edge Innovations Debute...,"{""name"": ""AI Advancements Summit"",\n ""date"": ""..."
4,"Press Release: Innovative Minds Gather at ""AI ...","{""name"": ""AI Horizon 2023"",\n ""date"": ""October..."


In [10]:
import json

parsed_list = []
fields = {
    "name": str,
    "date": str,
    "n_speakers": int,
    "n_participants": int,
    "price": str
}
correct_fields = 0
for row in pr_df.itertuples():
    parsed_release = parse_press_release(row.pr_text)
    parsed_list.append(json.dumps(parsed_release, indent=4))
    golden = json.loads(row.pr_parsed)
    for field, field_type in fields.items():
        golden_field = golden[field]
        parsed_field = parsed_release.get(field)
        try:
            parsed_field = field_type(parsed_field)
        except (ValueError, TypeError):
            pass
        if golden_field == parsed_field:
            correct_fields += 1
        else:
            print(f"For {golden['name']} {field} {parsed_release.get(field)} doesn't seem the same as {golden[field]}")

print(correct_fields)

For InnovAI Summit 2023 n_speakers 5 doesn't seem the same as 4
For Artificial Mariners: Navigatin' the AI Seas date 8.10.2023-9.10.2023 doesn't seem the same as 08.10.2023-09.10.2023
For Annual Machine Learning Symposium 2023 date 14.10.2023-16.10.2023 doesn't seem the same as October 14-16, 2023
For Annual Machine Learning Symposium 2023 n_participants 2,000 doesn't seem the same as 2000
For Annual Machine Learning Symposium 2023 price USD 1,450 doesn't seem the same as USD 1450
For AI Advancements Summit name AI Advancements Summit 2023 doesn't seem the same as AI Advancements Summit
For AI Horizon 2023 date 15.10.2023 doesn't seem the same as October 15, 2023
For Generative Intelligence Conclave, Spain 2023 date 08.10.2023 doesn't seem the same as 8.10.2023
For Generative Intelligence Conclave, Spain 2023 price €180 doesn't seem the same as EUR 180
26


In [11]:
pr_df['results'] = parsed_list
pr_df.to_csv("with_results.csv")

In [12]:
parsed_list

['{\n    "name": "InnovAI Summit 2023",\n    "date": "05.11.2023",\n    "n_participants": 3500,\n    "n_speakers": 5,\n    "price": "No information mentioned"\n}',
 '{\n    "name": "Artificial Mariners: Navigatin\' the AI Seas",\n    "date": "8.10.2023-9.10.2023",\n    "n_participants": "2000",\n    "n_speakers": 5,\n    "price": "No information mentioned"\n}',
 '{\n    "name": "Annual Machine Learning Symposium 2023",\n    "date": "14.10.2023-16.10.2023",\n    "n_participants": "2,000",\n    "n_speakers": 4,\n    "price": "USD 1,450"\n}',
 '{\n    "name": "AI Advancements Summit 2023",\n    "date": "16.10.2023",\n    "n_participants": "800",\n    "n_speakers": "2",\n    "price": "USD 950"\n}',
 '{\n    "name": "AI Horizon 2023",\n    "date": "15.10.2023",\n    "n_participants": 2000,\n    "n_speakers": "No information mentioned",\n    "price": "No information mentioned"\n}',
 '{\n    "name": "AI for Equity Summit",\n    "date": "15.10.2023",\n    "n_participants": "3000",\n    "n_spea

# Task 3. Broken telephone

3 points

In the practice session we saw how to do text-to-speech (TTS) with play-ht. OpenAI API also supports speech-to-text (ASR, automatic speech recognition) using Whisper model. Let's make a broken telephone function

```python
broken_telephone(message: str, iterations: int = 5) -> str:
```

which does TTS and ASR a certain amount of times (number equals to `iterations`) and outputs the result.

Check it with several initial phrases.

In [13]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.[0m[31m
[0mSuccessfully installed tiktoken-0.5.2


In [None]:
# Example of work
broken_telephone(
    "A tutor who tooted the flute tried to teach two young tooters to toot."\
    "Said the two to the tutor, Is it harder to toot, or to tutor two tooters to toot?"\
    "This year I asked Santa to gift me a little cutie kitten. And my dream came true!"
)

In [15]:
import requests
import json
import openai

openai.api_key = open("/content/drive/MyDrive/.open-ai-api-key.txt").read().strip()

playht_key = open("/content/drive/MyDrive/.playht-key.txt").read().strip()
playht_used_id = open("/content/drive/MyDrive/.playht-user-id.txt").read().strip()

def generate_speech(text):
    response = requests.post(
        url="https://play.ht/api/v2/tts",
        headers = {
            "AUTHORIZATION": f"Bearer {playht_key}",
            "X-USER-ID": playht_used_id,
            "accept": "text/event-stream",
            "content-type": "application/json"
        },
        json = {
            "text": text,
            "voice": "larry"
        }
    )
    return json.loads(response.text.splitlines()[-2].replace("data: ", ""))['url']

def transcribe_speech(audio_path: str):
    audio_file = open(audio_path, "rb")
    transcript = openai.Audio.transcribe("whisper-1", audio_file)
    return transcript['text']

def broken_telephone(message: str, iterations: int = 5) -> str:
    pass

In [16]:
from tqdm.auto import tqdm

def broken_telephone(message: str, iterations: int = 5) -> str:
    current_message = message
    for _ in tqdm(range(iterations)):
        generated_audio_link = generate_speech(current_message)
        with open("audio.mp3", 'wb') as audio_file:
            audio_file.write(requests.get(generated_audio_link).content)

        transcribed = transcribe_speech("audio.mp3")
        if transcribed == current_message:
            print("No change this iteration")
            break

        current_message = transcribed

    return current_message

In [17]:
broken_telephone("A tutor who tooted the flute tried to teach two young tooters to toot. Said the two to the tutor, ‘Is it harder to toot, or to tutor two tooters to toot?This year I asked Santa to gift me a little cutie kitten. And my dream came true!")

  0%|          | 0/5 [00:00<?, ?it/s]

'A tutor who tooted the flute tried to teach two young tutors to toot, said the tutor to the tutor. Is it harder to tutor? Is it harder to tutor? Is it harder to tutor? Is it harder to tutor? Is it tutor tutors to tooth? This year, I asked Santa to gift me a little kitty kitten, and my dream came true.'

**Note**. Because both algorithms might be too good, you can actually see no change, depending on the sentence you've passed.

**Bonus** (1 point). Do the same with the text <-> image translation. You can skip the text <-> audio part if you choose this. However, the authors of the homework don't know any good image captioning (image -> text) API, so you'll have to work with a model locally. It may be done quite conveniently with the [Hugging Face](https://huggingface.co/) 🤗 infrastructure, but this is beyond the scope of the first part of the course, so we leave it as an optional exercise.

# Bonus task.

1 point

It's quite important to understand the current limitations of the technology. As for LLMs, there are still plenty of weak spots. They can struggle even with such an example:

In [None]:
!pip install openai
!pip install tiktoken



In [None]:
import tiktoken
import openai

from google.colab import drive
drive.mount('/content/drive')

openai.api_key = open("/content/drive/MyDrive/.open-ai-api-key.txt").read().strip()

encoder = tiktoken.encoding_for_model("gpt-3.5-turbo")

def get_chatgpt_answer(message: str) -> str:
    chat_completion = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": message}]
    )
    return chat_completion.choices[0].message.content

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
sentence = "How many tokens are in this sentence?"
print(f"Actual token count: {len(encoder.encode(sentence))}")
print(f"ChatGPT thinks it's: {get_chatgpt_answer(sentence)}")

Actual token count: 8
ChatGPT thinks it's: There are six tokens in the sentence "How many tokens are in this sentence?".


Or with this one:

In [None]:
# example
sentence = "Reverse this sentence character by character"
reversed_sentence = get_chatgpt_answer(sentence)
print(reversed_sentence[::-1])

character by character. sentence this reverse this sentence.


### LLM doing math

Math is also not too easy for LLMs which are getting better at counting and mathematical reasoning thanks to chain-of-thought generation, but can still struggle with symbolic algebra.

Let us look at an example:

In [None]:
sentence = "Let's define a mathematical operation x * y := xy + x + y. Is it associative?"
print(get_chatgpt_answer(sentence))

To determine if the operation x * y is associative, we need to check if (x * y) * z = x * (y * z) holds true for all values of x, y, and z.

Using the definition of the operation x * y := xy + x + y, we can calculate:

(x * y) * z = ((xy + x + y) * z) = (xyz + xz + yz + x + y + z)

x * (y * z) = (x * (yz + y + z)) = (x(yz + y + z) + x + yz + y + z)

Expanding both expressions, we get:

(x * y) * z = (xyz + xz + yz + x + y + z)

x * (y * z) = (xyz + xz + yz + x + y + z)

Since (x * y) * z = x * (y * z) for all values of x, y, and z, the operation x * y is associative.


**The solution by ChatGPT-4**

The

To determine if the operation $*$ is associative, we need to check if:

$$(a * b) * c = a * (b * c)$$

for all real numbers $a$, $b$, and $c$. If this equality holds for all real numbers, then the operation is associative.

Given the operation $x∗y:=xy+x+y$:

Calculating $(a∗b)∗c$:

First,

$$a∗b=ab+a+b.$$

Using this result:

$$(a∗b)∗c=(ab+a+b)∗c=$$
$$=(ab+a+b)c+(ab+a+b)+c=$$
$$=abc+ac+bc+ab+a+b+c$$

Calculating $a∗(b∗c)$:

First,

$$b∗c=bc+b+c.$$

Using this result:

$$a∗(b∗c)=a∗(bc+b+c)=$$
$$=a(bc+b+c)+a+bc+b+c=$$
$$=abc+ab+ac+a+bc+b+c$$

Comparing the two results:

$$(a∗b)∗c=abc+ac+bc+ab+a+b+c$$

$$a∗(b∗c)=abc+ab+ac+a+bc+b+c$$

The two expressions are not equal for all real numbers $a$, $b$, and $c$. Therefore, the operation $*$ is not associative.

**End**

Let's analyze this solution. You can see that ChatGPT knows definitions and does well with logic, but fails at the very last stage where it can't understand that

$$abc+ac+bc+ab+a+b+c$$

and

$$abc+ab+ac+a+bc+b+c$$

is the same expression with permuted summands.

## Task 4*

Find out what is it you are good at, but ChatGPT cannot do. Please try to be objective. Bonus points for analyzing stability of the failures of ChatGPT and their dependence on prompt formulation.

ChatGPT might struggle with certain types of math problems, especially those that require iterative or computational approaches. Here are a few examples of math problems that I can solve using Python, but which might pose a challenge for ChatGPT due to their computational nature:

In [None]:
def factorial(n):
    result = 1
    for i in range(2, n + 1):
        result *= i
    return result

# Example usage
print(factorial(5))  # Output: 120

120


In [None]:
sentence = "I would like to calculate factorial(5)"
print(get_chatgpt_answer(sentence))

Factorial is the product of an integer and all the positive integers below it. 

Factorial of 5 (written as 5!) can be calculated as:
5! = 5 * 4 * 3 * 2 * 1
   = 120

So, factorial(5) = 120


In [None]:
def fibonacci(n):
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a

# Example usage
print(fibonacci(10))  # Output: 55

55


In [None]:
sentence = "I would like to calculate fibonacci(10)"
print(get_chatgpt_answer(sentence))

To calculate the Fibonacci number at position 10, we can use a simple recursive function or iteration. 

Using a recursive function:
```python
def fibonacci(n):
    if n <= 1:
        return n
    else:
        return fibonacci(n-1) + fibonacci(n-2)

result = fibonacci(10)
print(result)
```

Using iteration:
```python
def fibonacci(n):
    fib = [0, 1]
    for i in range(2, n+1):
        fib.append(fib[i-1] + fib[i-2])
    return fib[n]

result = fibonacci(10)
print(result)
```

Both methods will give you the same result, which is 55.


In [None]:
import random

def estimate_pi(num_samples):
    inside_circle = 0
    for _ in range(num_samples):
        x, y = random.random(), random.random()
        if x**2 + y**2 <= 1:
            inside_circle += 1
    return 4 * inside_circle / num_samples

# Example usage
print(estimate_pi(10000))

3.1264


In [None]:
sentence = "I would like to calculate Monte Carlo Estimation of π"
print(get_chatgpt_answer(sentence))

To calculate the Monte Carlo estimation of π, you can follow these steps:

1. Generate random points within a square: Create a loop that generates a large number of random points within a square with side length 2 (centered at the origin). You can use a random number generator to generate the x and y coordinates of each point. Save the number of points generated as N.

2. Count points within the unit circle: For each random point generated in step 1, check if it falls within the unit circle centered at the origin. You can calculate the distance of each point from the origin using the Pythagorean theorem (sqrt(x^2 + y^2)). If the distance is less than or equal to 1, count it as a hit (i.e., within the unit circle).

3. Estimate π: After generating N random points and counting the number of hits (points within the unit circle), estimate the value of π using the formula:
  π ≈ 4 * (number of hits / N)

4. Repeat for increased accuracy: To improve the accuracy of the estimation, increase t