**This notebook is focused on presenting the process of building a RAG application.**

**The goal of the project: to create a RAG application to create a dedicated assistant for a group based on the information exchanged between group members during various conversations. This chatbot should be able to answer questions that are asked in the group.**

**This project was written exclusively for Howsam Academy and the main focus was to achieve the desired result in this task.**
**Therefore, minor changes may be required for use in another task.**

# 0. Setting up

### 0.1 Installing the dependencies

In [None]:
!pip install telethon pandas tiktoken scipy openai

### 0.2 Setting the enviroment variables

In [None]:
import os
from telethon import TelegramClient
from google.colab import userdata
from openai import OpenAI
TELETHON_API_KEY = userdata.get('TELEGRAM_API_ID')
TELETHON_API_HASH = userdata.get('TELEGRAM_HASH_ID')
TELETHON_PHONE_NUMBER = userdata.get('TELEGRAM_PHONE_NUMBER')
TELETHON_GROUP_ID = userdata.get('TELEGRAM_GROUP_ID')
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

### 0.3 Setting up the clients

In [None]:
from openai import OpenAI
openai_client = OpenAI()

from telethon import TelegramClient
telethon_client = TelegramClient('session0', TELETHON_API_KEY, TELETHON_API_HASH)

# 1. Collecting Data

In [1]:
import sys
import re

### Utils

#### 1.1 Embedding & Text cleaning utils

here, there are two functions :
- clean_text:
  * this function applies the necessary tranformations to a given text.
  - such as:
    * removing emojies
    * removing URLs
    * removing extra white spaces

- get_embedding:
  * this function generates the embedding vector for a given text using OpenAI API


In [None]:
def clean_text(text):
    # Remove emojis
    text = emoji_pattern.sub(r'', text)
    # Remove URLs and @mentions
    text = url_pattern.sub(r'', text)
    # Remove any remaining special characters or excessive whitespace
    text = re.sub(r'[^\w\s]', '', text)
    text = ' '.join(text.split())  # Remove extra spaces
    return text

def get_embedding(text):
    response = openai_client.embeddings.create(
        input=text,
        model="text-embedding-ada-002"  # Choose the model you want to use
    )
    return response.data[0].embedding

##### Patterns

###### 1.1.1 Emoji Patterns

here, i set the emoji patterns unicode IDs for future to remove them

In [None]:
emoji_pattern = re.compile(
    u'[\U0001F600-\U0001F64F'  # emoticons
    u'\U0001F300-\U0001F5FF'  # symbols & pictographs
    u'\U0001F680-\U0001F6FF'  # transport & map symbols
    u'\U0001F700-\U0001F77F'  # alchemical symbols
    u'\U0001F780-\U0001F7FF'  # Geometric Shapes Extended
    u'\U0001F800-\U0001F8FF'  # Supplemental Arrows-C
    u'\U0001F900-\U0001F9FF'  # Supplemental Symbols and Pictographs
    u'\U0001FA00-\U0001FA6F'  # Chess Symbols
    u'\U0001FA70-\U0001FAFF'  # Symbols and Pictographs Extended-A
    u'\U00002702-\U000027B0'  # Dingbats
    u'\U000024C2-\U0001F251'  # Enclosed Characters
    ']+', re.UNICODE)


###### 1.1.2 URL Pattern

i define the url pattern for removing them in future

In [4]:
url_pattern = re.compile(r'https?://\S+|www\.\S+|@\w+', re.UNICODE)

### 1.2 Message Crawling

In [6]:
import pandas as pd

#### 1.2.1 Message fetching function

Here, designed this function to crawl information and extract conversations from a public/private telegram group chat. in this case ('دانش آموختگان هوسم').

This function uses the group id to access that group chat.
This function should:
  * iterate through group messages
  * extract messages that have atleast one reply
  * check the replies of messages that contain atleast 1 reply
  * fix the order of the replies coming to that messages.
  * save the conversations in a csv file



  This is how you can find the group ids:


```
from telethon import TelegramClient
import os

TELETHON_API_ID = ''
TELETHON_HASH_ID = ''
TELETHON_PHONE_NUMBER = '+'

client = TelegramClient('session', TELETHON_API_ID, TELETHON_HASH_ID)

async def main():
    await client.start(TELETHON_PHONE_NUMBER)
    print("Client started successfully")
    
    found_group = False
    async for dialog in client.iter_dialogs():
        if dialog.is_group:
            print(f"Group: {dialog.name}, ID: {dialog.id}")
            found_group = True

    if not found_group:
        print("No groups found.")

with client:
    client.loop.run_until_complete(main())
```

---



In [None]:
async def fetch_messages():
    try:
        await telethon_client.start(TELETHON_PHONE_NUMBER)
        print('Client started.')

        group = await telethon_client.get_entity(int(TELETHON_GROUP_ID))
        conversations = []

        print("Fetching group messages.")
        message_count = 0

        async for message in telethon_client.iter_messages(group, limit=50):
            try:
                if message.text:
                    message_count += 1

                    sys.stdout.write(f"\rFetching message {message_count}...")
                    sys.stdout.flush()

                    replies = []

                    if message.replies and message.id:
                        try:
                            async for reply in telethon_client.iter_messages(group, reply_to=message.id):
                                try:
                                    cleaned_reply = clean_text(reply.text.strip())

                                    replies.append({
                                        'date': reply.date,
                                        'text': cleaned_reply
                                    })
                                except Exception as e:
                                    print(e)
                                    continue

                            if replies:
                                cleaned_message = clean_text(message.text.strip())
                                conversation = [f'-> {cleaned_message}']

                                replies.sort(key=lambda r: r['date'])

                                for reply in replies:
                                    conversation.append(f"-> {reply['text']}")

                                conversations.append("\n".join(conversation))

                        except Exception as e:
                            print(e)
                            continue

            except Exception as e:
                print(e)
                continue

        print("\nAll messages fetched.")

        try:
            df = pd.DataFrame({'conv': conversations})
            df.to_csv('conversations.csv', index=False)
            print('Done')
        except Exception as e:
            print(e)

    except Exception as e:
        print(e)

In [None]:
await fetch_messages()

Client started.
Fetching group messages.
Fetching message 48...
All messages fetched.
Done


### 1.2.2 Generating the embeddings

In [None]:
# i manually generated the embeddings for all the conversations in the data using this code:
test_df['embedding'] = test_conv_df['conv'].apply(lambda x: get_embedding(x))
test_df.to_pickle('test_embeddings.pkl')

In [None]:
conversation_df = pd.read_csv('conversations.csv')
embeddings_df = pd.DataFrame(columns=['embedding'])
embeddings_df['embedding'] = conversation_df['conv'].apply(lambda x: get_embedding(x))
embeddings_df.to_pickle('embeddings.pkl')

### 1.2.3 Gathering the Conversations and Embeddings together

In [None]:
df = df.concat([conversation_df, embeddings_df], axis=1)
df.to_pickle('conversations_embeddings.pkl')

# 2. RAG

### 2.0 Set up

In [None]:
from scipy import spatial # for similarity search
import tiktoken # for counting tokens in a string

In [None]:
embedding_model = 'text-embedding-ada-002'
gpt_model = 'gpt-4o-mini'

### 2.1 Search
  Now i'll define a search function that:

Takes a user query and a dataframe with text & embedding columns

*   Takes a user query and a dataframe with text & embedding columns
*   Embeds the user query with the OpenAI API
*   Uses distance between query embedding and text embeddings to rank the texts
  - Returns two lists:
    * The top N texts, ranked by relevance
    * Their corresponding relevance scores

In [None]:
def strings_ranked_by_relatedness(query: str, df: pd.DataFrame, relatedness_function = lambda x, y: 1 - spatial.distance.cosine(x, y), top_n: int = 3):
  q_embedding = openai_client.embeddings.create(
      input = query,
      model = embedding_model
  ).data[0].embedding

  strings_and_relatedness = [
      (row['conv'], relatedness_function(q_embedding, row['embedding']))
      for i, row in df.iterrows()
  ]

  strings_and_relatedness.sort(key = lambda x: x[1], reverse = True)

  strings, relatednesses = zip(*strings_and_relatedness)

  return strings[:top_n], relatednesses[:top_n]

In [None]:
# strings, relatedness = strings_ranked_by_relatedness('دوره بینایی کامپیوتر', df)
# for stri, relatei in zip(strings, relatedness):
#   print(f"{relatei=:.3f}")
#   display(stri)

### 2.3 Ask
Below, i define a function ask that:

 * Takes a user query
 * Searches for text relevant to the query
 * Stuffs that text into a message for GPT
 * Sends the message to GPT
 * Returns GPT's answer

#

#### 2.3.1 Counting Tokens

In [None]:
def num_tokens(text: str, model: str = gpt_model):
  """
  This function returns the number of tokens in a text.
  """
  encoding = tiktoken.encoding_for_model(model)
  num_tokens = len(encoding.encode(text))
  return num_tokens

#### 2.3.2 Making the input prompt ready for GPT

In [None]:
def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
):
  """
  This returns a message ready for GPT.
  """
  strings, relatedness = strings_ranked_by_relatedness(
      query,
      df
  )

  introduction = 'Use the below conversation to answer the subsequent question. If the answer cannot be found in the text below, write "I could not find an answer.'

  question = f'\n\nQuestion: {query}'

  for string in strings:
    next_conv = f'\n\nConversation: """{string}"""'

    if (
        num_tokens(introduction + next_conv + question, model=model)
        > token_budget
    ):
      break
    else:
      introduction += next_conv

  return introduction + question

### 2.3.3 Ask

In [None]:
def ask(
    query: str,
    df: pd.DataFrame,
    model: str = gpt_model,
    client: openai.OpenAI() = openai_client,
    token_budget: int = 5000,
    print_message: bool = False,
):
  """
  Answers a query using GPT and a df of relevant text and embeddings
  """

  message = query_message(query, df, model, token_budget)

  if print_message:
    print(message)

  messages = [
      {'role':'system', 'content': 'You answer question in farsi about The Howsam Academy, An AI Academy with multiple courses. you will use the info you will be given to answer questions.'},
      {'role':'user', 'content': message},
  ]

  response = client.chat.completions.create(
    model = model,
    messages = messages
  ).choices[0].message.content

  return response

In [None]:
ask('دوره جدید بینایی کامپیوتر چه زمانی اماده میشه', df)

'دوره جدید بینایی کامپیوتر هنوز زمان دقیقی برای آماده شدن ندارد، اما استاد اشاره کرده\u200cاند که در حال حاضر درگیر دوره پردازش تصویر هستند و امیدوارند که بتوانند به زودی از آن شروع کنند.'