<a href="https://colab.research.google.com/github/TollanBerhanu/Semantic-Search-on-PPP-Discord/blob/main/Discord_semantic_search_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Implementing Semantic Search in a Sample Discord Chat

***This notebook provides an example implementation of semantic search on some sample discord chat data. This sample includes 500 messages extracted from the "Plutus Pioneer Program" discord channel.***


*   *This implementation utilizes the following tools:*

>

    1.   'Pandas' - to load and extract relevant information from the dataset
    2.   'RecursiveCharacterTextSplitter from langchain' - to chunk the data
    3.   'SentenceTransformers embedding model' - to generate embeddings for each chunk of data
    4.   'Pinecone' - to store and query the vector embeddings with some metadata
    5.   'Alpaca / LLaMA model' - to present the results in natural language


---



*Install a specific version of 'transformers' for importing LLaMA*

In [1]:
!pip install -q datasets loralib sentencepiece
!pip uninstall transformers
!pip install -q git+https://github.com/zphang/transformers@c3dc391
!pip -q install git+https://github.com/huggingface/peft.git
!pip -q install bitsandbytes

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m46.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m64.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from peft import PeftModel
from transformers import LLaMATokenizer, LLaMAForCausalLM, GenerationConfig
import textwrap


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


## **1. Loading the dataset from drive**

In [5]:
import pandas as pd

In [6]:
df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/dataset/discord_chatlogs.xlsx')
# df['Content']

In [7]:
# Cast the values of the column 'Content' into strings
df['Content'] = df['Content'].astype(str)

# Join the string values of all the rows in 'Content' into one large corpus of text
conversations = ' '.join(df['Content'])

## **2. Splitting the giant string into chnuks**

In [8]:
!pip install --upgrade langchain  -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.1/49.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [10]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,           # Usually chunk sizes are much larger than this
    chunk_overlap  = 20,        # Overlap is needed incase the text is split in odd places
    length_function = len,
)

In [11]:
# chunks = text_splitter.create_documents([conversations])
# print(chunks[:2]) # 1st two Document chunks

chunks = text_splitter.split_text(conversations)

## **3. Generating vector embeddings for the chunks**

In [12]:
!pip install sentence_transformers > /dev/null

In [13]:
from langchain.embeddings import HuggingFaceEmbeddings, SentenceTransformerEmbeddings

In [14]:
# model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# ... is equivalent to ...
embedding_model = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

embeddings = embedding_model.embed_documents(chunks)

Downloading (…)e9125/.gitattributes: 0.00B [00:00, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md: 0.00B [00:00, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json: 0.00B [00:00, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json: 0.00B [00:00, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py: 0.00B [00:00, ?B/s]

Downloading (…)7e55de9125/vocab.txt: 0.00B [00:00, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

## **4. Storing the embeddings in a vector database**

In [15]:
pip install pinecone-client

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pinecone-client
  Downloading pinecone_client-2.2.2-py3-none-any.whl (179 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.1/179.1 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Collecting loguru>=0.5.0 (from pinecone-client)
  Downloading loguru-0.7.0-py3-none-any.whl (59 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting dnspython>=2.0.0 (from pinecone-client)
  Downloading dnspython-2.3.0-py3-none-any.whl (283 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m283.7/283.7 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: loguru, dnspython, pinecone-client
Successfully installed dnspython-2.3.0 loguru-0.7.0 pinecone-client-2.2.2


In [16]:
import getpass    # To prompt the user for a password without echoing.
# from langchain.vectorstores import Pinecone

# PINECONE_ENV = getpass.getpass("Your env't name: ")   # Enter your pinecone env't name
PINECONE_API_KEY = getpass.getpass("Your API key: ")    # Enter your pinecone api key

PINECONE_ENV = "us-west1-gcp-free"
# PINECONE_API_KEY = "----------------"

Your API key: ··········


In [18]:
import pinecone
from langchain.vectorstores import Pinecone

# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_ENV,  # next to api key in console
)

# all_indices = pinecone.list_indexes() # List all the indexed in our pinecone workspace
index_name = "discord-embeddings"
index_dimension = len(embeddings[0])

# Create a pinecone index
print('Creating an index of dimension "'+ str(index_dimension) +'" ...')
pinecone.create_index(index_name, index_dimension)

pinecone.describe_index(index_name)
print('Pinecone index created!')

Creating an index of dimension "384" ...
Pinecone index created!


In [20]:
# Connect to the index
index = pinecone.Index(index_name)
# Current index statistics
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [21]:
# The total number of embedded chunks (if this exceeds 1000, it must be upserted step by step because pinecone can't handle it)
no_embeddings = len(chunks)

# This will be the size of the batch of vectors sent to pinecone at a time
step = 100

In [22]:
# This is logic to upsert the embeddings into pinecone step by step
for start in range(0, no_embeddings, step):
  # The end location of the current batch
  end = min(no_embeddings, start+step)    # If it reached the last batch, the end should be the total amount of vectors
                                          # [0..99], [100..199], ... , [1600..1678]  (The last batch should end at 1678)

  # create IDs for all embedded chunks (vectors) ... [0 -> 99 -> ... -> 1678]
  ids = [str(x) for x in range(start, end)]

  # create metadata for each vector ... (ideally, this should be as minimal as possible. For e.g., we can add the link to the median of the message chunk)
     # In this case, the original message is given as the metadata
  metadatas = [{'messages': chunk} for chunk in chunks[start:end]]

  # create a records list of current batch for upsert
  records = zip(ids, embeddings[start:end], metadatas)

  # upsert to Pinecone
    # vectors = [ ( "id1", [0.1,0.2,..], {metadata1} )  ,  ( "id2", [0.4,0.6,..], {metadata2} )  , .. ]
    # namespace = "my-namespace"
  index.upsert(vectors=records, namespace="first-upsert")

  # index stat after current batch upsert
  print('Batch no. ' + str(int( start/step + 1 )) )
  index.describe_index_stats()

# index stats after all upsert batch
print('Completed upserting all batches: ')
index.describe_index_stats()

Batch no. 1
Batch no. 2
Batch no. 3
Batch no. 4
Batch no. 5
Batch no. 6
Batch no. 7
Batch no. 8
Batch no. 9
Batch no. 10
Batch no. 11
Batch no. 12
Batch no. 13
Batch no. 14
Batch no. 15
Batch no. 16
Batch no. 17
Completed upserting all batches: 


{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'first-upsert': {'vector_count': 1679}},
 'total_vector_count': 1679}

**Querying semantically related data from the vector database**

In [31]:
# Prompt a question
query = input('Question: ')

Question: how to make a burger


In [33]:
# Generate embeddings for the query
embedded_query = embedding_model.embed_query(query)

# Query the database
query_response = index.query(
    namespace="first-upsert",
    top_k=10,
    include_values=False,
    include_metadata=True,
    vector=embedded_query
)

query_response['matches'][0]
score = query_response['matches'][0]['score']

In [27]:
# Append the top 10 semantically related messages to define the context
context = ' \n '.join( [msg['metadata']['messages'] for msg in query_response['matches'][:10]] )
print(context)

using version 4 or later of the PPP image, you should already have it. Hey guys, I'm trying to do 
 From the Ubuntu command line, I removed the PPP directory from Lesson 2. 
 Look in the PPP Docker container in the scripts folder for a script to query UTxOs. 
 out. I am at the PPP040202.  There are too many problems. The bash scripts aren't working. I 
 Hello. I am working through the PPP 040202 lecture but am stuck at running the command 
 And I changed mode of directory root, so it can be changed

And my PPP repository is up to date 
 example. Hope this helps! Question for git-savvy people. How do I sync my github fork of the ppp 
 installing it with docker. i've been using a different method that doesn't use docker...sorry not 
 there is any error messages? hi im new to the ppp program, i tried to the run the kuber the demo, 
 --protocol-params-file "$pp" \


### **5. Display the answer in natural language**

In [None]:
# Load the model
tokenizer = LLaMATokenizer.from_pretrained("decapoda-research/llama-7b-hf")

model = LLaMAForCausalLM.from_pretrained(
    "decapoda-research/llama-7b-hf",
    load_in_8bit=True,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, "samwit/alpaca7B-lora")

In [None]:
# Define a function that runs the model
def alpaca_talk(text):
    inputs = tokenizer(
        text,
        return_tensors="pt",
    )
    input_ids = inputs["input_ids"].cuda()

    generation_config = GenerationConfig(
        temperature=0.6,
        top_p=0.95,
        repetition_penalty=1.2,
    )
    print("Generating...")
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=256,
    )
    for s in generation_output.sequences:
        print(tokenizer.decode(s))

In [29]:
# Provide the model with some instructions along with the context and query
query_and_context = '''
Below is sequence of chat messages related to a certain topic. Write a response that answers the question below based on
what is discussed in the messages. Do not mention anything outside of what is discussed below. If there isn't enough
context, simply reply "This topic was not discussed previously"

### Messages:
{context}

### Question:
{query}

### Response:
'''.format(context=context, query=query)

In [34]:
# Watch the magic happen... it maybe not so magical as you'd expect

if score < 0.4:
  print(query_and_context + 'This topic was not discussed previously')
else
  alpaca_talk(query_and_context)


Below is sequence of chat messages related to a certain topic. Write a response that answers the question below based on
what is discussed in the messages. Do not mention anything outside of what is discussed below. If there isn't enough
context, simply reply "This topic was not discussed previously"

### Messages:
using version 4 or later of the PPP image, you should already have it. Hey guys, I'm trying to do 
 From the Ubuntu command line, I removed the PPP directory from Lesson 2. 
 Look in the PPP Docker container in the scripts folder for a script to query UTxOs. 
 out. I am at the PPP040202.  There are too many problems. The bash scripts aren't working. I 
 Hello. I am working through the PPP 040202 lecture but am stuck at running the command 
 And I changed mode of directory root, so it can be changed

And my PPP repository is up to date 
 example. Hope this helps! Question for git-savvy people. How do I sync my github fork of the ppp 
 installing it with docker. i've been usi