###
CHATBOT TEST #1 

HERE WE SHOW THE FIRST TRY TO DEVELOP OUR TEXT-BASED, CLOSED-DOMAIN Q&A CHATBOT USING THE 'glove-wiki-gigaword-100' EMBEDDING MODEL.
We explain why this model did not work for our specific case and what we did to solve it. However, this notebook only contains the code for the model that didn't work. The definitive and final code we used to create and test out the chatbot is in the Studentia-Chatbot repository.

###
Problem: 
glove-wiki-gigaword Embedding Model
- glove-wiki-gigaword-100 is a model that has to be downloaded and "installed" locally to use it. The chatbot is hosted in Firebase Functions, and we found it this model can NOT be hosted in it as well because of upload file size incompatibility. 
- It is not very effective as it has a limit of 100 embedding columns or floating-point numbers. Even its sibling model, 'glove-wiki-gigaword-300', which has a bigger resulting embedding size, 300, does not carry high accuracy.
  
Pros: 
- It's free.
- Decently efficient.
  
Cons:
- Occupied a lot of memory space because it has to be downloaded locally.
- Only 300 floating-point vector embedding size.

###
Solution: 
OpenAI's Curie Embedding Model
- We opted for a model that did not take any memory space locally or in Firebase Functions. Instead, we decided to use OpenAI's Embedding Models: text-search-curie-doc-001 and text-search-curie-query-001. These models are produced and hosted by OpenAI, this can be used online. Moreover, it has an embedding result of 4096 floating-point numbers -> much more room for efficiency and a high similarity!

Pros:
- Enhanced portability (does not need to be downloaded or stored anywhere)
- 4096 floating-point vector embedding size.
- Much more efficient.
  
Cons:
- It costs money.

Consequently, we decided OpenAi's 'Curie' model has more advantages for our especific case and the overall development of Studentia.


In [3]:
# Install required libraries
%pip install transformers
%pip install gensim
%pip install openai
%pip install pandas
%pip install numpy
%pip install typing

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/opt/homebrew/Cellar/jupyterlab/4.0.9/libexec/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/opt/homebrew/Cellar/jupyterlab/4.0.9/libexec/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/opt/ho

In [1]:
# Dictionary to cache responses
response_cache = {}

In [2]:
# Import installed libraries
import pandas as pd
import numpy as np
import gensim
from transformers import GPT2TokenizerFast
from gensim.models import Word2Vec
from typing import Dict, List, Tuple

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [3]:
# Import Embedding Library 
import gensim.downloader as api

# Load the pre-trained model glove-wiki-gigaword-100
word2vec_model = api.load("glove-wiki-gigaword-100")

In [4]:
# Check out the database to be embedded
df = pd.read_csv('ewha_database2.csv')
df = df.set_index(["title", "heading"])
print(f"{len(df)} rows in the data.")
df.sample(4)

5 rows in the data.


Unnamed: 0_level_0,Unnamed: 1_level_0,content,token
title,heading,Unnamed: 2_level_1,Unnamed: 3_level_1
Undergraduate Admission Requirements,Requirements,Applicants must have a high school diploma or ...,69
International Exchange Affairs Team,Location,"B334 ECC, Ewha Womans University, 52 Ewhayeoda...",39
International Exchange Affairs Team Office,Contact List,"- Title/Responsibility: Agreement, Protocol, P...",129
Undergraduate Graduation Requirements,Requirements,Completed Semesters : Minimum eight Semesters ...,91


In [5]:
# Define a function to get the Word2Vec embedding for a text
def get_embedding(text: str, model) -> List[float]:
    tokens = text.split()  # Split the text into tokens
    embedding = np.zeros(model.vector_size)  # Initialize an empty embedding vector
    
    # Compute the sum of Word2Vec embeddings for each token in the text
    num_tokens = 0
    for token in tokens:
        if token in model:
            embedding += model[token]
            num_tokens += 1
    
    if num_tokens > 0:
        embedding /= num_tokens  # Take the mean of embeddings
    
    return embedding.tolist()


def compute_doc_embeddings(df: pd.DataFrame, model) -> Dict[Tuple[str, str], List[float]]:
    """
    Create an embedding for each row in the dataframe using Word2Vec.
    
    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        idx: get_embedding(r.content.replace("\n", " "), model) for idx, r in df.iterrows()
    }

# Use Word2Vec model to compute context embeddings
context_embeddings = compute_doc_embeddings(df, word2vec_model)

# Convert the context_embeddings dictionary to a DataFrame
new_df_embeddings = pd.DataFrame.from_dict(context_embeddings, orient='index')

# Reset the index to get "title" and "heading" as columns
new_df_embeddings = new_df_embeddings.reset_index()

# Add column names to the new DataFrame
column_names = ["title", "heading"] + [str(i) for i in range(new_df_embeddings.shape[1] - 2)]
new_df_embeddings.columns = column_names

# Save the DataFrame to the document_embeddings.csv file
new_df_embeddings.to_csv("document_embeddings1.csv", index=False)
doc_embeddings = new_df_embeddings

In [6]:
import pandas as pd
import numpy as np
from gensim.models import KeyedVectors
from typing import List, Dict, Tuple

# Define a function to get the Word2Vec embedding for a text
def get_embedding(text: str, model) -> List[float]:
    tokens = text.split()  # Split the text into tokens
    embedding = np.zeros(model.vector_size)  # Initialize an empty embedding vector
    
    # Compute the sum of Word2Vec embeddings for each token in the text
    num_tokens = 0
    for token in tokens:
        if token in model:
            embedding += model[token]
            num_tokens += 1
    
    if num_tokens > 0:
        embedding /= num_tokens  # Take the mean of embeddings
    
    return embedding.tolist()

def compute_doc_embeddings(df: pd.DataFrame, model) -> pd.DataFrame:
    """
    Create an embedding for each row in the dataframe using Word2Vec.
    
    Return a DataFrame with embeddings.
    """
    embeddings = []
    for idx, row in df.iterrows():
        title = row.get("title", "")
        heading = row.get("heading", "")
        content = row.get("content", "").replace("\n", " ")

        # Use the Word2Vec model to compute document embedding
        embedding = get_embedding(content, model)

        # Create a dictionary with column names and values
        entry_dict = {"title": title, "heading": heading}
        entry_dict.update({str(i): emb_i for i, emb_i in enumerate(embedding)})

        embeddings.append(entry_dict)

    # Convert the embeddings list to a DataFrame
    df_embeddings = pd.DataFrame(embeddings)

    return df_embeddings

# Use Word2Vec model to compute context embeddings
df_embeddings = compute_doc_embeddings(df, word2vec_model)

# Reorder the columns so that "title" and "heading" come first
df_embeddings = df_embeddings[[str(i) for i in range(len(df_embeddings.columns)-2)]]

# Save the DataFrame to the document_embeddings.csv file
df_embeddings.to_csv("document_embeddings1.csv", index=False)


In [7]:
# Check out the embedding result
result = pd.read_csv('document_embeddings1.csv')
result.sample(4)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
1,-0.030419,0.209494,0.212253,-0.007221,0.052763,0.187372,-0.054599,0.259133,-0.301261,0.299786,...,-0.183965,-0.002947,-0.200739,0.296096,-0.511797,0.06911,-0.199801,-0.423748,0.631294,-0.011376
4,-0.031818,0.296204,0.221709,-0.126252,0.183296,0.064396,-0.059982,0.220088,-0.370262,0.353595,...,-0.152434,-0.009781,-0.069801,0.172032,-0.500879,0.062178,-0.129026,-0.324059,0.595406,-0.089915
3,-0.144195,0.131625,0.401107,0.016035,0.579145,0.40198,0.032005,-0.04649,-1.153765,0.386165,...,0.23221,0.134376,-0.05939,0.327628,-0.887215,-0.007225,-0.304517,-0.273666,0.512358,-0.038035
2,-1.2557,0.61036,0.56793,-0.96596,-0.45249,-0.071696,0.57122,-0.31292,-0.43814,0.90622,...,-0.05854,0.28253,-0.083276,-0.022234,-0.55914,0.24586,0.36052,-1.5877,0.76984,-0.64998


In [8]:
from transformers import GPT2TokenizerFast

MAX_SECTION_LEN = 300
SEPARATOR = "\n* "

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
separator_len = len(tokenizer.tokenize(SEPARATOR))

In [9]:
def get_query_embedding(text: str) -> List[float]:
    return get_embedding(text, word2vec_model)

In [10]:
def vector_similarity(x: List[float], y: List[float]) -> float:
    """
    We could use cosine similarity or dot product to calculate the similarity between vectors.
    In practice, we have found it makes little difference. 
    """
    return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))
    #return np.dot(np.array(x), np.array(y))

In [15]:
import matplotlib.pyplot as plt

def order_document_sections_by_query_similarity(query: str, contexts: Dict[Tuple[str, str], np.array]) -> List[Tuple[float, Tuple[str, str]]]:
    query_embedding = get_query_embedding(query)

    document_similarities = [
        (vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()
    ]

    # Sort by similarity in descending order
    document_similarities.sort(reverse=True)

    # Extract the top 5 most similar documents
    top_5_documents = document_similarities[:5]

    # Reverse the order of the top 5 documents so that the most similar document is at the top
    top_5_documents.reverse()

    # Extract the similarity scores and document names
    similarity_scores = [score * 100 for score, _ in top_5_documents]
    similarity_percentages = [f"{score:.2f}%" for score in similarity_scores]
    doc_names = [f"{index[0]} - {index[1]}" for _, index in top_5_documents]

    # Visualize the similarity scores with exact similarity percentages
    #plt.figure(figsize=(10, 6))
    #plt.barh(doc_names, similarity_scores, color='#a074ff')
    #plt.xlabel('Similarity Score %')
    #plt.ylabel('Document')
    #plt.title('Cosine Similarity Between Query and Document Sections')

    # Annotate the similarity percentages on top of each bar
    #for i, (similarity_score, similarity_percentage, doc_name) in enumerate(zip(similarity_scores, similarity_percentages, doc_names)):
        #plt.text(similarity_score + 0.1, i, similarity_percentage, ha='center', va='center')

    #plt.show()

    return document_similarities

# Call the function to visualize the similarity scores
order_document_sections_by_query_similarity("Do i need topik to enter ewha?", context_embeddings)

[(0.8770458840868364,
  ('Undergraduate Admission Requirements', 'Requirements')),
 (0.8452865117676231,
  ('Language Admission Requirements', 'Language Proficiency')),
 (0.8149613608495258,
  ('Undergraduate Graduation Requirements', 'Requirements')),
 (0.5698017441161497, ('International Exchange Affairs Team', 'Location')),
 (0.5464696262076335,
  ('International Exchange Affairs Team Office', 'Contact List'))]

In [16]:
def count_tokens(text: str, tokenizer) -> int:
    tokens = tokenizer.tokenize(text)
    return len(tokens)

In [17]:
def construct_prompt(question: str, context_embeddings: dict, df: pd.DataFrame) -> str:
    most_relevant_document_sections = order_document_sections_by_query_similarity(question, context_embeddings)
    
    chosen_sections = []
    chosen_sections_len = 0
    chosen_sections_indexes = []

    
    for _, section_index in most_relevant_document_sections:
        # Add contexts until we run out of space.        
        document_section = df.loc[section_index]

        # Calculate and add the 'tokens' column to the DataFrame
        df['tokens'] = df['content'].apply(lambda x: count_tokens(x, tokenizer))

        # Access the 'tokens' column like this:
        chosen_sections_len += document_section['token'] + separator_len
        if chosen_sections_len > MAX_SECTION_LEN:
            break

        chosen_sections.append(SEPARATOR + document_section.content.replace("\n", " "))
        chosen_sections_indexes.append(str(section_index))
            
    # Useful diagnostic information
    print(f"Selected {len(chosen_sections)} document sections:")
    print("\n".join(chosen_sections_indexes))
    
    header = """Answer the question as factual as possible using the provided context, and if the answer is not undoubtedly contained within the text below, absolutely don't answer anything except for saying "Sorry, I don't have that information. Please visit Ewha Womans University official website at https://www.ewha.ac.kr/ewhaen/index.do for more information."\n\nContext:\n"""
    
    return header + "".join(chosen_sections) + "\n\n Q: " + question + "\n A:"

In [18]:
prompt = construct_prompt(
    "Do i need topik to graduate?",
    context_embeddings,
    df
)

print("===\n", prompt)

Selected 2 document sections:
('Undergraduate Admission Requirements', 'Requirements')
('Language Admission Requirements', 'Language Proficiency')
===
 Answer the question as factual as possible using the provided context, and if the answer is not undoubtedly contained within the text below, absolutely don't answer anything except for saying "Sorry, I don't have that information. Please visit Ewha Womans University official website at https://www.ewha.ac.kr/ewhaen/index.do for more information."

Context:

* Applicants must have a high school diploma or equivalent. Applicants must submit their academic transcripts, standardized test scores (such as SAT or ACT), and letters of recommendation. Applicants must also write an essay and submit a personal statement. The specific requirements for each program may vary, so it is important to check with the university for more information.
* A. Undergraduate Freshman Applicants: Test of Proficiency in Korean(TOPIK) Level 3 or above. Undergraduat

In [19]:
# Dictionary to cache responses
response_cache = {}

In [20]:
import openai

openai.api_key = "PRIVATE OPENAI KEY"
COMPLETIONS_MODEL = "text-davinci-002"

In [21]:
COMPLETIONS_API_PARAMS = {
    "temperature": 0.0,
    "max_tokens": 200,
    "model": COMPLETIONS_MODEL,
}

In [22]:
def answer_query_with_context(
    query: str,
    df: pd.DataFrame,
    document_embeddings: Dict[Tuple[str, str], np.array],
    show_prompt: bool = False
) -> str:
    # Check if the response is cached
    if query in response_cache:
        cached_response = response_cache[query]
        print(f"Response for '{query}' obtained from cache.")
        return cached_response

    prompt = construct_prompt(query, document_embeddings, df)

    if show_prompt:
        print(prompt)

    response = openai.Completion.create(
        prompt=prompt,
        **COMPLETIONS_API_PARAMS
    )

    # Extract and store the generated text
    generated_text = response.choices[0].text.strip()
    response_cache[query] = generated_text
    print(f"Response for '{query}' obtained from API and cached.\n")

    return generated_text

In [25]:
answer = answer_query_with_context("what are the undergrad requirements?", df, context_embeddings)
print(answer)

Selected 2 document sections:
('Undergraduate Admission Requirements', 'Requirements')
('Language Admission Requirements', 'Language Proficiency')
Response for 'what are the undergrad requirements?' obtained from API and cached.

Applicants must have a high school diploma or equivalent. Applicants must submit their academic transcripts, standardized test scores (such as SAT or ACT), and letters of recommendation. Applicants must also write an essay and submit a personal statement. The specific requirements for each program may vary, so it is important to check with the university for more information.
