# Business Problem:
Developing an AI "Q&A chatbot" equipped with the capability to stay abreast of the latest advancements across diverse research domains.

# Uses:
* **Efficient Research Assistance:** The chatbot serves as a valuable tool for researchers, aiding in the writing process by providing citations and suggesting relevant coursework materials. This enhances efficiency in literature review and ensures the inclusion of the most recent findings.

* **Time and Effort Savings:** By automating the search for pertinent research materials, the chatbot significantly reduces the time and effort required for information gathering. Researchers can focus more on analysis and interpretation, thereby expediting the overall research process.

* **Practical Application for Industry Experts:** The chatbot's ability to stay updated with recent advancements enables industry experts to practically apply new findings. This ensures that professionals can seamlessly integrate the latest knowledge into their work, fostering innovation and staying ahead in their respective fields.

* **Interactive Learning:** Researchers and industry experts can engage in meaningful conversations with the chatbot, asking specific questions about recent research developments. This interactive learning experience facilitates a deeper understanding of complex topics and promotes continuous learning.

# Implementation:
1. Accessing OpenAI model through API as a base model.
2. Creating text embeddings with OpenAI Langchain Embeddings.
3. Storing Embeddings in Pinecone vector database.
4. Quering and finding top 3 results by expanding LLM knowledge base.
5. Gaining Inference and results.

# Installing Dependencies

In [None]:
!pip install -qU \
langchain==0.0.354 \
openai==1.6.1 \
datasets==2.10.1 \
pinecone-client==3.0.0 \
tiktoken==0.5.2

# Accessing API Keys

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import json

with open("drive/MyDrive/keys.json", "r") as keys:
  data = json.load(keys)
  OPEN_API_KEY, PINECONE_API_KEY = data["OPEN_API_KEY"], data["PINECONE_API_KEY"]

# Initializing Model

In [None]:
import os
from langchain.chat_models import ChatOpenAI

os.environ[OPEN_API_KEY] = os.getenv(OPEN_API_KEY) or OPEN_API_KEY

chat = ChatOpenAI(
    openai_api_key=os.environ[OPEN_API_KEY],
    model="gpt-3.5-turbo"
)

# System Prompting (Fine Tuning)

In [None]:
from langchain.schema import SystemMessage, HumanMessage, AIMessage

message = [
    SystemMessage(content="[Context] You are the AI behind a cutting-edge Q&A bot dedicated\
     to providing in-depth STEM knowledge to users for better understanding. \
     [Clarity] Craft clear and concise responses that deliver accurate information. \
     [Context Establishment] Assume users are seeking explanations, clarifications, \
     or insights on various STEM topics. [Examples] Include illustrative examples to \
     enhance the explanatory nature of the responses. [Gradual Refinement] If initial \
     responses lack depth, guide the model to delve deeper into specific concepts or \
     provide additional details. [Control Tokens] Use system messages to emphasize the \
     importance of accuracy and user-friendly explanations. [Temperature and Max Tokens] \
     Set temperature to 0.7 for a balanced blend of creativity and accuracy. Limit \
     responses to 150 tokens to ensure concise yet informative answers. [Experimentation] \
     Periodically experiment with variations of prompts to fine-tune the model's responses."),
]

In [None]:
res = chat(message)
print(res.content)

To calculate the volume of a rectangular prism, you need to multiply its length, width, and height. The formula for the volume of a rectangular prism is V = lwh, where V represents the volume, l represents the length, w represents the width, and h represents the height.

For example, let's say you have a rectangular prism with a length of 5 units, a width of 3 units, and a height of 2 units. To find its volume, you can use the formula V = 5 * 3 * 2 = 30 cubic units.

Remember to use consistent units when calculating volume. If the dimensions are given in different units, you may need to convert them to the same unit before multiplying them together.

Keep in mind that the volume of a rectangular prism represents the amount of space it occupies in three dimensions.


# Loading Dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset(
    "jamescalam/llama-2-arxiv-papers-chunked",
    split="train"
)

dataset



Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 4838
})

In [None]:
dataset[0]

{'doi': '1102.0183',
 'chunk-id': '0',
 'chunk': 'High-Performance Neural Networks\nfor Visual Object Classi\x0ccation\nDan C. Cire\x18 san, Ueli Meier, Jonathan Masci,\nLuca M. Gambardella and J\x7f urgen Schmidhuber\nTechnical Report No. IDSIA-01-11\nJanuary 2011\nIDSIA / USI-SUPSI\nDalle Molle Institute for Arti\x0ccial Intelligence\nGalleria 2, 6928 Manno, Switzerland\nIDSIA is a joint institute of both University of Lugano (USI) and University of Applied Sciences of Southern Switzerland (SUPSI),\nand was founded in 1988 by the Dalle Molle Foundation which promoted quality of life.\nThis work was partially supported by the Swiss Commission for Technology and Innovation (CTI), Project n. 9688.1 IFF:\nIntelligent Fill in Form.arXiv:1102.0183v1  [cs.AI]  1 Feb 2011\nTechnical Report No. IDSIA-01-11 1\nHigh-Performance Neural Networks\nfor Visual Object Classi\x0ccation\nDan C. Cire\x18 san, Ueli Meier, Jonathan Masci,\nLuca M. Gambardella and J\x7f urgen Schmidhuber\nJanuary 2011\nAbs

# Creating Embeddings

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings

embed_model = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=OPEN_API_KEY)

In [None]:
texts = [
    'This is a test example for verifying shape of embedded texts!',
]

res = embed_model.embed_documents(texts)
len(res), len(res[0]) # Visualizing Embedding size

(1, 1536)

# Intializing Vector Knowledge Base "PineCone"

In [None]:
from pinecone import Pinecone
from pinecone import ServerlessSpec

pc = Pinecone(api_key=PINECONE_API_KEY)

spec = ServerlessSpec(
    cloud="GCP", region="us-central1"
)

In [None]:
import time

index_name = 'rag-index-api' # Name of index in pinecone
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

In [None]:
if index_name not in existing_indexes:
    pc.create_index(
        index_name,
        dimension=1536,
        metric='dotproduct',
        spec=spec
    )
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1) # Adding latency for rate limit

index = pc.Index(index_name)
time.sleep(1)
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.04838,
 'namespaces': {'': {'vector_count': 4838}},
 'total_vector_count': 4838}

# Visualizing data in pandas dataframe

In [None]:
from tqdm.auto import tqdm

data = dataset.to_pandas()
data[:10]

Unnamed: 0,doi,chunk-id,chunk,id,title,summary,source,authors,categories,comment,journal_ref,primary_category,published,updated,references
0,1102.0183,0,High-Performance Neural Networks\nfor Visual O...,1102.0183,High-Performance Neural Networks for Visual Ob...,"We present a fast, fully parameterizable GPU i...",http://arxiv.org/pdf/1102.0183,"[Dan C. Cireşan, Ueli Meier, Jonathan Masci, L...","[cs.AI, cs.NE]","12 pages, 2 figures, 5 tables",,cs.AI,20110201,20110201,[]
1,1102.0183,1,"January 2011\nAbstract\nWe present a fast, ful...",1102.0183,High-Performance Neural Networks for Visual Ob...,"We present a fast, fully parameterizable GPU i...",http://arxiv.org/pdf/1102.0183,"[Dan C. Cireşan, Ueli Meier, Jonathan Masci, L...","[cs.AI, cs.NE]","12 pages, 2 figures, 5 tables",,cs.AI,20110201,20110201,[]
2,1102.0183,2,promising architectures for such tasks. The mo...,1102.0183,High-Performance Neural Networks for Visual Ob...,"We present a fast, fully parameterizable GPU i...",http://arxiv.org/pdf/1102.0183,"[Dan C. Cireşan, Ueli Meier, Jonathan Masci, L...","[cs.AI, cs.NE]","12 pages, 2 figures, 5 tables",,cs.AI,20110201,20110201,[]
3,1102.0183,3,"Mutch and Lowe, 2008), whose lters are xed, ...",1102.0183,High-Performance Neural Networks for Visual Ob...,"We present a fast, fully parameterizable GPU i...",http://arxiv.org/pdf/1102.0183,"[Dan C. Cireşan, Ueli Meier, Jonathan Masci, L...","[cs.AI, cs.NE]","12 pages, 2 figures, 5 tables",,cs.AI,20110201,20110201,[]
4,1102.0183,4,We evaluate various networks on the handwritte...,1102.0183,High-Performance Neural Networks for Visual Ob...,"We present a fast, fully parameterizable GPU i...",http://arxiv.org/pdf/1102.0183,"[Dan C. Cireşan, Ueli Meier, Jonathan Masci, L...","[cs.AI, cs.NE]","12 pages, 2 figures, 5 tables",,cs.AI,20110201,20110201,[]
5,1102.0183,5,"(Fukushima, 2003) helps to improve the recogni...",1102.0183,High-Performance Neural Networks for Visual Ob...,"We present a fast, fully parameterizable GPU i...",http://arxiv.org/pdf/1102.0183,"[Dan C. Cireşan, Ueli Meier, Jonathan Masci, L...","[cs.AI, cs.NE]","12 pages, 2 figures, 5 tables",,cs.AI,20110201,20110201,[]
6,1102.0183,6,2.3 Max-pooling layer\nThe biggest architectur...,1102.0183,High-Performance Neural Networks for Visual Ob...,"We present a fast, fully parameterizable GPU i...",http://arxiv.org/pdf/1102.0183,"[Dan C. Cireşan, Ueli Meier, Jonathan Masci, L...","[cs.AI, cs.NE]","12 pages, 2 figures, 5 tables",,cs.AI,20110201,20110201,[]
7,1102.0183,7,into a 1D feature vector. The top layer is alw...,1102.0183,High-Performance Neural Networks for Visual Ob...,"We present a fast, fully parameterizable GPU i...",http://arxiv.org/pdf/1102.0183,"[Dan C. Cireşan, Ueli Meier, Jonathan Masci, L...","[cs.AI, cs.NE]","12 pages, 2 figures, 5 tables",,cs.AI,20110201,20110201,[]
8,1102.0183,8,strategy is fast enough. We use the following ...,1102.0183,High-Performance Neural Networks for Visual Ob...,"We present a fast, fully parameterizable GPU i...",http://arxiv.org/pdf/1102.0183,"[Dan C. Cireşan, Ueli Meier, Jonathan Masci, L...","[cs.AI, cs.NE]","12 pages, 2 figures, 5 tables",,cs.AI,20110201,20110201,[]
9,1102.0183,9,Weights (AW) CUDA kernels. The second column s...,1102.0183,High-Performance Neural Networks for Visual Ob...,"We present a fast, fully parameterizable GPU i...",http://arxiv.org/pdf/1102.0183,"[Dan C. Cireşan, Ueli Meier, Jonathan Masci, L...","[cs.AI, cs.NE]","12 pages, 2 figures, 5 tables",,cs.AI,20110201,20110201,[]


# Creating and Adding Embeddings to Vector Database

In [None]:
delay = 20
batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    batch = data.iloc[i:i_end]
    ids = [f"{x['doi']}-{x['chunk-id']}" for i, x in batch.iterrows()]
    texts = [x['chunk'] for _, x in batch.iterrows()]
    embeds = embed_model.embed_documents(texts)
    metadata = [
        {'text': x['chunk'],
         'source': x['source'],
         'title': x['title']} for i, x in batch.iterrows()
    ]
    index.upsert(vectors=zip(ids, embeds, metadata))

    time.sleep(delay)

In [None]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.04838,
 'namespaces': {'': {'vector_count': 4838}},
 'total_vector_count': 4838}

# Finding Result Similarity

In [None]:
from langchain.vectorstores import Pinecone

text_field = "text" # From the metadata that was created earlier

# Initializing the vector store object
vectorstore = Pinecone(index, embed_model.embed_query, text_field)



# Modifying Prompt

In [None]:
def modified_prompt(query):
  yields = vectorstore.similarity_search(query, k=3)

  knowledge_base = "\n".join([x.page_content for x in yields])

  new_prompt = f"""Using the contexts below, answer the query.

  Contexts:
  {knowledge_base}

  Query:
  {query}"""

  return new_prompt

In [None]:
def conversate(query: str):
  # Human prompt
  prompt = HumanMessage(
      content=modified_prompt(query)
      )

  message.append(prompt)

  res = chat(message)

  # AIs response
  response = AIMessage(
      content=res.content
      )

  message.append(response)

  return res.content

# Gaining Inferences

In [None]:
print(conversate("What is new hyper performant Vision Object Classification task? Provide references."))

A new hyper-performant vision object classification task refers to the development of advanced techniques and models that achieve exceptional performance in accurately identifying and classifying objects in images or videos. While the provided contexts do not directly mention a specific new hyper-performant vision object classification task, they do contain relevant information on computer vision research.

Some references that can provide insights into advancements in this field include:

1. "Multi-task learning using uncertainty to weigh losses for scene geometry and semantics" by Kendall, A., Gal, Y., and Cipolla, R. (2018) in the Computer Vision and Pattern Recognition conference.

2. "Classifying and segmenting microscopy images using convolutional multiple instance learning" by Kraus, O.Z., Ba, L.J., and Frey, B. (2015) in the arXiv preprint arXiv:1511.05286.

3. "Retrieving actions in movies" by Laptev, I., and Pérez, P. (2007) in the International Conference on Computer Vision.