# SCI-BOT

This code involves the development of a chatbot for discussing scientific topics, integrating various libraries and services for seamless functionality. Initially, configurations are set up, including API keys for OpenAI and Pinecone services. The chatbot leverages the power of OpenAI's GPT-3.5 model for natural language processing, enabling smooth conversational flow. It loads and preprocesses a dataset named "ronaldahmed/scitechnews,". Integration with Pinecone allows for efficient vector storage and similarity searches, essential for handling user queries effectively. The project creates and initializes a Pinecone index named "science-papers-1," ensuring it's ready for use. An embedding model from OpenAI converts text data into numerical representations for processing. Furthermore, a prompt augmentation function enriches user queries with relevant context from the Pinecone knowledge base. With Gradio, the project provides a user-friendly interface for users to interact with the chatbot, fostering engagement and knowledge dissemination in scientific domains.

### Installing Packages 

In [None]:
!python.exe -m pip install --upgrade pip

In [None]:
!pip install -qU \
    langchain==0.0.354 \
    openai==1.10.0 \
    datasets==2.10.1 \
    pinecone-client==3.0.0 \
    tiktoken==0.5.2

### OPEN AI API

The OpenAI API key is set to authenticate access to OpenAI's services. Subsequently, GPT-3.5 turbo model is used for language processing. Additionally, message objects representing system messages, human messages, and AI messages are imported from the LangChain schema module. These messages are organized within an array, likely serving as initial conversation prompts and responses for the chatbot. Overall, this segment sets the groundwork for the chatbot's functionality, enabling it to process and generate responses based on user interactions through the OpenAI API and LangChain library.

In [None]:
!pip install -U langchain-openai

In [None]:
import os 
from langchain_openai import ChatOpenAI

os.environ["OPENAI_API_KEY"] = "ENTER_YOUR_API_KEY"

chat = ChatOpenAI(
    openai_api_key=os.environ["OPENAI_API_KEY"],
    model='gpt-3.5-turbo'
)

In [None]:
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="Hi AI, how are you today?"),
    AIMessage(content="I'm great thank you. How can I help you?"),
    HumanMessage(content="I'd like to understand about New York City's Vaccine Passport Plan Renews Online Privacy Debate.")
]

### Loading Dataset 

Hugging Face Datasets library is used to load `"ronaldahmed/scitechnews"` dataset. 

Dataset overview: 
The SciTechNews dataset consists of scientific papers paired with their corresponding press release snippet mined from ACM TechNews. ACM TechNews is a news aggregator that provides regular news digests about scientific achieve- ments and technology in the areas of Computer Science, Engineering, Astrophysics, Biology, and others.

In [None]:
!pip install -U datasets

In [None]:
from datasets import load_dataset

dataset = load_dataset("ronaldahmed/scitechnews")

dataset_sci_data_1 = dataset['train']

# Print information about the dataset
print(dataset_sci_data_1)

### Creating a Vector Database index

The below code segment performs several tasks related to organizing and processing data, particularly for indexing and embedding text information. Initially, it sets up a system to manage and measure the performance of a data index named "science-papers-1". It creates one with specific characteristics like dimensionality and metric, ensuring efficient data storage and retrieval. Additionally, it waits for the index to be fully initialized before proceeding. Then, it establishes a connection to the index to begin data operations. Next, it employs an embedding model, likely for converting text data into numerical representations for analysis and comparison. Overall, this code orchestrates the organization and analysis of textual data, facilitating efficient indexing and embedding for further processing and analysis.

In [None]:
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = "ENTER_YOUR_API_KEY"

# configure client
pc = Pinecone(api_key=api_key)

In [None]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="gcp-starter", region="Iowa (us-central1)"
)

In [None]:
import time

index_name = 'science-papers-1'
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of ada 002
        metric='cosine',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

In [None]:
!pip install -U sqlalchemy

In [None]:
!pip install -U langchain-openai
from langchain.embeddings.openai import OpenAIEmbeddings

embed_model = OpenAIEmbeddings(model="text-embedding-ada-002")

In [None]:
from tqdm.auto import tqdm  # for progress bar

# Assuming 'train' is the split you want to work with
train_data = dataset_sci_data_1.to_pandas()  # Convert 'train' split to pandas DataFrame

batch_size = 80

for i in tqdm(range(0, len(train_data), batch_size)):
    i_end = min(len(train_data), i + batch_size)
    # get batch of data
    batch = train_data.iloc[i:i_end]
    # generate unique ids for each chunk
    ids = [f"{x['id']}" for _, x in batch.iterrows()]
    # get text to embed
    texts = [x['pr-summary'] for _, x in batch.iterrows()]
    # embed text
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['pr-summary'],
         'id': x['id'],
         'title': x['pr-title']} for _, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

In [None]:
index.describe_index_stats()

### Similarity search and RAG

The below code is part of a system designed to enhance the capabilities of a chatbot by integrating a vector store mechanism with the Pinecone library. Initially, a metadata field named "text" is defined, indicating where the summary of science papers' data resides within the system. Subsequently, a vector store object is initialized using the Pinecone library, incorporating an index, an embedding model, and the designated text field. The vector store facilitates efficient storage and retrieval of numerical representations of text data. The code also defines a function named augment_prompt, which enhances user queries by retrieving relevant contextual information from the vector store. It executes a similarity search based on the user's query, retrieves relevant results, and constructs an augmented prompt by combining these results with the original query. Additionally, there's a function named input_message that takes user messages, augments them with contextual information using the augment_prompt function, and sends the augmented message to the chatbot for processing. This mechanism enriches the chatbot's responses by incorporating relevant context retrieved from the vector store, enhancing the overall user experience and the chatbot's ability to provide informative responses.

In [None]:
from langchain.vectorstores import Pinecone

text_field = "text"  # the metadata field that contains our text

# initialize the vector store object
vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)

In [None]:
def augment_prompt(query: str):
    # get top 1 result from knowledge base
    results = vectorstore.similarity_search(query, k=1)

    # get the text from the results
    source_knowledge = "\n".join([x.page_content for x in results])
    # feed into an augmented prompt
    augmented_prompt = f"""Using the contexts below, answer the query.

    Contexts:
    {source_knowledge}

    Query: {query}"""

    print("augment_prompt: ", augmented_prompt)
    return augmented_prompt


In [None]:
def input_message (message):
    prompt = HumanMessage(
        content=augment_prompt(
            "message"
        )
    )
    
    messages.append(prompt)
    res = chat(messages + [prompt])
    
    return res.content

### Creating an interface using GRADIO

The below code sets up a user interface for a chatbot using the Gradio library.

In [None]:
! pip install --upgrade pyqt5 pyqtwebengine pydantic
! pip install --upgrade jinja2
! pip install --upgrade gradio

In [None]:
import gradio as gr

def echo(message, history):
    return input_message(message)

demo = gr.ChatInterface(fn=echo, examples=["Who developed a quantum information transfer protocol ?", "Which organization launched the Joint Cyber Defense Collaborative (JCDC) ?", "Who is doing research about disease progression and response to treatment for brain disorders ?"], title="Sci-Bot")
demo.launch(share = True)