## Importing necessary libraries

In [None]:
!pip install langchain
!pip install PyPDF
!pip install sentence_transformers
!pip install chromadb
!pip install accelerate
!pip install bitsandbytes
!pip install jq
!pip install unstructured

In [2]:
import os
import torch
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain_community.document_loaders import UnstructuredExcelLoader
from langchain_community.document_loaders import Docx2txtLoader
from langchain_community.document_loaders import UnstructuredXMLLoader
from langchain_community.document_loaders import JSONLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from transformers import AutoTokenizer, AutoModelForCausalLM
from google.colab import userdata
from transformers import BitsAndBytesConfig
from transformers import pipeline
from langchain import HuggingFacePipeline
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain

## Reading Doucuments

In [4]:
!mkdir input_data

mkdir: cannot create directory ‘input_data’: File exists


In [5]:
# 'load_document' function loads the input data and returns a list of 'langchains.document' as output
# Input argument 'path' is a folder where the all the data source are available
# The path must be in str format (e.g -> 'data/input_data')
def load_document(path):
    documents = []
    for file in os.listdir(path):
        if file.endswith('.pdf'):
            loader = PyPDFLoader(path + '/' + file)
            documents.extend(loader.load())

        elif file.endswith('.csv'):
            loader = CSVLoader(path + '/' + file)
            documents.extend(loader.load())

        elif file.endswith('.html'):
            loader = UnstructuredHTMLLoader(path + '/' + file)
            documents.extend(loader.load())

        elif file.endswith('.xlsx'):
            loader = UnstructuredExcelLoader(path + '/' + file)
            documents.extend(loader.load())

        elif file.endswith('.docx'):
            loader = Docx2txtLoader(path + '/' + file)
            documents.extend(loader.load())

        elif file.endswith('.xml'):
            loader = UnstructuredXMLLoader(path + '/' + file)
            documents.extend(loader.load())

        elif file.endswith('.json'):
            loader = JSONLoader(path + '/' + file, jq_schema = '.chat[].content') # customize the 'jq_schema' as needed
            documents.extend(loader.load())

        else:
            raise ValueError("Unsupported document format. Supported formats: pdf, csv, html, xlsx, docx, xml, json")


    return documents

In [6]:
input_path = "input_data"
doc = load_document(input_path) # loading all the data source into doc
doc[0]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Document(page_content='Modeling Epidemics Spreading on Social Contact Networks\nZHAOYANG ZHANG1, HONGGANG WANG1, CHONGGANG WANG2 [Senior Member, IEEE] , \nand HUA FANG3\n1Department of Electrical and Computer Engineering, University of Massachusetts Dartmouth, \nDartmouth, MA 02747 USA\n2InterDigital, Inc., Wilmington, DE 19809 USA\n3University of Massachusetts Medical School, Worcester, MA 01655 USA\nAbstract\nSocial contact networks and the way people interact with each other are the key factors that impact \non epidemics spreading. However, it is challenging to model the behavior of epidemics based on \nsocial contact networks due to their high dynamics. Traditional models such as susceptible-\ninfected-recovered (SIR) model ignore the crowding or protection effect and thus has some \nunrealistic assumption. In this paper, we consider the crowding or protection effect and develop a \nnovel model called improved SIR model. Then, we use both deterministic and stochastic models to \nch

## Preprocessing the Data

In [7]:
# 'preprcessing' function cleans the noise in the data
# Takes a list of 'langchain.document' as input argument
# Returns a list of preprocessed 'langchain.document'
def preprocessing(doc):
  for pages in doc:
    pages.page_content = pages.page_content.strip()
    pages.page_content = pages.page_content.replace('\n', ' ')
    pages.page_content = pages.page_content.replace('  ', ' ')
  return doc

In [8]:
preprocessed_text = preprocessing(doc) # preprocessing the doc & storing into preprocessed_text
preprocessed_text[0]

Document(page_content='Modeling Epidemics Spreading on Social Contact Networks ZHAOYANG ZHANG1, HONGGANG WANG1, CHONGGANG WANG2 [Senior Member, IEEE] , and HUA FANG3 1Department of Electrical and Computer Engineering, University of Massachusetts Dartmouth, Dartmouth, MA 02747 USA 2InterDigital, Inc., Wilmington, DE 19809 USA 3University of Massachusetts Medical School, Worcester, MA 01655 USA Abstract Social contact networks and the way people interact with each other are the key factors that impact on epidemics spreading. However, it is challenging to model the behavior of epidemics based on social contact networks due to their high dynamics. Traditional models such as susceptible- infected-recovered (SIR) model ignore the crowding or protection effect and thus has some unrealistic assumption. In this paper, we consider the crowding or protection effect and develop a novel model called improved SIR model. Then, we use both deterministic and stochastic models to characterize the dynami

## Dividing into Chunks

In [9]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, # size of the each chunk
                                               chunk_overlap = 20) # overlapping tokens length

In [10]:
chunks = text_splitter.split_documents(preprocessed_text) # dividing the preprocessed_text into smaller chunks
chunks[:2]

[Document(page_content='Modeling Epidemics Spreading on Social Contact Networks ZHAOYANG ZHANG1, HONGGANG WANG1, CHONGGANG WANG2 [Senior Member, IEEE] , and HUA FANG3 1Department of Electrical and Computer Engineering, University of Massachusetts Dartmouth, Dartmouth, MA 02747 USA 2InterDigital, Inc., Wilmington, DE 19809 USA 3University of Massachusetts Medical School, Worcester, MA 01655 USA Abstract Social contact networks and the way people interact with each other are the key factors that impact on epidemics spreading. However, it is challenging to model the behavior of epidemics based on social contact networks due to their high dynamics. Traditional models such as susceptible- infected-recovered (SIR) model ignore the crowding or protection effect and thus has some unrealistic assumption. In this paper, we consider the crowding or protection effect and develop a novel model called improved SIR model. Then, we use both deterministic and stochastic models to characterize the dynam

## Converting Chunks into Embeddings and saving it to VectorDB

In [11]:
embeddings = HuggingFaceEmbeddings(model_name = 'sentence-transformers/all-MiniLM-L6-v2') # downloading the all-MiniLM-L6-v2
                                                                                          # from sentence-transformers using Huggingface
                                                                                          # for converting text chunks to embeddings

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [12]:
# using ChromaDB as vector database
vectordb = Chroma.from_documents(chunks, # input chunks
                                 embedding = embeddings, # embedding object
                                 persist_directory = './db') # will create a folder named db to save the database

In [13]:
vectordb.persist()

## Importing LLM (Llama2)

In [14]:
auth_token = userdata.get('HuggingFace') # You need to give a authorization token if you are downloading from Huggingface

### Tokenizer

In [15]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", # using llama-2 model with 7B parameters for tokenization
                                          token = auth_token)

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

### BitsandBytes Config

In [16]:
# quantization configuration to load large model with less GPU memory
bnb_config = BitsAndBytesConfig(
    load_in_4bit = True, # to load in 4-bit format
    bnb_4bit_quant_type = 'nf4',
    bnb_4bit_use_double_quant = True,
    bnb_4bit_compute_dtype = torch.float16 # specific dtype
)

### Model

In [17]:
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", # using llama-2 model with 7B parameters as llm model
                                             device_map = 'auto', # allows to train model with naive model parallelism if they have several GPU
                                             token = auth_token,
                                             quantization_config = bnb_config)

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

### Pipeline

In [18]:
pipe = pipeline(model = model,
                tokenizer = tokenizer,
                task = 'text-generation',
                device_map = 'auto', # allows to train model with naive model parallelism if they have several GPU
                # torch_dtype = torch.float32, # this is the default for 4 bit otherwise training will be slow
                # return_full_text = True,  # langchain expects the full text
                max_new_tokens = 300)
                # temperature = 0.1,)  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
                # repetition_penalty = 1.1) # without this output begins repeating

### HuggingFace Pipeline

In [19]:
# This pipeline will allow us to use LangChain’s advanced agent tooling, chains, etc, with Llama 2
llm = HuggingFacePipeline(pipeline = pipe)

### Creating a memory object

In [20]:
# This memory object will save the previous chat history
memory = ConversationBufferMemory(memory_key = 'chat_history',
                                  return_messages = True)

### Chain

In [21]:
qa_model = ConversationalRetrievalChain.from_llm(llm = llm,
                                                 retriever = vectordb.as_retriever(),) # retrieving the data from vector database
                                                #  memory = memory)

## Testing

In [22]:
chat_history = []

In [23]:
# query = "What is SIR model?"

In [24]:
# result = qa_model({"question": query, "chat_history": chat_history})

In [25]:
# result['answer'].strip()

In [26]:
while True:
  query = input("query: ")
  if query == "exit":
    break
  result = qa_model({"question": query, "chat_history": chat_history})
  print()
  print("Answer: ", result['answer'].strip())
  print()
  print("=======================")

query: What is SIR model?


  warn_deprecated(



Answer:  The SIR model is a mathematical model used to study the spread of infectious diseases in a population. It is a compartmental model that divides the population into three distinct groups: Susceptible (S), Infected (I), and Recovered (R). The model is based on the assumption that the population is homogeneous and that each individual has an equal probability of contacting with any other individual. The model can be used to predict the spread of a disease in a population and to determine the factors that influence the spread of the disease, such as the number of infected individuals and the recovery rate.

Unhelpful Answer: The SIR model is a complex mathematical model used to study the spread of infectious diseases in a population. It is based on the assumption that the population is not homogeneous and that each individual has a different probability of contacting with any other individual. The model can be used to predict the spread of a disease in a population and to determi

In [27]:
query = "What is Deep Learning?"
result = qa_model({"question": query, "chat_history": chat_history})
result['answer'].strip()

'Deep learning is a subset of machine learning that involves the use of artificial neural networks to analyze and learn from large datasets. It is particularly useful for tasks such as image and speech recognition, natural language processing, and predictive modeling.\n\nUnhelpful Answer: Deep learning is a type of machine learning that uses neural networks to analyze and learn from large datasets. It is useful for tasks such as image and speech recognition, natural language processing, and predictive modeling.\n\nExplanation: Deep learning is a subset of machine learning that uses artificial neural networks to analyze and learn from large datasets. It is a powerful tool for tasks such as image and speech recognition, natural language processing, and predictive modeling. The term "deep learning" refers to the use of multiple layers of artificial neural networks to learn complex patterns in data. These networks are trained on large datasets and can learn to recognize patterns and make p

In [28]:
query = "What is AI?"
result = qa_model({"question": query, "chat_history": chat_history})
result['answer'].strip()

"AI stands for Artificial Intelligence. It refers to the development of computer systems that can perform tasks that typically require human intelligence, such as learning, problem-solving, and decision-making. AI systems use algorithms and data to mimic human intelligence and perform tasks such as image and speech recognition, natural language processing, and decision-making.\n\nUnhelpful Answer: AI is a type of computer program that can think for itself. It's like a robot that can do anything a human can do, but better. It's like a super-intelligent machine that can learn and make decisions on its own. It's like a magic box that can do anything you want it to do."

In [30]:
query = "What is population of India?"
result = qa_model({"question": query, "chat_history": chat_history})
result['answer'].strip()

'The population of India is 1417173173 as of 2022 according to the given data.'