### Loading documents (pdf and word files) for RAG

In this step, we gather the source materials—PDFs and Word documents—that our application will use to provide information. By loading these documents into our system, we enable it to access and process the content, forming the foundation for our Retrieval Augmented Generation (RAG) application. LangChain offers various document loaders to facilitate this process efficiently.

In [1]:
from langchain_community.document_loaders import PyPDFLoader, Docx2txtLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from typing import List
from langchain_core.documents import Document
import os

def load_documents(folder_path: str) -> List[Document]:
    documents = []
    for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path, filename)
        if filename.endswith('.pdf'):
            loader = PyPDFLoader(file_path)
        elif filename.endswith('.docx'):
            loader = Docx2txtLoader(file_path)
        else:
            print(f"Unsupported file type: {filename}")
            continue
        documents.extend(loader.load())
    return documents

folder_path = "docs"
documents = load_documents(folder_path)
print(f"Loaded {len(documents)} documents from the folder.")

Loaded 7 documents from the folder.


Checking the contents of the very first document

In [4]:
print(documents[0])

page_content='Question: 
Explain why it is impossible to design a perfectly secure Network & Information 
System. 
Answer: 
It is impossible to design a perfectly secure Network & Information System due to the 
following reasons: 
1. Evolving Threats: Cybersecurity threats are constantly changing. Attackers develop 
new techniques and exploit previously unknown vulnerabilities, making it 
impossible to anticipate and counter all potential attacks. 
2. Human Error: Many security breaches result from mistakes made by users or 
administrators, such as weak passwords, improper configurations, or falling victim 
to social engineering attacks. Human behavior is inherently unpredictable and 
cannot be fully secured. 
3. Complexity of Systems: Modern systems are highly complex, with multiple 
interconnected components. This complexity increases the likelihood of 
vulnerabilities that attackers can exploit. Ensuring every component is secure is 
practically unachievable. 
4. Resource Limitation

### Splitting documents

Once the documents are loaded, we need to divide them into smaller, manageable sections or chunks. This segmentation is crucial because it allows the system to retrieve and process relevant information more effectively, especially when dealing with large texts. LangChain provides text splitters that assist in breaking down documents appropriately.

We're using 1000-characters chunk size with a 200-characters overlap. This can be adapted based on your needs and the type of data being used for RAG. 

In [2]:
# Split the documents into chunks of 1000 characters with 200 characters overlap
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)

splits = text_splitter.split_documents(documents)
print(f"Split the documents into {len(splits)} chunks.")

Split the documents into 14 chunks.


Checking the contents of the very first chunk

In [5]:
print(splits[0])

page_content='Question: 
Explain why it is impossible to design a perfectly secure Network & Information 
System. 
Answer: 
It is impossible to design a perfectly secure Network & Information System due to the 
following reasons: 
1. Evolving Threats: Cybersecurity threats are constantly changing. Attackers develop 
new techniques and exploit previously unknown vulnerabilities, making it 
impossible to anticipate and counter all potential attacks. 
2. Human Error: Many security breaches result from mistakes made by users or 
administrators, such as weak passwords, improper configurations, or falling victim 
to social engineering attacks. Human behavior is inherently unpredictable and 
cannot be fully secured. 
3. Complexity of Systems: Modern systems are highly complex, with multiple 
interconnected components. This complexity increases the likelihood of 
vulnerabilities that attackers can exploit. Ensuring every component is secure is 
practically unachievable.' metadata={'source': 'd

Let's also check out the metadata of the chunk. Metadata tells us more about the context of the chunk amd can help in applying filters when performing RAG, for example filtering by a certain document/source file, or by a date/time constraint etc.

In [6]:
print(splits[0].metadata)

{'source': 'docs\\nis past papers.pdf', 'page': 0}


###  Creating embeddings using Cohere embeddings

After splitting the documents, we transform each chunk into a numerical representation known as an embedding using Cohere's embedding models. These embeddings capture the semantic meaning of the text, enabling the system to understand and compare the content of different chunks. LangChain integrates seamlessly with Cohere to facilitate this embedding process.

Get (free trial) API Key here: https://dashboard.cohere.com/api-keys

In [9]:
import getpass
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Set the API key as an environment variable
if not os.getenv("COHERE_API_KEY"):
    os.environ["COHERE_API_KEY"] = os.getenv("COHERE_API_KEY")

We're using Cohere's ```embed-english-light-v3.0``` which is a lighter and faster version of Cohere's latest english embedding model.

In [10]:
from langchain_cohere import CohereEmbeddings

embeddings = CohereEmbeddings(
    model="embed-english-light-v3.0",
)

Creating the embeddings for all the document chunks

In [11]:
document_embeddings = embeddings.embed_documents([split.page_content for split in splits])
print(f"Created embeddings for {len(document_embeddings)} document chunks.")

Created embeddings for 14 document chunks.


### Set up the vector store for storing the embeddings

With embeddings created, we store them in a vector store—a specialized database designed for handling high-dimensional vectors. This setup allows for efficient storage and quick retrieval of embeddings, which is essential for the performance of our RAG application. LangChain offers support for various vector stores to manage embeddings effectively.

Using Chroma vectorstore provided by LangChain for now as it is quick and easy to configure for basic test applications

In [13]:
from langchain_chroma import Chroma

collection_name = "my_collection" # choose any name for your collection
vectorstore = Chroma.from_documents(
    collection_name=collection_name,
    documents=splits,
    embedding=embeddings,
    persist_directory="./chroma_db" # directory to store the vector store
)
print("Vector store created and persisted to './chroma_db'")


Vector store created and persisted to './chroma_db'


### Performing vector search

Vector search involves querying the vector store to find embeddings that are most similar to a given input query. This process helps in identifying the most relevant chunks of information from our documents in response to user queries. LangChain provides tools to perform vector searches efficiently, ensuring that the most pertinent information is retrieved.

Ask any question that is relevant to the document YOU have uploaded and requires an answer strictly using its contents.
Note the k parameter - it retrieves the top k most relevant chunks. Here we have set it to 2 to retrieve the top 2 most relevant chunks. You can try changing this.

In [14]:
query = "Why is it difficult to implement security in a system?"

search_results = vectorstore.similarity_search(query, k=2) # vector search here is being performed with k=2 meaning fetch the top 2 most relevant chunks

print(f"\nTop 2 most relevant chunks for the query: '{query}'\n")

for i, result in enumerate(search_results, 1):
    print(f"Result {i}:")
    print(f"Source: {result.metadata.get('source', 'Unknown')}")
    print(f"Content: {result.page_content}")
    print()



Top 2 most relevant chunks for the query: 'Why is it difficult to implement security in a system?'

Result 1:
Source: docs\nis past papers.pdf
Content: interconnected components. This complexity increases the likelihood of 
vulnerabilities that attackers can exploit. Ensuring every component is secure is 
practically unachievable. 
4. Resource Limitations: Implementing security measures involves costs and trade-
offs, such as reduced system performance or higher maintenance requirements. 
Organizations often cannot afford the resources needed for comprehensive 
security. 
5. Conflict Between Usability and Security: Strong security measures often make 
systems harder to use, leading to resistance from users. Balancing usability with 
security inevitably creates gaps that attackers can exploit. 
These challenges ensure that absolute security remains unattainable; instead, the goal is 
to mitigate risks to an acceptable level through continuous monitoring and updating of 
security measur

### Creating a retriever for the RAG chain

The retriever acts as a bridge between the user's query and the relevant document chunks. It uses vector search to fetch the most relevant embeddings from the vector store, providing the necessary context for generating accurate responses. LangChain's retriever components are designed to streamline this retrieval process within the RAG framework.

In [15]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 2}) # set k according to your requirements

retriever_results = retriever.invoke("Why is it difficult to implement security in a system?") # ask a question relevant to your uploaded data

print(retriever_results)

[Document(metadata={'page': 0, 'source': 'docs\\nis past papers.pdf'}, page_content='interconnected components. This complexity increases the likelihood of \nvulnerabilities that attackers can exploit. Ensuring every component is secure is \npractically unachievable. \n4. Resource Limitations: Implementing security measures involves costs and trade-\noffs, such as reduced system performance or higher maintenance requirements. \nOrganizations often cannot afford the resources needed for comprehensive \nsecurity. \n5. Conflict Between Usability and Security: Strong security measures often make \nsystems harder to use, leading to resistance from users. Balancing usability with \nsecurity inevitably creates gaps that attackers can exploit. \nThese challenges ensure that absolute security remains unattainable; instead, the goal is \nto mitigate risks to an acceptable level through continuous monitoring and updating of \nsecurity measures. \nQuestion: \n(b) DETERMINE the following Denial of 

### Building the RAG chain

In the final step, we construct the RAG chain, which combines the retrieval and generation processes. The retriever supplies relevant document chunks based on the user's query, and the language model uses this information to generate informed and contextually accurate responses. LangChain facilitates the seamless integration of these components using the | operator to construct the RAG chain efficiently with minimal code.

We'll use Llama 3.3 as our LLM of choice, through Groq. Get Groq API Key here: https://console.groq.com/keys

In [16]:
from dotenv import load_dotenv
import os

load_dotenv()
groq_api_key = os.getenv("GROQ_API_KEY")
os.environ["GROQ_API_KEY"] = groq_api_key

In [17]:
from langchain_groq import ChatGroq

llm = ChatGroq(temperature=0, model_name="llama-3.3-70b-versatile")

In [18]:
from langchain_core.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Define the template for the chat prompt - this will be used for each query - the context and question will be filled in dynamically
template = """Answer the question based only on the following context:
{context}
Question: {question}
Answer: """

prompt = ChatPromptTemplate.from_template(template)


# Define a function to convert a list of documents to a single string - to inject the retrieved context as string into the prompt ("context" placeholder)
def docs2str(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Define the RAG chain - the | operator is used to chain the components together, one after the other in order
rag_chain = (
    {"context": retriever | docs2str, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser() # a built-in output parser provided by LangChain to convert the model output to a string for better readability
)


#### Lets test our RAG chain now

In [20]:
question = "What are different kinds of flooding attacks?" # ask a question relevant to your uploaded data

response = rag_chain.invoke(question) # invoke the RAG chain with the question

print(f"Question: {question}")

print(f"Answer: {response}")


Question: What are different kinds of flooding attacks?
Answer: Based on the provided context, there are at least two kinds of flooding attacks:

1. TCP SYN Flooding Attacks: This type of attack exploits the TCP three-way handshake mechanism by sending a large number of SYN packets with spoofed source addresses to the server, consuming server resources.

2. ICMP Flooding Attacks: This type of attack overwhelms a server by flooding it with ICMP packets, such as echo requests (ping), consuming the target's bandwidth and processing power.

Additionally, there is a mention of "Reflection Attacks", which may also be a type of flooding attack, but the context does not provide a detailed explanation of this specific attack.
