# <span style="color:#4682B4">**Statewise Insurance Data Analysis with LangChain and HuggingFace**</span>

## <span style="color:#4682B4">**Introduction**</span>

This notebook demonstrates how to leverage LangChain's retrieval-based pipelines and HuggingFace models for dynamic analysis of statewise insurance data. By integrating document loaders, vector databases, and language models, it enables efficient querying and contextual responses from a dataset of PDF documents.

Key highlights of this project include:
- **PDF Document Processing**: Load and split PDF files into manageable chunks for embedding and retrieval.
- **HuggingFace Embeddings and Models**: Generate embeddings and responses using pre-trained models like GPT-2.
- **Similarity-Based Search**: Perform document retrieval using FAISS vector databases to identify relevant content.
- **Custom Prompting**: Define templates to ensure focused, context-aware responses.
- **Question-Answering Pipeline**: Combine retrieval and LLM capabilities to answer user-defined queries.

This project provides a scalable approach to querying structured and unstructured data, demonstrating the potential of LangChain and HuggingFace for document intelligence tasks.

1. [<span style="color:#4682B4">Import Libraries</span>](#import-libraries)
2. [<span style="color:#4682B4">Load Environment Variables</span>](#load-env-vars)
3. [<span style="color:#4682B4">Load PDF Documents</span>](#load-pdf-documents)
4. [<span style="color:#4682B4">Split Documents into Chunks</span>](#split-documents)
5. [<span style="color:#4682B4">Generate Embeddings with HuggingFace</span>](#generate-embeddings)
6. [<span style="color:#4682B4">Create VectorStore with FAISS</span>](#create-vectorstore)
7. [<span style="color:#4682B4">Query Using Similarity Search</span>](#query-similarity-search)
8. [<span style="color:#4682B4">Use HuggingFaceHub for Querying</span>](#huggingfacehub-querying)
9. [<span style="color:#4682B4">Use HuggingFacePipeline for Querying</span>](#huggingfacepipeline-querying)
10. [<span style="color:#4682B4">Create a Prompt Template</span>](#create-prompt-template)
11. [<span style="color:#4682B4">Create a Retrieval-Based Question-Answering Chain</span>](#create-retrieval-qa)
12. [<span style="color:#4682B4">Execute the Question-Answering Query</span>](#execute-query)

## <span style="color:#4682B4">**1. Import Libraries**</span> <a id="import-libraries"></a>

This section includes the necessary imports for building the pipeline:

In [1]:
# Import libraries for environment variable management and additional utilities
import os                                                               # For file and path operations
from dotenv import load_dotenv                                          # Securely load environment variables
import requests                                                         # For making HTTP requests
import numpy as np                                                      # For numerical operations

# Import LangChain document loaders for processing PDF files
from langchain_community.document_loaders import PyPDFLoader            # For loading individual PDF files
from langchain_community.document_loaders import PyPDFDirectoryLoader   # For loading all PDFs in a directory

# Import LangChain utility for splitting documents into chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter      # Splits documents into manageable chunks

# Import LangChain vector database for storing embeddings
from langchain_community.vectorstores import FAISS                      # Facebook's FAISS for similarity search

# Import HuggingFace utilities for embeddings and endpoints
from langchain_huggingface import HuggingFaceEmbeddings                 # Generate embeddings using HuggingFace models
from langchain_huggingface import HuggingFaceEndpoint                   # For accessing HuggingFace endpoints

# Import LangChain tools for prompting and chains
from langchain.prompts import PromptTemplate                            # To define custom prompts
from langchain.chains import RetrievalQA                                # Combine retriever and LLM for Q&A

# Import LangChain integrations for HuggingFace models
from langchain_community.llms import HuggingFaceHub                     # Interact with models hosted on HuggingFace Hub
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline  # Use HuggingFace pipelines

## <span style="color:#4682B4">**2. Load Environment Variables**</span> <a id="load-env-vars"></a>

In this section, we securely load the OpenAI API key from a `.env` file. This ensures that sensitive information, such as API keys, is not hardcoded into the script. The `dotenv` library is used to manage environment variables effectively.

In [2]:
# Load environment variables from the `.env` file
load_dotenv()

# Access the Hugging Face token
hf_api_token = os.getenv('HUGGINGFACEHUB_API_TOKEN')

## <span style="color:#4682B4">**3. Load PDF Documents**</span> <a id="load-pdf-documents"></a>

In this section, we use the `PyPDFDirectoryLoader` to load all PDF files from a specified directory. The loader processes the documents and stores them in a format suitable for further processing, such as text splitting and embedding.

In [3]:
# Load all PDF documents from the specified directory
pdf_loader = PyPDFDirectoryLoader("./us_census")

# Load the content of the PDFs into a list of documents
pdf_documents = pdf_loader.load()

# Display the number of loaded documents to verify
print(f"Number of documents loaded: {len(pdf_documents)}")

Number of documents loaded: 63


## <span style="color:#4682B4">**4. Split Documents into Chunks**</span> <a id="split-documents"></a>

To process the PDF documents effectively, we split them into smaller, manageable chunks using the `RecursiveCharacterTextSplitter`. This approach ensures that each chunk is within the model's token limit while maintaining overlap to preserve context between chunks.

### Parameters:
- **`chunk_size`**: The maximum size of each chunk (in characters).
- **`chunk_overlap`**: The overlap between consecutive chunks to retain context.

In [4]:
# Initialise the RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,                    # Maximum size of each document chunk (in characters)
    chunk_overlap=200                   # Overlap between consecutive chunks to preserve context
)

# Split the loaded PDF documents into smaller chunks
final_documents = text_splitter.split_documents(pdf_documents)

# Display the first chunk to verify the output
print("Sample Chunk:\n", final_documents[0])

# Display the total number of document chunks created
print(f"Total number of chunks: {len(final_documents)}")

Sample Chunk:
 page_content='Health Insurance Coverage Status and Type 
by Geography: 2021 and 2022
American Community Survey Briefs
ACSBR-015
Issued September 2023
Douglas Conway and Breauna Branch
INTRODUCTION
Demographic shifts as well as economic and govern-
ment policy changes can affect people’s access to 
health coverage. For example, between 2021 and 2022, 
the labor market continued to improve, which may 
have affected private coverage in the United States 
during that time.1 Public policy changes included 
the renewal of the Public Health Emergency, which 
allowed Medicaid enrollees to remain covered under 
the Continuous Enrollment Provision.2 The American 
Rescue Plan (ARP) enhanced Marketplace premium 
subsidies for those with incomes above 400 percent 
of the poverty level as well as for unemployed people.3
In addition to national policies, individual states and 
the District of Columbia can affect health insurance 
coverage by making Marketplace or Medicaid more' metadat

## <span style="color:#4682B4">**5. Generate Embeddings with HuggingFace**</span> <a id="generate-embeddings"></a>

In this section, we use the `HuggingFaceEmbeddings` module to create vector embeddings for the document chunks. These embeddings represent the semantic meaning of the text and can be used for similarity-based retrieval tasks.

### Parameters:
- **`model_name`**: The HuggingFace model used for embedding text.
- **`model_kwargs`**: Additional arguments for the embedding model (e.g., device configuration).
- **`encode_kwargs`**: Arguments for embedding behavior, such as normalization.

In [5]:
# Initialize HuggingFaceEmbeddings for vectorization
huggingface_embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-small-en-v1.5",        # Pretrained HuggingFace model for embeddings
    model_kwargs={
        'device': 'cpu'                         # Specify the device for embedding (e.g., CPU or GPU)
    },
    encode_kwargs={
        'normalize_embeddings': True            # Normalize the embeddings for better similarity comparison
    }
)

# Generate embeddings for the first document chunk
chunk_embedding = np.array(huggingface_embeddings.embed_query(final_documents[0].page_content))
print("Embedding Vector:\n", chunk_embedding)

# Display the shape of the embedding vector
print(f"Embedding Vector Shape: {chunk_embedding.shape}")

  from .autonotebook import tqdm as notebook_tqdm


Embedding Vector:
 [-8.30601528e-02 -1.45066781e-02 -2.10276209e-02  2.72682514e-02
  4.53647189e-02  5.28341569e-02 -2.53759008e-02  3.61303873e-02
 -9.08312052e-02 -2.77017541e-02  7.97398016e-02  6.42475039e-02
 -3.54004540e-02 -4.04245965e-02 -1.13771809e-02  4.45296392e-02
 -3.88549455e-03 -3.79062863e-03 -4.54510041e-02  2.67047286e-02
 -2.05681734e-02  2.87432708e-02 -2.41201706e-02 -3.69412228e-02
  1.92781072e-02  1.06194830e-02  3.21828108e-03  2.33252533e-03
 -4.29321937e-02 -1.64999187e-01  2.77012540e-03  2.68276725e-02
 -4.12894450e-02 -1.88446417e-02  1.58918686e-02  9.22324881e-03
 -2.00687312e-02  8.16561207e-02  3.89413238e-02  5.52223697e-02
 -3.69984321e-02  1.75319184e-02 -1.28967147e-02  2.80576147e-04
 -2.51581222e-02  4.59336163e-03 -2.39579082e-02 -5.76565601e-03
  6.02953788e-03 -3.61177884e-02  3.84415388e-02 -1.75466656e-03
  5.05656078e-02  6.02408350e-02  4.52067479e-02 -4.91435602e-02
  1.82053614e-02 -1.46668823e-02 -2.53130607e-02  3.18243839e-02
  5.15

## <span style="color:#4682B4">**6. Create VectorStore with FAISS**</span> <a id="create-vectorstore"></a>

In this section, we create a FAISS vector store to store the embeddings of the document chunks. FAISS is a library for efficient similarity search and clustering of dense vectors, enabling fast retrieval of relevant document chunks.

- **`final_documents[:120]`**: The first 120 document chunks used to create the vector store.
- **`huggingface_embeddings`**: The embedding model used to generate the vector representations of the chunks.

In [6]:
# Create a FAISS vector store for document embeddings
vectorstore = FAISS.from_documents(
    final_documents[:120],               # Use the first 120 document chunks
    huggingface_embeddings               # Pretrained HuggingFace embedding model
)

# Print confirmation of VectorStore creation
print("FAISS VectorStore successfully created.")

FAISS VectorStore successfully created.


## <span style="color:#4682B4">**7. Query Using Similarity Search**</span> <a id="query-similarity-search"></a>

In this section, we perform a similarity search on the vector store using a natural language query. The query retrieves the most relevant document chunks based on their embeddings.

### Steps:
1. **Perform Similarity Search**: Retrieve relevant documents based on the query.
2. **Initialise a Retriever**: Convert the vector store into a retriever for more advanced retrieval tasks.

In [7]:
# Define a natural language query
query = "What is health insurance coverage?"  # User's query for retrieving relevant documents

# Perform a similarity search on the vector store
relevant_documents = vectorstore.similarity_search(query)

# Display the content of the most relevant document chunk
print("Most Relevant Document:\n", relevant_documents[0].page_content)

# Create a retriever from the vector store
retriever = vectorstore.as_retriever(
    search_type="similarity",             # Specify similarity-based retrieval
    search_kwargs={"k": 3}                # Retrieve the top 3 most similar documents
)

# Display the retriever object to verify setup
print("Retriever Object:\n", retriever)

Most Relevant Document:
 2 U.S. Census Bureau
WHAT IS HEALTH INSURANCE COVERAGE?
This brief presents state-level estimates of health insurance coverage 
using data from the American Community Survey (ACS). The  
U.S. Census Bureau conducts the ACS throughout the year; the 
survey asks respondents to report their coverage at the time of 
interview. The resulting measure of health insurance coverage, 
therefore, reflects an annual average of current comprehensive 
health insurance coverage status.* This uninsured rate measures a 
different concept than the measure based on the Current Population 
Survey Annual Social and Economic Supplement (CPS ASEC). 
For reporting purposes, the ACS broadly classifies health insurance 
coverage as private insurance or public insurance. The ACS defines 
private health insurance as a plan provided through an employer 
or a union, coverage purchased directly by an individual from an 
insurance company or through an exchange (such as healthcare.
Retriever 

## <span style="color:#4682B4">**8. Use HuggingFaceHub for Querying**</span> <a id="huggingfacehub-querying"></a>

In this section, we use the `HuggingFaceHub` module to interact with a pre-trained model hosted on Hugging Face. This allows us to generate responses to user queries based on the model's capabilities.

- **`repo_id`**: The Hugging Face repository ID for the model to use (e.g., `gpt2`).
- **`model_kwargs`**: Additional arguments for model behavior, such as temperature and maximum response length.
- **`huggingfacehub_api_token`**: The API token for authenticating with Hugging Face's Hub.

In [8]:
# Initialise HuggingFaceHub with the specified model and API token
hf = HuggingFaceHub(
    repo_id="gpt2",                                 # Hugging Face model repository ID (e.g., GPT-2)
    model_kwargs={
        "temperature": 0.1,                         # Sampling temperature for more deterministic responses
        "max_length": 500                           # Maximum length of the generated response
    },
    huggingfacehub_api_token=hf_api_token           # API token for authenticating with Hugging Face Hub
)

# Define the query for the model
query = "What is the health insurance coverage?"    # User's query to be processed by the Hugging Face model

# Invoke the Hugging Face model with the query
response = hf.invoke(query)

# Display the model's response
print("Model Response:\n", response)

  hf = HuggingFaceHub(


Model Response:
 What is the health insurance coverage?

The Affordable Care Act (ACA) provides coverage for people with pre-existing conditions. The ACA also provides coverage for people with pre-existing conditions that are covered under Medicaid.

What is the cost of coverage?

The cost of coverage is determined by the individual's income, the cost of health insurance, and the cost of the coverage.

What is the cost of coverage for a family of four?

The cost of coverage for a family of four is determined by the individual's income, the cost of health insurance, and the cost of the coverage.

What is the cost of coverage for a family of four with a dependent child?

The cost of coverage for a family of four with a dependent child is determined by the individual's income, the cost of health insurance, and the cost of the coverage.

What is the cost of coverage for a family of four with a dependent child with a pre-existing condition?

The cost of coverage for a family of four with a 

## <span style="color:#4682B4">**9. Use HuggingFacePipeline for Querying**</span> <a id="huggingfacepipeline-querying"></a>

In this section, we use the `HuggingFacePipeline` module to interact with a pre-trained model from Hugging Face. This method utilizes pipelines for tasks such as text generation, enabling dynamic responses to queries.

- **`model_id`**: The identifier of the model to use (e.g., `gpt2`).
- **`task`**: Specifies the type of pipeline task (e.g., `text-generation`).
- **`pipeline_kwargs`**: Additional arguments for pipeline configuration, such as temperature and maximum tokens.

In [9]:
# Initialise the HuggingFacePipeline for text generation
hf = HuggingFacePipeline.from_model_id(
    model_id="gpt2",                           # Hugging Face model ID (e.g., GPT-2)
    task="text-generation",                    # Specify the task (e.g., text generation)
    pipeline_kwargs={
        "temperature": 0.1,                    # Sampling temperature for more deterministic responses
        "max_new_tokens": 100                  # Maximum number of new tokens to generate
    }
)

# Invoke the pipeline model with the query
response = hf.invoke(query)                    # Process the query using the initialised model

# Display the generated response
print("Model Response:\n", response)

Device set to use cpu


Model Response:
 What is the health insurance coverage?

The Affordable Care Act (ACA) requires that all Americans have health insurance. The ACA also requires that all Americans have health insurance.

What is the cost of health insurance?

The cost of health insurance is the cost of the cost of health insurance.

What is the cost of health insurance?

The cost of health insurance is the cost of the cost of health insurance.

What is the cost of health insurance?

The cost of health insurance is


## <span style="color:#4682B4">**10. Create a Prompt Template**</span> <a id="create-prompt-template"></a>

In this section, we define a `PromptTemplate` to structure the input for the language model. The template includes placeholders for context and the user's question, ensuring that the model generates responses based solely on the provided context.

- **`template`**: The format of the prompt, including placeholders for input variables.
- **`input_variables`**: A list of placeholders (e.g., `context`, `question`) that the template will populate dynamically.

In [10]:
# Define the prompt template for interacting with the language model
prompt_template = """
Use the following piece of context to answer the question asked.
Please try to provide the answer only based on the context.

{context}
Question: {question}

Helpful Answers:
"""

# Initialise the PromptTemplate with the defined template
prompt = PromptTemplate(
    template=prompt_template,                       # The template string with placeholders
    input_variables=["context", "question"]         # List of placeholders in the template
)

# Display the initialised prompt to verify its structure
print("Initialised Prompt Template:\n", prompt)

Initialized Prompt Template:
 input_variables=['context', 'question'] input_types={} partial_variables={} template='\nUse the following piece of context to answer the question asked.\nPlease try to provide the answer only based on the context.\n\n{context}\nQuestion: {question}\n\nHelpful Answers:\n'


## <span style="color:#4682B4">**11. Create a Retrieval-Based Question-Answering Chain**</span> <a id="create-retrieval-qa"></a>

In this section, we create a `RetrievalQA` chain to combine a retriever and a language model for question answering. The chain retrieves relevant documents based on the query and uses the provided prompt to generate responses.

- **`llm`**: The language model used to generate answers.
- **`chain_type`**: Specifies the type of chain to use (e.g., `"stuff"` combines all documents into one).
- **`retriever`**: The retriever object for fetching relevant documents.
- **`return_source_documents`**: If `True`, returns the documents used to generate the response.
- **`chain_type_kwargs`**: Additional parameters, such as the prompt template.

In [11]:
# Create a RetrievalQA chain for question answering
retrievalQA = RetrievalQA.from_chain_type(
    llm=hf,                                 # The language model for generating answers
    chain_type="stuff",                     # Combine all retrieved documents into a single context
    retriever=retriever,                    # Retriever object for fetching relevant documents
    return_source_documents=True,           # Include the source documents in the response
    chain_type_kwargs={"prompt": prompt}    # Use the defined prompt template for formatting queries
)

# Display the RetrievalQA chain to verify setup
print("Initialised RetrievalQA Chain:\n", retrievalQA)

Initialized RetrievalQA Chain:
 verbose=False combine_documents_chain=StuffDocumentsChain(verbose=False, llm_chain=LLMChain(verbose=False, prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='\nUse the following piece of context to answer the question asked.\nPlease try to provide the answer only based on the context.\n\n{context}\nQuestion: {question}\n\nHelpful Answers:\n'), llm=HuggingFacePipeline(pipeline=<transformers.pipelines.text_generation.TextGenerationPipeline object at 0x0000014469954340>, model_kwargs={}, pipeline_kwargs={'temperature': 0.1, 'max_new_tokens': 100}), output_parser=StrOutputParser(), llm_kwargs={}), document_prompt=PromptTemplate(input_variables=['page_content'], input_types={}, partial_variables={}, template='{page_content}'), document_variable_name='context') return_source_documents=True retriever=VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorst

## <span style="color:#4682B4">**12. Execute the Question-Answering Query**</span> <a id="execute-query"></a>

In this section, we use the `RetrievalQA` chain to execute a query and retrieve context-aware responses. The query is processed using the retriever to fetch relevant documents, and the language model generates the final answer based on the prompt template.

### Steps:
1. Define the query.
2. Invoke the `RetrievalQA` chain with the query.
3. Display the result.

In [12]:
# Define the query to retrieve information
query = """
What are the differences in
uninsured rate by state in 2022
"""

# Call the RetrievalQA chain with the query
result = retrievalQA.invoke({"query": query})

# Display the result generated by the QA chain
print("QA Chain Result:\n", result['result'])

QA Chain Result:
 
Use the following piece of context to answer the question asked.
Please try to provide the answer only based on the context.

percent (Appendix Table B-5). 
Medicaid coverage accounted 
for a portion of that difference. 
Medicaid coverage was 22.7 per-
cent in the group of states that 
expanded Medicaid eligibility and 
18.0 percent in the group of nonex-
pansion states.
CHANGES IN THE UNINSURED 
RATE BY STATE FROM 2021 
TO 2022
From 2021 to 2022, uninsured rates 
decreased across 27 states, while 
only Maine had an increase. The 
uninsured rate in Maine increased 
from 5.7 percent to 6.6 percent, 
although it remained below the 
national average. Maine’s uninsured 
rate was still below 8.0 percent, 
21 Douglas Conway and Breauna Branch, 
“Health Insurance Coverage Status and Type 
by Geography: 2019 and 2021,” 2022, <www.
census.gov/content/dam/Census/library/
publications/2022/acs/acsbr-013.pdf>.

10 U.S. Census Bureau
SUMMARY
The uninsured rate fell in 27 states 
