#### Let's load the Text Files

In [45]:
import os

# Set the folder path where your TXT files are stored
folder_path = r"C:\Users\User\Documents\10K_Filings"

# List all TXT files in the folder
txt_files = [f for f in os.listdir(folder_path) if f.endswith('.txt')]

# Read and store the content of each file
documents = {}

for file in txt_files:
    file_path = os.path.join(folder_path, file)
    with open(file_path, "r", encoding="utf-8") as f:
        documents[file] = f.read()

# Print the names of the loaded files to confirm
print("Loaded files:", txt_files)


Loaded files: ['AllState.txt', 'Chubb.txt', 'Progressive.txt', 'Travelers.txt']


In [46]:
# Print the first 1000 characters of one document
sample_file = txt_files[0]  # Pick the first file
print(documents[sample_file][:1000])  # Print first 1000 characters


<SEC-DOCUMENT>0000899051-25-000015.txt : 20250224
<SEC-HEADER>0000899051-25-000015.hdr.sgml : 20250224
<ACCEPTANCE-DATETIME>20250224122327
ACCESSION NUMBER:		0000899051-25-000015
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		263
CONFORMED PERIOD OF REPORT:	20241231
FILED AS OF DATE:		20250224
DATE AS OF CHANGE:		20250224

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			ALLSTATE CORP
		CENTRAL INDEX KEY:			0000899051
		STANDARD INDUSTRIAL CLASSIFICATION:	FIRE, MARINE & CASUALTY INSURANCE [6331]
		ORGANIZATION NAME:           	02 Finance
		IRS NUMBER:				363871531
		STATE OF INCORPORATION:			DE
		FISCAL YEAR END:			1231

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	001-11840
		FILM NUMBER:		25655406

	BUSINESS ADDRESS:	
		STREET 1:		3100 SANDERS ROAD
		CITY:			NORTHBROOK
		STATE:			IL
		ZIP:			60062
		BUSINESS PHONE:		8474025000

	MAIL ADDRESS:	
		STREET 1:		3100 SANDERS ROAD
		CITY:			NORTHBROOK
		STATE:			IL
		ZIP:			60062
</SEC-HEADER>
<DOCU

#### Let's extract only the relevant sections from each document and ignore unnecessary metadata.

In [47]:
import re

def extract_relevant_sections(text):
    """
    Extracts key sections from 10-K filings:
    - Business Overview
    - Risk Factors
    - Management Discussion & Analysis (MD&A)
    - Financial Statements
    """
    sections = {}

    # Define section titles we care about
    section_patterns = {
        "Business Overview": r"(?i)Item\s*1\.\s*Business(.*?)(?=Item\s*\d)",
        "Risk Factors": r"(?i)Item\s*1A\.\s*Risk Factors(.*?)(?=Item\s*\d)",
        "MD&A": r"(?i)Item\s*7\.\s*Management(?:’s|')? Discussion and Analysis(.*?)(?=Item\s*\d)",
        "Financial Statements": r"(?i)Item\s*8\.\s*Financial Statements(.*?)(?=Item\s*\d)"
    }

    # Extract each section
    for section, pattern in section_patterns.items():
        match = re.search(pattern, text, re.DOTALL)
        sections[section] = match.group(1).strip() if match else ""

    return sections

# Apply extraction to all documents
extracted_documents = {fname: extract_relevant_sections(content) for fname, content in documents.items()}

# Print a sample from the "Business Overview" section of one document
sample_file = txt_files[0]
print(f"Business Overview Section of {sample_file}:\n")
print(extracted_documents[sample_file]["Business Overview"][:500])  # Print first 500 characters


Business Overview Section of AllState.txt:

</span></div></div><div style="margin-bottom:3pt;text-align:center"><span style="color:#0033a0;font-family:'Allstate Sans',sans-serif;font-size:11pt;font-weight:700;line-height:120%">Part&#160;I</span></div><div id="i6bfbd41c4e984319aea05b97650e64a4_13"></div><div style="margin-bottom:6pt"><span style="color:#0033a0;font-family:'Allstate Sans',sans-serif;font-size:11pt;font-weight:700;line-height:120%">Item&#160;1.&#160;Business</span></div><div id="i6bfbd41c4e984319aea05b97650e64a4_16"></div><d


In [48]:
import re
from bs4 import BeautifulSoup  # Install this with `pip install beautifulsoup4`
import html

def clean_html(text):
    """
    Cleans extracted sections by:
    - Removing HTML tags
    - Decoding HTML entities (e.g., &#160; → space)
    - Removing excessive whitespace
    """
    soup = BeautifulSoup(text, "html.parser")  # Remove HTML tags
    text = soup.get_text(separator=" ")  # Extract clean text
    text = html.unescape(text)  # Decode HTML entities
    text = re.sub(r"\s+", " ", text).strip()  # Remove excessive spaces
    return text

def extract_relevant_sections(text):
    """
    Extracts and cleans key sections from 10-K filings:
    - Business Overview
    - Risk Factors
    - Management Discussion & Analysis (MD&A)
    - Financial Statements
    """
    sections = {}

    # Define section titles we care about
    section_patterns = {
        "Business Overview": r"(?i)Item\s*1\.\s*Business(.*?)(?=Item\s*\d)",
        "Risk Factors": r"(?i)Item\s*1A\.\s*Risk Factors(.*?)(?=Item\s*\d)",
        "MD&A": r"(?i)Item\s*7\.\s*Management(?:’s|')? Discussion and Analysis(.*?)(?=Item\s*\d)",
        "Financial Statements": r"(?i)Item\s*8\.\s*Financial Statements(.*?)(?=Item\s*\d)"
    }

    # Extract and clean each section
    for section, pattern in section_patterns.items():
        match = re.search(pattern, text, re.DOTALL)
        raw_text = match.group(1).strip() if match else ""
        sections[section] = clean_html(raw_text)  # Apply HTML cleaning

    return sections

# Apply extraction & cleaning to all documents
extracted_documents = {fname: extract_relevant_sections(content) for fname, content in documents.items()}

# Print a sample from the "Business Overview" section of one document
sample_file = txt_files[0]
print(f"Cleaned Business Overview Section of {sample_file}:\n")
print(extracted_documents[sample_file]["Business Overview"][:500])  # Print first 500 characters


Cleaned Business Overview Section of AllState.txt:

Part I Item 1. Business The Allstate Corporation was incorporated under the laws of the State of Delaware on November 5, 1992, to serve as the holding company for Allstate Insurance Company. Its business is conducted principally through Allstate Insurance Company and other subsidiaries (collectively, including The Allstate Corporation, “Allstate”). The Allstate Corporation is one of the largest publicly held personal lines insurers in the United States. Allstate’s strategy is to increase market 


#### Let's implement Chunking for RAG

In [25]:
def chunk_text(text, chunk_size=500, overlap=100):
    """
    Splits text into overlapping chunks.
    - chunk_size: Max words per chunk
    - overlap: Number of overlapping words between chunks
    """
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

# Apply chunking to all extracted sections
chunked_documents = {}

for fname, sections in extracted_documents.items():
    chunked_sections = {}
    for section_name, content in sections.items():
        chunked_sections[section_name] = chunk_text(content)
    chunked_documents[fname] = chunked_sections

# Print sample chunk from Business Overview
sample_file = txt_files[0]
print(f"First chunk from Business Overview of {sample_file}:\n")
print(chunked_documents[sample_file]["Business Overview"][0])  # Print first chunk


First chunk from Business Overview of AllState.txt:

Part I Item 1. Business The Allstate Corporation was incorporated under the laws of the State of Delaware on November 5, 1992, to serve as the holding company for Allstate Insurance Company. Its business is conducted principally through Allstate Insurance Company and other subsidiaries (collectively, including The Allstate Corporation, “Allstate”). The Allstate Corporation is one of the largest publicly held personal lines insurers in the United States. Allstate’s strategy is to increase market share in personal property-liability and broaden protection offerings. The Allstate brand is widely known through the “You’re In Good Hands With Allstate ® ” slogan. Allstate at a Glance 208 million policies in force (“PIF”) 55,000 employees 3 rd largest personal property and casualty insurer in the United States (1) $72.61 billion investment portfolio (1) Based on 2023 statutory direct premiums written according to A.M. Best We empower custom

#### Let's enable retrieval-augmented generation (RAG)

In [49]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
import pickle

# Load a local embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")  # Small, efficient model

def get_embedding(text):
    """
    Generates an embedding vector for a given text chunk using a local model.
    """
    return embedding_model.encode(text, convert_to_numpy=True)

# Prepare data for FAISS storage
all_chunks = []
chunk_metadata = []
index = 0

for fname, sections in chunked_documents.items():
    for section_name, chunks in sections.items():
        for chunk in chunks:
            embedding = get_embedding(chunk)  # Convert chunk to embedding
            all_chunks.append(embedding)  # Store embedding
            chunk_metadata.append((fname, section_name, index, chunk))  # Store metadata
            index += 1

# Convert embeddings into a FAISS index
embedding_dim = len(all_chunks[0])  # Check embedding size
faiss_index = faiss.IndexFlatL2(embedding_dim)  # Create FAISS index
faiss_index.add(np.array(all_chunks))  # Store embeddings

# Save FAISS index and metadata for retrieval
faiss.write_index(faiss_index, "10k_faiss_index")

with open("chunk_metadata.pkl", "wb") as f:
    pickle.dump(chunk_metadata, f)

print("✅ FAISS index and metadata saved successfully using a local model!")


✅ FAISS index and metadata saved successfully using a local model!


In [50]:
import cohere

# Set up Cohere API
cohere_client = cohere.Client("Zw3pvdW9sLUocDRByhE05Y1tClc6W6KQLVrBztCx")  # Replace with your key

def ask_cohere(question, retrieved_chunks):
    """
    Uses Cohere's API to generate an answer based on retrieved chunks.
    """
    context = "\n\n".join([text for _, _, text in retrieved_chunks])
    prompt = f"Answer the following question based on the context:\n\nContext:\n{context}\n\nQuestion: {question}\nAnswer:"

    response = cohere_client.generate(
        model="command-light",  # ✅ Use 'command-light' instead of 'command-r'
        prompt=prompt,
        max_tokens=300,
        temperature=0.3
    )
    return response.generations[0].text.strip()

# Example Query
query = "What are Allstate's strategic initiatives?"
retrieved_chunks = retrieve_relevant_chunks(query)

# Get AI-generated Answer from Cohere
answer = ask_cohere(query, retrieved_chunks)

# Print Results
print("\n🔹 **AI Answer (Cohere):**\n")
print(answer)  # Print with better formatting




🔹 **AI Answer (Cohere):**

Allstate's strategic initiatives revolve around increasing market share in personal property-liability protection and broadening its product offerings. The company aims to be recognized through its well-known slogan “You’re in Good Hands With Allstate” and provides affordable, simple protection solutions to consumers. These strategic goals include the particular to its product offerings, expanding into new states and offering more diverse protection types. 

Additionally, the company is focused on certain segments, like auto telematics information, roadside assistance, protection plans, identity protection, and mobile data collection services. They also aim to create economic value for shareholders and improve community impact.


In [51]:
# Example Query
query = " What are the causes driving the largest amount of losses across all carriers? "
retrieved_chunks = retrieve_relevant_chunks(query)

# Get AI-generated Answer from Cohere
answer = ask_cohere(query, retrieved_chunks)

print("\n🔹 **AI Answer (Cohere):**\n")
print(answer)  # Print with better formatting


🔹 **AI Answer (Cohere):**

The causes of the largest amount of losses across all carriers are the result of inherent unpredictability and complexity in predicting and managing the factors that lead to property and casualty claims. Some of these specific carriers may most be impacted by severe weather and other catastrophe events, including hurricanes, tornadoes, windstorms, floods, earthquakes, hailstorms, severe winter weather, and fires. Additionally, climate change may also be exacerbating these events and their impacts. 

These events are inherently unpredictable and can change in frequency, severity, duration, geographic location, and scope. They can have a significant impact on the profitability of our property and other businesses more than they affect the profitability of our other businesses. The extent of insured losses from a catastrophe event is a function of our total insured exposure in the area affected by the event, the nature, severity, and duration of the event, and 

In [52]:
# Example Query
query = "Compare Traveler’s Strategic initiatives with Chubb’s? "
retrieved_chunks = retrieve_relevant_chunks(query)

# Get AI-generated Answer from Cohere
answer = ask_cohere(query, retrieved_chunks)

print("\n🔹 **AI Answer (Cohere):**\n")
print(answer)  # Print with better formatting


🔹 **AI Answer (Cohere):**

Traveler's strategic initiatives focus heavily on expanding its insurance services, particularly in the auto industry. These initiatives aim to create innovative products and solutions, aiming to provide clients with superior protection benefits. In contrast, Chub's strategic initiatives are primarily centered around auto insurance and are aimed at delivering improved products and services through innovative means. 

The two companies have different approaches, with Traveler focusing on expanded coverage options and innovative products, while Chubb emphasizes the creation of unique solutions that cater to the needs of individuals and businesses. Both companies are working to achieve success in their respective market segments through their respective offerings.

The strategic initiatives of both companies are designed to provide customers with excellent protection services and innovative solutions. They are unique selling points and comprehensive coverage op