# 📘 Speech Analyzer: Data Ingestion and Vector Store Preparation

---

## 📌 Introduction

This notebook documents the initial phase of the **Speech Analyzer** project. The goal is to collect speech data from the web, preprocess it, and prepare it for efficient semantic search using vector embeddings. We achieve this by web scraping speeches, splitting them into manageable chunks, and storing their vector representations in a **FAISS** database.

---

## 🔹 1. Web Scraping and Data Storage

In this section, we use **BeautifulSoup (bs4)** to scrape textual data from a relevant website or HTML source. The scraped content is cleaned and structured before being stored in a **JSON format** for easy downstream processing.

**Steps:**
- Use `requests` to fetch the webpage content  
- Parse HTML using `BeautifulSoup`  
- Extract meaningful speech content  
- Save the data in `data.json` format  

---

## 🔹 2. Data Splitting Using RecursiveCharacterTextSplitter

To ensure semantic consistency while chunking the textual data, we use **RecursiveCharacterTextSplitter** from LangChain. This technique breaks the speech text into overlapping or non-overlapping chunks suitable for embedding.

**Key Features:**
- Maintains sentence structure where possible  
- Handles large documents efficiently  
- Outputs manageable text chunks for embeddings  

---

## 🔹 3. Embedding Creation and FAISS Vector Database Ingestion

We convert each text chunk into a high-dimensional vector using a pre-trained **embedding model** (`Ollama`). These vectors are then stored in a **FAISS (Facebook AI Similarity Search)** vector database for efficient similarity search.

**Tasks:**
- Initialize the embedding model  
- Transform text chunks to embeddings  
- Push the vectors into FAISS  
- Save the FAISS index for later use  

---


## 🧾 Imports and Setup

In this section, we import all the necessary libraries and modules required for the workflow of the Speech Analyzer project:

- **`requests`** and **`BeautifulSoup`**: For scraping and parsing HTML content from the web.
- **`json`** and **`os`**: For handling JSON file operations and managing file paths.
- **`RecursiveCharacterTextSplitter`**: From LangChain, used to split long texts into semantically meaningful chunks.
- **`FAISS`**: A high-performance vector store from the LangChain community module, used for storing and querying vector embeddings.
- **`OllamaEmbeddings`**: A pre-trained embedding model used to convert text into vector representations.
- **`Document`**: A utility class from LangChain used to create standardized document objects for downstream processing.


In [2]:
import requests
from bs4 import BeautifulSoup
import json
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.docstore.document import Document


In [3]:
os.makedirs("data", exist_ok=True)
os.makedirs("vectorstore", exist_ok=True)


## 🌐 Web Scraping Famous Persuasive Speeches

In this section, we scrape persuasive speech content from the website [HighSpark](https://highspark.co/famous-persuasive-speeches/) using `requests` and `BeautifulSoup`. The data is then structured and stored in a JSON file for further processing.

### 🔄 Steps Involved:

1. **Fetch the Webpage Content**
   - The URL of the page containing famous persuasive speeches is stored in the variable `url`.
   - The `requests.get()` function is used to retrieve the HTML content of the page.
   - The HTML is parsed using `BeautifulSoup` with the `html.parser`.

2. **Extracting Speech Data**
   - The loop `for h2 in soup.find_all("h2")` iterates over all `<h2>` tags which generally hold the speech titles.
   - For each `<h2>`, the following elements are extracted:
     - **Title**: The text inside the `<h2>` tag.
     - **Speech Content**: The next `<blockquote>` tag after the `<h2>` is assumed to contain the actual speech.
     - **Background Information**: If available, the paragraph `<p>` immediately after the `<blockquote>` is extracted as background or context for the speech.
   - If both the speech and optional background exist, they are added to a list `speeches` as dictionaries with `title`, `text`, and `background`.

3. **Storing the Data**
   - The `speeches` list is saved in JSON format to a file named `speeches.json` inside a `data/` directory.
   - The JSON is formatted with indentation for better readability using `json.dump(..., indent=2)`.

4. **Logging**
   - A simple print statement displays how many speeches were successfully scraped and saved.

> ✅ After this step, we have a clean and structured dataset of famous persuasive speeches ready for chunking and embedding.


In [6]:
url = "https://highspark.co/famous-persuasive-speeches/"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

speeches = []

for h2 in soup.find_all("h2"):
    title = h2.get_text(strip=True)
    blockquote = h2.find_next("blockquote")
    background_p = blockquote.find_next("p") if blockquote else None

    if blockquote:
        speech_text = blockquote.get_text(strip=True)
        background = background_p.get_text(strip=True) if background_p else ""
        speeches.append({
            "title": title,
            "text": speech_text,
            "background": background
        })

with open("data/speeches.json", "w") as f:
    json.dump(speeches, f, indent=2)

print(f"Saved {len(speeches)} speeches.")


Saved 40 speeches.


## ✂️ Splitting Speech Data into Chunks with Metadata

After scraping and saving the speeches, we now proceed to **load the JSON data** and **split the speech content into smaller chunks** using LangChain’s `RecursiveCharacterTextSplitter`. These chunks are then wrapped into `Document` objects along with relevant metadata.

### 📂 Step-by-Step Breakdown:

1. **Load the JSON Data**
   - The file `data/speeches.json` is opened and read using `json.load()`.
   - The result is a list of speech dictionaries, each containing `title`, `text`, and `background`.

2. **Initialize Text Splitter**
   - We use `RecursiveCharacterTextSplitter` with the following parameters:
     - `chunk_size=500`: Each chunk will contain up to 500 characters.
     - `chunk_overlap=100`: Chunks will overlap by 100 characters to maintain context continuity.
   - This ensures that chunks are semantically meaningful and do not cut off in the middle of sentences unnecessarily.

3. **Prepare Chunks with Metadata**
   - For each speech in the list:
     - Combine the speech body and background information into one string (`full_text`).
     - Use the `splitter` to divide the `full_text` into smaller segments.
     - For each resulting chunk:
       - Create a `Document` object with:
         - `page_content`: the chunked text.
         - `metadata`: a dictionary containing the `title` of the speech.
     - All `Document` objects are stored in the `documents` list.

This prepares the dataset in a format that is compatible with vector embedding models and vector databases like FAISS, while preserving context and metadata for traceability.

> ✅ Result: A list of text chunks with associated metadata, ready to be embedded and stored in a vector database.


In [7]:
with open("data/speeches.json") as f:
    speeches = json.load(f)

documents = []
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

for speech in speeches:
    full_text = f"{speech['text']}\n\nBackground:\n{speech['background']}"
    chunks = splitter.split_text(full_text)
    for chunk in chunks:
        documents.append(Document(page_content=chunk, metadata={"title": speech["title"]}))


## 🧠 Creating and Saving FAISS Vector Store using Ollama Mistral Embeddings

In this final step, we embed the chunked documents and store them in a **FAISS vector database** for efficient similarity search. We use the **Ollama Mistral** model to generate high-dimensional vector representations for each chunk.

### 🛠️ Step-by-Step Explanation:

1. **Initialize Embedding Model**
   - We use `OllamaEmbeddings` with the model name `"mistral"`.
   - This model runs **locally**, so it may consume significant compute resources and take time to generate embeddings depending on your system.

2. **Create FAISS Vector Store**
   - We build the FAISS index using `FAISS.from_documents()`, which:
     - Passes each chunked `Document` through the embedding model.
     - Stores the resulting vectors in an efficient FAISS index structure.

3. **Save FAISS Index Locally**
   - The index is saved to a local directory `vectorstore/faiss_index` using `db.save_local()` so it can be reloaded later without recomputing embeddings.

4. **Print Confirmation**
   - A simple message is printed once the process completes.

> ⚠️ **Note**: This cell may take a significant amount of time to run, especially if the Mistral model is being initialized for the first time or if many documents are being processed. Ensure sufficient CPU/GPU resources are available when running this step.

> ✅ After this step, we have a persistent, searchable vector store containing all the embedded chunks of the speech data.


In [8]:
from langchain.vectorstores import FAISS
from langchain.embeddings import OllamaEmbeddings

# Initialize Ollama Mistral embedding
embedding = OllamaEmbeddings(model="mistral")

# Build FAISS vector store from documents
db = FAISS.from_documents(documents, embedding)

# Save vector store locally
db.save_local("vectorstore/faiss_index")

print("Vector store created and saved successfully.")


  embedding = OllamaEmbeddings(model="mistral")


Vector store created and saved successfully.


In [12]:
query = "What did Queen Elizabeth I say about her role in the war against Spain?"
results = db.similarity_search_with_score(query, k=3)

for doc, score in results:
    print(f"Title: {doc.metadata['title']}")
    print(f"Content: {doc.page_content}\n")
    print(f"Score: {score}\n")


Title: 30. Black Power Address at UC Berkeley by Stokely Carmichael
Content: to sanction Black Power. We’re tired waiting; every time black people move in this country, they’re forced to defend their position before they move. It’s time that the people who are supposed to be defending their position do that. That’s white people. They ought to start defending themselves as to why they have oppressed and exploited us.”

Score: 133587.265625

Title: 35. Questioning the Universe by Stephen Hawking
Content: show that we have made remarkable progress in the last hundred years. But if we want to continue beyond the next hundred years, our future is in space. That is why I am in favor of manned — or should I say, personned — space flight.”

Score: 135536.296875

Title: 21. June 9 Speech to Martial Law Units by Deng Xiaoping
Content: Perhaps this bad thing will enable us to go ahead with reform and the open policy at a steadier and better — even a faster — pace, more speedily correct our mistak