# 1. The current Jupyter Notebook will cover the full life cylce of the First Phase of the project: ETL process and "memory building"

## 1.1 Scrape and extract textual content

In this step we will extract the needed data from the "Witch Cult Translations" site.

Because every arc is divided into n chapters, it is necessary to loop the main page to extract the text of every chapter.

In [None]:
# Import the needed libraries for the step

import requests
from bs4 import BeautifulSoup
import time

# Define the object of BeautifulSoup
URL = "https://witchculttranslation.com/table-of-content/"
headers = {'User-Agent': 'Mozilla/5.0'}
page = requests.get(URL, headers=headers)

soup = BeautifulSoup(page.content, "html.parser")

# Define the "route" of where the table of contents is saved on the main page

principal_container = soup.find("div", class_="entry-content")

# Define the "route" where the links of every chapter are saved

chapters_links = principal_container.find_all("a")

# Extract all the URLs found

chapters_urls = [] # Use to save the URLs of the chapters

for link in chapters_links:

    chapter_link = link.get('href')

    chapters_urls.append(chapter_link)

## Optimized version of the code above
## chapters_urls = [link['href'] for link in chapters_links]

# The urls of the chapters follows the next pattern (at least in the first chapter):
# https://witchculttranslation.com/aaaa/mm/dd/arc-n-chapter-n-title/
# So it is a good idea to filter the extracted ULRs by the word "arc-1" so we avoid all the "unnecessary" URLs.

cleaned_chapters_urls = []

for url in chapters_urls:
    # If we want to extract the url of all the chapters of all acrs,
    # instead of using "arc-1" we should use just "arc"
    if "arc-1" in url:
        cleaned_chapters_urls.append(url)
    else:
        pass

# Loop

print(f"Starting download of {len(cleaned_chapters_urls)} chapters...")

for url in cleaned_chapters_urls:

    try:
        # Add a timer to avoid a ban from the server
        time.sleep(1)

        # Download the page
        headers = {'User-Agent': 'Mozilla/5.0'}
        page = requests.get(url, headers=headers)

        # Parse the HTML
        soup_parser = BeautifulSoup(page.content, "html.parser")

        # Find the text container
        text_container = soup_parser.find("div", class_="entry-content")

        # Extract the text
        if text_container:
            chapter_text = text_container.get_text(separator="\n\n", strip=True)

            # Save the text of the files
            import re

            # Define the regular expression to download only the chapters of the first arc.
            # Due to the links pattern to name the URL of the first arc of the novel.

            match = re.search(r'arc-1-chapter-\d+', url)

            filename = f"{match.group(0)}.txt"

            # To extract more data to feed the model with in future phases of the project we can use the next code:
            # in order to extract the arc and chapter of the links
            # arc_match = re.search(r'arc-\d+', url)
            # chapter_match = re.search(r'chapter-\d+', url)
            # filename = f"{arc_match.group(0)}{chapter_match.group(0)}.txt"

            # Import os to save the data in a specific folder
            import os

            folder = r"your_path"

            full_path = os.path.join(folder, filename)

            with open(full_path, "w", encoding="utf-8") as file:
                file.write(chapter_text)
            print(f"File {filename} saved correctly!")

        else:
            print(f"Text not found in {url}. Review your selector.")
    
    except Exception as e:
        print(f"Errror downloading {url}: {e}")

print("Download completed!")

Starting download of 23 chapters...
File arc-1-chapter-1.txt saved correctly!
File arc-1-chapter-2.txt saved correctly!
File arc-1-chapter-3.txt saved correctly!
File arc-1-chapter-4.txt saved correctly!
File arc-1-chapter-5.txt saved correctly!
File arc-1-chapter-6.txt saved correctly!
File arc-1-chapter-7.txt saved correctly!
File arc-1-chapter-8.txt saved correctly!
File arc-1-chapter-9.txt saved correctly!
File arc-1-chapter-10.txt saved correctly!
File arc-1-chapter-11.txt saved correctly!
File arc-1-chapter-12.txt saved correctly!
File arc-1-chapter-13.txt saved correctly!
File arc-1-chapter-14.txt saved correctly!
File arc-1-chapter-15.txt saved correctly!
File arc-1-chapter-16.txt saved correctly!
File arc-1-chapter-17.txt saved correctly!
File arc-1-chapter-18.txt saved correctly!
File arc-1-chapter-19.txt saved correctly!
File arc-1-chapter-20.txt saved correctly!
File arc-1-chapter-21.txt saved correctly!
File arc-1-chapter-22.txt saved correctly!
Errror downloading https://

## 1.2 Clean and segment the text into episodes/scenes (chunking)

Because context is important for RAG architectures, methods for cleaning (noise removal) of the data such as stopword removal, normalization (upper/lower cases), stemming, or lemmatization will not be used in the current phase of the project.

Usually, a "scene" in the novel has about 500 words, so with a size of 1000 we will capture 2 scenes per chunk, which is about 6 paragraphs (as the example of chunking shows), and this is ideal for memory building.

Because we want our RAG architecture to answer in which part of the novel a scene happened, we will add metadata to label the chunks with the arc and chapter they belong to.

In [4]:
## TEST OF CHUNKING

# The text above is an extract of the arc 1 chapter 11 (the first three paragraphs)

text = """There were drops of blood running down the glass’s sharp shards, and following them upward one would find Rom’s throat.

He had lost his arm and his throat was ripped apart which caused a large amount of foamy blood to pour out of his mouth, after which the light left his gray eyes as he collapsed to the ground.

His twitching body had already lost its vitality, and there was no doubt that it no longer held life."""

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create the initialize splitter object

splitter = RecursiveCharacterTextSplitter(chunk_size = 1000,
                                          chunk_overlap = 200,
                                          length_function = len,
                                          is_separator_regex=False)

# Create the chunks

chunks = splitter.split_text(text)

print(f"Number of chunks: {len(chunks)}")

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i} ---")
    print(chunk)
    print(f"Length: {len(chunk)}\n")

Number of chunks: 1
--- Chunk 0 ---
There were drops of blood running down the glass’s sharp shards, and following them upward one would find Rom’s throat.

He had lost his arm and his throat was ripped apart which caused a large amount of foamy blood to pour out of his mouth, after which the light left his gray eyes as he collapsed to the ground.

His twitching body had already lost its vitality, and there was no doubt that it no longer held life.
Length: 416



In [5]:
# Chunk and label (transform the data) of the data extracted in the txt files

from langchain.docstore.document import Document
import os

# Hold the chunked and labeled data

all_chunks = []

# Define the path where the documents were downloaded

chapters_path = r"C:\Users\lonel\OneDrive\Escritorio\Re Zero NLP Project\chapters_files"

chapters = os.listdir(chapters_path)

# Loop through every file and chunk and label the data

for chapter in chapters:
    if chapter.endswith(".txt"):
        chapter_path = os.path.join(chapters_path, chapter)

        print(f"{chapter} is about to be transformed!")

        # Open the files

        with open(chapter_path, "r", encoding="utf-8") as f:
            chapter_text = f.readlines() # We need to delete the first 27 rows because they contain unuseful data for the RAG system

            useful_text = chapter_text[27:]

            # Join the text to use the splitter

            chapter_text_joined = "".join(useful_text)

            print(f"{chapter} correctly cleaned!")

        # Split the text

        chunks = splitter.split_text(chapter_text_joined)

        # Add the labeled data to our predefined list

        for chunk in chunks:
            labeled_chunk = Document(page_content=chunk,
                                     metadata = {"source": chapter})
            all_chunks.append(labeled_chunk)

            print(f"{chapter} correctly chunked!")

print(f"The data of the txt files was correctly trasnformed! There are {len(all_chunks)} chunks.")

print("all_chunks output example:")
print(all_chunks[0]) # First chunk

arc-1-chapter-1.txt is about to be transformed!
arc-1-chapter-1.txt correctly cleaned!
arc-1-chapter-1.txt correctly chunked!
arc-1-chapter-1.txt correctly chunked!
arc-1-chapter-1.txt correctly chunked!
arc-1-chapter-1.txt correctly chunked!
arc-1-chapter-1.txt correctly chunked!
arc-1-chapter-1.txt correctly chunked!
arc-1-chapter-1.txt correctly chunked!
arc-1-chapter-1.txt correctly chunked!
arc-1-chapter-1.txt correctly chunked!
arc-1-chapter-1.txt correctly chunked!
arc-1-chapter-1.txt correctly chunked!
arc-1-chapter-1.txt correctly chunked!
arc-1-chapter-10.txt is about to be transformed!
arc-1-chapter-10.txt correctly cleaned!
arc-1-chapter-10.txt correctly chunked!
arc-1-chapter-10.txt correctly chunked!
arc-1-chapter-10.txt correctly chunked!
arc-1-chapter-10.txt correctly chunked!
arc-1-chapter-10.txt correctly chunked!
arc-1-chapter-10.txt correctly chunked!
arc-1-chapter-10.txt correctly chunked!
arc-1-chapter-10.txt correctly chunked!
arc-1-chapter-10.txt correctly chunk

## 1.3 Generate embeddings for scenes or episodes and store them in a vector database

For this step, we will be using the libraries sentence-transformers (provides the model for embedding generation), langchain-huggingface (integrates HuggingFace models into LangChain pipelines), chromadb (allows us to store the embeddings), and langchain-chroma (allows us to extract the data from the database) because we will "create" the "brain" of our RAG system locally.

sentence-transformers and langchain-huggingface will allow us to create the embeddings of the chunks, while chromadb and langchain-chroma will allow us to store the vectors in a local database.

For this part, it is important to clarify that embeddings are numerical "representations" (like coordinates) of words, sentences (as in this project), or documents that capture the semantic meaning of those texts. For example, sentences that involve the word "love" will be close to each other in the dimensional space generated by the embeddings. To achieve the generation of an embedding, it is necessary to use a model that can encode the text and capture its semantic meaning.

Having the last paragraph in mind, vectors are the arrays of numbers that represent the embeddings.

The database created from the embeddings generated is key because when we query a question like "What did Subaru feel about Rem?", the database will find the closest vector to the question, retrieving the relevant chunks.

In [6]:
# Test to know if the needed packages are working

from langchain_huggingface import HuggingFaceEmbeddings

# Initialization of the Embedding Model

embedding_model = HuggingFaceEmbeddings(model_name = "all-MiniLM-L6-v2")

test_vector = embedding_model.embed_query(text)

print(f"Vector len: {len(test_vector)}")
print(f"First five numbers: {test_vector[:5]}")

'(ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')), '(Request ID: 934b25ce-e1bc-48c8-9bcc-ca60dd73297f)')' thrown while requesting HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/./modules.json
Retrying in 1s [Retry 1/5].


Vector len: 384
First five numbers: [-0.0564187653362751, 0.05009940266609192, 0.0574469193816185, -0.019749840721488, 0.027888907119631767]


In [7]:
from langchain_chroma import Chroma

# Define the directory where the database will be saved

database_directory = r"C:\Users\lonel\OneDrive\Escritorio\Re Zero NLP Project\vector_database"

# Create the database

vector_database = Chroma.from_documents(documents = all_chunks,
                                        embedding = embedding_model,
                                        persist_directory = database_directory)

print("Vector Database correctly created!")

Vector Database correctly created!
