# The current Jupyter Notebook will cover the full life cylce of the First Phase of the project: ETL process and "memory building"

## Scrape and extract textual content

In this step we will extract the needed data from the "Witch Cult Translations" site.

Because every ARC is divided into n characters, it is necessary to loop the main page to extract the text of every chapter.

In [6]:
# Import the needed libraries for the step

import requests
from bs4 import BeautifulSoup
import time

# Define the object of BeautifulSoup
URL = "https://witchculttranslation.com/table-of-content/"
headers = {'User-Agent': 'Mozilla/5.0'}
page = requests.get(URL, headers=headers)

soup = BeautifulSoup(page.content, "html.parser")

# Define the "route" of where the table of contents is saved on the main page

principal_container = soup.find("div", class_="entry-content")

# Define the "route" where the links of every chapter are saved

chapters_links = principal_container.find_all("a")

# Extract all the URLs found

chapters_urls = [] # Use to save the URLs of the chapters

for link in chapters_links:

    chapter_link = link.get('href')

    chapters_urls.append(chapter_link)

## Optimized version of the code above
## chapters_urls = [link['href'] for link in chapters_links]

# The urls of the chapters follows the next "pattern": https://witchculttranslation.com/aaaa/mm/dd/arc-n-chapter-n-title/
# So it is a good idea to filter the extracted ULRs by the word "arc" so we avoid all the "unnecessary" URLs.

cleaned_chapters_urls = []

for url in chapters_urls:
    if "arc" in url:
        cleaned_chapters_urls.append(url)
    else:
        pass

# Loop

print(f"Starting download of {len(cleaned_chapters_urls)} chapters...")

for url in cleaned_chapters_urls:

    try:
        # Add a timer to avoid a ban from the server
        time.sleep(1)

        # Download the page
        headers = {'User-Agent': 'Mozilla/5.0'}
        page = requests.get(url, headers=headers)

        # Parse the HTML
        soup_parser = BeautifulSoup(page.content, "html.parser")

        # Find the text container
        text_container = soup_parser.find("div", class_="entry-content")

        # Extract the text
        if text_container:
            chapter_text = text_container.get_text(separator="\n\n", strip=True)

            # Save the text of the files
            import re

            match = re.search(r'arc-\d+-chapter-\d+', url)
            filename = f"{match.group(0)}.txt"

            # Import os to save the data in a specific folder
            import os

            folder = r"C:\Users\lonel\OneDrive\Escritorio\Re Zero NLP Project\chapters_files"

            full_path = os.path.join(folder, filename)

            with open(full_path, "w", encoding="utf-8") as file:
                file.write(chapter_text)
            print(f"File {filename} saved correctly!")

        else:
            print(f"Text not found in {url}. Review your selector.")
    
    except Exception as e:
        print(f"Errror downloading {url}: {e}")

print("Download completed!")

Starting download of 476 chapters...
File arc-1-chapter-1.txt saved correctly!
File arc-1-chapter-2.txt saved correctly!
File arc-1-chapter-3.txt saved correctly!
File arc-1-chapter-4.txt saved correctly!
File arc-1-chapter-5.txt saved correctly!
File arc-1-chapter-6.txt saved correctly!
File arc-1-chapter-7.txt saved correctly!
File arc-1-chapter-8.txt saved correctly!
File arc-1-chapter-9.txt saved correctly!
File arc-1-chapter-10.txt saved correctly!
File arc-1-chapter-11.txt saved correctly!
File arc-1-chapter-12.txt saved correctly!
File arc-1-chapter-13.txt saved correctly!
File arc-1-chapter-14.txt saved correctly!
File arc-1-chapter-15.txt saved correctly!
File arc-1-chapter-16.txt saved correctly!
File arc-1-chapter-17.txt saved correctly!
File arc-1-chapter-18.txt saved correctly!
File arc-1-chapter-19.txt saved correctly!
File arc-1-chapter-20.txt saved correctly!
File arc-1-chapter-21.txt saved correctly!
File arc-1-chapter-22.txt saved correctly!
Errror downloading https:/

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Text not found in https://witchculttranslation.com/wp-content/uploads/2019/01/a4-post-read-before-arc-5.pdf. Review your selector.


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Text not found in https://witchculttranslation.com/wp-content/uploads/2019/01/a4-post-read-before-arc-5.pdf. Review your selector.


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Text not found in https://witchculttranslation.com/wp-content/uploads/2019/01/a4-post-read-before-arc-5.pdf. Review your selector.


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Text not found in https://witchculttranslation.com/wp-content/uploads/2019/01/a4-post-read-before-arc-5.pdf. Review your selector.


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Text not found in https://witchculttranslation.com/wp-content/uploads/2019/01/a4-post-read-before-arc-5.pdf. Review your selector.


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Text not found in https://witchculttranslation.com/wp-content/uploads/2019/01/a4-post-read-before-arc-5.pdf. Review your selector.


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Text not found in https://witchculttranslation.com/wp-content/uploads/2019/01/a4-post-read-before-arc-5.pdf. Review your selector.
File arc-5-chapter-1.txt saved correctly!
File arc-5-chapter-2.txt saved correctly!
File arc-5-chapter-3.txt saved correctly!
File arc-5-chapter-4.txt saved correctly!
File arc-5-chapter-5.txt saved correctly!
File arc-5-chapter-6.txt saved correctly!
File arc-5-chapter-7.txt saved correctly!
File arc-5-chapter-8.txt saved correctly!
File arc-5-chapter-9.txt saved correctly!
File arc-5-chapter-10.txt saved correctly!
File arc-5-chapter-11.txt saved correctly!
File arc-5-chapter-12.txt saved correctly!
File arc-5-chapter-13.txt saved correctly!
File arc-5-chapter-14.txt saved correctly!
File arc-5-chapter-15.txt saved correctly!
File arc-5-chapter-16.txt saved correctly!
File arc-5-chapter-17.txt saved correctly!
File arc-5-chapter-18.txt saved correctly!
File arc-5-chapter-19.txt saved correctly!
File arc-5-chapter-20.txt saved correctly!
File arc-5-chapter