# Latin Texts Web Scraping
This file contains the code used to scrape PHI Latin Texts and produce the resulting parquet file used in the final Power BI report.
Because all the code in this file was the result of experimenting and becoming familiar with the Beautiful Soup library, the program is currently very innefficient.
No AI was used in the making of this project.

In [None]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

# Headers for device used to scrape - may need to update if run on other devices
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36'}

### Initial Request
This cell makes a request to PHI Latin Texts' homepage in order to scrape a list of authors to be scraped from further.

In [None]:
# Base request to the page containing the list of authors and links to their pages
response = requests.get('https://latin.packhum.org/browse', headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

# Getting every link on the page and saving all links that point to authors' pages to a list
homepage_links = soup.find_all('a')
homepage_links_references = []
for tag in homepage_links:
    homepage_link_reference = tag.get('href')
    if 'author' in homepage_link_reference:
        homepage_links_references.append(homepage_link_reference)

# Getting every author's name associated with each link scraped above
author_tags = soup.find_all(attrs={"class": "srch"})
author_names = []
for tag in author_tags:
    author_names.append(tag.text)

### Scraping Authors for their Works
This cell loops over the list of authors and their links relative to the base URL to get a list of the works, or texts, that will be scraped later.

In [None]:
# A list that will store the author information from above, alongside each author's work's title and link
works = []

# Loop going over links/names scraped above
for author_index in range(len(homepage_links_references)):

    # Request to each author's respective page
    response = requests.get('https://latin.packhum.org' + homepage_links_references[author_index], headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Getting every link on the page and saving all links that point to unique works to a list
    works_links = soup.find_all('a')
    works_links_references = []
    for tag in works_links:
        works_links_reference = tag.get('href')
        if 'loc' in works_links_reference:
            works_links_references.append(works_links_reference)

    # Getting every work's name associated with each link scraped above
    works_tags = soup.find_all(attrs={"class":"wnam"})
    works_names = []
    for tag in works_tags:
        works_names.append(tag.text)

    # Saving the results of the scraping for the current author to the works list for future reference
    for work_index in range(len(works_links_references)):
        works.append([author_names[author_index], works_links_references[work_index], works_names[work_index]])

### Scraping Texts
This cell loops over all pages containing raw Latin texts and saves the entire text as a string. Currently, this cell takes ~3 hours to run on my PC - again, since this project was about experimenting, I now can see that this is wildly inefficient and can likely be done much faster (although I am assuming the bulk of that time is unavoidable due to making 800+ requests).

In [None]:
# A list that will store the author and work information from above, alongside each work's raw text as a string
texts = []

# Loop going over links/works scraped above
for work_index in range(len(works)):

    """
    Some authors' works are so large that they are stored on multiple pages on the website being scraped.
    This sub-loop goes through every sub-page storing multiple pages.
    This solution works by storing all pages' texts to a single cumulative text, with the loop being broken if the current page's text is already in the cumulative text (final page reached).
    """

    # Defining cumulative text and increment used to iterate through multiple sub-pages' urls
    cumulative_text = ""
    page_increment = 0

    # Multi-page loop
    while True:

        # Request to current page/sub-page
        response = requests.get('https://latin.packhum.org' + works[work_index][1][:-1] + str(page_increment), headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')

        """
        Each page containing text has the text split into multiple sections, with all sections with the "td" tag being the raw text.
        This loop Gets all "td" tags and stores their individual texts to a single string containing the full page's text (which is then added to the cumulative text).
        """

        # Defining full text and finding al "td" tags
        full_text = ""
        text_sections = soup.find_all('td')

        # Loop combining all "td" tag texts
        for text in text_sections:
            full_text = full_text + " " + text.text.strip()

        # Check for if the current page's text is already in the cumulative text - indicates the previous iteration was the final page for the work and that the inner loop should be terminated
        if full_text in cumulative_text:
            break

        # Updating cumulative text with full text and incrementing the page for next iteration
        cumulative_text += " " + full_text
        page_increment += 1

    # Saving current work's full (potentially multi-page) text to the texts list with the scraped data from the previous cell
    current_work_copy = works[work_index].copy()
    current_work_copy.append(cumulative_text)
    texts.append(current_work_copy)

### Creating Word Dictionary
This cell loops through each saved text and creates a dictionary containing every unique word in the text and their counts throughout the texts - this will be the basis for most of the final Power BI report.

In [None]:
# Loop to turn raw text stored in list above into dictionary containing each unique word and their respective counts for each text
for text in texts:

    # Adjusting formatting of scraped data stored in current text's array (author names and links for readability/usage, words for formatting as dictionary)
    text[0] = re.sub(r'\s+', ' ', re.sub('[^a-zA-Z]', ' ', text[0])).strip()
    text[1] = 'https://latin.packhum.org' + text[1]
    word_list = re.sub(r'\s+', ' ', re.sub('[^a-z]', ' ', text[3].lower())).strip().split()

    # Creating word dictionary and adding each word in current text, as well as updating dictionary's count if the current word was already added in current text
    word_dictionary = {}
    for word in word_list:
        if word in word_dictionary:
            word_dictionary[word] += 1
            continue
        word_dictionary[word] = 1

    # Adding dictionary to current text's array of scraped data
    text.append(word_dictionary)

### Manipulating/Formatting Results
This cell uses Pandas to format the data stored in the texts list into a Dataframe/spreadsheet that can be easily exported (or analyzed).

In [None]:
# Creating final DataFrame to store all scraped data
df = pd.DataFrame(texts, columns=['author', 'source', 'title', 'text', 'word'])

# Turning column containg dictionary of words and their counts into two distinct columns - one for the word in the given text, and the other for the count of that word in the given text
df['word'] = df['word'].apply(lambda row: list(row.items()))
df = df.explode('word')
df[['word', 'count']] = df['word'].apply(pd.Series)

# Dropping full raw text - WHOLE PROGRAM COULD BE IMPROVED IF THIS WAS DONE EARLIER (leaving as is because scraping theoretically only needs to be done once)
textless_df = df.drop(columns='text')

### Exporting
This cell has two ways of exporting the resulting dataframe commented out - to save the results to either csv or parquet, uncomment the respect line.  

In [None]:
# Save result to csv (too large to push to GitHub)
# textless_df.to_csv('../bin/scraped_latin_texts_words.csv', index=False)

# Save result to compressed Parquet for GitHub push
# textless_df.to_parquet('../bin/scraped_latin_texts_words.parquet', index=False, compression="gzip", engine="pyarrow")