Attached you will find my code for a Digital Humanities project that:

- Collected and digitized historic texts on Guam and CHamoru culture
- Created TEI files for analysis of the corpus
- Seeked to use TF-IDF, co-occurrence, sentiment analysis, and topic modeling to explore differences between how insiders (those from Guam and who identify as CHamrou) and outsiders (those who write ABOUT Guam but are not from or of the culture) talk about the island.

This is a starting point and uses distant reading to point towards potential future close-reading and digitization efforts.

In [3]:
# import packages

import os
import re
import spacy
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs 
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer

**STEP 1: Prepare texts for analysis**

In [4]:
# load model

nlp = spacy.load("en_core_web_sm")

# def a function for pre-processing
def pre_process(text):
    doc = nlp(text)
    processed_tokens = [token.lemma_ for token in doc if token.pos_ in ['NOUN', 'ADJ']]
    return " ".join(processed_tokens)   # joined processed tokens consisting of pre-processed nouns and adjectives

# use os.scandir to apply function to files in path
filepath = "/Users/ricky/digital_texts/corpus/files/0_tei_files/finalized_tei"
metadata_list = []
text_list = []
raw_text_list = []

for entry in os.scandir(filepath):
    # control for hidden files (realized this was providing an issue for processing)
    if entry.name.startswith('.'):
        continue
    print(f'Processing: {entry.path}')  # track processing time
    try:
        with open(entry.path, encoding='utf8') as file:
            xml_content = file.read()
            # parsing file w/ BeautifulSoup to extract metadata
            soup = bs(xml_content, "xml")
            author = soup.author.text.strip() if soup.author else "Unknown"
            title = soup.title.text.strip() if soup.title else "Untitled"
            pub_date = soup.date.text.strip() if soup.date else "Unknown"
            in_out = soup.affiliation.text.strip() if soup.affiliation else "Unknown" 
            # store metadata
            metadata = {"author": author, "title": title, "pub_date": pub_date, "insider/outsider": in_out,}
            metadata_list.append(metadata)
            
            # pre processing continued with text extraction
            text = soup.body.text if soup.body else ""
            raw_text_list.append(text)  # append raw text to a list for future processing
            processed_text = pre_process(text)
            text_list.append(processed_text)    # append processed text list for future processing
    except UnicodeDecodeError:  # more control measures for hidden files
        print(f'UnicodeDecodeError in: {entry.path}')
        continue
print('Processing done.')

Processing: /Users/ricky/digital_texts/corpus/files/0_tei_files/finalized_tei/marianas_mosiac.xml
Processing: /Users/ricky/digital_texts/corpus/files/0_tei_files/finalized_tei/guam_two_invasions_and_three_military_occupations.xml
Processing: /Users/ricky/digital_texts/corpus/files/0_tei_files/finalized_tei/legacy_of_a_political_union.xml
Processing: /Users/ricky/digital_texts/corpus/files/0_tei_files/finalized_tei/eng_chamoru_legends.xml
Processing: /Users/ricky/digital_texts/corpus/files/0_tei_files/finalized_tei/destinys_landfall.xml
Processing: /Users/ricky/digital_texts/corpus/files/0_tei_files/finalized_tei/history_of_the_chamorro_people.xml
Processing done.


**STEP 2: Analysis of text**

NOTES ON VECTORIZING:
- Vectorizing is the process of converting raw text data into numerical representations (vectors) that machine learning algorithms can understand and process. Since algorithms work with numbers, not words, this step is essential for tasks like text classification, clustering, or analysis.
- CountVectorizer (Bag-Of-Words) counts how many times each word appears in a document
- TF-IDF reflects term importance (see below)

NOTES ON TF-IDF:
- Converting from CountVectorizer to TF-IDF involves transforming raw word counts into a weighted representation that reflects term importance across documents where
- CountVectorizer: outputs a term-frequency matrix (e.g. "The cat sat on the mat" > {'the':2, 'cat':1, 'sat':1, ...})
- TF-IDF adjusts these two counts using Term Frequency (how often a word appears) and Inverse Document Frequency (penalizes terms that may appear in many documents)
- Resulting in a boost in rare and meaninful terms and a downweight of common terms 

METHOD 1: TF-IDF

In [5]:
from sklearn.feature_extraction.text import TfidfTransformer

vectorizer = CountVectorizer()  # counting how many times each word appears in docs
# converts lists of text documents into numerical matrix suitable for machine learning
# fit() learns vocabulary from text list
# transform converts data into numerical matric based on learned vocabulart
# C is a sparse matrix where each row represents a document and columns represent word/term
C = vectorizer.fit_transform(text_list) 
feature_names = vectorizer.get_feature_names_out()  # for mapping column indices to actual words/terms

# converting to TF-IDF
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(C)

# convert dataframe
titles = [metadata['title'] for metadata in metadata_list]
df_tfidf = pd.DataFrame(X_tfidf.toarray(), columns=feature_names, index=titles)

# display top words per document
top_n = 5   # adjust for number of words to show
for title in titles:
    print(f"Top words for '{title}':")
    top_words = df_tfidf.loc[title].nlargest(top_n)
    print(top_words, '\n')

Top words for 'A Marianas Mosaic: Signs and Shifts in Contemporary Island Life':
generation    0.259778
other         0.216253
people        0.212301
culture       0.203745
island        0.180117
Name: A Marianas Mosaic: Signs and Shifts in Contemporary Island Life, dtype: float64 

Top words for 'GUAM: TWO INVASIONS AND THREE MILITARY OCCUPATIONS':
japanese     0.430121
naval        0.224926
gun          0.202889
island       0.197352
guamanian    0.186958
Name: GUAM: TWO INVASIONS AND THREE MILITARY OCCUPATIONS, dtype: float64 

Top words for 'Legacy of a Political Union: A Founding Father's Memoir':
negotiation    0.353061
people         0.331279
political      0.221373
member         0.191123
citizen        0.158975
Name: Legacy of a Political Union: A Founding Father's Memoir, dtype: float64 

Top words for 'Chamoru Legends: A Gathering of Stories':
tree       0.292939
brother    0.204505
child      0.190687
man        0.165218
time       0.155640
Name: Chamoru Legends: A Gatherin

METHOD 1 (continued): Co-occurrences

In [7]:
corpus = raw_text_list
co_occurrences = [] # create a list of co-occurrences for target_noun
target_noun = 'guam'    # could be anything

for doc in corpus:
    spacy_doc = nlp(doc)
    for token in spacy_doc:
        if token.text.lower() == target_noun:
            for child in token.children:
                if child.pos_ == 'ADJ': # looking for adjecticves near our target noun
                    print(f"Adjective near '{target_noun}': {token.head.text}")
                    co_occurrences.append(child.text)
            if token.head.pos_ == 'ADJ':
                print(f"Adjective governing '{target_noun}': {token.head.text}")    # looking to track heads/chapter names that have co-occurrences
                co_occurrences.append(token.head.text)

print('Co-occurrences:', co_occurrences)

Adjective governing 'guam': past
Adjective near 'guam': in
Adjective near 'guam': saw
Adjective near 'guam': in
Adjective near 'guam': neutralize
Adjective near 'guam': be
Adjective near 'guam': in
Adjective near 'guam': confine
Adjective near 'guam': in
Adjective near 'guam': in
Adjective near 'guam': in
Adjective near 'guam': from
Adjective near 'guam': people
Adjective near 'guam': scouted
Adjective near 'guam': in
Adjective near 'guam': in
Adjective near 'guam': settlements
Co-occurrences: ['past', 'modern', 'postwar', 'modern', 'strong', 'fortified', 'vulnerable', 'defenseless', 'roadless', 'southern', 'central', 'northern', 'Northern', 'southern', 'northern', 'northern', 'central']


METHOD 2: Sentiment Analysis

Notes on Sentiment Analysis:
- Sentiment creates a spectrum of sentiment based on negative (-), neutral (0), and positive (+) words throughout the text. In this case, most of the texts are neutral to netural-positive which could tell us something based on the nature / purpose of the texts being written. 

In [8]:
from textblob import TextBlob

# creating a dataframe of metadata to identify parts of the text we're analyzing
df_corpus = pd.DataFrame(metadata_list)

df_corpus['sentiment'] = [TextBlob(text).sentiment.polarity for text in text_list]
print(df_corpus[["title", "sentiment"]])

                                               title  sentiment
0  A Marianas Mosaic: Signs and Shifts in Contemp...   0.087177
1  GUAM: TWO INVASIONS AND THREE MILITARY OCCUPAT...   0.026302
2  Legacy of a Political Union: A Founding Father...   0.112685
3            Chamoru Legends: A Gathering of Stories   0.097129
4              Destiny’s Landfall: A History of Guam   0.038031
5  The Hale'-Ta Series: HeStorian Taotao Tano': H...   0.100835


METHOD 3: Topic Modeling

NOTES ON TOPIC MODELING:

When to INCREASE n_topics (e.g., from 6 to 8 or 10):
- If topics seem too broad or contain mixed themes in one topic.
- If words from different subjects are appearing together in a single topic.
- If you suspect there are more distinct themes in your documents.<br><br>


When to DECREASE n_topics (e.g., from 6 to 4 or 5):
- If topics seem too fragmented, with very specific themes that might not be useful.
- If some topics repeat similar themes with slight variations.
- If you get many topics that don’t seem meaningful.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.decomposition import LatentDirichletAllocation
from joblib import dump


# set pre-work to load XML files
folderpath = "/Users/ricky/digital_texts/corpus/files/0_tei_files/finalized_tei"
texts = []
filenames = []

# step 1: iterate over all files within folder
for file in os.scandir(folderpath):
    # control for hidden files
    if file.name.startswith('.') or not file.is_file():
        continue
    try:
        with open(file.path, encoding='utf8') as file:
            xml_content = file.read()
            # parse xml and extract <body> text
            soup = bs(xml_content, "xml")
            text = soup.body.get_text() if soup.body else soup.get_text()
            if text.strip():
                texts.append(text)
                filenames.append(file.name)
            else:
                print(f'skipping emptry documents: {file.path}')
    except UnicodeDecodeError:
        print(f'skipping unreadable files: {file.path}')
        
# step 2: convert to dataframe
df = pd.DataFrame({'filename': filenames, 'text': texts})
print('Done extracting text from files. \n')

# step 3: vectorize texts
print('Vectorizing texts...', end=' ', flush=True)
vectorizer = CountVectorizer(min_df=0.01, max_df=0.6, stop_words='english')
vectorized_data = vectorizer.fit_transform(df.text)
print('done.\n')

# step 4: training lda model
print('Building lda model using training set...', end=' ', flush=True)
n_topics = 4 # adjust as needed depending on the output
lda = LatentDirichletAllocation(n_components=n_topics, learning_decay=0.8, random_state=1)
doc_topic_distribution = lda.fit_transform(vectorized_data)
print('done.\n')

# step 5: display topic models
print('Top words per topic:')
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    top_words = [feature_names[i] for i in topic.argsort()[:-10 - 1:-1]]
    print(f"Topic {topic_idx + 1}: {', '.join(top_words)}")
    
# step 6: assign topics to documents
df['topic'] = doc_topic_distribution.argmax(axis=1)

# step 7: save model and vectorizer
dump(lda, 'lda_model.joblib')
dump(vectorizer, 'vectorizer.joblib')

print('\n model and vectorizer saved.')



Done extracting text from files. 

Vectorizing texts... done.

Building lda model using training set... done.

Top words per topic:
Topic 1: guamanians, japan, naval, navy, army, pp, guns, lvt, landing, invasion
Topic 2: ancient, vitores, spaniards, maga, missionaries, society, lahi, padre, ancestors, think
Topic 3: ko, hilitai, elena, nåna, sirena, carabao, cow, skin, fruit, maybe
Topic 4: chamoru, generation, cnmi, halo, filipino, chamorus, healers, refaluwasch, 2017, art

 model and vectorizer saved.
