
**Classification**

Last week we talked about the process of classificaiton in basic outlines - taking two (or more, but let's start with two) groups of texts, labeled accordingly (type 1, type 2) in some quantity (1000 is a benchmark); selecting "features" for a classifier to use to distinguish them (e.g., in the simplest version, single words/word counts); selecting a type of classificaiton model (e.g. logistic reression, naive bayes); and then producing/analyzing an output that might be used in different ways.

Oh, and we also mentioned the necessity of splitting up data into training and testing, or training testing and validation sets (or, as will be modeled below, using a slightly more complicated process, such as 10-fold cross validation, to prevent "overfitting"); and we considered, too, the ways in which we verify the functionality of a classification model (accuracy score) etc.

Now let's put this all into practice through an example. We'll use texts by James Joyce - as in prior classes - and we'll use a program/package called "scikitlearn" to build/run classification models. You've seen this program before. Does anyone remember what we used scikitlearn for in the past?


First we need some texts. let's start by taking two joyce texts - dubliners and portrait - and spitting them up into mini chunks (let's say of 300 words each or so). Then let's see if a classifier can learn the difference betwen the chunks (i.e., classifying chunks from one text or the other with some accuracy). We'd expect this to work, even if our chunks are not super nicely divided (e.g., by paragraph). In a larger experiment, we could have the full books be single documents for classification, but then we'd need a lot more books!

In [None]:
import requests

# URLs to the texts
portrait_url = "https://www.gutenberg.org/cache/epub/4217/pg4217.txt"
dubliners_url = "https://www.gutenberg.org/cache/epub/2814/pg2814.txt"

# Filenames to save
portrait_filename = "portrait.txt"
dubliners_filename = "dubliners.txt"

# Download and save Portrait
response_portrait = requests.get(portrait_url)
with open(portrait_filename, "w", encoding="utf-8") as f:
    f.write(response_portrait.text)

# Download and save Dubliners
response_dubliners = requests.get(dubliners_url)
with open(dubliners_filename, "w", encoding="utf-8") as f:
    f.write(response_dubliners.text)

print("Texts downloaded and saved as 'portrait.txt' and 'dubliners.txt'")


Now let's clean out the gutenberg front and backmatter of each text by deleting everyhting before the first line and after the last.

Find first line of portrait:

In [None]:
import re

# Search for "once upon a time" in portrait_raw, case-insensitive
matches = list(re.finditer(r"once upon a time", portrait_raw, flags=re.IGNORECASE))

# Print up to 3 matches with context
for i, match in enumerate(matches[:3]):
    start = max(0, match.start() - 100)
    end = match.end() + 100
    print(f"\n--- Match {i+1} ---\n")
    print(portrait_raw[start:end])


Trim everything before

In [None]:
# Step 1: Trim everything before "Once upon a time"
start_phrase = "Once upon a time"
start_idx = portrait_raw.lower().find(start_phrase.lower())

if start_idx == -1:
    raise ValueError("Start phrase not found in Portrait.")

portrait_trimmed = portrait_raw[start_idx:]

# Save intermediate result to inspect or continue cleaning later
with open("portrait_trimmed_start.txt", "w", encoding="utf-8") as f:
    f.write(portrait_trimmed)

print("Trimmed Portrait from start phrase. Saved as 'portrait_trimmed_start.txt'")


In [None]:
# Load the trimmed file and preview beginning
with open("portrait_trimmed_start.txt", "r", encoding="utf-8") as f:
    portrait_trimmed = f.read()

print("\n--- PORTRAIT (START TRIMMED) BEGINNING ---\n")
print(portrait_trimmed[:1000])  # Print first 1000 characters


Find last line of portrait:

In [None]:
import re

# Define the full ending snippet, as it appears across lines
end_snippet_pattern = r"old father.*?stand me now and ever in good\s+stead"

# Use regex to match across line breaks in portrait_trimmed
match = re.search(end_snippet_pattern, portrait_trimmed, flags=re.IGNORECASE | re.DOTALL)

if match:
    trimmed_portrait_final = portrait_trimmed[:match.end()]

    # Save final cleaned version
    with open("portrait_clean.txt", "w", encoding="utf-8") as f:
        f.write(trimmed_portrait_final)

    print("✅ Successfully matched and trimmed Portrait to correct ending.")
else:
    print("❌ Could not find ending snippet in portrait_trimmed.")


check it worked

In [None]:
# Load the final cleaned Portrait text
with open("portrait_clean.txt", "r", encoding="utf-8") as f:
    portrait_clean = f.read()

# Preview function
def preview_text(name, text, num_chars=500):
    print(f"\n--- {name} BEGINNING ---\n")
    print(text[:num_chars])
    print(f"\n--- {name} END ---\n")
    print(text[-num_chars:])

# Check cleaned Portrait
preview_text("PORTRAIT (FINAL TRIM)", portrait_clean)


Now let's clean dubliners in the same way:

In [None]:
import re

# Load full raw Dubliners text
with open("dubliners.txt", "r", encoding="utf-8") as f:
    dubliners_raw = f.read()

# --- Trim start ---
start_snippet = "There was no hope for him this time"
start_idx = dubliners_raw.lower().find(start_snippet.lower())

if start_idx == -1:
    raise ValueError("❌ Start phrase not found in Dubliners.")
else:
    dubliners_trimmed_start = dubliners_raw[start_idx:]

# --- Trim end ---
# We'll use a fuzzy multiline match to capture the final sentence
end_pattern = (
    r"his soul swooned slowly.*?the descent of their last end, upon all the living and the dead"
)

match = re.search(end_pattern, dubliners_trimmed_start, flags=re.IGNORECASE | re.DOTALL)

if match:
    dubliners_clean = dubliners_trimmed_start[:match.end()]

    # Save cleaned version
    with open("dubliners_clean.txt", "w", encoding="utf-8") as f:
        f.write(dubliners_clean)

    print("✅ Successfully cleaned and saved 'dubliners_clean.txt'")
else:
    raise ValueError("❌ Could not find ending phrase in Dubliners.")


OK now let's split each text into a bunch of subdocuments, chunks of 300 words. we'll put the chunks in two separata dataframes, and give them index numbers to keep them in order

In [None]:
import pandas as pd

def chunk_text(text, chunk_size=300):
    # Tokenize text into words
    words = text.split()
    chunks = []

    for i in range(0, len(words), chunk_size):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)

    return chunks

# Load the cleaned texts
with open("portrait_clean.txt", "r", encoding="utf-8") as f:
    portrait_clean = f.read()

with open("dubliners_clean.txt", "r", encoding="utf-8") as f:
    dubliners_clean = f.read()

# Chunk the texts
portrait_chunks = chunk_text(portrait_clean, chunk_size=300)
dubliners_chunks = chunk_text(dubliners_clean, chunk_size=300)

# Create DataFrames with index numbers
portrait_df = pd.DataFrame({
    "chunk_number": range(len(portrait_chunks)),
    "text": portrait_chunks
})

dubliners_df = pd.DataFrame({
    "chunk_number": range(len(dubliners_chunks)),
    "text": dubliners_chunks
})

# Preview
print("✅ Chunking complete. Portrait chunks:", len(portrait_df))
print("✅ Chunking complete. Dubliners chunks:", len(dubliners_df))

# Optional: save to CSVs for inspection
# portrait_df.to_csv("portrait_chunks.csv", index=False)
# dubliners_df.to_csv("dubliners_chunks.csv", index=False)


Cool. we'll notice that this isn't a huge amount of documents to work with to build a classifier. In an ideal world, we'd have 1000 of each. To achieve that we could make smaller chunks of about 100 words. But let's soldier on and see how this works.

In [None]:
portrait_df.head()

In [None]:
dubliners_df.head()

First, we need to know what format we need this data in to be used in scikitlearn's classifier.

First, we need equal numbers of texts of each class. So, let's say we're going to take the first 225 chunks from each text. 225 chunks of dubliners and 225 chunks of portrait. what format do we need these in?

to use the scikit learn classifier, we need this text formatted in this way:

one big list of all the chunks: so, thats 225 dubliners chunks followed by 225 portrait chunks. and then, one big lists of all the labels for the texts. the labels basically should be 0 and 1, where 0 represents one class (portrait) and one represents another (dubliners).

In [None]:
# Get the first 225 chunks from each (if not already sliced)
portrait_texts = portrait_chunks[:225]
dubliners_texts = dubliners_chunks[:225]

# Combine into one list of features
X = portrait_texts + dubliners_texts

# Create corresponding list of binary labels
# 0 for Portrait, 1 for Dubliners
y = [0] * len(portrait_texts) + [1] * len(dubliners_texts)

# Sanity check
print(f"✅ Created feature list X with {len(X)} entries")
print(f"✅ Created label list y with {len(y)} entries")
print(f"Example:\nLabel: {y[0]} | Text: {X[0][:150]}...")


In [None]:
#first, out list of chunks, called X; let's check it out:
X

In [None]:
#And let's check out ur list of labels, Y
y

OK so, our data is now ready to be classifier
we still have to do a few things
first, we might need, in some cases, to split this data up into train and test sets

instead, though, we're going to use the method that I suggest with classification which is using tenfold crossvalidation

**pause for lecture on that

Also, we need to select our features. Here, we're going to use the simple feature of single word counts. And we need to select what kind of classifier to use. We'll start with a logistic regression classifier. In a moment, we'll talk about what both these things mean (especially the secod) and how we can substitute them. BUT let's start by selecting them and moving forward

The code below will train a logistic regression classification model on our data using bag of words features (single words/unigrams). It will use tenfold crossvlidation meaning training 10 diff models on different train and test set splits. BUT it will then SAVE the model with the highest accuracy score. So our saved output will be the best model.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
import numpy as np

# ---------------------------------------------
# Setup: use 10-fold cross-validation manually
# ---------------------------------------------

# We'll split the data into 10 folds
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Track scores and best model
fold_accuracies = []
best_accuracy = 0
best_model = None

# ---------------------------------------------
# Loop through each fold
# ---------------------------------------------
for fold, (train_index, test_index) in enumerate(kf.split(X, y), 1):
    # Split the data into train and test sets for this fold
    X_train = [X[i] for i in train_index]
    X_test = [X[i] for i in test_index]
    y_train = [y[i] for i in train_index]
    y_test = [y[i] for i in test_index]

    # -------------------------------
    # Build the pipeline
    # -------------------------------
    # CountVectorizer turns text into Bag-of-Words features (one column per word)
    # LogisticRegression assigns weights to each word to predict class (0 = Portrait, 1 = Dubliners)
    model = make_pipeline(
        CountVectorizer(),
        LogisticRegression(max_iter=1000)
    )

    # Train the model on this fold's training set
    model.fit(X_train, y_train)

    # Evaluate on this fold's test set
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    fold_accuracies.append(acc)

    print(f"Fold {fold} accuracy: {acc:.3f}")

    # Keep track of the best-performing model
    if acc > best_accuracy:
        best_accuracy = acc
        best_model = model  # Save the pipeline

# ---------------------------------------------
# Report cross-validation results
# ---------------------------------------------
print("\n✅ 10-Fold Cross-Validation Summary:")
print(f"Accuracy scores: {np.round(fold_accuracies, 3)}")
print(f"Average accuracy: {np.mean(fold_accuracies):.3f}")
print(f"Best fold accuracy: {best_accuracy:.3f}")


woah, these are really high accuracy! the fact that some did perfectly might suggest "overfitting", an so we might want to investigate more by testing this out in different sets of data, in a real life experiment. Or, we might want to get rid of some input features that might make it too "easy" if were hoping for some sort of broad applicability to other texts from our classifier. However, the tenfold strategy should prevent overfitting, so we may be able to just trust that this works really well (as it should).

But, now that we have our best model (or, one of them) Let's look at a few things we can do with this classifier we just built, before thinking about changing the chosen features or type.

Fist, let's say we want to examine the features the classifier ended up using to make its distinctions. which words were most frequently associated with portrait (0)? which one with dubliners (1?)

In [None]:
# Get feature names and weights from the best model
vectorizer = best_model.named_steps['countvectorizer']
classifier = best_model.named_steps['logisticregression']

feature_names = vectorizer.get_feature_names_out()
coefficients = classifier.coef_[0]

# Get top words for each class
word_weights = list(zip(feature_names, coefficients))
top_portrait = sorted(word_weights, key=lambda x: x[1])[:20]
top_dubliners = sorted(word_weights, key=lambda x: x[1], reverse=True)[:20]

print("\n🟦 Top 20 words for *Portrait* (class 0):")
for word, weight in top_portrait:
    print(f"{word:15} {weight:.4f}")

print("\n🟥 Top 20 words for *Dubliners* (class 1):")
for word, weight in top_dubliners:
    print(f"{word:15} {weight:.4f}")


well, that makes sense. we'd have guessed charafter names would be key to distinghsing the books. still, there are some other words that feel distinctive, too!

OK, so we used a logistic regression classifier. what does this mean? well, this is one type of process the model can use to bild a classifier, as compared with some others called naive bayes, support vector machines. You can read about these in detail. But here's some short descriptions:

]
Logistic Regression

Logistic Regression is a linear classifier used for binary and multi-class classification.
It learns weights for each feature (e.g., word in a document) and uses them to calculate the probability that an input belongs to a class.
It's called "logistic" because it uses the logistic (sigmoid) function to squeeze predictions between 0 and 1.
It’s interpretable — you can see which words “push” the model toward a class.

Naive Bayes

Naive Bayes is a probabilistic classifier based on Bayes' Theorem.
It assumes that features (like words) are independent of one another — an assumption that’s rarely true, but works surprisingly well for text.
It calculates the probability of a class given the words in the document and chooses the most likely class.
It’s fast, simple, and often strong for bag-of-words models.

Support Vector Machine (SVM)

SVM is a margin-based classifier that tries to find the best boundary (hyperplane) between classes.
It looks for the decision boundary that maximizes the distance (margin) between the closest examples of each class.
It’s powerful in high-dimensional spaces like text, but less interpretable than logistic regression or Naive Bayes.



TLDR - log reg is probably the most conceptually simple and I like to start with it, but naive bayes is also commonly used. When in doubt, the proof is in the pudding. Use the type that gets you the highest accuracy scores

Let's see what happens if we redo our classification model on dubliners vs portrait but use naive bayes instead

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
import numpy as np

# ---------------------------------------------
# Setup: 10-fold cross-validation
# ---------------------------------------------
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
fold_accuracies = []
best_accuracy = 0
best_model = None

# ---------------------------------------------
# Loop through folds
# ---------------------------------------------
for fold, (train_index, test_index) in enumerate(kf.split(X, y), 1):
    X_train = [X[i] for i in train_index]
    X_test = [X[i] for i in test_index]
    y_train = [y[i] for i in train_index]
    y_test = [y[i] for i in test_index]

    # Build pipeline: CountVectorizer + Naive Bayes
    model = make_pipeline(
        CountVectorizer(),
        MultinomialNB()
    )

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    fold_accuracies.append(acc)

    print(f"Fold {fold} accuracy: {acc:.3f}")

    if acc > best_accuracy:
        best_accuracy = acc
        best_model = model

# ---------------------------------------------
# Report results
# ---------------------------------------------
print("\n✅ 10-Fold Cross-Validation Summary (Naive Bayes):")
print(f"Accuracy scores: {np.round(fold_accuracies, 3)}")
print(f"Average accuracy: {np.mean(fold_accuracies):.3f}")
print(f"Best fold accuracy: {best_accuracy:.3f}")


Again our most accurate was a 1.0, but there was only one. Let's compare the words used:

In [None]:
# Extract the trained vectorizer and classifier from the best Naive Bayes pipeline
vectorizer = best_model.named_steps['countvectorizer']
classifier = best_model.named_steps['multinomialnb']

# Get feature names (i.e., vocabulary) and log probabilities
feature_names = vectorizer.get_feature_names_out()
log_probs = classifier.feature_log_prob_  # shape: (n_classes, n_features)

# Pair each word with its log-prob for each class
portrait_word_probs = list(zip(feature_names, log_probs[0]))
dubliners_word_probs = list(zip(feature_names, log_probs[1]))

# Sort and get top 20 for each
top_portrait = sorted(portrait_word_probs, key=lambda x: x[1], reverse=True)[:20]
top_dubliners = sorted(dubliners_word_probs, key=lambda x: x[1], reverse=True)[:20]

# Display results
print("\n🟦 Top 20 words strongly associated with *Portrait* (class 0):\n")
for word, logprob in top_portrait:
    print(f"{word:15} {logprob:.4f}")

print("\n🟥 Top 20 words strongly associated with *Dubliners* (class 1):\n")
for word, logprob in top_dubliners:
    print(f"{word:15} {logprob:.4f}")


wait a second - this is way different!!! The words for one thing are super generic. and the lists are the same. why is this?

Here's Why:
💡 Naive Bayes ranks words by how frequent they are in each class — independently
It doesn't compare word importance between classes.

It just learns:

"In Portrait, these are the most common words"
"In Dubliners, these are the most common words"

So if the same high-frequency words appear a lot in both books (e.g., "said", "father", "time", etc.), they’ll rank high for both classes.

✅ It's frequency-based, not contrast-based.

⚖️ Logistic Regression, on the other hand:
Looks at which words help separate the classes.

It says:

“This word pushes the model toward predicting Portrait.”
“That word pushes toward Dubliners.”

So you get different top words — because it's not about overall frequency, but discriminative power.

✅ It's contrast-based — it focuses on what’s unique about each class.

If we wanted to, we could use the naive bayes output to get the most distinctive words across the lists. But it would take an extra step!

OK, let's go back to logistic regression:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
import numpy as np

# ---------------------------------------------
# Setup: 10-fold cross-validation
# ---------------------------------------------
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
fold_accuracies = []
best_accuracy = 0
best_model_logreg = None  # store best-performing model

# ---------------------------------------------
# Loop through folds
# ---------------------------------------------
for fold, (train_index, test_index) in enumerate(kf.split(X, y), 1):
    X_train = [X[i] for i in train_index]
    X_test = [X[i] for i in test_index]
    y_train = [y[i] for i in train_index]
    y_test = [y[i] for i in test_index]

    # Build pipeline: CountVectorizer + LogisticRegression
    model = make_pipeline(
        CountVectorizer(),
        LogisticRegression(max_iter=1000)
    )

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    fold_accuracies.append(acc)

    print(f"Fold {fold} accuracy: {acc:.3f}")

    if acc > best_accuracy:
        best_accuracy = acc
        best_model_logreg = model

# ---------------------------------------------
# Report results
# ---------------------------------------------
print("\n✅ 10-Fold Cross-Validation Summary (Logistic Regression):")
print(f"Accuracy scores: {np.round(fold_accuracies, 3)}")
print(f"Average accuracy: {np.mean(fold_accuracies):.3f}")
print(f"Best fold accuracy: {best_accuracy:.3f}")


Another thing we can do with this model, now, is take a look at how "well" it classifies differnet chunks of the text into the categories. which can be interesting. Remember Ryan Cordell's comment, when he visited our class, about how one nice thing about how classiifers label genres is that they see them as "probabilities" vs just binary yes or no judgments. The classificaiton model won't just tell us a chunk of text is a 0 or 1 (portrait or dubliners). It will tell us its judged probability of the label.

Let's take a look at the first 20 chunks, each, from portrait and dubliners, and judge how likely they are, using our best performing logistic regression classifier, to be labeld as one or the other. In a way, it's like asking, which chunks of portrait are most "portraity" (relative to dubliners); and which chunks of dubliners are most "dubliners-y" (relative to portrait). This is what underwood and so mean by using classifiers to measure textual distance in a "pespectival" way. I.e., its almost like each txt gives us a perspective on the other.

In [None]:
# Take the first 20 chunks from each text
test_chunks = portrait_chunks[:20] + dubliners_chunks[:20]
true_labels = [0]*20 + [1]*20  # 0 = Portrait, 1 = Dubliners

# Use the best logistic regression model to predict probabilities
probs = best_model_logreg.predict_proba(test_chunks)

# Display results
print(f"{'Index':<5} {'True':<6} {'P(Portrait)':<15} {'P(Dubliners)':<15}")
print("-" * 45)
for i, (true_label, prob) in enumerate(zip(true_labels, probs)):
    p_portrait = prob[0]
    p_dubliners = prob[1]
    print(f"{i:<5} {true_label:<6} {p_portrait:<15.3f} {p_dubliners:<15.3f}")


Now let's redo this, but examine the three highest and lowest probability of their class in each book

In [None]:
# First 20 chunks from each text
portrait_test_chunks = portrait_chunks[:20]
dubliners_test_chunks = dubliners_chunks[:20]

# Get probabilities from the model
portrait_probs = best_model_logreg.predict_proba(portrait_test_chunks)
dubliners_probs = best_model_logreg.predict_proba(dubliners_test_chunks)

# For Portrait: extract P(Portrait) = prob[:, 0]
portrait_scores = [(i, chunk, prob[0]) for i, (chunk, prob) in enumerate(zip(portrait_test_chunks, portrait_probs))]
portrait_sorted = sorted(portrait_scores, key=lambda x: x[2], reverse=True)

# For Dubliners: extract P(Dubliners) = prob[:, 1]
dubliners_scores = [(i, chunk, prob[1]) for i, (chunk, prob) in enumerate(zip(dubliners_test_chunks, dubliners_probs))]
dubliners_sorted = sorted(dubliners_scores, key=lambda x: x[2], reverse=True)

# Display Portrait results
print("\n🎨 Portrait chunks with HIGHEST model confidence (P=Portrait):\n")
for i, text, prob in portrait_sorted[:3]:
    print(f"Chunk {i} | P(Portrait) = {prob:.3f}\n{text[:300]}...\n")

print("\n🎨 Portrait chunks with LOWEST model confidence (P=Portrait):\n")
for i, text, prob in portrait_sorted[-3:]:
    print(f"Chunk {i} | P(Portrait) = {prob:.3f}\n{text[:300]}...\n")

# Display Dubliners results
print("\n📘 Dubliners chunks with HIGHEST model confidence (P=Dubliners):\n")
for i, text, prob in dubliners_sorted[:3]:
    print(f"Chunk {i} | P(Dubliners) = {prob:.3f}\n{text[:300]}...\n")

print("\n📘 Dubliners chunks with LOWEST model confidence (P=Dubliners):\n")
for i, text, prob in dubliners_sorted[-3:]:
    print(f"Chunk {i} | P(Dubliners) = {prob:.3f}\n{text[:300]}...\n")


What if we then used our classifier to classify trunks from a wholly differnet text? like, say, ulysses? let's say we took chunks from ulysses? would the model classify them as portrait or dubliners? it's an interesting question, because Ulysses has elements in common with both books. In early stages, e.g., it follows stephen daedalus closely, like portrait. later on, it moves more widely around dublin and its dubliners. however, stephen is present throughout.will the presence of his very name give a "portrait" signal to the whole text? or will it not? let's see...

Remember, we've got our model saved, as best_model_logreg

so all we need is new text chunks to use it on

In [None]:
import requests

# Download the text of Ulysses
url = "https://www.gutenberg.org/cache/epub/4300/pg4300.txt"
response = requests.get(url)

# Store the raw text
ulysses_raw = response.text

# Preview the beginning
print(ulysses_raw[:1000])


In [None]:
# Update markers to match this specific version of the file
start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK ULYSSES ***"
end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK ULYSSES ***"

# Find where the real book starts and ends
start_idx = ulysses_raw.find(start_marker) + len(start_marker)
end_idx = ulysses_raw.find(end_marker)

# Extract and clean
ulysses_cleaned = ulysses_raw[start_idx:end_idx].strip()

# Preview to confirm it worked
print("\n--- Cleaned Ulysses Start ---\n")
print(ulysses_cleaned[:1000])


Now let's chunk the text

In [None]:
# Function to chunk text by number of words
def chunk_text(text, chunk_size=300):
    words = text.split()
    return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

# Chunk the cleaned Ulysses text
ulysses_chunks = chunk_text(ulysses_cleaned, chunk_size=300)

# Preview: show number of chunks and the first one
print(f"✅ Total chunks created: {len(ulysses_chunks)}")
print("\n--- First Ulysses Chunk ---\n")
print(ulysses_chunks[0])


In [None]:
import matplotlib.pyplot as plt

# Step 1: Get predicted class labels and probabilities
ulysses_probs = best_model_logreg.predict_proba(ulysses_chunks)
ulysses_preds = best_model_logreg.predict(ulysses_chunks)

# Step 2: Count number of chunks classified as each
num_portrait = sum(1 for label in ulysses_preds if label == 0)
num_dubliners = sum(1 for label in ulysses_preds if label == 1)

print("📊 Ulysses Classification Summary:")
print(f"Portrait-like chunks:  {num_portrait}")
print(f"Dubliners-like chunks: {num_dubliners}")
print(f"Total chunks:          {len(ulysses_preds)}")

# Step 3: Get P(Portrait) for each chunk
portrait_probs = [prob[0] for prob in ulysses_probs]
x_vals = list(range(len(portrait_probs)))

# Step 4: Plot
plt.figure(figsize=(14, 5))
plt.plot(x_vals, portrait_probs, marker='o', linestyle='-', color='darkblue', label="P(Portrait)")
plt.title("Stylistic Progression of *Ulysses*: Portrait-Like Probability per Chunk")
plt.xlabel("Chunk Index (progress through book)")
plt.ylabel("P(Portrait)")
plt.ylim(0, 1)
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()


chunks 750-800 are "giving" strong portrait (lol), even though they are later in the text. what part of the text is this? maybe this is the "circe" episode where stephen shows up prominently again? or another late episode, like the ithaca where bloom and stephen are cavorting at night, since that's where stepehn shows up again toward the end of the text (despite bloom being the cnetral figure for much of the book's second half)? Ithaca is also highly academic in style, akin to portrait. so, if i had to guess, id guess that's what's being labeled as portrait like here. let's see:

In [None]:
# Define the range
start_idx = 750
end_idx = 800

# Make sure we don't exceed the number of chunks
end_idx = min(end_idx, len(ulysses_chunks))

print(f"\n🔍 Examining Ulysses chunks {start_idx} to {end_idx}:\n")

# Loop through the selected range
for i in range(start_idx, end_idx):
    chunk_text = ulysses_chunks[i]
    p_portrait = ulysses_probs[i][0]
    p_dubliners = ulysses_probs[i][1]
    predicted_label = "Portrait" if ulysses_preds[i] == 0 else "Dubliners"

    print(f"📌 Chunk {i}")
    print(f"Predicted: {predicted_label} | P(Portrait) = {p_portrait:.3f} | P(Dubliners) = {p_dubliners:.3f}")
    print(f"Text preview:\n{chunk_text[:300]}...\n{'-'*80}\n")


That's Ithaca! This makes a reassuring amunt of sense
That said, we don't learn too much by discovering that parts of ulysses with stephen daedalus are more likely to be labeled as "portrait" like. Maybe in a new experiment we might traun our classifier without the use of character names, to see if we can pull out more unexpected patterns of similarity?

**OCR**

We may not have time for this in class, but I wanted to supply some sample code for those who want to use OCR, and with non English languages

My understanding is that the program to use is tesseract, and it can do ocr on PDFs in many nonEnglish languages, including Georgian (Looking at you Megi).

Let's test it out:

First we need to load in tesseract, the ocr program, with support packages for the langauge we want. Here's how we'd load it in and the support package for georgian (with hashed out lines showing how we'd load in, say, turkish or russian).

In [None]:
# Install Tesseract and Georgian language support
!apt-get update
!apt-get install -y tesseract-ocr tesseract-ocr-kat


In [None]:
# Install Python OCR tools
!pip install pytesseract pdf2image pillow


In [None]:
#import in everything we just downloaded

import pytesseract
from pdf2image import convert_from_path
from PIL import Image


OK now we need a pdf! I'm going to take the sample georgian text from this link:

https://www.language-museum.com/encyclopedia/g/georgian.php

and make it a pdf and then load it in. You can do the same!




In [None]:
from google.colab import files

# Upload your Georgian PDF file (image-based, not digital text)
uploaded = files.upload()


In colab specifically before we can process tis pdf well also need a program called poppler

In [None]:
!apt-get install -y poppler-utils


In [None]:


# Convert all pages to images
pages = convert_from_path(pdf_path)


In [None]:


# Process each page
for i, page in enumerate(pages):
    text = pytesseract.image_to_string(page, lang='kat')  # 'kat' = Georgian
    print(f"\n📄 --- Page {i+1} ---\n")
    print(text[:1000])  # Print first 1000 characters of OCR output


cool.