# Math 120 Final Project

Dani Corum

December 19, 2025

## Setup

In [1]:
import os
import sys

# Check if running in Google Colab
try:
    import google.colab
    IN_COLAB = True
    print("Running in Google Colab")

    # Clone repository if in Colab
    if not os.path.exists('/content/Dani_Math120_Final_Project/'):
        !git clone https://github.com/DaniCor3/Dani_Math120_Final_Project.git

    # Change to project directory
    os.chdir('/content/Dani_Math120_Final_Project/')

except ImportError:
    IN_COLAB = False
    print("Running locally")

# Add src directory to Python path
if 'src' not in sys.path:
    sys.path.append('src')

print(f"Current working directory: {os.getcwd()}")

Running in Google Colab
Cloning into 'Dani_Math120_Final_Project'...
remote: Enumerating objects: 45, done.[K
remote: Counting objects: 100% (45/45), done.[K
remote: Compressing objects: 100% (41/41), done.[K
remote: Total 45 (delta 15), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (45/45), 1.21 MiB | 4.91 MiB/s, done.
Resolving deltas: 100% (15/15), done.
Current working directory: /content/Dani_Math120_Final_Project


In [2]:
import re
import pandas as pd
import plotly.express as px

import nltk
nltk.download('vader_lexicon')

from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))


# Custom functions:
from dataProcessing import loadRawData, makeCleanTokens, separateAndTokenize, makeCleanData, saveDataFrame
from analysis import countWords, sortDict, chWordCounts, top5Chapter, bookWordCounts, bookTopX, findUniqueWords, findUniqueTop5, makeChSentiments, makeDuneData, makeWordVis, makeSentVis

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## Cleaning the Data

In [3]:
# Loading raw data
text=loadRawData("/content/Dani_Math120_Final_Project/data_raw/Dune.txt")

# Separating by chapter and tokenizing each
# Tokenizing also cleans out stop words and "said"
separated, tokenized = separateAndTokenize(text)

# Merging cleaned data
cleanData=makeCleanData(separated, tokenized)
cleanData.head(12)

Unnamed: 0,Chapter,Text,Clean Tokens
0,1,\nA beginning is the time for taking the most ...,"[beginning, time, taking, delicate, care, bala..."
1,2,\nTo attempt an understanding of Muad'Dib with...,"[attempt, understanding, muad, dib, without, u..."
2,3,"\nThus spoke St. Alia-of-the-Knife: ""The Rever...","[thus, spoke, st, alia, knife, reverend, mothe..."
3,4,\nYou have read that Muad'Dib had no playmates...,"[read, muad, dib, playmates, age, caladan, dan..."
4,5,"\nYUEH (yu'e), Wellington (weling-tun), Stdrd ...","[yueh, yu, e, wellington, weling, tun, stdrd, ..."
5,6,\nHow do we approach the study of Muad'Dib's f...,"[approach, study, muad, dib, father, man, surp..."
6,7,"\nWith the Lady Jessica and Arrakis, the Bene ...","[lady, jessica, arrakis, bene, gesserit, syste..."
7,8,"\n""Yueh! Yueh! Yueh!"" goes the refrain. ""A mil...","[yueh, yueh, yueh, goes, refrain, million, dea..."
8,9,\nMany have marked the speed with which Muad'D...,"[many, marked, speed, muad, dib, learned, nece..."
9,10,\nWhat had the Lady Jessica to sustain her in ...,"[lady, jessica, sustain, time, trial, think, c..."


In [4]:
# Save cleaned data to a csv
saveDataFrame(cleanData,'data/separated_and_tokenized.csv')

## Counting Words

### Per Chapter

In [5]:
# Make a list of sorted dicts including every word and their counts for each chapter
chCounter=chWordCounts(tokenized)

# Find the top 5 words and their counts for each chapter
# Output is two separate lists, one with the words, the other with the counts
chTop5Words, chTop5Nums = top5Chapter(chCounter)

# Show as a DataFrame
dfTop5 = pd.DataFrame({
    'Chapter': list(range(1,49)),
    'Top 5 Words': chTop5Words,
    'Word Counts': chTop5Nums
})

dfTop5.head(12)

Unnamed: 0,Chapter,Top 5 Words,Word Counts
0,1,"[paul, mother, old, hand, woman]","[47, 41, 32, 32, 27]"
1,2,"[baron, piter, feyd, rautha, duke]","[71, 63, 29, 19, 16]"
2,3,"[jessica, paul, mother, old, woman]","[33, 23, 21, 21, 17]"
3,4,"[paul, halleck, hawat, thought, gurney]","[69, 35, 31, 18, 17]"
4,5,"[paul, yueh, book, fremen, thought]","[27, 26, 11, 8, 7]"
5,6,"[duke, paul, son, father, arrakis]","[29, 28, 14, 12, 10]"
6,7,"[jessica, mapes, lady, thought, know]","[57, 37, 21, 19, 16]"
7,8,"[jessica, yueh, people, like, wellington]","[23, 17, 13, 12, 12]"
8,9,"[paul, room, could, would, seeker]","[19, 14, 10, 9, 9]"
9,10,"[jessica, door, room, paul, hawat]","[32, 26, 24, 23, 19]"


### In Entire Book

In [6]:
# Make a dict containing every word count in the book
totalCount=bookWordCounts(chCounter)
print(f'Total word counts: {totalCount}')

# Find the top 10 words in the book
top10Book=bookTopX(totalCount, x=10)

# Show as a DataFrame with rank
dfBookTop10=pd.DataFrame({
    'Rank': list(range(1,11)),
    'Word': top10Book.keys(),
    'Count': top10Book.values()
})

print('Top 10 words:')

dfBookTop10

Top 10 words:


Unnamed: 0,Rank,Word,Count
0,1,paul,1711
1,2,jessica,895
2,3,one,626
3,4,thought,617
4,5,baron,582
5,6,duke,568
6,7,could,527
7,8,man,493
8,9,hawat,429
9,10,fremen,413


#### Analysis

In these results, we can see that 5 characters are in the top 10 words: Paul, Jessica, Baron (Baron Harkonnen), Duke (Duke Leto), and Hawat. Most of these are very important characters, although Thufir Hawat stands out as a surprising appearance of a side character. Another surprising entry was "one," at rank 3. I do not think there is any way to explain its prevalence other than to say that Frank Herbert must use it frequently.

### Uniqueness

In [7]:
# Words that are only used once in the text
uniqueWords=findUniqueWords(totalCount)

print(f'One-time words: {uniqueWords}')

# Number of unique words, since there are too many to show
print(f'Number of one-time words: {len(uniqueWords)}')

One-time words: ['balances', '57th', 'locate', 'deceived', 'visit', 'vaulted', 'bulky', 'spiderwebs', 'untuned', 'slyness', 'faculties', 'thump', 'mouthed', 'quasi', 'arouses', 'predictions', 'rigidly', 'caid', 'census', 'regate', 'aortal', 'enriched', 'flooding', 'overload', 'obtain', 'extend', 'victims', 'extinct', 'destroys', 'perceptual', 'bodily', 'strive', 'permanence', 'bustle', 'emotionless', 'semiformal', 'dressing', 'syubi', 'tapestried', 'farmlands', 'petulant', 'curtsy', 'pulsed', 'gestalten', 'maternal', 'ingredient', 'enjoin', 'politeness', 'solidly', 'cube', 'compulsions', 'prickling', 'gums', 'silvery', 'musky', 'aumas', 'manageable', 'denying', 'sirra', 'headsman', 'axe', 'trapper', 'flex', 'burns', 'gasps', 'agonized', 'crisping', 'charred', 'bathing', 'withstood', 'withdrawing', 'lightless', 'volition', 'stump', 'astonished', 'induction', 'maiming', 'uncurled', 'bitten', 'axis', 'hating', 'interrupts', 'enslave', 'butlerian', 'counterfeit', 'revolt', 'emphasizes', 'm

In [14]:
# top5Appearances tracks the number of times a word ranked in the top 5 of a chapter
# uniqueTop5 contains words that only appear in the top 5 of a single chapter
top5Appearances, uniqueTop5=findUniqueTop5(chCounter)
print(top5Appearances)
print(uniqueTop5)
print(f'{len(uniqueTop5)} words only rank for one chapter.')

{'paul': 28, 'jessica': 19, 'duke': 12, 'stilgar': 10, 'hawat': 9, 'baron': 8, 'thought': 8, 'gurney': 6, 'sand': 6, 'mother': 5, 'one': 5, 'feyd': 4, 'rautha': 4, 'halleck': 4, 'yueh': 4, 'fremen': 4, 'men': 4, 'leto': 4, 'kynes': 4, 'water': 4, 'old': 3, 'woman': 3, 'piter': 3, 'father': 3, 'room': 3, 'could': 3, 'man': 3, 'chani': 3, 'emperor': 3, 'hand': 2, 'arrakis': 2, 'lady': 2, 'know': 2, 'like': 2, 'door': 2, 'lord': 2, 'rabban': 2, 'jamis': 2, 'harah': 2, 'time': 2, 'maker': 2, 'alia': 2, 'book': 1, 'son': 1, 'mapes': 1, 'people': 1, 'wellington': 1, 'would': 1, 'seeker': 1, 'must': 1, 'tooth': 1, 'sardaukar': 1, 'house': 1, 'mind': 1, 'way': 1, 'tent': 1, 'nefud': 1, 'rock': 1, 'desert': 1, 'tuek': 1, 'harkonnens': 1, 'lump': 1, 'across': 1, 'voice': 1, 'saw': 1, 'us': 1, 'basin': 1, 'troop': 1, 'count': 1, 'h': 1, 'reverend': 1, 'uncle': 1, 'worm': 1, 'life': 1, 'wall': 1, 'storm': 1, 'child': 1}
['book', 'son', 'mapes', 'people', 'wellington', 'would', 'seeker', 'must', 't

#### Analysis

Looking at the results in `uniqueWords`, since *Dune* is such a large text, it makes sense that there would be over 4 thousand words that are only used once.

As for the results in `top5Appearances`, we can see that the most commmon high-ranking words are character names or titles, likely due to dialogue and action descriptions.

With the words we found in `uniqueTop5`, there are a few interesting outliers:

* In chapter 9, the fifth most common word is "seeker," since this chapter covers an assassination attempt on Paul via a device called a hunter-seeker.

* The fourth most common word in chapter 18 is "tooth," which is in reference to a poisoned tooth Dr. Yueh implants in the Duke Leto.

* In chapter 35, the fifth most common word is "h." At first, I thought it was a miscount of some kind, but it was not. It is actually related to the fourth most common word, "count," as in Count Fenring. Whenever he speaks, he adds a fake stutter of sorts (i.e. "ah-h-h," "um-mm-m," and "mm-m-m-ah"). Since the tokenization only tracked alphanumeric characters, these hyphens were cut out, and "h" was counted 42 times.

## Sentiment Analysis

In [9]:
# Sentiment per chapter
chSentiments=makeChSentiments(separated)

# Show as a DataFrame
dfSent=pd.DataFrame({
    'Chapter': list(range(1,49)),
    'Sentiment': chSentiments
})

dfSent.head(12)

Unnamed: 0,Chapter,Sentiment
0,1,-0.9985
1,2,0.9997
2,3,-0.9608
3,4,0.9848
4,5,0.9151
5,6,0.9928
6,7,0.9922
7,8,0.9399
8,9,-0.6572
9,10,0.9952


## Data

In [10]:
# Compiling data for sentiments and top 5 words for each chapter
duneData=makeDuneData(text)

# Show data from the first 4 chapters
# Chapter numbers and sentiments have been duplicated so we can show 5 separate word rankings
duneData.head(20)

Unnamed: 0,Chapter,Sentiment,Rank,Word,Count
0,1,-0.9985,1,paul,47
1,1,-0.9985,2,mother,41
2,1,-0.9985,3,old,32
3,1,-0.9985,4,hand,32
4,1,-0.9985,5,woman,27
5,2,0.9997,1,baron,71
6,2,0.9997,2,piter,63
7,2,0.9997,3,feyd,29
8,2,0.9997,4,rautha,19
9,2,0.9997,5,duke,16


In [11]:
# Save duneData to a csv
saveDataFrame(duneData,'data/sentiment_and_word_count.csv')

## Visualizations & Analysis

### Word Counts

In [12]:
# Bar graph of the top 5 data
figWords=makeWordVis(duneData)
figWords.show()

#### Analysis

As expected from our results for the overall text, Paul's name is frequently seen in the top 5 per chapter. This makes sense, since he is the main character and the main driver of the plot.

This also tracks for other characters, where names or titles like Jessica, Duke, Stilgar, and Baron rank highly in chapters involving those characters.

### Chapter Sentiments

In [13]:
# Bar graph of the sentiment data
figSent=makeSentVis(duneData)
figSent.show()

#### Analysis

Analyzing the changes between positive and negative, we can compare the fluctuation to the events of the plot and see that it tracks the sentiment with decent accuracy.

**Chapter 1 (-):** Paul is given a painful, life-or-death test.

**Chapter 2 (+):** Harkonnens discuss a scheme that is going according to plan.

**Chapter 3 (-):** Jessica is scolded for having a son.

**Chapter 4-8 (+):** Plot and worldbuilding are set up as house Atreides arrives on Arrakis.

**Chapter 9 (-):** Paul is nearly assassinated.

**Chapter 10-13 (+):** The Atreides situate themselves on Arrakis, engage with the local politics, and try to win some favor with the Fremen.

**Chapter 14-15 (-):** Lady Jessica is suspected to be a traitor, and a spice crawler is eaten by a sandworm.

**Chapter 16 (+):** The Atreides hold a dinner party and engage in political discussions with other important factions in the broader universe.

**Chapter 17-19 (-):** The Harkonnens begin their attack on the Atreides.

**Chapter 20 (+):** Dr. Yueh successfully hides his deception and the false tooth he implanted in Duke Leto.

**Chapter 21-24 (-):** Duke Leto fails his assassination, Paul learns of his father's death, Paul has a terrifying vision, and Hawat is defeated.

**Chapter 25 (+):** Duncan Idaho sacrifices himself so Paul and Jessica can escape.

**Chapter 26-27 (-):** The Baron is informed of Paul and Jessica's supposed deaths. Meanwhile, the two narrowly escape a sandstorm.

**Chapter 28 (+):** Gurney Halleck bides his time among smugglers.

**Chapter 29 (-):** Paul and Jessica escape a sandworm, but encounter hostile Fremen.

**Chapter 30-32 (+):** Dr. Kynes experiences a vision of his father before dying. Jessica uses religious manipulation to convince the Fremen that she and Paul are saviors.

**Chapter 33-34 (-):** Paul is challenged to a duel and mourns the man he kills.

**Chapter 35 (+):** Feyd-Rautha fights in a gladiator match for his birthday.

**Chapter 36-37 (-):** Lady Jessica drinks the poisonous Water of Life, which affects her unborn daughter and grants her massive amounts of political power as a religious leader of the Fremen.

**Chapter 38 (+):** The Baron tells Feyd-Rautha about a plan that could make him emperor.

**Chapter 39-40 (-):** Hawat tells the Baron how strong the Fremen are. Meanwhile, Alia is regarded as a demon by the Fremen, and Paul reflects on the inevitability of the terrible future he has seen.

**Chapter 41-42 (+):** Jessica speaks with Fremen women she is close with and Paul successfully rides a massive sandworm.

**Chapter 43-48 (-):** The novel reaches its climax, as Paul leads an attack with the Fremen, becomes emperor, and locks himself to a path that will turn him into a dictator.

## References

Herbert, Frank. *Dune*. Ace Books, 2005.