# LING 380 - Assignment 1 (Group 13)

### Group Members & Responsibilities

| Name (Last, First) | Student ID | Section Contributed | Section Edited | Other Contributions |
|---|---|---|---|---|
| Miguel, Matthew | 301422631 | Data collection, cleaning | Code review | Set up GitHub repo |
| Intanon, Supamongkol  | 301541005 | Analysis code (token count, diversity, frequency) | Final formatting | Group communication |
| Member 3  |(student number) |absent | absent | absent |


## Import Libraries

In [45]:
!pip install nltk
import os
import csv
import re
import nltk
import spacy
from nltk.tokenize import word_tokenize
from nltk import FreqDist
nltk.download('all')
nlp = spacy.load("en_core_web_sm")
nlp.max_length = 2000000



[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\npspt\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\npspt\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\npspt\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\npspt\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\npspt\AppData\Roaming\nltk_data...
[

## Data Collection Process

| Genre | Source | Link |
|---|---|---|
| Fiction | Project Gutenburg - Pride and Prejudice, Jane Austen | [https://www.gutenberg.org/ebooks/1342](https://www.gutenberg.org/ebooks/1342) |
| Academic Essays | Project Gutenburg - Humanistic Studies of the University of Kansas | [https://www.gutenberg.org/ebooks/51685](https://www.gutenberg.org/ebooks/51685) |
| Autobiographies | Project Gutenburg - My Life Vol 1, Richard Wagner | [https://www.gutenberg.org/ebooks/5197](https://www.imsdb.com) |

## Data Cleaning Functions

In [47]:
print(os.path.exists("./data/pg1342.txt"))
print(os.path.exists("./data/pg51685.txt"))
print(os.path.exists("./data/pg5197.txt"))

True
True
True


In [49]:
def remove_gutenberg_metadata(text):
    text = re.sub(r'^.*?\*\*\* START OF.*?\*\*\*', '', text, flags=re.DOTALL)
    text = re.sub(r'\*\*\* END OF.*?\*\*\*.*$', '', text, flags=re.DOTALL)
    return text.strip()

def load_and_clean_file(filepath):
    with open(filepath, 'r', encoding='utf-8') as file:
        text = file.read()
    return remove_gutenberg_metadata(text)


#### File paths and loading data

In [51]:
data_folder = "./data/"
files = {
    "fiction": os.path.join(data_folder, "pg1342.txt"),
    "academic": os.path.join(data_folder, "pg51685.txt"),
    "autobiography": os.path.join(data_folder, "pg5197.txt")
}

texts = {genre: load_and_clean_file(path) for genre, path in files.items()}


## Analysis Functions

In [53]:
def token_count(text):
    tokens = word_tokenize(text)
    return len(tokens)

def lexical_diversity(text):
    tokens = word_tokenize(text)
    return len(set(tokens)) / len(tokens)

def top_10_words(text):
    tokens = word_tokenize(text)
    words_only = [word.lower() for word in tokens if word.isalpha()]  # only words, no punctuation
    freq_dist = FreqDist(words_only)
    return freq_dist.most_common(10)

def extract_named_entities(text):
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents]
    return entities


#### Apply analysis to each genre

In [55]:
analysis_results = {}

for genre, text in texts.items():
    analysis_results[genre] = {
        "token_count": token_count(text),
        "lexical_diversity": lexical_diversity(text),
        "top_10_words": top_10_words(text),
        "named_entities": extract_named_entities(text)
    }

for genre, result in analysis_results.items():
    print(f"\n=== Analysis for {genre} ===")
    print(f"Token Count: {result['token_count']}")
    print(f"Lexical Diversity: {result['lexical_diversity']:.4f}")
    print(f"Top 10 Words: {result['top_10_words']}")
    print(f"Sample Named Entities: {result['named_entities'][:10]}")



=== Analysis for fiction ===
Token Count: 151020
Lexical Diversity: 0.0540
Top 10 Words: [('the', 4654), ('to', 4296), ('of', 3836), ('and', 3751), ('her', 2248), ('i', 2097), ('a', 2033), ('in', 1975), ('was', 1868), ('she', 1732)]
Sample Named Entities: ['GEORGE ALLEN\n                               ', '156', 'LONDON', 'Reading Jane’s Letters', '34', 'Jane Austen', 'Preface', 'George Saintsbury', 'Hugh Thomson', '1894']

=== Analysis for academic ===
Token Count: 131641
Lexical Diversity: 0.1090
Top 10 Words: [('the', 6896), ('of', 5179), ('in', 2946), ('and', 2648), ('is', 2600), ('to', 2195), ('a', 2180), ('that', 1192), ('it', 1156), ('as', 1092)]
Sample Named Entities: ['TRANSCRIBER', 'four', 'two', 'three', 'four', 'two', 'the Table of Contents', 'Indexes', 'KANSAS\n  PUBLISHED BY THE UNIVERSITY', '1915']

=== Analysis for autobiography ===
Token Count: 245267
Lexical Diversity: 0.0597
Top 10 Words: [('the', 13584), ('of', 8521), ('to', 8076), ('and', 5338), ('i', 5301), ('in',

## Final Reflection

### a) Most Frequent Words - What do they tell you?
- Fiction: Frequent use of pronouns ("her," "she") shows focus on characters and narration.
- Academic: Dominated by functional words, showing formal, analytical style.
- Autobiography: High frequency of "I" and "my", emphasizing personal reflection.

### b) Named Entities - What do they reveal?
- Fiction: Captured some author and publication details, but missed many characters.
- Academic: Focused on publication metadata instead of key academic terms.
- Autobiography: Captured names, dates, and events, which fits the personal nature.

### c) Named Entity Accuracy
- Fiction & Academic: Partly correct, but spaCy overfocuses on document structure.
- Academic: It catches publication details but misses key academic terms and authors, showing spaCy isn’t well-tuned for academic texts.
- Autobiography: Mostly correct, as names and historical periods were identified well.

### d) Interpreting Lexical Diversity 
- Fiction (0.0540): Low diversity due to repetitive storytelling language.
- Academic (0.1090): Highest diversity due to technical terminology.
- Autobiography (0.0597): Similar to fiction, as personal stories often repeat key names and places.
