| Name (Last, First) | Student ID  | Section contributed                     | Section edited     | Other contributions                     |
|--------------------|------------|------------------------------------------|--------------------|-----------------------------------------|
| Tjokrowardojo, Michael      | 301416843   | Data collection, GitHub syncing         | Code and write-up | Troubleshoot spaCy                     |
| Ng, Ryan     | 3014423701   | Data Collection, Function for most frequent words        | Code              | Communication and task coordination    |


References

- Podcast: https://www.youtube.com/watch?v=Ex3OQhsVaiY
- Reading: https://www.gutenberg.org/ebooks/2160
- Bee Movie Script: https://gist.github.com/MattIPv4/045239bc27b16b2bcf7a3a9a4648c08a

## Import Libraries

In [24]:
import os
import pandas as pd

import re

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import FreqDist

import spacy
nlp = spacy.load("en_core_web_sm")

## Functions to read and process the files

In [25]:
def process_dir(path):
    """
    Scans a directory for text files and processes their content using spaCy and stores them in a doc object.

    Args: 
        path (str): The directory path containing text files.

    Returns:
        dict: A dictionary where file names are keys and the processed text data (tokens, types, lexical diversity) are values.
    
    """
    file_info = {}

    for filename in os.listdir(path):
        if filename.endswith(".txt"):    
            file_path = os.path.join(path, filename)      
            with open(file_path, 'r', encoding="utf-8") as f:
                text = f.read()
                file_info[filename] = nlp(text)
    return file_info

## Read and Process Files

In [26]:
path = './data'

files_in_dir_info = process_dir(path)

## Count Tokens and Lexical Diversity

In [27]:
# Loop through the dictionary to process tokens and types for each file
for filename, doc in files_in_dir_info.items():
    # Extract tokens and types
    tokens = [token.text for token in doc]
    types = set(tokens)
    
    # Compute lexical diversity
    lexical_diversity = len(types) / len(tokens) if len(tokens) > 0 else 0  # Avoid division by zero

    # Print file name
    print(f"File: {filename}")

    # Print number of tokens and types
    print("\nNumber of tokens:", len(tokens))
    print("Lexical Diversity:", lexical_diversity)
    print("\n" + "="*40 + "\n")


File: Podcast.txt

Number of tokens: 28052
Lexical Diversity: 0.11678311706830173


File: BeeMovieScript.txt

Number of tokens: 13727
Lexical Diversity: 0.15130764187367962


File: TheExpeditionofHumphryClinker.txt

Number of tokens: 188341
Lexical Diversity: 0.07439166193234611




## 10 Most Frequent Words

In [28]:
# Define a function turn all the words into lowercase
def clean_text(text):
    text = text.lower() # lowercase
    return text

In [32]:
for filename, doc in files_in_dir_info.items():
    # lowercase text
    cleaned_text = clean_text(doc.text)

    # tokenize using split() to keep punctuation (don't, I'm, and etc.)
    tokens = cleaned_text.split()

    # count all word frequencies including punctuation
    fq = FreqDist(tokens)

    # remove punctuation tokens
    filtered_fq = {word: count for word, count in fq.items() if re.match(r"^\w+$", word)}

    # get top 10 most common words
    most_common_words = sorted(filtered_fq.items(), key=lambda x: x[1], reverse=True)[:10]

    # print result for each file
    print(f"File: {filename}")
    print("10 Most Common Words:")
    for word, count in most_common_words:
        print(f"{word}: {count}")

    print("\n" + "=" * 40 + "\n")

File: Podcast.txt
10 Most Common Words:
the: 948
to: 760
you: 712
and: 681
i: 570
that: 515
a: 509
of: 391
is: 379
this: 340


File: BeeMovieScript.txt
10 Most Common Words:
a: 258
the: 258
you: 235
i: 235
to: 189
of: 134
and: 107
is: 104
this: 94
it: 92


File: TheExpeditionofHumphryClinker.txt
10 Most Common Words:
the: 8331
of: 5540
and: 4750
to: 4325
a: 3580
in: 3085
i: 2367
his: 1993
he: 1895
that: 1793




## List of Named Entities Using SpaCy


In [31]:
for filename, doc in files_in_dir_info.items():
    docs = nlp(doc)
    
    # Extract named entities
    entities = [(ent.text, ent.label_) for ent in docs.ents]

    # Print results for each file
    print(f"File: {filename}")
    print("\nNamed Entities Found:")
    
    for entity, label in entities:
        print(f"{entity} → {label}")

    print("\n" + "=" * 40 + "\n")

File: Podcast.txt

Named Entities Found:
today → DATE
Mexico Canada → ORG
Trump → ORG
Dow → ORG
650 → CARDINAL
yesterday → DATE
Bloomberg → ORG
122,000 → MONEY
Trump → ORG
Mexico → GPE
Trump → ORG
Trump → ORG
Pardon Pete Rose Pete Rose → ORG
Pete → PERSON
the Hall of Fame → FAC
Pardon → GPE
Trump → ORG
zilinsky → PERSON
Ukraine → GPE
Trump → ORG
Clash → GPE
British → NORP
Kier → PERSON
London → GPE
Trump → ORG
Ukraine → GPE
Bill Mah → PERSON
Democrat → NORP
Democrat → NORP
AOC → ORG
NPR → ORG
Andrew KO → PERSON
New York City → GPE
California → GPE
Harris → PERSON
days → DATE
Tim Waltz → PERSON
2028 → DATE
Tim Waltz → PERSON
2028 → DATE
America → GPE
April → DATE
2028 → CARDINAL
Harris → ORG
Rah Emanuel → PERSON
Chicago → GPE
one → CARDINAL
Buffett → PERSON
Trump → PERSON
the Tooth Fairy → ORG
Microsoft → ORG
Pioneer → ORG
Skype → ORG
Skype → ORG
Vinnie → PERSON
one → CARDINAL
Sergey Brin → PERSON
60 hours → TIME
Jamie Diamond → PERSON
a couple weeks ago → DATE
Florida → GPE
California 

## Reflection