| Name (Last, First) | Student ID  | Section contributed                     | Section edited     | Other contributions                     |
|--------------------|------------|------------------------------------------|--------------------|-----------------------------------------|
| Tjokrowardojo, Michael      | 301416843   | Data collection, GitHub syncing         | Code and write-up | Troubleshoot spaCy                     |
| Ng, Ryan     | 3014423701   | Data Collection, Function for most frequent words        | Code              | Communication and task coordination    |


References

- Podcast: https://www.youtube.com/watch?v=Ex3OQhsVaiY
- Reading: https://www.gutenberg.org/ebooks/2160
- Bee Movie Script: https://gist.github.com/MattIPv4/045239bc27b16b2bcf7a3a9a4648c08a

## Import Libraries

In [2]:
import os
import pandas as pd

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import FreqDist

import spacy
nlp = spacy.load("en_core_web_sm")

## Functions to read and process the files

In [4]:
def process_dir(path):
    """
    Scans a directory for text files and processes their content using spaCy and stores them in a doc object.

    Args: 
        path (str): The directory path containing text files.

    Returns:
        dict: A dictionary where file names are keys and the processed text data (tokens, types, lexical diversity) are values.
    
    """
    file_info = {}

    for filename in os.listdir(path):
        if filename.endswith(".txt"):    
            file_path = os.path.join(path, filename)      
            with open(file_path, 'r', encoding="utf-8") as f:
                text = f.read()
                file_info[filename] = nlp(text)
    return file_info

## Read and Process Files

In [5]:
path = './data'

files_in_dir_info = process_dir(path)

## Count Tokens and Lexical Diversity

In [None]:
# Loop through the dictionary to process tokens and types for each file
for filename, doc in files_in_dir_info.items():
    # Extract tokens and types
    tokens = [token.text for token in doc]
    types = set(tokens)
    
    # Compute lexical diversity
    lexical_diversity = len(types) / len(tokens) if len(tokens) > 0 else 0  # Avoid division by zero

    # Print file name
    print(f"File: {filename}")

    # Print number of tokens and types
    print("\nNumber of tokens:", len(tokens))
    print("Lexical Diversity:", lexical_diversity)
    print("\n" + "="*40 + "\n")


File: BeeMovieScript.txt

Number of tokens: 12687
Lexical Diversity: 0.16347442263734532


File: Podcast.txt

Number of tokens: 28052
Lexical Diversity: 0.11678311706830173


File: TheExpeditionofHumphryClinker.txt

Number of tokens: 191955
Lexical Diversity: 0.07493943893099946




## 10 Most Frequent Words (Not Complete)

In [9]:
# Define a function to remove punctuation
def clean_word(word):
    return ''.join(char for char in word if char.isalnum())  # Keeps only letters and numbers

# Loop through the dictionary to count word frequencies in each file
for filename, doc in files_in_dir_info.items():
    # Extract tokens and clean them (remove punctuation)
    tokens = [clean_word(token.text.lower()) for token in doc if clean_word(token.text)]  # Convert to lowercase and remove empty strings

    # Count occurrences of each word using a dictionary
    word_counts = {}
    for word in tokens:
        word_counts[word] = word_counts.get(word, 0) + 1  # Increment count

    # Get the 10 most common words by sorting
    most_common_words = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)[:10]

    # Print file name
    print(f"File: {filename}")

    # Print 10 most common words
    print("\n10 Most Common Words (No Punctuation):")
    for word, count in most_common_words:
        print(f"{word}: {count}")

    print("\n" + "="*40 + "\n")


File: BeeMovieScript.txt

10 Most Common Words (No Punctuation):
i: 349
you: 327
a: 259
the: 259
it: 239
s: 214
to: 191
nt: 139
that: 138
of: 134


File: Podcast.txt

10 Most Common Words (No Punctuation):
the: 948
you: 831
to: 760
i: 746
s: 705
and: 681
that: 661
a: 509
it: 480
of: 391


File: TheExpeditionofHumphryClinker.txt

10 Most Common Words (No Punctuation):
the: 8611
of: 5674
and: 5066
to: 4442
a: 3737
in: 3205
i: 2675
he: 2109
his: 2009
that: 1897




## List of Named Entities Using SpaCy


## Reflection