# **Assignment 1**

|  Name | Student ID | Section Contributed | Section Edited | Other Contributions |
| --- | --- | --- | --- | --- |
| Dueck, Ellie | 301462367 | Word Freq, Named entities, Reflection | all | Notebook Formatting,
| Flett, Iain | 301581520 | Data Collection, Tokens, Lexical Diversity | all| Notebook Creation, 

**references**
<br>
Meditations: https://www.gutenberg.org/cache/epub/2680/pg2680.txt
<br>
Star Trek: https://www.scifiscripts.com/scripts/startrek2_wrathofkhan.txt
<br>
Winnie the Pooh: https://www.gutenberg.org/cache/epub/67098/pg67098.txt

In [49]:
import os
import nltk
import numpy
import re
import matplotlib
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import FreqDist
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_sm")

### **Length and Lexical Diversity:**

In [43]:
def get_text_info(text):
    """
    Uses NLTK to calculate: tokens, types, lexical diversity
    
    Args:
        text (str): a string containing the file or text
        
    Returns: 
        dict: a dictionary containing tokens, types, and lexical diversity
    """
    tokens = nltk.word_tokenize(text)
    n_tokens = len(tokens)
    n_types = len(set(tokens))
    return {
            'tokens': n_tokens,
            'types': n_types,
        }
def process_dir(path):
    """
    Reads all the files in a directory. Processes them using the 'get_text_info' function
    
    Args: 
        path (str): path to the directory where the files are
        
    Returns:
        dict: a dictionary with file names as keys and the tokens, types, lexical diversity, as values
    
    """
    file_info = {}

    for filename in os.listdir(path):
        if filename.endswith(".txt"):    
            file_path = os.path.join(path, filename)      
            with open(file_path, 'r', encoding="utf-8") as f:
                text = f.read()
                file_info[filename] = get_text_info(text)
    return file_info

In [45]:
path = './data'

filesInfo = process_dir(path)


In [47]:
df = pd.DataFrame.from_dict(filesInfo, orient='index')
df

Unnamed: 0,tokens,types
Meditations_Marcus_Aurelius.txt,81803,6602
StarTrekII.txt,22065,3673
Winnie_the_Pooh_AA_Milne.txt,29936,2459


In [49]:
df['lex_div'] = df['types']/df['tokens']
df

Unnamed: 0,tokens,types,lex_div
Meditations_Marcus_Aurelius.txt,81803,6602,0.080706
StarTrekII.txt,22065,3673,0.166463
Winnie_the_Pooh_AA_Milne.txt,29936,2459,0.082142


### **The top 10 most frequent words and their counts:**

In [52]:
def text_cleaner(text):
    #removes unwanted punctuation from text
    text_clean = re.sub(r'[\,\.\"\”\“\*\)\(\-\!\?]', '', text,)
    return text_clean

path = './data'
word_freq = {}
    #reads and processes files to be cleaned and counted
for filename in os.listdir(path):
        if filename.endswith(".txt"):    
            file_path = os.path.join(path, filename)      
            with open(file_path, 'r', encoding="utf-8") as f:
                text = text_cleaner(f.read())
                tokens = word_tokenize(text)
                word_freq[filename] = FreqDist(tokens).most_common(10)

In [54]:
print("\n".join(f"\n{filename}: {words}" for filename, words in word_freq.items()))


Meditations_Marcus_Aurelius.txt: [('and', 3049), ('the', 2400), ('of', 2281), ('to', 1916), ('that', 1882), ('is', 1394), ('it', 1118), ('in', 1060), ('a', 993), ('be', 963)]

StarTrekII.txt: [('the', 564), ('to', 354), ('KIRK', 271), ('and', 263), ('a', 233), ('of', 227), ('I', 206), ('is', 203), ('you', 191), ("'s", 172)]

Winnie_the_Pooh_AA_Milne.txt: [('and', 792), ('the', 652), ('he', 576), ('said', 539), ('to', 537), ('a', 502), ('it', 479), ('I', 453), ('of', 377), ('Pooh', 351)]


### **Named Entities:**

In [86]:
ent_dfs = {}

path = './data'

for filename in os.listdir(path):
        if filename.endswith(".txt"):    
            file_path = os.path.join(path, filename)      
            with open(file_path, 'r', encoding="utf-8") as f:
                text = f.read()
                doc = nlp(text)

            named_ents = []            

            # go through the entities and append each to the list
            for ent in doc.ents:
                named_ents.append((ent.text, ent.label_))

            df = pd.DataFrame(named_ents, columns=['entity', 'label'])

            ent_dfs[filename] = df
            
for filename, df in ent_dfs.items():
    print(f"DataFrame for {filename}:")
    print(df)
    print("\n")

DataFrame for Meditations_Marcus_Aurelius.txt:
                                      entity   label
0     NTH BOOK\n\n     TWELFTH BOOK\n\n          ORG
1                  MARCUS AURELIUS ANTONINUS     ORG
2                                   April 26    DATE
3                                   A.D. 121    DATE
4                            M. Annius Verus     ORG
...                                      ...     ...
1496                                   Roman    NORP
1497                                   Roman     ORG
1498                                  Marcus     ORG
1499                                  Fronto  PERSON
1500                                  Fronto     ORG

[1501 rows x 2 columns]


DataFrame for StarTrekII.txt:
               entity     label
0                KHAN    PERSON
1     Jack B. Sowards    PERSON
2               Story    PERSON
3       Harve Bennett    PERSON
4     Jack B. Sowards    PERSON
...               ...       ...
2567              274  CARDINAL
2568

### **Reflection**
&emsp;Our dataset is comprised of a screenplay from 1982, _Star Trek II_; a children’s novel from 1926, _The Adventures of Winnie the Pooh_; and a 2000-year-old book originally written in Koine Greek, _Meditations_. 
<br>
&emsp;Notably, across all genres the most frequent words in the text are function words. These are words that contain little lexical meaning but facilitate the meaning of a phrase. It makes sense that these are most common, as they are present in almost all sentences, regardless of genre or medium. The only other type of word that appeared on this list is character names. These only appeared in Winnie the Pooh and Star Trek which is unsurprising as these are both narratives while Meditations is more of a journal containing the author's musings and therefore there are not names mentioned often enough to appear in the top 10 list. Similarly, in the Named Entities section, the narratives had a higher prevalence of Names in the output than Meditations did. Additionally, more errors seemed to have been made with this file. This could be due to the year it was published as language has changed significantly since that time. For example, the output suggests that the authors name, Marcus Aurelius Antoninus, as an organisation. This could however be explained by the fact that a name like this would no longer be common and therefore might not match patterns from the database.  In truth, all genres have mistakes in the named entities, even when so few examples are given in the output. One error that occurred multiple times is the same token being categorised as two different entities. This shows that while this function can have major benefits, results must be thought about critically before this data can be used.
<br>
&emsp;Star Trek was the most lexically diverse file with a score of 0.167. Winnie the Pooh and Meditations  had a very similar score of 0.081 and 0.082 respectively. This difference may come down to the genre. Star Trek is a screenplay meaning that every utterance is likely more or less unique without the need for all the “she/he/they said” and other repetitive descriptors used to describe a scene.  The place where a screenplay would likely have a fair bit of repetition would be in the names which are used to show who’s speaking. This is evidenced in “KIRK” being the third most frequent token in the Star Trek screenplay. 
