#**Project Information**
Title: Assessment_9: NLP AI Applications
Author: Darien Prall
Date: 01-04-2024 (mmddyyyy)

**Project Description**
For this project, I explore the usage of modal verbs in different texts from the Gutenberg Corpus. I will analyze the relative frequency through different Natural Language Processing (NLP) techniques.

**Project Steps**
Task 1: Modals in the Gutenberg Corpus
- Task 1.1: Install NLTK
- Task 1.2: Download the Gutenberg Corpus
- Task 1.3: Define Each Modal Group
- Task 1.4: Count Relative Frequencies of Each Modal Verb
- Task 1.5: Find the Texts with the Largest Span of Modal Frequencies
- Task 1.6: Compare Usage of Modals in the Two Texts

Task 2: Inaugural Corpus - Analyzing Kennedy's 1961 Speech
- Task 2.1: Download the Inaugural Corpus
- Task 2.2: Load the 1961 Kennedy Speech
- Task 2.3: Identify the 10 Most Frequent Used Long Words
- Task 2.4: Use WordNet to Find Synonyms and Hyponyms
- Task 2.5: Reflect of the Results

##**Task 1: Modals in the Gutenberg Corpus**

###**Task 1.1: Install NLTK**
**What this code is doing:** Importing the NLTK library to allow the download of the gutenberg texts

In [1]:
import nltk
from nltk.corpus import gutenberg

###**Task 1.2: Download the Gutenberg Corpus**
**What this code is doing:** Downloading the gutenberg texts and splitter. Storing all texts to a variable to print what they are. Then checking the title of each file. 

In [2]:
files = gutenberg.fileids()
for file in files:
    print(file)

austen-emma.txt
austen-persuasion.txt
austen-sense.txt
bible-kjv.txt
blake-poems.txt
bryant-stories.txt
burgess-busterbrown.txt
carroll-alice.txt
chesterton-ball.txt
chesterton-brown.txt
chesterton-thursday.txt
edgeworth-parents.txt
melville-moby_dick.txt
milton-paradise.txt
shakespeare-caesar.txt
shakespeare-hamlet.txt
shakespeare-macbeth.txt
whitman-leaves.txt


###**Task 1.3: Define Each Modal Group**
**What this code is doing:** Creating a list of modal verbs I want to look for within the texts

In [3]:
list_of_modals = ['can', 'could', 'may', 'might', 'will', 'would', 'should']

###**Task 1.4: Count Relative Frequencies of Each Modal Verb**
**What this code is doing:** Creating an empty dictionary. This dictonary will be filled with the text file as the key and a dictionary of modal counts as the value. This information is generated by the for loop that loops through each file and converts its contents to an array of arrays. Creates another empty dictionary to store the modal as the key and the count of the modal in the text as the value. The dictionary of modal:count is stored as the value to frequency_of_modals_by_file. For best practices, I like to comment the key:value pair of the dictionary being created.

In [4]:
frequency_of_modals_by_file = {
# KEY  : VALUE
# FILE : {modal: count}

}

from tabulate import tabulate

for file in files:
    file_words = gutenberg.words(file)
    lowercase_file_words = []
    for word in file_words:
        lowercase_file_words.append(word.lower())
    count_of_each_modal_in_file = {
        # KEY   :  VALUE
        # modal : count of modal
    }
    for modal in list_of_modals:
        count_of_each_modal_in_file[modal] = lowercase_file_words.count(modal)
    frequency_of_modals_by_file[file] = count_of_each_modal_in_file

# Printing the results in table format
frequency_table = [(modal, count) for modal, count in frequency_of_modals_by_file.items()]
print(tabulate(frequency_table, headers=["Modal", "Count"], tablefmt="grid"))

+-------------------------+--------------------------------------------------------------------------------------------------+
| Modal                   | Count                                                                                            |
| austen-emma.txt         | {'can': 284, 'could': 837, 'may': 221, 'might': 326, 'will': 570, 'would': 820, 'should': 369}   |
+-------------------------+--------------------------------------------------------------------------------------------------+
| austen-persuasion.txt   | {'can': 107, 'could': 451, 'may': 87, 'might': 166, 'will': 167, 'would': 355, 'should': 188}    |
+-------------------------+--------------------------------------------------------------------------------------------------+
| austen-sense.txt        | {'can': 218, 'could': 578, 'may': 175, 'might': 215, 'will': 363, 'would': 515, 'should': 236}   |
+-------------------------+------------------------------------------------------------------------------------

###**Task 1.5: Find the Texts with the Largest Span of Modal Frequencies**
**What this code is doing:** Creating an empty dictionary to store modal_span_information. That information will be the most_frequent_file, least_frequent__file, most_frequent_file_count, and least_frequent_file_count


In [5]:
modal_span_information = {
    # KEY   : VALUE
    # modal : {modal_span_information}
}
for modal in list_of_modals:
    modal_span_information[modal] = {
        "most_frequent_file": "", 
        "least_frequent_file": "", 
        "most_frequent_file_count": 0, 
        "least_frequent_file_count": float('inf')
        }

for file, dic_modal_counts in frequency_of_modals_by_file.items():
    for modal, count_of_modal in dic_modal_counts.items():
        if count_of_modal > modal_span_information[modal]["most_frequent_file_count"]:
            modal_span_information[modal]["most_frequent_file_count"] = count_of_modal
            modal_span_information[modal]["most_frequent_file"] = file
        if count_of_modal < modal_span_information[modal]["least_frequent_file_count"]:
            modal_span_information[modal]["least_frequent_file_count"] = count_of_modal
            modal_span_information[modal]["least_frequent_file"] = file

###**Compare Usage of Modals in the Two Texts**
**What this code is doing:** This code is looping through each key:value in the modal_span_information to print each of the modals. 
It is then getting the file names of where the modal appears the most and least and then prints the results

In [12]:
for modal, span in modal_span_information.items():
    #print(f"Modal: {modal}")
    most_modal_count_file = gutenberg.words(span['most_frequent_file'])
    least_modal_count_file = gutenberg.words(span['least_frequent_file'])
    most_modal_count = most_modal_count_file.count(modal)
    least_modal_count = least_modal_count_file.count(modal)
    max_length = len(most_modal_count_file)
    min_length = len(least_modal_count_file)
    #print(f" Appears most frequent in {span['most_frequent_file']}:\n  {most_modal_count} times out of {max_length} total words. Relative Frequency = {most_modal_count / max_length:.6f}")
    #print(f" Appears the least frequent in {span['least_frequent_file']}:\n  {least_modal_count} times out of {min_length} total words. Relative Frequency = {least_modal_count / min_length:.6f}")

    table_data = []
headers = ["Modal", "Most Frequent File", "Most Frequent Count", "Most Frequent Total Words", "Most Frequent Frequency", 
           "Least Frequent File", "Least Frequent Count", "Least Frequent Total Words", "Least Frequent Frequency"]


for modal, span in modal_span_information.items():
    most_modal_count_file = gutenberg.words(span['most_frequent_file'])
    least_modal_count_file = gutenberg.words(span['least_frequent_file'])
    
    most_modal_count = most_modal_count_file.count(modal)
    least_modal_count = least_modal_count_file.count(modal)
    
    max_length = len(most_modal_count_file)
    min_length = len(least_modal_count_file)
    
    most_modal_frequency = most_modal_count / max_length
    least_modal_frequency = least_modal_count / min_length
    
    table_data.append([
        modal, 
        span['most_frequent_file'], 
        most_modal_count, 
        max_length, 
        f"{most_modal_frequency:.6f}",
        span['least_frequent_file'], 
        least_modal_count, 
        min_length, 
        f"{least_modal_frequency:.6f}"
    ])

print(tabulate(table_data, headers=headers, tablefmt="grid"))

+---------+-----------------------+-----------------------+-----------------------------+---------------------------+-------------------------+------------------------+------------------------------+----------------------------+
| Modal   | Most Frequent File    |   Most Frequent Count |   Most Frequent Total Words |   Most Frequent Frequency | Least Frequent File     |   Least Frequent Count |   Least Frequent Total Words |   Least Frequent Frequency |
| can     | edgeworth-parents.txt |                   340 |                      210663 |                  0.001614 | shakespeare-caesar.txt  |                     16 |                        25833 |                   0.000619 |
+---------+-----------------------+-----------------------+-----------------------------+---------------------------+-------------------------+------------------------+------------------------------+----------------------------+
| could   | austen-emma.txt       |                   825 |                      192

####**Explaination of Why the Modal Counts are Different Between Two Texts**
Setting aside that the books are different, as this is not a good enough explaination of why texts are different across books, its better to look at the time and tone of the texts.
**Formality**: Some texts may be more formal than others so woulds like 'may' and 'shall' would appear more. The author may be indicating obligation or necessity.
**Type of Storytelling**: The story could be a narrative or a descriptive text in which hypothetical and conditional modals are used more. These tend to explore potential outcomes of character choices.
**Audience**: Depending on who the text is written for could could determine how casual modals are used like 'may' and 'could' that fit a more conversational tone.
**Technicality**: If the text is in the form of a techincal document, it may focus on precision and clarity with modals like "must"


##**Task 2: Inaugural Corpus - Analyzing Kennedy's 1961 Speech**

###**Task 2.1: Download the Inaugural Corpus**
**What this code is doing:** Simply importing and downloading the inaugural module

In [7]:
from nltk.corpus import inaugural
nltk.download('inaugural')

[nltk_data] Downloading package inaugural to
[nltk_data]     /Users/darienprall/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!


True

###**Task 2.2: Load the 1961 Kennedy Speech**
**What this code is doing:** Getting all of the files in inaugural so I can see the name of the Kenndey Speech file (even though it was provided). It grabs the contents of the speech and then splits its contents into words.

In [8]:
all_files = inaugural.fileids()
#for file in all_files:
#    print(file)

kennedys_speech = inaugural.words('1961-Kennedy.txt')

###**Task 2.3: Identify the 10 Most Frequent Used Long Words**
**What this code is doing:** This code is looking for any words that are over 7-characters in length and appending them to an array. The words then need to be added to a dictionary with the word being the key and the count being the value. The dictionary can then be sorted to get the top 10 long words.

In [9]:
long_words = []
for word in kennedys_speech:
    if len(word) > 7:
        long_words.append(word.lower())

frequency_of_long_words = {}

for word in long_words:
    if word in frequency_of_long_words:
        frequency_of_long_words[word] += 1
    else:
        frequency_of_long_words[word] = 1

sorted_long_words_frequency = sorted(frequency_of_long_words.items(), key = lambda x: x[1], reverse = True)
top_10_long_words = sorted_long_words_frequency[:10]
print(f"The top ten long words in Kennedy's 1961 Speech are: {top_10_long_words}")

The top ten long words in Kennedy's 1961 Speech are: [('citizens', 5), ('president', 4), ('americans', 4), ('generation', 3), ('forebears', 2), ('revolution', 2), ('committed', 2), ('powerful', 2), ('supporting', 2), ('themselves', 2)]


###**Task 2.4: Use WordNet to Find Synonyms and Hyponyms**
**What this code is doing:** This code is importing and downloading the wordnet from nltk. It then loops through the top_10_long words then extracts any unique synonyms and hyponyms for each word. These are stored in two seperate lists.

In [10]:
from nltk.corpus import wordnet as wn
nltk.download('wordnet')

for word, _ in top_10_long_words:
    synonyms = []
    hyponyms = []
    
    for synset in wn.synsets(word):

        for lemma in synset.lemma_names():
            if lemma not in synonyms:
                synonyms.append(lemma)
        
        for hypo in synset.hyponyms():
            for lemma in hypo.lemma_names():
                if lemma not in hyponyms:
                    hyponyms.append(lemma)
    
    print(f"Word: {word}")
    print(f"  Synonyms: {', '.join(synonyms)}")
    print(f"  Hyponyms: {', '.join(hyponyms)}")
    print()

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/darienprall/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Word: citizens
  Synonyms: citizen
  Hyponyms: voter, elector, freeman, freewoman, thane, private_citizen, active_citizen, civilian, repatriate

Word: president
  Synonyms: president, President_of_the_United_States, United_States_President, President, Chief_Executive, chairman, chairwoman, chair, chairperson, prexy
  Hyponyms: ex-president, Kalon_Tripa, vice_chairman

Word: americans
  Synonyms: American, American_English, American_language
  Hyponyms: New_Mexican, North_Carolinian, Tarheel, New_Englander, Yankee, Rhode_Islander, Wyomingite, Carolinian, Kentuckian, Bluegrass_Stater, Illinoisan, Bay_Stater, Californian, Mississippian, South_Dakotan, Wisconsinite, Badger, Marylander, Ohioan, Buckeye, Puerto_Rican, Anglo-American, Arkansan, Arkansawyer, South_Carolinian, Texan, Vermonter, African-American, African_American, Afro-American, Black_American, Delawarean, Delawarian, Kansan, Coloradan, West_Virginian, Indianan, Hoosier, Bostonian, Idahoan, Franco-American, Arizonan, Arizonian, 

##**Task 2.5: Reflect of the Results**##

In this project, I learned many different techniques around usage and frequency of modal verbs. The two sources used were the Gutenberg Corpus and Kennedy's 1961 Inaugural Speech. Here are some of the notworthy observations I made while completing this assignemnt:

- 1) Corpora libraries and their texts required different handling from regular text documents on your machine. For instance, a standard text file on my machine would need the file path, open it, read it, and then close it. This is because nltk handles file management in the background when you access them.

- 2) When accessing text files, the best way to get all the words of the file into a structure (array, dictionary), is by using a splitter. In this case gutenberg.words() takes care of this in an easy manner, allow me to loop through each word in the array. 

- 3) I've witness the benefit of using arrays to dynamically create keys in a dictionary. This can be seen in the count_of_each_modal_in_file dictionary and others. 

- 4) When comparing strings, they are case sensitive, 'Can' is different that 'can' so a normalized format should be used. In this case, converting all words to lowercase to match the list_of_modals array.

- 5) Conducting Modal Span Analysis to see the text with the highest and lowest frequencies of each modal helps determine not just difference in counts but also difference in writing styles.

- 6) The use of WordNet was new to me as a programmer. Counting long words and storing them to an array allowed me to loop through the array and pass in each word to the wn.sysnet() function. This was able to return synonyms and hyponyms of each of the long words. I noticed that for words like 'powerful' and 'themselves' that it did not return any hypoymns. 'Themselves' also didnt return any synonyms. At first I thought this was error in the code but it turns out they are not in the hierarchy of WordNet. Not all code is perfect and just because something works, doesn't mean it can't be improved. 

Overall, I understand the importance of nltk for text processing to ensure it is efficient and logical. 