
- Question from Homework on Text Analysis:
    - (a) Tokenize the texts into uni-grams (single words). Count the number of total tokens and the number of unique tokens in each text.
    - (b) Now remove stopwords from the tokens and repeat the same count.
    - (c) Count the number of times the token “america” appears in the text.
    - (d) Count the number of times the token “union” appears in the text.
    - (e) Count the number of times the token “freedom” appears in the text.
    - (f) Count the number of times the token “constitution” appears in the text.
    - (g) Now tokenize the texts into bi-grams (pairs of words). Count the number of times “united states” appears in the texts.
    - (h) Based on the token counts, can we conclude anything about the theme of these presidential speeches?
    - (i) Compute the FOG index for readability for each text. Comment on the result.
    - (j) Construct the document feature matrix (DFM) of all the texts. Which are the top 5 features for each text?
    - (k) Compute the cosine distance of texts in the DFM. Which two texts have the highest similarity? Which ones the lowest?

In [1]:
import os
import re
import numpy as np
import pandas as pd
import spacy
from functools import lru_cache
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_distances
import syllapy
import tabulate
from TextAnalyzer import TextAnalyzer

nlp = spacy.load("en_core_web_sm")
folder_path = 'NLP_FIles/'

# Read texts from the folder and initialize the analyzer
texts = TextAnalyzer.read_texts_from_folder(folder_path)
analyzer = TextAnalyzer(texts)

# (a) Tokenization with uni-grams, without removing stopwords
analyzer.tokenize(remove_stopwords=False)
analyzer.display_token_counts(header="(a). Token Counts")

# (b) Tokenization with uni-grams, removing stopwords
analyzer.tokenize()
analyzer.display_token_counts(header="(b). Token Counts with no Stopwords")

# (c)~(f) Count specific tokens
tokens_to_count = ["america", "union", "freedom", "constitution"]
analyzer.display_specific_token_counts(tokens_to_count, header="(c)~(f). Specific Token Counts")

# (g) Tokenization with bi-grams
analyzer.tokenize(ngram_range=(2, 2), dfm_key='bigrams')
united_states_counts = analyzer.dfms['bigrams'].get('united states', pd.Series(0, index=analyzer.texts.keys()))
data = list(zip(analyzer.texts.keys(), united_states_counts))
analyzer.display_data(data, ["Text", "united states"], "(g). Bi-gram Counts for 'united states'")

# (h) Analyze Token Counts to infer the theme of each text
def print_theme_analysis():
    divider = "=" * 50  # Extended for longer lines

    print("\n(h). Themes of Each Text")
    print(divider)

    # Biden & Trump
    print("\nBiden & Trump:")
    print("- Tokens of Interest: 'America'")
    print("- Inference: Both use 'America' frequently, suggesting themes related to U.S.\n  internal politics and the nation's future outlook. Their focus seems centered on\n  evolving visions and potential trajectories for the United States.")
    print(divider)

    # Lincoln
    print("\nLincoln:")
    print("- Tokens of Interest: 'Union' and 'Constitution'")
    print("- Inference: Lincoln's emphasis on 'Union' and 'Constitution' implies themes\n  around the American Civil War. The national mood was heavily influenced\n  by constitutional and legal matters, with tensions from potential southern\n  states' secession and the overarching theme of national unity.")
    print(divider)

    # JFK's Inaugural Address
    print("\nJFK's Inaugural Address:")
    print("- Tokens of Interest: 'freedom'")
    print("- Inference: JFK's emphasis on 'freedom' suggests a theme within the Cold War era,\n  portraying it as a dichotomy between freedom and communism,\n  emphasizing the pivotal role of freedom globally.")
    print(divider)

    # Washington
    print("\nWashington:")
    print("- Tokens of Interest: 'union' and 'constitution'")
    print("- Inference: Washington's focus on 'union' and 'constitution' hints at themes\n  of nation-building. Given the country's infancy and the ongoing constitution\n  ratification, there was an emphasis on fostering unity and establishing\n  a robust constitutional groundwork.")
    print(divider + "\n")

# Call the function
print_theme_analysis()

# (i) Compute FOG index
fog_indexes = analyzer.compute_fog_indexes()
data = [(text, "{:.2f}".format(fog)) for text, fog in fog_indexes.items()]
analyzer.display_data(data, ["Text", "FOG Index"], "(i). FOG Index for Each Text")
spacy_fog_indexes = analyzer.compute_fog_indexes(use_spacy=True)
data = [(text, "{:.2f}".format(fog)) for text, fog in spacy_fog_indexes.items()]
analyzer.display_data(data, ["Text", "FOG Index"], "(i). FOG Index for Each Text (spaCy)")

# (j) Top 5 features for each text and display the DFM
analyzer.top_features()
analyzer.display_overall_top_features()

# (k) Cosine similarity analysis
most_similar_texts, least_similar_texts = analyzer.cosine_similarity_analysis()
data_most_similar = [(text, most_similar_texts[text]) for text in most_similar_texts]
data_least_similar = [(text, least_similar_texts[text]) for text in least_similar_texts]
analyzer.display_data(data_most_similar, ["Text", "Most Similar Text"], "(k). Most Similar Texts")
analyzer.display_data(data_least_similar, ["Text", "Least Similar Text"], "(k). Least Similar Texts")

# Proof via Frequency Matrix
freq_matrix = analyzer.generate_freq_matrix(dfm_key='base')
column_sums = freq_matrix.sum(axis=0)
sorted_columns = column_sums.sort_values(ascending=False).index
freq_matrix_sorted = freq_matrix[sorted_columns]

freq_matrix_sorted


(a). Token Counts
+-----------------+----------------+-----------------+
| Text            |  Total Tokens  |  Unique Tokens  |
| Biden_2021      |      2300      |       722       |
+-----------------+----------------+-----------------+
| Kennedy_1961    |      1332      |       530       |
+-----------------+----------------+-----------------+
| Lincoln_1861    |      3539      |      1007       |
+-----------------+----------------+-----------------+
| Trump_2017      |      1431      |       533       |
+-----------------+----------------+-----------------+
| Washington_1789 |      1394      |       592       |
+-----------------+----------------+-----------------+


(b). Token Counts with no Stopwords
+-----------------+----------------+-----------------+
| Text            |  Total Tokens  |  Unique Tokens  |
| Biden_2021      |      972       |       577       |
+-----------------+----------------+-----------------+
| Kennedy_1961    |      612       |       412       |
+-------

Unnamed: 0,people,america,government,constitution,shall,nation,states,let,country,union,...,crimes,culture,lower,crises,lot,crucial,loss,crucible,cultural,invisible
Biden_2021,9,20,0,3,2,14,1,6,4,2,...,0,1,1,1,1,0,0,1,0,0
Kennedy_1961,1,2,0,0,5,2,2,16,4,0,...,0,0,0,0,0,0,0,0,1,0
Lincoln_1861,20,0,18,24,17,0,19,0,3,20,...,1,0,0,0,0,0,1,0,0,0
Trump_2017,10,19,3,0,0,8,2,3,9,0,...,0,0,0,0,0,1,0,0,0,0
Washington_1789,4,0,8,1,3,2,2,0,5,2,...,0,0,0,0,0,0,0,0,0,1
Total,44,41,29,28,27,26,26,25,25,24,...,1,1,1,1,1,1,1,1,1,1
