<a href="https://colab.research.google.com/github/R-802/LING-226-Assignments/blob/main/Assignment_One.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Shemaiah Rangitaawa Assignment One LING226 2023 T3 `300601546`**
- Attempting Challenge

### **Text Preprocessing `preprocess_text`**
**Remove Punctuation**
- The function strips all punctuation from the text.

**Remove Stopwords**
- Stopwords, like "the", "is", "at" are removed from the text.

**Lowercase All Words**
- The text is converted to lowercase. This standardization is important as it prevents the same words in different cases from being counted as different words (e.g., "Hello" and "hello").

**Remove Words Above a Certain Frequency (Inclusive)**
- Words that appear very rarely or very frequently in the dataset can be removed. Rare words might be typos or irrelevant, and very common words might not carry useful information.


In [48]:
import re
from collections import Counter
from nltk.stem import WordNetLemmatizer

def preprocess_text(text, stopwords, removal_frequency):
    # Initialize the lemmatizer
    lemmatizer = WordNetLemmatizer()

    # Lowercase and remove punctuation
    text = re.sub(r'[^\w\s]', '', text).lower()

    # Split text into words
    words = text.split()

    # Count the frequency of each word
    word_frequency = Counter(words)

    # Identify words to remove: stopwords and words exceeding the frequency threshold
    removed_words = set(stopwords).union(
        {word for word, freq in word_frequency.items() if freq >= removal_frequency}
    )

    # Filter out removed words
    processed_words = [word for word in words if word not in removed_words]

    return ' '.join(processed_words), list(removed_words)

### **Text Metric Function `text_metrics`**
**Total Words:** The total count of words in the text.

**Overall Lexical Diversity:** The ratio of unique words to the total number of words, providing a measure of the text's vocabulary variety.

**Average Sentence Lexical Diversity:** The average diversity of vocabulary used across all sentences in the text.

**Top Ten Most Frequent Word:** A list of the ten most commonly used words in the text, along with their frequencies.

**Total Number of Sentences:** The total sentence count of the text. When analysing processed text, this metric becomes redundant as there is no punctuation to split the text on.

In [46]:
def text_metrics(text):
    # Tokenizing the text into words
    words = re.findall(r'\b\w+\b', text.lower())
    total_words = len(words)

    # Overall lexical diversity (unique words / total words)
    unique_words = len(set(words))
    overall_lexical_diversity = unique_words / total_words if total_words > 0 else 0

    # Tokenizing the text into sentences and calculating diversity
    sentences = re.split(r'[.!?]', text)
    sentence_diversities = []
    for sentence in sentences:
        sentence_words = re.findall(r'\b\w+\b', sentence.lower())
        unique_in_sentence = len(set(sentence_words))
        total_in_sentence = len(sentence_words)
        if total_in_sentence > 0:
            sentence_diversities.append(unique_in_sentence / total_in_sentence)

    # Average lexical diversity of text sentences
    avg_sentence_lexical_diversity = sum(sentence_diversities) / len(sentence_diversities) if sentence_diversities else 0

    # Top ten most frequent words
    word_frequencies = Counter(words)
    top_ten_words = word_frequencies.most_common(10)

    # Number of sentences
    num_sentences = len(sentences)

    return total_words, overall_lexical_diversity, avg_sentence_lexical_diversity, top_ten_words, num_sentences


###**Formatting for Text Metrics**

In [6]:
def format_metrics(title, metrics):
    formatted_top_words = ', '.join([word for word, _ in metrics[3]])
    highest_word, highest_freq = metrics[3][0]  # Extracting the highest frequency word and its frequency

    # Formatting the diversities as percentages
    overall_diversity_percentage = metrics[1] * 100
    avg_sentence_diversity_percentage = metrics[2] * 100

    return (f"--------- Text Metrics for {title} ---------\n"
            f"Total Words: {metrics[0]}\n"
            f"Total Sentences: {metrics[4]}\n"
            f"Overall Lexical Diversity: {overall_diversity_percentage:.2f}%\n"
            f"Average Lexical Diversity of Sentences: {avg_sentence_diversity_percentage:.2f}%\n"
            f"Top Ten Most Frequent Words: {formatted_top_words}\n"
            f"Highest Frequency Word: '{highest_word}' (Frequency: {highest_freq})")


## **Importing and Reading `TP001.txt` from URL and `austen-emma.txt` from NLTK corpora**

In [7]:
!wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/the-current/tp001.txt'

--2023-11-16 22:49:07--  https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/the-current/tp001.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 220746 (216K) [text/plain]
Saving to: ‘tp001.txt’


2023-11-16 22:49:08 (1.77 MB/s) - ‘tp001.txt’ saved [220746/220746]



In [8]:
# Open the file and read its lines
with open('tp001.txt', 'r', encoding='utf-8') as file:
    lines = file.readlines()

# Concatenate all comments into a single text string
tp001_text = ""
for line in lines:
    if '\t' in line:
        comment = line.split('\t')[1].strip()  # Extract and strip the comment
        tp001_text += comment + " "  # Add the comment to the text string

# Optionally, display the first part of the concatenated text
print("First part of tp001_text:", tp001_text[:500])

First part of tp001_text: ... we need to work hard to make it happen 3d is better than other bands in the whole country a ban on sales of new petrol vehicles would be more sensible than an outright ban .  an outright ban is itself wasteful A carless life is much more fun A good idea in theory but would have to change a lot of infrastructure. Not to mention industry and jobs. a good idea to protect our earth ! A good opportunity to reduce harm to the environment A N G E R Y A s part of many other changes A STEP IN THE RIG


In [9]:
import nltk
from nltk.corpus import gutenberg

# Download gutenberg corpus
nltk.download('gutenberg')

# Using Emma by Jane Austen 1816
emma_text = gutenberg.raw('austen-emma.txt')
print(emma_text[:290])

[nltk_data] Downloading package gutenberg to /root/nltk_data...


[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.



[nltk_data]   Unzipping corpora/gutenberg.zip.


# **Experimentation**
The following experimentation section includes:
- An analysis and overview of metrics from both sample texts.  
- Visualization of the top ten words before and after processing.
- Analysis of Emma's overall lexical diversity before and after processing.

**Notes:** I have chosen to use the NLTK's stopword list for preprocessing. I  have used 'Emma by Jane Austen 1816' from NLTK corpora and 'TP001 (Petrol cars should be banned by 2030)' from The Current.

### **Importing libraries and initializing stopwords set**
Required for preprocessing and visualization.

In [24]:
import random

import plotly.graph_objects as go
from plotly.subplots import make_subplots

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize

# Tokenizer divides a text into a list of sentences
nltk.download('punkt')

# Download the stopwords from NLTK
nltk.download('stopwords')

# Create a set of English stopwords
stop_words = set(stopwords.words('english'))

# For some reason 'n' was the top word in tp001
stop_words.add('n')

print(stop_words)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

{'him', 'through', 'any', 'down', 'just', 'in', 'both', 'no', 'mightn', 'yourself', 'yourselves', 'against', 'above', 'needn', 'how', "should've", 'too', 'over', 'shouldn', 'for', 'n', 'but', 'd', "hasn't", 'theirs', 'weren', "don't", 'haven', "mightn't", "needn't", 've', 'same', 'm', "mustn't", 'being', 'once', 'there', 'were', 'of', 'was', 'did', 'are', 'why', "wasn't", 'mustn', 'while', 'to', 'so', 'if', 'off', 'wasn', 'who', 'themselves', 'them', 'out', "hadn't", 'she', 'won', 'be', 'do', 'his', 's', "isn't", 'don', 'didn', 'we', "aren't", 'all', 'these', "doesn't", 'our', 'on', 'her', 'has', 'hers', 'is', 'some', 'only', 'hasn', 'again', 'until', 'ourselves', 'should', 'here', 'i', 'which', 'itself', 'after', 'had', 'those', "couldn't", 'now', 'can', 'further', "weren't", 't', 'into', 'wouldn', 'the', 'ma', 'between', "you've", 'more', 'each', "that'll", 'y', "shan't", 'doing', 'ours', 'when', 'herself', 'isn', 'yours', 'below', 're', 'very', 'with', 'not', 'by', 'doesn', 'will', 


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## **Analysis and overview of metrics from both sample texts**

### Text Metrics for "Emma" by Jane Austen
#### Unprocessed Text

- **Total Words:** `161,983`
  
  Indicates the substantial length typical of 19th-century novels.

- **Total Sentences:** `10,567`

- **Overall Lexical Diversity:** `4.48%`

  A relatively small proportion of unique words, common in longer texts with repetitive usage of certain words.

- **Average Lexical Diversity of Sentences:** `94.43%`

  Most sentences contain a high proportion of unique words, suggesting varied sentence structures and vocabulary.

- **Top Ten Most Frequent Words:** `to, the, and, of, i, a, it, her, was, she`

  Common in unprocessed English texts, indicating frequent use of linguistic connectors and pronouns.

- **Highest Frequency Word:** `'to'` (Frequency: `5,239`)

  Highlights its prevalent role in sentence construction.

#### Processed Text

- **Total Words:** `56,909`

  Significantly lower, likely due to the removal of common words and other processing steps.

- **Total Sentences:** `1`

  This is an artifact of how `text_metrics` processes text.
- **Overall Lexical Diversity:**

  Typically higher than the unprocessed text, suggesting a greater variety of unique words after removing common ones.

- **Top Ten Most Frequent Words:** `frank, ever, young, churchill, two, though, indeed, better, come, oh`
  
  More specific and thematic, focusing on characters and narrative elements.

- **Highest Frequency Word:** `'frank'` (Frequency: `192`)
  
  Indicates a shift towards content-specific vocabulary.

#### Insights and Implications

- **Impact of Processing:** The processing significantly alters the text's word count and frequency distribution, emphasizing content-specific words.

- **Narrative Focus:** The shift in frequent words from common linguistic elements to character names and thematic words reflects the novel's narrative focus.

- **Lexical Diversity:** The increase in lexical diversity in the processed text highlights the contribution of unique, content-specific words to the richness of vocabulary, as opposed to common structural words.

---

## Text Metrics Analysis for `tp001`

#### Unprocessed Text

- **Total Words:** `39,065`
- **Total Sentences:** `2,037`
- **Overall Lexical Diversity:** `12.06%`

  Shows a reasonable variety of vocabulary.

- **Average Lexical Diversity of Sentences:** `89.43%`

  Most sentences have a high proportion of unique words, indicating diverse sentence construction.

- **Top Ten Most Frequent Words:** `the, to, we, and, i, it, a, is, be, for`

  Typical of unprocessed English texts, these are common structural words.

- **Highest Frequency Word:** `'the'` (Frequency: `1,507`)

  Common in English texts, often used for grammatical structure.

#### Processed Text

- **Total Words:** `18,059`

  Significantly reduced, likely due to the removal of common words and possibly other processing steps.

- **Overall Lexical Diversity:** `25.80%`

  A noticeable increase from the unprocessed text, indicating a higher proportion of unique words after processing.

- **Top Ten Most Frequent Words:** `environment, future, would, people, bad, world, make, dont, transport, climate`

  These words reveal the main focus of the text.

- **Highest Frequency Word:** `'environment'` (Frequency: `189`)

  Indicates a strong emphasis on environmental themes in the text.

#### Insights and Implications

- **Impact of Processing:** The reduction in total word count and the shift in word frequency distribution underscore the effect of text processing.

- **Thematic Focus:** Processing the text highlights specific themes like the environment and future, which are less apparent in the unprocessed text due to the prevalence of common structural words.

- **Lexical Diversity:** The increase in lexical diversity post-processing reflects the removal of common words, leaving a text rich in unique, content-specific vocabulary.

In [49]:
# Get text metrics for raw unprocessed text
emma_metrics = text_metrics(emma_text)
tp001_metrics = text_metrics(tp001_text)

# Extracting top ten words and their frequencies for plotting for both texts
emma_top_ten_words, emma_frequencies = zip(*emma_metrics[3])
tp001_top_ten_words, tp001_frequencies = zip(*tp001_metrics[3])

# Extract the number of sentences
emma_num_sentences = emma_metrics[4]
tp001_num_sentences = tp001_metrics[4]

# Generate a random text occurrence removal frequency
random_frequency = random.randint(125, 300)

# Preprocess the texts
preprocessed_emma, _ = preprocess_text(emma_text, stop_words, random_frequency)
preprocessed_tp001, _ = preprocess_text(tp001_text, stop_words, random_frequency)

# Get metrics for preprocessed texts
preprocessed_emma_metrics = text_metrics(preprocessed_emma)
preprocessed_tp001_metrics = text_metrics(preprocessed_tp001)

# Extracting top ten words and their frequencies for preprocessed texts
preprocessed_emma_top_ten, preprocessed_emma_freq = zip(*preprocessed_emma_metrics[3])
preprocessed_tp001_top_ten, preprocessed_tp001_freq = zip(*preprocessed_tp001_metrics[3])

# Extract the number of sentences for preprocessed texts
preprocessed_emma_num_sentences = preprocessed_emma_metrics[4]
preprocessed_tp001_num_sentences = preprocessed_tp001_metrics[4]


print(format_metrics("Emma (Unprocessed)", emma_metrics) + "\n")
print(format_metrics("Emma (Processed)", preprocessed_emma_metrics) + "\n")
print(format_metrics("tp001.txt (Unprocessed)", tp001_metrics) + "\n")
print(format_metrics("tp001.txt (Processed)", preprocessed_tp001_metrics) + "\n")
print(f"\nRemoved words that occurred more than {random_frequency} times.")

--------- Text Metrics for Emma (Unprocessed) ---------
Total Words: 161983
Total Sentences: 10567
Overall Lexical Diversity: 4.48%
Average Lexical Diversity of Sentences: 94.43%
Top Ten Most Frequent Words: to, the, and, of, i, a, it, her, was, she
Highest Frequency Word: 'to' (Frequency: 5239)

--------- Text Metrics for Emma (Processed) ---------
Total Words: 61193
Total Sentences: 1
Overall Lexical Diversity: 15.21%
Average Lexical Diversity of Sentences: 15.21%
Top Ten Most Frequent Words: jane, time, great, woodhouse, nothing, dear, always, soon, may, thought
Highest Frequency Word: 'jane' (Frequency: 272)

--------- Text Metrics for tp001.txt (Unprocessed) ---------
Total Words: 39065
Total Sentences: 2037
Overall Lexical Diversity: 12.06%
Average Lexical Diversity of Sentences: 89.43%
Top Ten Most Frequent Words: the, to, we, and, i, it, a, is, be, for
Highest Frequency Word: 'the' (Frequency: 1507)

--------- Text Metrics for tp001.txt (Processed) ---------
Total Words: 19384


# **Visualization of The Top Ten Words with Their Frequencies Before and After Processing**

The visual comparison of word frequencies before and after text processing illustrates the shift from generic to specific language elements, informing the thematic interpretation of the text.

**Before Processing:**
The initial charts for "Emma" and "TP001" show a high frequency of stop words—common elements such as "to," "the," "and," "of," among others, which serve basic grammatical functions but do not distinguish the text's unique content. The analysis highlights the need to eliminate these stop words in preprocessing to pave the way for a more focused examination of the text's themes and subjects.

**After Processing:**
Upon removing stop words, "Processed Emma" displays words that are central to the text's narrative. Terms such as "dear," "always," "soon," "may," "thought," "see," "shall," "without," "man," and "first" emerge, which suggest the narrative's focus on personal relationships and internal contemplation. These words carry specific thematic weight, indicating emotions, time, and philosophical considerations that are integral to understanding "Emma" by Jane Austen, published in 1816.

In the "TP001" chart, the removal of stop words brings forward terms such as "electric," "change," "better," "good," "planet," "environment," "future," "would," "people," and "bad." These terms indicate a focus on environmental issues, progress, and evaluative commentary, signaling discussions around ecological responsibility, advancements, and societal impacts.

The transition from unprocessed to processed words is crucial in text analysis. By filtering out non-informative words and focusing on content-rich terms, the analysis can more accurately identify the main themes, sentiments, and discussion points within the text. The processing enables extraction of the core topics and narratives, providing a clearer view of the text's intent and subject matter.

In [53]:
# Create a subplot figure with 2 rows and 2 columns
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Unprocessed Emma', 'Unprocessed TP001', 'Processed Emma', 'Processed TP001')
)

# Original Emma
fig.add_trace(
    go.Bar(x=emma_top_ten_words, y=emma_frequencies),
    row=1, col=1
)

# Original TP001
fig.add_trace(
    go.Bar(x=tp001_top_ten_words, y=tp001_frequencies),
    row=1, col=2
)

# Preprocessed Emma
fig.add_trace(
    go.Bar(x=preprocessed_emma_top_ten, y=preprocessed_emma_freq),
    row=2, col=1
)

# Preprocessed TP001
fig.add_trace(
    go.Bar(x=preprocessed_tp001_top_ten, y=preprocessed_tp001_freq),
    row=2, col=2
)

# Update layout
fig.update_layout(
    title_text='Top Ten Words and Their Frequencies',
    showlegend=False,
    height=800, width=1200
)

# Customize axis labels
fig.update_xaxes(title_text='Words', row=1, col=1)
fig.update_xaxes(title_text='Words', row=1, col=2)
fig.update_xaxes(title_text='Words', row=2, col=1)
fig.update_xaxes(title_text='Words', row=2, col=2)
fig.update_yaxes(title_text='Occurrence Frequency', col=1)

# Show the figure
fig.show()

# Print removed words note
print(f"\nRemoved words that occur more than {random_frequency} times.")


Removed words that occur more than 275 times.


## **Comparative Analysis of Overall Lexical Diversity in Processed and Unprocessed Versions of "Emma" by Jane Austen.**

The results of the analysis below show the overall lexical diversity of Jane Austen's "Emma" in both its processed and unprocessed forms as the batch size (number of sentences per batch) increases.

**Processed Overall Lexical Diversity (Blue):** As the batch size increases, we observe a gradual decrease in lexical diversity. This indicates that when analyzing larger portions of the text together, the processed version becomes less lexically diverse.

**Unprocessed Overall Lexical Diversity (Red):** This line represents the lexical diversity of the original, unprocessed text. Similarly, as the batch size increases, we also see a decrease in lexical diversity.

In [29]:
increment = 100  # n sentences per increment
batch_sizes = list(range(1, emma_num_sentences, increment))  # Incrementally increase batch size

overall_lex_div_unprocessed = []
overall_lex_div_processed = []

# Split the text into sentences
sentences = sent_tokenize(emma_text)

# Calculate lexical diversities
for batch_size in batch_sizes:
    # Concatenate all sentences up to the current batch size
    concatenated_unprocessed = ' '.join(sentences[:batch_size])
    concatenated_processed = preprocess_text(concatenated_unprocessed, stop_words, random_frequency)[0]

    # Calculate overall lexical diversity for the concatenated text
    overall_lex_div_unprocessed.append(text_metrics(concatenated_unprocessed)[1])
    overall_lex_div_processed.append(text_metrics(concatenated_processed)[1])

In [31]:
# Prepare batch size labels with sentence count
batch_size_labels = [batch_size for batch_size in batch_sizes]

# Convert lexical diversity to percentages
processed_lex_div = [ld * 100 for ld in overall_lex_div_processed]
unprocessed_lex_div = [ld * 100 for ld in overall_lex_div_unprocessed]

# Create traces
trace1 = go.Scatter(
    x=batch_size_labels,
    y=processed_lex_div,
    mode='lines+markers',
    name='Overall Lexical Diversity (Processed)',
)
trace2 = go.Scatter(
    x=batch_size_labels,
    y=unprocessed_lex_div,
    mode='lines+markers',
    name='Overall Lexical Diversity (Unprocessed)',
)

# Layout
layout = go.Layout(
    title='Overall Lexical Diversity over Batch Size Increments',
    xaxis=dict(title='Number of Sentences'),
    yaxis=dict(title='Lexical Diversity (%)'),
)

# Figure
fig = go.Figure(data=[trace1, trace2], layout=layout)

# Show plot
fig.show()

# Print metrics (assuming format_metrics is defined)
print("\n" + format_metrics("Emma (Unprocessed)", emma_metrics) + "\n")
print(format_metrics("Emma (Processed)", preprocessed_emma_metrics) + "\n")



--------- Text Metrics for Emma (Unprocessed) ---------
Total Words: 161983
Total Sentences: 10567
Overall Lexical Diversity: 4.48%
Average Lexical Diversity of Sentences: 94.43%
Top Ten Most Frequent Words: to, the, and, of, i, a, it, her, was, she
Highest Frequency Word: 'to' (Frequency: 5239)

--------- Text Metrics for Emma (Processed) ---------
Total Words: 57300
Total Sentences: 1
Overall Lexical Diversity: 16.22%
Average Lexical Diversity of Sentences: 16.22%
Top Ten Most Frequent Words: made, like, frank, ever, young, churchill, two, though, indeed, better
Highest Frequency Word: 'made' (Frequency: 198)

