a function to preprocess text data which can:

1.   ◦ remove punctuation
2.   ◦ remove stopwords
3.   ◦ lowercase all words
4.   ◦ remove words below/above a certain frequency

@Madison Kremmer
ID:300523256

In [23]:
#working

import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
import nltk

def preprocess_text(text,min_freq=1,max_freq=float('inf')):
    # Split pairs into individual words
    pairs = text.split()
    words = [pair.split('/')[0] for pair in pairs]

    # Lowercase conversion
    words = [word.lower() for word in words]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]

    # Remove words consisting only of punctuation or numbers
    words = [word for word in words if word.isalnum()]



    # Calculate word frequencies
    word_freq = FreqDist(words)

    # Remove words based on frequency thresholds
    words = [word for word in words if min_freq <= word_freq[word.lower()] <= max_freq]

    # Print some information for inspection
    print("Number of words after preprocessing:", len(words))
    print("Words after preprocessing:", words[:10])
    print("Word frequencies:", word_freq)
    print('------------------------------------------------------------------')

    return words

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


a function (or functions) to calculate these text metrics:

1.   ◦ total number of words
2.   ◦ overall lexical diversity of the text
3.   ◦ average lexical diversity of text sentences
4.   ◦ top ten most frequent words

In [27]:
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

def calculate_text_metrics(text):
    # Combine sentences into a single string
    combined_text = ' '.join(text)

    # Tokenize the text into words
    words = word_tokenize(combined_text.lower())

    # Calculate total number of words
    total_words = len(words)

    # Calculate overall lexical diversity of the text
    overall_lexical_diversity = len(set(words)) / total_words

    # Tokenize the text into sentences
    sentences = sent_tokenize(combined_text)

    # Calculate average lexical diversity of text sentences
    sentence_lexical_diversities = []
    for sentence in sentences:
        sentence_words = word_tokenize(sentence.lower())
        sentence_words = [word.lower() for word in sentence_words if word.isalpha()]  # Remove non-alphabetic tokens

        # Check if the length of sentence_words is greater than zero before division
        if len(sentence_words) > 0:
            sentence_lexical_diversity = len(set(sentence_words)) / len(sentence_words)
            sentence_lexical_diversities.append(sentence_lexical_diversity)

    average_sentence_lexical_diversity = (
        sum(sentence_lexical_diversities) / len(sentence_lexical_diversities)
        if len(sentence_lexical_diversities) > 0
        else 0  # Handle the case when there are no sentences
    )

    # Calculate the frequency distribution of words
    word_freq = FreqDist(words)

    # Get the top ten most frequent words
    top_ten_words = word_freq.most_common(10)

    # Display the results
    print("Total number of words:", total_words)
    print("Overall lexical diversity of the text:", overall_lexical_diversity)
    print("Average lexical diversity of text sentences:", average_sentence_lexical_diversity)
    print("Top ten most frequent words:", top_ten_words)
    print('------------------------------------------------------------------')

    return


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Using this data, look for trends and consistent effects that preprocessing has on various text metrics.
Also look to see if there are any texts more or less immune to the effects of preprocessing. After
conducting your experiment, write a short report (500-600 words) reflecting on your results.

You
should detail the comparisons and analyses that you conducted, what results you found, and your
interpretation of the results. Specifically, you should focus on what happens to these metrics under
different preprocessing conditions, and focus on making conclusions about their implications for text
analysis in general.

In [28]:
#Experiment 1 - Brown

import nltk
nltk.download('brown')
from nltk.corpus import brown

document_name = "ca01"  # Replace with any document from the corpus
text = brown.raw(document_name)

# Using the function for the 'brown' corpus format
result_brown = preprocess_text(text)
print(result_brown)

calculate_text_metrics(result_brown)

#only words that appear at least 10 times and less than 20
print("Results of words that appear at least 1 times and less than 4:")
result_brown_10_20 = preprocess_text(text,min_freq = 1, max_freq = 4)
calculate_text_metrics(result_brown_10_20)


Number of words after preprocessing: 1111
Words after preprocessing: ['fulton', 'county', 'grand', 'jury', 'said', 'friday', 'investigation', 'recent', 'primary', 'election']
Word frequencies: <FreqDist with 657 samples and 1111 outcomes>
------------------------------------------------------------------
Total number of words: 1111
Overall lexical diversity of the text: 0.5913591359135913
Average lexical diversity of text sentences: 0.5880733944954128
Top ten most frequent words: [('said', 24), ('jury', 18), ('county', 15), ('fulton', 14), ('election', 14), ('state', 12), ('city', 9), ('department', 9), ('would', 9), ('resolution', 9)]
------------------------------------------------------------------
Results of words that appear at least 1 times and less than 4:
Number of words after preprocessing: 885
Words after preprocessing: ['grand', 'friday', 'investigation', 'recent', 'primary', 'produced', 'evidence', 'irregularities', 'took', 'place']
Word frequencies: <FreqDist with 657 samp

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [32]:
#Experiment 2 - The Current Topic Question: Petrol cars should be banned by 2030.

!wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/the-current/tp001.txt'


tp001 = open('tp001.txt').read().rstrip().split('\n')
tp001_comments = [comment.split('\t')[1] for comment in tp001]
tp001_comments[:2]
len(tp001_comments)
tp001_combined = ' '.join([comment for comment in tp001_comments])
tp001_tokens = nltk.word_tokenize(tp001_combined)

result_tp001 = preprocess_text(tp001_combined)
calculate_text_metrics(result_tp001)

print('==Unprocessed==')
calculate_text_metrics(tp001_tokens)

--2023-11-23 21:35:04--  https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/the-current/tp001.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 220746 (216K) [text/plain]
Saving to: ‘tp001.txt.3’


2023-11-23 21:35:04 (5.45 MB/s) - ‘tp001.txt.3’ saved [220746/220746]

Number of words after preprocessing: 18987
Words after preprocessing: ['need', 'work', 'hard', 'make', 'happen', '3d', 'better', 'bands', 'whole', 'country']
Word frequencies: <FreqDist with 4286 samples and 18987 outcomes>
------------------------------------------------------------------
Total number of words: 19015
Overall lexical diversity of the text: 0.22550617933210623
Average lexical diversity of text sentences: 0.22432575356953993
Top ten most frequent words: [('cars', 49

In [None]:
#Experiment 3 - The Current Topic Question: Nature helps us get through lockdowns
!wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/the-current/tp008.txt'

tp008 = open('tp008.txt').read().rstrip().split('\n')
tp008_comments = [comment.split('\t')[1] for comment in tp008]
tp008_comments[:2]
len(tp008_comments)
tp008_combined = ' '.join([comment for comment in tp008_comments])
tp008_tokens = nltk.word_tokenize(tp008_combined)

result_tp008 = preprocess_text(tp008_combined)
calculate_text_metrics(result_tp008)

--2023-11-23 01:18:34--  https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/the-current/tp008.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 150802 (147K) [text/plain]
Saving to: ‘tp008.txt.3’


2023-11-23 01:18:34 (1.72 MB/s) - ‘tp008.txt.3’ saved [150802/150802]

Number of words after preprocessing: 13276
Words after preprocessing: ['santana', 'stink', 'bum', 'would', 'good', 'coing', 'stress', 'reconnect', 'really', 'nature']
Word frequencies: <FreqDist with 3631 samples and 13276 outcomes>
------------------------------------------------------------------
Total number of words: 13284
Overall lexical diversity of the text: 0.27341162300511895
Average lexical diversity of text sentences: 0.2720888083371092
Top ten most frequent words: [('n

In [None]:
#Experiment 4 - The Great Gatsby

import requests
r = requests.get(r'https://www.gutenberg.org/cache/epub/64317/pg64317.txt')
great_gatsby = r.text

#subset for the book text
#(removing the project gutenburg introduction/footnotes)
great_gatsby = great_gatsby[1433:277912]
#print(great_gatsby)

result_gatsby = preprocess_text(great_gatsby)
calculate_text_metrics(result_gatsby)


Number of words after preprocessing: 16803
Words after preprocessing: ['people', 'world', 'advantages', 'say', 'always', 'unusually', 'communicative', 'reserved', 'understood', 'meant']
Word frequencies: <FreqDist with 4725 samples and 16803 outcomes>
------------------------------------------------------------------
Total number of words: 16806
Overall lexical diversity of the text: 0.28126859454956565
Average lexical diversity of text sentences: 0.281183051654368
Top ten most frequent words: [('said', 164), ('one', 130), ('like', 115), ('tom', 115), ('gatsby', 105), ('came', 103), ('daisy', 99), ('little', 91), ('went', 90), ('back', 89)]
------------------------------------------------------------------


In [39]:
import requests
r = requests.get(r'https://www.gutenberg.org/cache/epub/64317/pg64317.txt')
great_gatsby = r.text

#subset for the book text
#(removing the project gutenburg introduction/footnotes)
great_gatsby = great_gatsby[1433:277912]
#print(great_gatsby)

result_gatsby = preprocess_text(great_gatsby)
Metrics_gatsby = calculate_text_metrics(result_gatsby)
print(Metrics_gatsby)

print("Results of words that appear at least 1 times and less than 4:")
gatsby_1_4 = preprocess_text(great_gatsby,min_freq = 1, max_freq = 4)
calculate_text_metrics(gatsby_1_4)



Number of words after preprocessing: 16803
Words after preprocessing: ['people', 'world', 'advantages', 'say', 'always', 'unusually', 'communicative', 'reserved', 'understood', 'meant']
Word frequencies: <FreqDist with 4725 samples and 16803 outcomes>
------------------------------------------------------------------
Total number of words: 16806
Overall lexical diversity of the text: 0.28126859454956565
Average lexical diversity of text sentences: 0.281183051654368
Top ten most frequent words: [('said', 164), ('one', 130), ('like', 115), ('tom', 115), ('gatsby', 105), ('came', 103), ('daisy', 99), ('little', 91), ('went', 90), ('back', 89)]
------------------------------------------------------------------
None
Results of words that appear at least 1 times and less than 4:
Number of words after preprocessing: 6018
Words after preprocessing: ['advantages', 'unusually', 'communicative', 'reserved', 'meant', 'deal', 'inclined', 'reserve', 'habit', 'natures']
Word frequencies: <FreqDist wi

# New Section