# Task 1

In Moodle you will find the file trump.xml, containing Speeches of Donald Tump’s speeches
during the campaign rallies of his 2016 presidential election.

The file also contains meta information about the place and date of the speech. We are however
only interested in the speeches themselves. Read the xml file into your console as if it were a
simple text-file and then use Regex to filter out the speeches.


In [17]:
import re
from collections import Counter

# Step 1: Extracting Speeches
# Assuming 'xml_content' contains the text content of the XML file
with open('trump.xml', 'r', encoding='utf-8') as file:
    xml_content = file.read()

speeches = re.findall(r'<Speech>(.*?)</Speech>', xml_content, re.DOTALL)


# Task 2

Apply elementary tokenization steps. That is, within each speech
* Remove punctuation, numbers and special characters
* Turn all letters into lower case
* Tokenize the text into individual words

The result should be a list of lists (list of vectors for R). Each inner list represents a speech as
a list of words.
Count how often each word occurs in this text corpus and display the 10 most common words.


In [22]:
from collections import Counter
import string

# Function to clean and tokenize each speech
def clean_and_tokenize(speech):
    # Remove punctuation and numbers, and turn all letters into lower case
    cleaned_text = re.sub(r'[' + string.punctuation + string.digits + ']', '', speech).lower()
    # Tokenize the text into individual words
    tokens = cleaned_text.split()
    return tokens

# Applying the cleaning and tokenization function to each speech
tokenized_speeches = [clean_and_tokenize(speech) for speech in speeches]

# Flattening the list of lists to count word frequencies across all speeches
all_words = [word for speech in tokenized_speeches for word in speech]
word_counts = Counter(all_words)

# Displaying the 10 most common words
most_common_words = word_counts.most_common(10)
most_common_words


[('the', 14144),
 ('and', 11282),
 ('to', 9333),
 ('a', 7888),
 ('you', 7694),
 ('i', 7623),
 ('of', 6818),
 ('we', 6526),
 ('it', 5584),
 ('they', 5520)]

# Task 3

Use each one automated word stemming- and lemmatization method for your programming
language. Apply them to the corpus resulting from task 2 and compare the resulting texts
when applying each. Which of the two approaches would you prefer?

In [26]:
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import nltk

# Downloading required NLTK resources if not already present
nltk.download('wordnet')
nltk.download('omw-1.4')

# Initializing stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Applying stemming and lemmatization to a sample from the corpus
# To manage the computation, we'll use a subset of the corpus for demonstration
sample_speech_words = tokenized_speeches[0][:2000]  # Using the first 200 words of the first speech

# Stemming
stemmed_words = [stemmer.stem(word) for word in sample_speech_words]

# Lemmatization
lemmatized_words = [lemmatizer.lemmatize(word) for word in sample_speech_words]

# Counting the unique words after stemming and lemmatization
stemmed_unique_words = len(set(stemmed_words))
lemmatized_unique_words = len(set(lemmatized_words))

stemmed_unique_words, lemmatized_unique_words, stemmed_words[:10], lemmatized_words[:10]


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\harsh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\harsh\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


(450,
 479,
 ['thank',
  'you',
  'thank',
  'you',
  'thank',
  'you',
  'to',
  'vice',
  'presid',
  'penc'],
 ['thank',
  'you',
  'thank',
  'you',
  'thank',
  'you',
  'to',
  'vice',
  'president',
  'penny'])

###  Which of the two approaches would you prefer? 

I would prefer lemmatization. The output indicates that lemmatization retains more accurate and meaningful forms of words ("president" instead of "presid"), which suggests it is better suited for applications requiring a high level of understanding and preservation of the original text's meaning.

# Task 4

Use your ”best” corpus from task 3 and apply stop word removal. That is, remove every word
from a stop word list from your text. Beware that you have to apply the same pre-processing
of your text to your stop words, such as removing the apostrophe from ”don’t”.
Compare the most common words with the results from task 2. What do you notice?


In [27]:
# Since we've decided that lemmatization provides the "best" corpus for our purposes,
# we will apply stop word removal to the lemmatized text.
# First, we need to obtain a list of English stop words.

# NLTK provides a comprehensive list of English stop words.
nltk.download('stopwords')
from nltk.corpus import stopwords

# Getting English stop words
stop_words = set(stopwords.words('english'))

# Pre-processing the stop words similar to our corpus pre-processing
# This involves removing punctuation (if any) and converting to lower case
# Since our lemmatized text is already in lower case and without punctuation, we can use the stop words as is

# Removing stop words from the lemmatized corpus
lemmatized_text_without_stopwords = [[word for word in speech if word not in stop_words]
                                     for speech in tokenized_speeches]

# Flattening the list of lists to count word frequencies
all_words_without_stopwords = [word for speech in lemmatized_text_without_stopwords for word in speech]
word_counts_without_stopwords = Counter(all_words_without_stopwords)

# Displaying the 10 most common words after stop word removal
most_common_words_without_stopwords = word_counts_without_stopwords.most_common(10)
most_common_words_without_stopwords


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\harsh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


[('going', 2319),
 ('said', 2217),
 ('people', 2144),
 ('know', 2039),
 ('great', 2005),
 ('dont', 1915),
 ('right', 1665),
 ('like', 1504),
 ('thats', 1490),
 ('want', 1489)]

Result from Task 2: 

[('the', 14144),
 ('and', 11282),
 ('to', 9333),
 ('a', 7888),
 ('you', 7694),
 ('i', 7623),
 ('of', 6818),
 ('we', 6526),
 ('it', 5584),
 ('they', 5520)]

The comparison between the results of Task 2 and Task 4 demonstrates the importance of stop word removal in text preprocessing. It significantly enhances the focus on meaningful content, providing a clearer and more insightful basis for further analysis of the text corpus.

### Thank you