<img src="https://communications.univie.ac.at/fileadmin/_processed_/csm_Uni_Logo_2016_2f47aacf37.jpg" 
     alt="Logo Universität Wien" 
     width="200"/>

# Practical Machine Learning for Natural Language Processing - 2023 SS  

### Assigment 1 - Python for Poets  

This assigment is an adaptation for Python of the original exercise ["Unix for Poets"](https://www.cs.upc.edu/~padro/Unixforpoets.pdf)

***

### Loading the document

In [None]:
with open("C:/Users/accou/OneDrive/Dokumente/GitHub/Python_Course/Data/txt/nyt_200811.txt", "r") as f:
    text = f.read()

print(text[0:500])

***

### You will solve the following exercises using **Pure Python**  
### (only packages "string" and "re" are allowed).  

1. Count words in a text  
2. Sort a list of words in various ways  
   • ascii order   
   • "rhyming" order   
3. Extract useful info for a dictionary  
4. Compute ngram statistics  
5. Make a Concordance  

***

#### 1. Count words in a text

a. Output a list of words in the file along with their frequency counts (ignoring case).   
a. Count how many unique words there are (ignoring case).    
c. Check how common are all different sequences of vowels (e.g. the sequences "ieu" or just "e" in "lieutenant")?

In [None]:
# a)
import string
# Create list of lower-case words
list_of_words = text.lower().translate(str.maketrans('', '', string.punctuation)).split()
# Create dictionary of word (key) and frequency (value))
freq_dic = {}
for word in list_of_words:
    if word in freq_dic:
        freq_dic[word] += 1
    else:
        freq_dic[word] = 1
# print dictionary
for word, freq in freq_dic.items():
    print(f'{word}: {freq}')

In [None]:
# b) 
# Count number of words in sets (unique words)
len(set(text.translate(str.maketrans('', '', string.punctuation)).split()))

In [None]:
# c)
"""y is considered as a consonant in this exercise"""
import re

# Define regular expression to match vowels
vowel_pattern = re.compile('[aeiou]+')

# Initialize dictionary to store the frequency of vowel sequences
vowel_freq = {}

# Iterate over each word
for word in list_of_words:
    # Find all vowel sequences
    vowel_matches = vowel_pattern.findall(word)
    # Iterate over each vowel sequence
    for i in range(len(vowel_matches)):
        for j in range(i + 1, len(vowel_matches) + 1):
            # Get current vowel sequence
            vowel_seq = ''.join(vowel_matches[i:j])
            # Increment frequency of vowel sequence
            if vowel_seq in vowel_freq:
                vowel_freq[vowel_seq] += 1
            else:
                vowel_freq[vowel_seq] = 1

# Print dictionary
for vowel_seq, freq in vowel_freq.items():
    print(f'{vowel_seq}: {freq}')

#### 2. Sorting and reversing lines of text

a. Sort each line alphabetically (ignoring case).  
b. Sort in numeric ([ascii](https://python-reference.readthedocs.io/en/latest/docs/str/ASCII.html)) order.  
c. Alphabetically reverse sort (ignoring case).  
d. Sort in reverse numeric ([ascii](https://python-reference.readthedocs.io/en/latest/docs/str/ASCII.html)) order.  

In [None]:
# a)
# Split lines
lines = text.translate(str.maketrans('', '', string.punctuation)).translate(str.maketrans('', '', string.punctuation)).splitlines()
# Remove empty lines
strip_lines = [line for line in lines if line.strip()]
# Sort character in each line alphabetically
sorted_alpha = [''.join(sorted(line, key=str.lower)) for line in strip_lines]
# Remove whitespace
sorted_alpha_final = [line.strip() for line in sorted_alpha]
# Print sorted list
sorted_alpha_final

In [None]:
# b)
# Order list numercically (default key)
# Split lines
lines = text.translate(str.maketrans('', '', string.punctuation)).translate(str.maketrans('', '', string.punctuation)).splitlines()
# Remove empty lines
strip_lines = [line for line in lines if line.strip()]
# Sort character in each line alphabetically
sorted_ascii = [''.join(sorted(line)) for line in strip_lines]
# Remove whitespace
sorted_ascii_final = [line.strip() for line in sorted_ascii]
# Print sorted list
sorted_ascii_final

In [None]:
# c)
# Step through list from last to first index
sorted_alpha_final_reversed = [line[::-1] for line in sorted_alpha_final] 
# Print sorted list
sorted_alpha_final_reversed

In [None]:
# d)
# Step through list from last to first index
sorted_ascii_reversed = [line[::-1] for line in sorted_ascii_final]
# Print sorted list
sorted_ascii_reversed

#### 3. Computing basic statistics

a. Find the 50 most common words  
b. Find the words in the NYT that end in "zz"  
c. Count the lines, the words, and the characters  
d. How many all uppercase words are there in this NYT file?  
e, How many 4-letter words?  
f. How many different words are there with no vowels?  
g. **tricky:** How many “1 syllable” words are there?  

In [None]:
# a) 
# Sort freq_dic with respect to values, and reverse sorting order
sorted_freq = sorted(freq_dic.items(),  key = lambda x: x[1], reverse = True)
# Print top 50 words
for key, value in sorted_freq[0:50]:
    print(f'{key}: {value}')

In [None]:
# b)
# Find the words in the NYT that end in "zz" (similar to exercise 1a)
zz_dic = {}
for word in list_of_words:
    # use 'string.endswith' to find words that end with zz
    if word.endswith("zz"):
        if word in zz_dic:
            zz_dic[word] += 1
        else:
            zz_dic[word] = 1
# Print dictionary 
for word, freq in zz_dic.items():
    print(f'{word}: {freq}')

In [None]:
# c)
# Number of lines
print("Count lines:", len(text.splitlines()))
# Number of words:
word_list = len(text.translate(str.maketrans('', '', string.punctuation)).split())
print("Count words:", word_list)
# Number of characters
print("Count characters:", len(list(text)))

In [None]:
# d)
# Remove digits from list of words
all_words = text.translate(str.maketrans('', '', string.punctuation)).translate(str.maketrans('', '', string.digits)).split()
# Count upper case words
count_u = 0
for word in all_words:
    if word.isupper():
        count_u += 1
print('The number of uppercase words is', count_u)

In [None]:
# e) 
# Count 4-letter words
count_4 = 0
for word in all_words:
    if len(word) == 4:
        count_4 += 1

print(f'There are {count_4} 4-letter words in the text file.')

In [None]:
# f)
"""y is considered as a consonant"""
no_vowel_list = []
count = 0
# Take vowel_freq dictionary from exercise 1c
vowel_seq = list(vowel_freq.keys())
# Iterate over list of words
for word in list_of_words:
    # Set flag to false
    found_vowel_seq = False
    # Iterate over keys of vowel_seq dictionary
    for seq in vowel_seq:
        # If vowel sequence is in word, set flag to true and break
        if seq in word:
            found_vowel_seq = True
            break
    # if no vowel sequence was found, increase count by 1 
    if not found_vowel_seq:
        count += 1        

print(f'There are {count} words without vowels in the text file.')

In [None]:
# g)
# Idea: Count the number of words with only one vowel
count = 0
for word in all_words:
    # Count the number of vowels in the word by summing 
    num_vowels = sum(1 for letter in word if letter in 'aeiouAEIOU')
    if num_vowels == 1:
        # If the word has only one vowel, it's a 1-syllable word
        count += 1

# Print the result
print(f'There are {count} words with one syllable in the text file.')

#### 4. Compute ngrams  

a. Find the 10 most common bigrams (2 words)
b. Find the 10 most common trigrams (3 words)

In [None]:
# a)
# Compute bigrams by matching all adjacent words
bigrams = [w1 + ' ' + w2 for w1, w2 in zip(list_of_words, list_of_words[1:])]
# Create dictionary of bigrams and their frequencies
bigram_dic = {}
for word in bigrams:
    if word in bigram_dic:
        bigram_dic[word] += 1
    else:
        bigram_dic[word] = 1
# Sort dictionary with respect to values, and reverse sorting order
bigram_freq = sorted(bigram_dic.items(), key = lambda x: x[1], reverse = True)
# Print top 10 bigrams
for key, value in bigram_freq[0:10]:
    print(f'{key}: {value}')


In [None]:
# b)
# Compute trigrams by matching all adjacent words
trigrams = [w1 + ' ' + w2 + ' ' + w3 for w1, w2, w3 in zip(list_of_words, list_of_words[1:], list_of_words[2:])]
# Create dictionary of trigrams and their frequencies
trigram_dic = {}
for word in trigrams:
    if word in trigram_dic:
        trigram_dic[word] += 1
    else:
        trigram_dic[word] = 1
# Sort dictionary with respect to values, and reverse sorting order
trigram_freq = sorted(trigram_dic.items(), key = lambda x: x[1], reverse = True)
# Print top 10 trigrams
for key, value in trigram_freq[0:10]:
    print(f'{key}: {value}')

#### 5. Make a Concordance

a. Create a concordance display for an arbitrary word. See the example below  

![](../../Data/figs/Sample-concordance-lines-of-actually.png)

In [None]:
def get_concordance(term, term_list):
    # Create dictionary for concordances
    concordance = {}
    # Key = word, value = index of word
    for i, word in enumerate(term_list):
        # If word is not yet in concordance dictionary
        if word not in concordance:
            # Create empty list for word
            concordance[word] = []
        # Append index to list to store all indices of this key as values
        concordance[word].append(i)

    # Print the condordance
    for word in concordance:
        # Retrieve list of indices
        indices = concordance[word]
        # Create list to store context of term
        context = []
        # Iterate over indices
        for i in indices:
            # Start index = current index - 2 (minimally 0)
            start_index = max(0, i-2)
            # End index = current index + 2 (not longer than than term_list)
            end_index = min(len(term_list), i+3)
            # Slice context words from term_list and store them in the list 'context'
            context.append(' '.join(term_list[start_index:end_index]))
        # Print concordance when the searched term matches the word in the for loop
        if word == term:
            print(f" {word}:\n\n", '\n '.join(context))

# Call function and enter 'word' and 'list'
get_concordance('million', list_of_words)

***

##### Extra Credit – Secret Message
+ The answers to the extra credit exercises will reveal a secret message.  
+ We will be working with the following text file for these exercises:  
[Link to Text](https://web.stanford.edu/class/cs124/lec/secret_ec.txt)  
(No starter code in the Extra Credit)  

##### Extra Credit Exercise 1
• Find the 2 most common words in secret_ec.txt containing the letter e.  
• Your answer will correspond to the first two words of the secret message.  

##### Extra Credit Exercise 2
• Find the 2 most common bigrams in secret_ec.txt where the second word in the bigram ends with a consonant.  
• Your answer will correspond to the next four words of the secret message.  

##### Extra Credit Exercise 3
• Find all 5-letter-long words that only appear once in secret_ec.txt.   
• Concatenate your result. This will be the final word of the secret message.  

What is the secret message?  