## Submission Instructions

You must submit two materials for HW1 on KLMS **by March 28 at 23:59:59**, with penalty for late submission only over two days \(-20%, -40%\).

<br>

### A. ```.ipynb``` file containing your Python code

- Provide your code below each problem. After completing each problem, make sure to **run the code and print the results**. Your submitted code must be executable.
- You are allowed to use more than one code block per problem.

### B. Report

- Submit a report that describes the process and results of solving each problem (either with screenshots or text).
- There is no page limit, but the report must be in PDF format.
- Avoid unnecessary explanations for simple problems, but make sure to include essential details on how you solved each problem.
- The report must be written **in English only**. Other languages are not allowed.

### C. File Naming Convention
Your file names must follow the format:
- ```CS372_HW1_{studentID}.ipynb```
- ```CS372_HW1_{studentID}.pdf```

<br>

Any form of plagiarism will be penalized.

If you have any questions, please use the KLMS Q&A board or contact cs372@nlp.kaist.ac.kr.

## Example of Using Colab

To use Colab, write your code in a code box and click the play button on the left to execute it. Below is a simple code -- click the play button to see how it prints the results.

You can also add code cells or text by clicking ```+code``` or ```+text``` between blocks.

Feel free to modify the code below to get familiar with Colab.

The execution order of code cells may affect subsequent code, so make sure that different problems do not interfere with each other.

In [1]:
from numba.core.ir import Print
from sympy import limit

a = [1, 2, 3, 4, 5]
a[0]

1

## Environment Setting

The downloads and imports below are baseline settings for the assignment. Once you are connected to the server, you must download these files; otherwise, you may encounter errors such as "Resource words not found."

If you need additional libraries or packages for solving the problems, you are allowed to use them. However, you **must** mention why you used them in your report.

In [2]:
import nltk, re, pprint
nltk.download('gutenberg')
nltk.download('brown')
nltk.download('words')
nltk.download('cmudict')
nltk.download('wordnet')
nltk.download('genesis')
nltk.download('inaugural')
nltk.download('nps_chat')
nltk.download('webtext')
nltk.download('treebank')
nltk.download('udhr')

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/chloeyamtai/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package brown to
[nltk_data]     /Users/chloeyamtai/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /Users/chloeyamtai/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package cmudict to
[nltk_data]     /Users/chloeyamtai/nltk_data...
[nltk_data]   Package cmudict is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/chloeyamtai/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package genesis to
[nltk_data]     /Users/chloeyamtai/nltk_data...
[nltk_data]   Package genesis is already up-to-date!
[nltk_data] Downloading package inaugural to
[nltk_data]     /Users/chloeyamtai/nltk_data...
[nltk_data]   Package inaugural is already up-t

True

For Problems 1, 2 and 3, you need to import ```nltk.book``` as follows:

In [3]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [4]:
sent1

['Call', 'me', 'Ishmael', '.']

In [5]:
text1

<Text: Moby Dick by Herman Melville 1851>

## Problems

### Problem 1 (2 points)


Write a Python code that prints out a sorted list of unique words \(excluding special characters\) from the sentences ```sent1```, ```sent2```, ..., ```sent9``` combined, using list addition and sorted() operations. Additionally, print the size of the sorted list of unique words.


In [6]:
# Problem 1

# Step 1: Combine all sentences using list addition
all_sents = sent1 + sent2 + sent3 + sent4 + sent5 + sent6 + sent7 + sent8 + sent9

# Step 2: Remove special characters, keep alphabetic and numeric words
# Here we assume COVID19 is a valid word
all_words = [word.lower() for word in all_sents if word.isalnum()]

# Step 3: Get the unique words and convert it to a list
unique_words = list(set(all_words))

# Step 4: Sort the unique words in alphabetical order
sorted_unique_words = sorted(unique_words)

# Step 5: Print the sorted list of unique words
print(f"Sorted list of unique words: {sorted_unique_words}")

# Step 6: Print the size of the sorted list
print(f"Size of the sorted list: {len(sorted_unique_words)}")

Sorted list of unique words: ['1', '25', '29', '61', 'a', 'and', 'arthur', 'as', 'attrac', 'been', 'beginning', 'board', 'call', 'citizens', 'clop', 'cloud', 'created', 'dashwood', 'director', 'discreet', 'earth', 'encounters', 'family', 'fellow', 'for', 'god', 'had', 'have', 'heaven', 'house', 'i', 'in', 'ishmael', 'join', 'king', 'lady', 'lay', 'lol', 'london', 'long', 'male', 'me', 'nonexecutive', 'of', 'old', 'older', 'on', 'park', 'people', 'pierre', 'pming', 'problem', 'ragged', 'red', 'representatives', 'saffron', 'scene', 'seeks', 'senate', 'settled', 'sexy', 'side', 'single', 'suburb', 'sunset', 'sussex', 'the', 'there', 'to', 'vinken', 'whoa', 'will', 'wind', 'with', 'years']
Size of the sorted list: 75


### Problem 2 (2 points)

Write a Python code that extracts the last two words of ```text2``` using the slice expression.


In [7]:
# Problem 2

# Step1: Extract the last two words of text2 using the slice expression
last2_words = text2[-2:]

# Step 2: Print the last two words
# Optional for debugging, since the problem does not require any output
# print(last2_words)

### Problem 3 (4 points)

Define a Python function called ```vocab_size(text)``` that takes a single parameter for the text and returns the vocabulary \(excluding special characters\) size of the input text. As input, use a list of words that concatenates ```sent1```, ```sent2```, ..., ```sent9```.

Report the result of the function for two cases:
- The result when the function is applied to the input data in its original state.
- The result when the function is applied to the input data after changing all words to lowercase.


In [8]:
# Problem 3

# Step 1: Concatenate all sents as input data
all_sents = sent1 + sent2 + sent3 + sent4 + sent5 + sent6 + sent7 + sent8 + sent9

# Step 2: Define a helper function to remove special characters in a word
# Here we assume COVID-19 is a valid word with meaning, but not %^&*
def remove_special_chars(token):
    import re
    return re.sub(r'[^a-zA-Z0-9]', '', token)

# Step 3: Define vocab_size(text) function
def vocab_size(text):
    # Declare a set to store the unique words
    vocab = set()
    for word in text:
        # Preprocess the word by removing special characters
        cleaned = remove_special_chars(word)
        
        # Check if the token after processing is not empty
        # and contains only alphabets and digits
        if cleaned and cleaned.isalnum():
            vocab.add(cleaned)
    
    # Return the size of the vocabulary
    return len(vocab)

# Step 4: Apply function to the input data in its original state
vocab_size_original = vocab_size(all_sents)
print(f"Vocabulary size in original state: {vocab_size_original}")

# Step 5: Apply function to the input data after changing all words to lowercase
all_sents_lower = [word.lower() for word in all_sents]
vocab_size_lower = vocab_size(all_sents_lower)
print(f"Vocabulary size after changing all words to lowercase: {vocab_size_lower}")

Vocabulary size in original state: 80
Vocabulary size after changing all words to lowercase: 76


### Problem 4 (4 points)

Assume that the following list of words is given:

`['chair', 'teacher', 'chapter', 'echo', 'cider', 'china', 'camera', 'character', 'beach', 'approach']`

Write a Python code to perform the following tasks:

a. Print all words that begins with ```"ch"```. (2 points)

b. Print all words that are longer than four characters. (2 points)

In [9]:
# Problem 4

word_list = ['chair', 'teacher', 'chapter', 'echo', 'cider', 'china', 'camera', 'character', 'beach', 'approach']

# Step 1: Declare two lists to store the results for each task
start_with_ch_words = []
long_words = []

for word in word_list:
    # Step 2: Find words that start with 'ch'
    if word.startswith('ch'):
        start_with_ch_words.append(word)
    # Step 3: Find the words that longer than 4 characters
    if len(word) > 4:
        long_words.append(word)

# Step 4: Print the results for each task
print(f"Words that start with 'ch': {start_with_ch_words}")
print(f"Words that are longer than 4 characters: {long_words}")

Words that start with 'ch': ['chair', 'chapter', 'china', 'character']
Words that are longer than 4 characters: ['chair', 'teacher', 'chapter', 'cider', 'china', 'camera', 'character', 'beach', 'approach']


### Problem 5 (3 points)

We want to identify special bigrams that occur in the same sentence, with exactly two words between them. For example, in the sentence "Success comes to those who work.", the following three special bigram pairs are considered special:
- \(```Success```, ```those```\)
- \(```comes```, ```who```\)
- \(```to```, ```work```\)

Find the 10 most frequent special bigrams in the News category of the Brown Corpus and show the results.


In [10]:
# Problem 5

# Step 1: Load Brown Corpus's News Category
from nltk.corpus import brown
news_sents = brown.sents(categories='news')

# Step 2: Declare a dictionary to store the bigram frequency
bigram_freq = {}

# Step 3: Find special bigrams with only alphabet and numerical characters
for sent in news_sents:
    for i in range(len(sent) - 3):
        # Convert the words to lowercase
        w1 = sent[i].lower()
        w2 = sent[i+1].lower()
        w3 = sent[i+2].lower()
        w4 = sent[i+3].lower()
        
        # Check if the words are all in alphanumeric format
        if w1.isalnum() and w2.isalnum() and w3.isalnum() and w4.isalnum():
            bigram = (w1, w4)
            bigram_freq[bigram] = bigram_freq.get(bigram, 0) + 1
            
# Step 4: Sort the bigrams by frequency
sorted_bigrams = sorted(bigram_freq.items(), key=lambda x: x[1], reverse=True)

# Step 5: Print top 10 most frequent special bigrams
print(f"Top 10 most frequent special bigrams in the News category of the Brown Corpus:\n", sorted_bigrams[:10])


Top 10 most frequent special bigrams in the News category of the Brown Corpus:
 [(('the', 'the'), 405), (('the', 'of'), 254), (('a', 'the'), 137), (('the', 'a'), 126), (('the', 'in'), 108), (('in', 'of'), 103), (('and', 'of'), 101), (('to', 'the'), 101), (('and', 'the'), 95), (('the', 'to'), 85)]


### Problem 6 (3 points)

Show how many words in the Gutenberg shakespeare-macbeth corpus have two or more pronunciations according to the CMU Pronouncing Dictionary. Additionally, show the list of words with the most diverse pronunciations.


In [14]:
# Problem 6

# Step 1: Load Shakespeare Macbeth Corpus and preprocess the words
from nltk.corpus import gutenberg
# Use set to store unique words
macbeth_words = set(
    # Convert to lowercase and remove non-alphabetic words
    w.lower()
    for w in gutenberg.words('shakespeare-macbeth.txt')
    if w.isalpha()
)

# Step 2: Load CMU Pronouncing Dictionary
from nltk.corpus import cmudict
cmu_dict = cmudict.dict()

# Step 3: Find words with >= 2 pronunciations
multi_pron_words = []
for word in macbeth_words:
    if word in cmu_dict and len(cmu_dict[word]) >= 2:
        multi_pron_words.append(word)
        
# Step 4: Print the total number of words with >= 2 pronunciations
print(f"Number of words with two or more pronunciations: {len(multi_pron_words)}")

# Step 5: Find word with the most diverse pronunciations
max_pronunciation_word = []
max_pronunciation_count = 0
for word in multi_pron_words:
    # Check if the word has more pronunciations than the current max
    if len(cmu_dict[word]) > max_pronunciation_count:
        # Update the max count
        max_pronunciation_count = len(cmu_dict[word])
        # Reset the list with the new word
        max_pronunciation_word = [word]
    # If the word has the same number of pronunciations as the current max
    elif len(cmu_dict[word]) == max_pronunciation_count:
        # Append the word to the list
        max_pronunciation_word.append(word)

# Step 6: Print the list of words with the most diverse pronunciations
print(f"Words with the most diverse pronunciations:\n {max_pronunciation_word} ({max_pronunciation_count} pronunciations)")

<class 'set'>
Number of words with two or more pronunciations: 304
Words with the most diverse pronunciations:
 ['direction', 'directly', 'was', 'effects', 'painted', 'interest', 'ne', 'with', 'when', 'planted'] (4 pronunciations)


### Problem 7 (4 points)


Provide a brief explanation for each of the following questions.

a. Retrieve all noun synsets for ```duck``` from WordNet. For each synset, print all hypernyms and the lemma names of those hypernyms. (2 points)

b. Using ```wn.all_synsets('v')```, write a Python code that prints 5 random lemmas from the first 10 verb synsets. (The output lemma will be in the format: ```Lemma('run.v.02.run')```) (2 points)


#### Problem 7-a

In [13]:
# Problem 7
import random
random.seed(42)

# Step 1: Load WordNet
from nltk.corpus import wordnet as wn

# Step 2: Retrieve all noun synsets for duck
duck_synsets = wn.synsets('duck', pos='n')

# Step 3: Print hypernyms and lemma names of hypernyms
for synset in duck_synsets:
    print(f"Synset: {synset.name()} - {synset.definition()}")
    hypernyms = synset.hypernyms()
    # Deffensive check if there are no hypernyms
    if not hypernyms:
        print("No hypernymsfor this synset.\n")
        continue
        
    for hypernym in hypernyms:
        print(f"Hypernym: {hypernym.name()} Lemma names - {hypernym.lemma_names()}")
    print()

Synset: duck.n.01 - small wild or domesticated web-footed broad-billed swimming bird usually having a depressed body and short legs
Hypernym: anseriform_bird.n.01 Lemma names - ['anseriform_bird']

Synset: duck.n.02 - (cricket) a score of nothing by a batsman
Hypernym: score.n.03 Lemma names - ['score']

Synset: duck.n.03 - flesh of a duck (domestic or wild)
Hypernym: poultry.n.02 Lemma names - ['poultry']

Synset: duck.n.04 - a heavy cotton fabric of plain weave; used for clothing and tents
Hypernym: fabric.n.01 Lemma names - ['fabric', 'cloth', 'material', 'textile']



#### Problem 7-b

In [14]:
# Step 1: Retrieve all verb synsets
verb_synsets = list(wn.all_synsets('v'))

# Step 2: Make counter to print only first 10 synsets
counter = 0
limit = 10

# Step 2 : Iterate over 10 synsets and print 5 random lemmas
for idx, synset in enumerate(verb_synsets, start=1):
    all_lemmas = synset.lemmas()
    
    if len(all_lemmas) < 5:
        # print(f"Synset {idx} has less than 5 lemmas.")
        continue
    
    else:
        random_lemmas = random.sample(all_lemmas, 5)
    
    print(f"Synset #{idx}: {synset.name()} -> definition: {synset.definition()}")
    for lemma in random_lemmas:
        print(f"  {lemma}")
    print()
    
    counter += 1
    if counter == limit:
        break

Synset #20: expectorate.v.02 -> definition: discharge (phlegm or sputum) from the lungs and out of the mouth
  Lemma('expectorate.v.02.expectorate')
  Lemma('expectorate.v.02.spit_out')
  Lemma('expectorate.v.02.cough_out')
  Lemma('expectorate.v.02.cough_up')
  Lemma('expectorate.v.02.spit_up')

Synset #36: shed.v.04 -> definition: cast off hair, skin, horn, or feathers
  Lemma('shed.v.04.molt')
  Lemma('shed.v.04.slough')
  Lemma('shed.v.04.exuviate')
  Lemma('shed.v.04.shed')
  Lemma('shed.v.04.moult')

Synset #63: sleep.v.01 -> definition: be asleep
  Lemma('sleep.v.01.catch_some_Z's')
  Lemma('sleep.v.01.log_Z's')
  Lemma('sleep.v.01.sleep')
  Lemma('sleep.v.01.slumber')
  Lemma('sleep.v.01.kip')

Synset #76: fall_asleep.v.01 -> definition: change from a waking to a sleeping state
  Lemma('fall_asleep.v.01.drift_off')
  Lemma('fall_asleep.v.01.dope_off')
  Lemma('fall_asleep.v.01.nod_off')
  Lemma('fall_asleep.v.01.drop_off')
  Lemma('fall_asleep.v.01.fall_asleep')

Synset #79: go

### Problem 8 (3 points)

Describe the class of strings that match each of the following regular expressions.
1.	[a-zA]+
2.	[a-z][A-Z]*[a-z]
3.	[BCLT]r[aeiou]{,2}t[a-z]
4.	([aeiou][^aeiou])*
5.	\W+|[^\W\s]+

You can test your answers using ```nltk.re_show()```.
Your answers should be included in the report.


In [15]:
# Problem 8

# Step 1: Import nltk
import nltk

# Step 2: Define the regular expressions
regexps = [
    r"[a-zA]+",
    r"[a-z][A-Z]*[a-z]",
    r"[BCLT]r[aeiou]{,2}t[a-z]",
    r"([aeiou][^aeiou])*",
    r"\W+|[^\W\s]+"
]

# Step 3: Test the regular expressions
for regexp in regexps:
    print(f"Regular Expression: {regexp}")
    nltk.re_show(regexp, "Your answers should be included in the report.")
    print()
    
# Step 4: Describe the class of strings
# 1. [a-zA]+: One or more alphabetic characters
# 2. [a-z][A-Z]*[a-z]: One lowercase letter, followed by zero or more uppercase letters, and then another lowercase letter
# 3. [BCLT]r[aeiou]{,2}t[a-z]: A single uppercase letter from B, C, L, or T, followed by the letter 'r', followed by up to 2 vowels, followed by the letter 't', and ending with any lowercase letter
# 4. ([aeiou][^aeiou])*: Zero or more pairs of a vowel followed by a non-vowel
# 5. \W+|[^\W\s]+: One or more non-alphanumeric characters or one or more non-whitespace characters
#

Regular Expression: [a-zA]+
Y{our} {answers} {should} {be} {included} {in} {the} {report}.

Regular Expression: [a-z][A-Z]*[a-z]
Y{ou}r {an}{sw}{er}s {sh}{ou}{ld} {be} {in}{cl}{ud}{ed} {in} {th}e {re}{po}{rt}.

Regular Expression: [BCLT]r[aeiou]{,2}t[a-z]
Your answers should be included in the report.

Regular Expression: ([aeiou][^aeiou])*
{}Y{}o{ur}{} {an}{}s{}w{er}{}s{} {}s{}h{}o{ul}{}d{} {}b{e in}{}c{}l{uded}{} {in}{} {}t{}h{e }{}r{epor}{}t{}.{}

Regular Expression: \W+|[^\W\s]+
{Your}{ }{answers}{ }{should}{ }{be}{ }{included}{ }{in}{ }{the}{ }{report}{.}



### Problem 9 (5 points)

Pig Latin is a simple transformation of English text. Each word is converted as follows:
- Move all consonant (or consonant cluster) at the start of a word to the end.
- Append ```ay``` to the word.
- Example:
  - ```string``` → ```ingstray```
  - ```language``` → ```anguagelay```
  - ```idle``` → ```idleay```
  - For more details, refer to http://en.wikipedia.org/wiki/Pig_Latin.

This time, we will modify the Pig Latin rules:
- After moving the consonant(s) of the word, change the first vowel in the word:
    - ```a``` → ```e```
    - ```e``` → ```i```
    - ```i``` → ```o```
    - ```o``` → ```u```
    - ```u``` → ```a```
- Append ```ay``` to the word.
- Example:
    - ```string``` → ```ongstray```
    - ```idle``` → ```odleay```

a. Write a Python function to convert a word into its equivalent in the modified Pig Latin. (3 points)

b. Write a Python code that converts an entire text (a sequence of words), instead of just individual words.  (1 point)

#### Problem 9-a

In [16]:
# Problem 9

# Step 1: Define function to convert a word into Pig Latin
def pig_latin(word):
    vowels = 'aeiou'
    consonants = ''
    for char in word:
        if char in vowels:
            # Mark down the char's idx in vowels
            pos = vowels.index(char)
            if pos == len(vowels) - 1:
                pos = -1
            # print(pos) # Debugging line
            break
        consonants += char
        
    idx = (pos + 1) % len(vowels)
    # print(idx) # Debugging line
    
    if word[0] in vowels:
        return vowels[idx] + word[1:] + 'ay'
    else:
        return vowels[idx] + word[len(consonants)+1:] + consonants + 'ay'

# Testing    
print(pig_latin('string')) # ongstray
print(pig_latin('language')) # enguagelay
print(pig_latin('idle')) # odleay

ongstray
enguagelay
odleay


#### Problem 9-b

In [17]:
text = "Write a Python code that converts an entire text (a sequence of words), instead of just individual words."

for word in text.split():
    print(pig_latin(word), end=' ')

oteWray eay unPythay udecay etthay unvertscay enay intireay ixttay e(ay iquencesay ufay urds),way onsteaday ufay astjay ondividualay urds.way 

### Problem 10 (4 points)

Using a multilingual corpus such as the Universal Declaration of Human Rights Corpus (```nltk.corpus.udhr```), along with NLTK’s frequency distribution (```nltk.FreqDist```), develop a system that guesses the language of a previously unseen text.

You can estimate the language of the unseen text by comparing:
- How often each word appears in each language of the corpus
- How often each word appears in the unseen text.

To achieve this, use rank correlation (```nltk.spearman_correlation```).

For simplicity, you can work at the character level rather than the word level and choose languages (at least two) of your choice.


In [18]:
# Problem 10
from nltk.metrics.spearman import spearman_correlation
from nltk.corpus import udhr
from nltk.probability import FreqDist

languages = [
    "English-Latin1",
    "French_Francais-Latin1",
    "German_Deutsch-Latin1",
    "Italian_Italiano-Latin1",
    "Korean_Hankuko-UTF8",
    "Japanese_Nihongo-UTF8",
    "Spanish-Latin1",
    "Swedish_Svenska-Latin1"
]

unseen_text = "You never really understand a person until you consider things from his point of view… Until you climb inside of his skin and walk around in it."


def get_char_ranks(text, n=20):
    # Create a frequency distribution of alphabetic characters
    freqdist = FreqDist(ch for ch in text if ch.isalpha())
    # Sort chars by descending frequency, take top n
    most_common_chars = freqdist.most_common(n)
    # Return just the chars in order
    return [char for (char, _) in most_common_chars]

def compute_spearman_rank_correlation(ranks1, ranks2):
    # Create dictionaries: char -> rank_index
    dict1 = {ch: i for i, ch in enumerate(ranks1)}
    dict2 = {ch: i for i, ch in enumerate(ranks2)}
    
    # We only want to compare characters common to both sets
    common_chars = set(dict1.keys()) & set(dict2.keys())
    
    # Build sequences of (char, rank) pairs for the intersection
    seq1 = [(ch, dict1[ch]) for ch in common_chars]
    seq2 = [(ch, dict2[ch]) for ch in common_chars]
    
    if len(common_chars) < 2:
        # Not enough overlap to compute correlation
        return 0.0
    
    # Now pass these sequences to spearman_correlation
    return spearman_correlation(seq1, seq2)

# 1) Compute ranks for the unseen text
unseen_ranks = get_char_ranks(unseen_text, n=20)
print("Unseen text top-20 character ranks:")
print(unseen_ranks)
print()

# 2) For each language, compute top-n char ranks & measure correlation
results = []
for lang in languages:
    lang_text = udhr.raw(lang)
    lang_ranks = get_char_ranks(lang_text, n=20)
    
    corr_value = compute_spearman_rank_correlation(unseen_ranks, lang_ranks)
    results.append((lang, corr_value))

# 3) Print results
print("Language Rank Correlations:")
for lang, corr in results:
    print(f"{lang}: Spearman correlation = {corr:.4f}")

# Identify the language with the highest correlation
best_guess, best_corr = max(results, key=lambda x: x[1])
print(f"\nBest guess for unseen text language: {best_guess} (corr = {best_corr:.4f})")


Unseen text top-20 character ranks:
['n', 'i', 'o', 'e', 's', 'r', 'u', 'a', 'l', 'd', 't', 'y', 'h', 'f', 'v', 'p', 'c', 'm', 'w', 'k']

Language Rank Correlations:
English-Latin1: Spearman correlation = 0.6966
French_Francais-Latin1: Spearman correlation = 0.7221
German_Deutsch-Latin1: Spearman correlation = 0.6581
Italian_Italiano-Latin1: Spearman correlation = 0.7169
Korean_Hankuko-UTF8: Spearman correlation = 0.0000
Japanese_Nihongo-UTF8: Spearman correlation = 0.0000
Spanish-Latin1: Spearman correlation = 0.7108
Swedish_Svenska-Latin1: Spearman correlation = 0.3787

Best guess for unseen text language: French_Francais-Latin1 (corr = 0.7221)


### Problem 11 (6 points)

Write a Python program that processes a text and discovers cases where a word is used with a novel sense.

One simple approach is using WordNet:
For each word, compute the WordNet similarity between all synsets of the word and all synsets of the words in its context.

Note: this is a crude approach. Also, it is important to mention that implementing this effectively is a difficult, open research problem.

Define what a “novel sense” of the word is in your own words, and describe the detailed design choice you made for your implementation.


In [19]:
# Problem 11
import nltk
from nltk.corpus import wordnet as wn

def avg_similarity(target_synset, context_synsets):
    """
    Computes the average WordNet similarity between the target_synset
    and all synsets in context_synsets. Returns 0.0 if no valid similarities.
    """
    similarities = []
    for c_syn in context_synsets:
        sim = target_synset.path_similarity(c_syn)
        # You could also try wup_similarity or another measure
        if sim is not None:
            similarities.append(sim)
    if similarities:
        return sum(similarities) / len(similarities)
    return 0.0

def best_synset_for_context(word, context_words):
    """
    Among all synsets for `word`, returns the one with highest average similarity
    to the synsets of `context_words`.
    """
    # Collect all synsets for each context word
    context_syns = [syn for cw in context_words for syn in wn.synsets(cw)]
    target_syns = wn.synsets(word)
    
    if not target_syns or not context_syns:
        return None  # No synsets or no context synsets
    
    best_synset = None
    best_score = 0.0
    
    for ts in target_syns:
        score = avg_similarity(ts, context_syns)
        if score > best_score:
            best_score = score
            best_synset = ts
    return best_synset

def detect_novel_sense(text, window_size=2):
    """
    A toy function that processes each word in the text,
    checks if the used sense might be 'novel' by measuring
    the best sense vs the worst sense.
    
    text: list of words (tokenized).
    """
    novel_usages = []
    
    for i, word in enumerate(text):
        # We'll assume that punctuation etc. is removed or skip them if no synsets
        if not word.isalpha():
            continue
        
        # Grab the local context
        start = max(0, i - window_size)
        end = min(len(text), i + window_size + 1)
        context_words = [w for j, w in enumerate(text[start:end]) if j != i and w.isalpha()]
        
        # Calculate best sense for that context
        best_sense = best_synset_for_context(word, context_words)
        
        if best_sense is None:
            continue  # Could not find or evaluate senses for the context
        
        # Let's define 'novel' if the best sense is actually not the first sense in WordNet,
        # or if there's some other naive test. We'll do something simplistic:
        # Compare best_sense to the "WordNet's default sense" = word's first synset
        # If they're different, let's mark it as 'novel' (Crude!)
        
        first_syn = wn.synsets(word)[0]  # WordNet's default sense
        if best_sense != first_syn:
            # We'll store (word, best_sense, first_syn)
            novel_usages.append((word, best_sense, first_syn))
    
    return novel_usages


# Example usage:
if __name__ == "__main__":
    text = nltk.word_tokenize(
        "The bank of the river was very steep. I decided to deposit money at the bank. "
        "Such a bank shot in billiards is too risky."
    )
    
    novel_cases = detect_novel_sense(text, window_size=2)
    print("Possible 'novel sense' usage (toy approach):")
    for (word, best_sense, first_sense) in novel_cases:
        print(f"Word '{word}' best-sense: {best_sense.name()} vs. WordNet-first: {first_sense.name()}")

Possible 'novel sense' usage (toy approach):
Word 'was' best-sense: be.v.01 vs. WordNet-first: washington.n.02
Word 'steep' best-sense: steep.a.01 vs. WordNet-first: steep.n.01
Word 'I' best-sense: one.s.01 vs. WordNet-first: iodine.n.01
Word 'deposit' best-sense: sediment.n.01 vs. WordNet-first: deposit.n.01
Word 'bank' best-sense: bank.v.05 vs. WordNet-first: bank.n.01
Word 'bank' best-sense: bank.v.05 vs. WordNet-first: bank.n.01
Word 'shot' best-sense: changeable.s.04 vs. WordNet-first: shooting.n.01
Word 'in' best-sense: in.s.01 vs. WordNet-first: inch.n.01
