<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/nltk-exercise-sets/exercises_set_06_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Exercise set 06**

In [None]:
import nltk
nltk.download('book')
from nltk.corpus import wordnet as wn

5. ☼ Investigate the holonym-meronym relations for some nouns. Remember that there are three kinds of holonym-meronym relation, so you need to use: `member_meronyms()`, `part_meronyms()`, `substance_meronyms()`, `member_holonyms()`, `part_holonyms()`, and `substance_holonyms()`.



In [None]:
print(wn.synsets('seed'))
my_noun = wn.synset('seed.n.01')

In [None]:
print(my_noun.member_holonyms(), my_noun.member_meronyms())

In [None]:
print(my_noun.part_meronyms(), my_noun.part_holonyms())

In [None]:
print(my_noun.substance_holonyms(), my_noun.substance_meronyms())

8. ◑ Define a conditional frequency distribution over the Names corpus that allows you to see which initial letters are more frequent for males vs. females (cf. 4.4).


In [None]:
names = nltk.corpus.names

cfd = nltk.ConditionalFreqDist(
    (fileid, name[0]) 
    for fileid in names.fileids()
    for name in names.words(fileid))

import matplotlib.pyplot as plt
plt.figure(figsize = (10, 5))
cfd.plot()

12. ◑ The CMU Pronouncing Dictionary contains multiple pronunciations for certain words. How many distinct words does it contain? What fraction of words in this dictionary have more than one possible pronunciation?


In [None]:
entries = nltk.corpus.cmudict.entries()

entries[1:5]

cmu_words = [w[0] for w in entries]

len(set(cmu_words))

13. ◑ What percentage of noun synsets have no hyponyms? You can get all noun synsets using `wn.all_synsets('n')`.


In [None]:
# because it returns a generator you should wrap it in list()
all_noun_synsets = list(wn.all_synsets('n'))

all_noun_synsets[:5]

In [None]:
# you need to count the number of synsets returning nothing (an empty list) for .hyponyms()
no_hyponyms = [s for s in all_noun_synsets if s.hyponyms() == []]
yes_hyponyms = [s for s in all_noun_synsets if len(s.hyponyms()) > 0]
print(len(no_hyponyms), len(yes_hyponyms))

In [None]:
# the majority of nouns do not have hyponyms. does that make sense?
len(no_hyponyms)/len(all_noun_synsets) * 100

In [None]:
# explore the ones with and without to get a better understanding. 
no_hyponyms[:5]

In [None]:
yes_hyponyms[:5]

14. ◑ Define a function supergloss(s) that takes a synset s as its argument and returns a string consisting of the concatenation of the definition of s, and the definitions of all the hypernyms and hyponyms of s.


In [None]:
def supergloss(s):
  print('Original Word:\n')
  print(s.name())
  print(s.definition() + '\n')
  print("Hypernyms:\n")
  print([(i.name(), i.definition()) for i in s.hypernyms()])
  print("\nHyponyms:\n")
  print([(i.name(), i.definition()) for i in s.hyponyms()])




In [None]:
supergloss(wn.synset('dog.n.01'))

In [None]:
supergloss(wn.synset('beer.n.01'))

17. ◑ Write a function that finds the 50 most frequently occurring words of a text that are not stopwords.


In [None]:
def freq_no_stop(text):
  stopwords = nltk.corpus.stopwords.words('english')
  
  no_stopwords = [w for w in text if w.lower() not in stopwords]
  print(len(set(no_stopwords)) / len(set(text)))
  print(nltk.FreqDist(no_stopwords).most_common(50))

In [None]:
freq_no_stop(nltk.corpus.gutenberg.words('austen-emma.txt'))

In [None]:
freq_no_stop(open('/content/drive/MyDrive/texts/mood_ring.txt').read().split())

18. ◑ Write a program to print the 50 most frequent bigrams (pairs of adjacent words) of a text, omitting bigrams that contain stopwords.


In [None]:
def freq_bigrams(text):
  stopwords = nltk.corpus.stopwords.words('english')
  content = [w for w in text if w.lower() not in stopwords and w.isalpha()]
  text_bigrams = list(nltk.bigrams(content))
  fd = nltk.FreqDist(text_bigrams)

  print(fd.most_common(50))

In [None]:
freq_bigrams(nltk.corpus.gutenberg.words('austen-emma.txt'))

In [None]:
freq_bigrams(nltk.word_tokenize(open('/content/drive/MyDrive/texts/mood_ring.txt').read()))

21. ◑ Write a program to guess the number of syllables contained in a text, making use of the CMU Pronouncing Dictionary.


In [None]:
# Probably easiest to convert this to a dictionary - sure there are other ways. 
# the first thing to do is define what a syllable is - we'll be lazy and say its any vowel sound in a word. 

# So, to get a syllable, we would want to count vowels. 
# You could list all the cmu codes for vowerls, but a list of AEIOU should also work.
# alternatively ,we can also just see whether there are numbers in the phone, since CMU assigns a number to each vowel

# for the word in the text, then for the entry in the dict, then for the phones in the entry

In [None]:
# version looking for vowels. 
def guess_syllables(text):
  """process syllables from raw text"""

  # each CMU syllable will be comprised of at least one of these letters
  cmu_vowels = ['A', 'E', 'I', 'O', 'U']

  # tokenize the text and initialize the CMU entries
  text2 = [w.lower() for w in nltk.word_tokenize(text)]
  entries = nltk.corpus.cmudict.entries()
  entries_dict = dict(entries)
  
  # grab the pron for each word (what other ways could people check this? )
  text_sounds = [entries_dict[w] for w in text2 if w in entries_dict.keys()]
  print('full text sounds:', text_sounds)

  # now count how many sounds have vowels and approximate that to be the syllables
  # first loop over the entry, which is a list
  # then loop over the members of the list (the sounds)
  num_syllables = [sound for entry in text_sounds for sound in entry if sound[0] in cmu_vowels]

  # you could do the same thing using len == 3
 # num_syllables = [sound for entry in text_sounds for sound in entry if len(sound) == 3]
  print(num_syllables, len(num_syllables))

In [None]:
guess_syllables('Super Diet Onion')

In [None]:
# this treats "am" as having two syllables. do you thinkt that's correct?
guess_syllables('I do not know who I am, am I anyone at all?')

22. ◑ Define a function `hedge(text)` which processes a text and produces a new version with the word 'like' between every third word.


In [None]:
# I keep forcing myself to use list comprehesion but these for loops are easier to read. 
# this function is hilarious
def hedge(text):
  
  text2 = nltk.word_tokenize(text.lower())
  output = []

  # I suppose you can use a counter for this. 
  counter = 0
  for word in text2:
    counter = counter + 1
    if counter == 3 and text2.index(word) != len(text2)-1:
      counter = 0
      output.append(word + ' like')
    else:
      output.append(word)
  return ' '.join(output)

In [None]:
hedge('these pretzels are making me thirsty!')

In [None]:
def hedge2(text):
  
  text2 = nltk.word_tokenize(text.lower())
  output = []

  # without a counter.
  for index, word in enumerate(text2):
    if index > 0 and index%3 == 0 and index != len(text2)-1:
      output.append('like')
      output.append(word)
      
    else:
      output.append(word)
  return ' '.join(output)

In [None]:
hedge2('these pretzels are making me thirsy!')

In [None]:
# like, mood like ring. 
hedge2(open('/content/drive/MyDrive/texts/mood_ring.txt').read())

27. ★ The polysemy of a word is the number of senses it has. Using WordNet, we can determine that the noun dog has 7 senses with: `len(wn.synsets('dog', 'n'))`. Compute the average polysemy of nouns, verbs, adjectives and adverbs according to WordNet.

Note: `n` == noun, `v` == verb, `a` == adjective, `r` == adverb

In [None]:
def avg_polysemy(pos):
  # grab all of the lemma names for each synset in the particular pos
  all_lemmas = [i.lemma_names() for i in wn.all_synsets(pos)]

  # flatten into one list
  all_lemmas2 = [i for synset in all_lemmas for i in synset]

  # set to remove duplicates
  all_lemmas3 = set(all_lemmas2)

  # get lengths
  synset_lengths = [len(wn.synsets(i, pos)) for i in all_lemmas3]

  avg_poly = sum(synset_lengths)/len(synset_lengths)
  return avg_poly

In [None]:
avg_polysemy('n')

In [None]:
avg_polysemy('v')

In [None]:
avg_polysemy('a')

In [None]:
avg_polysemy('r')