1. Search the Web for “spoof newspaper headlines,” to find such gems as: British
Left Waffles on Falkland Islands, and Juvenile Court to Try Shooting Defendant.
Manually tag these headlines to see whether knowledge of the part-of-speech tags
removes the ambiguity.

In [4]:
import nltk

headline = 'Juvenile/NOUN Court/NOUN to/PRT Try/VERB Shooting/ADJ Defendant/NOUN'
[nltk.tag.str2tuple(t) for t in headline.split()]

[('Juvenile', 'NOUN'),
 ('Court', 'NOUN'),
 ('to', 'PRT'),
 ('Try', 'VERB'),
 ('Shooting', 'ADJ'),
 ('Defendant', 'NOUN')]

2. Working with someone else, take turns picking a word that can be either a noun
or a verb (e.g., contest); the opponent has to predict which one is likely to be the
most frequent in the Brown Corpus. Check the opponent’s prediction, and tally
the score over several turns.

In [6]:
import nltk
from nltk.corpus import brown
from nltk import FreqDist

# Ensure required resources are downloaded
nltk.download('brown')

# Load Brown Corpus
def get_word_frequencies():
    """Get frequencies of nouns and verbs in the Brown Corpus."""
    # Get words and their tags from the Brown Corpus
    words = brown.tagged_words()
    # Create frequency distributions for nouns and verbs
    fdist_nouns = FreqDist(word.lower() for word, tag in words if tag.startswith('NN'))
    fdist_verbs = FreqDist(word.lower() for word, tag in words if tag.startswith('VB'))
    
    return fdist_nouns, fdist_verbs

def word_frequencies(word, fdist_nouns, fdist_verbs):
    """Get frequencies for a given word."""
    word = word.lower()
    noun_freq = fdist_nouns.get(word, 0)
    verb_freq = fdist_verbs.get(word, 0)
    
    return noun_freq, verb_freq

def play_game():
    fdist_nouns, fdist_verbs = get_word_frequencies()
    
    print("Welcome to the Word Frequency Game!")
    print("Pick a word that can be either a noun or a verb.")
    print("Your opponent will predict which form (noun or verb) is more frequent.")
    print("Let's start!")
    
    score = 0
    while True:
        word = input("Enter a word (or 'quit' to end the game): ")
        if word.lower() == 'quit':
            break
        
        # Get frequencies for the word
        noun_freq, verb_freq = word_frequencies(word, fdist_nouns, fdist_verbs)
        
        # Get opponent's prediction
        prediction = input(f"Is '{word}' more frequent as a noun or a verb? (Enter 'noun' or 'verb'): ").strip().lower()
        
        # Determine the correct answer
        correct = 'noun' if noun_freq > verb_freq else 'verb'
        
        # Check the prediction
        if prediction == correct:
            print(f"Correct! '{word}' is more frequent as a {correct}.")
            score += 1
        else:
            print(f"Incorrect. '{word}' is more frequent as a {correct}.")
        
        print(f"Current Score: {score}\n")
    
    print(f"Game over! Final Score: {score}")

if __name__ == "__main__":
    play_game()


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


Welcome to the Word Frequency Game!
Pick a word that can be either a noun or a verb.
Your opponent will predict which form (noun or verb) is more frequent.
Let's start!
Incorrect. 'noun' is more frequent as a noun.
Current Score: 0

Incorrect. '' is more frequent as a verb.
Current Score: 0

Incorrect. '' is more frequent as a verb.
Current Score: 0

Incorrect. '' is more frequent as a verb.
Current Score: 0

Incorrect. 'noun' is more frequent as a noun.
Current Score: 0

Incorrect. '' is more frequent as a verb.
Current Score: 0

Incorrect. '' is more frequent as a verb.
Current Score: 0

Incorrect. '' is more frequent as a verb.
Current Score: 0

Incorrect. '' is more frequent as a verb.
Current Score: 0

Incorrect. '' is more frequent as a verb.
Current Score: 0

Incorrect. '' is more frequent as a verb.
Current Score: 0

Incorrect. 'more frequent as a noun' is more frequent as a verb.
Current Score: 0

Incorrect. '' is more frequent as a verb.
Current Score: 0

Incorrect. '' is mor

3. Tokenize and tag the following sentence: They wind back the clock, while we
chase after the wind. What different pronunciations and parts-of-speech are
involved?

In [6]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Ensure required resources are downloaded
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Sentence to be analyzed
sentence = "They wind back the clock, while we chase after the wind."

# Tokenize the sentence
tokens = word_tokenize(sentence)

# Tag the tokens with parts of speech
tagged_tokens = pos_tag(tokens)

# Print the tokenized and tagged output
print("Tokenized and Tagged Sentence:")
for token, tag in tagged_tokens:
    print(f"{token}: {tag}")

# Analyze different pronunciations and parts-of-speech
print("\nParts of Speech and Pronunciations:")
for token, tag in tagged_tokens:
    print(f"{token}: {tag}")

# Example of analyzing specific words
from nltk.corpus import wordnet as wn

def get_synsets(word):
    """Get synsets for a word."""
    return wn.synsets(word)

def get_word_info(word):
    """Prints the synsets and parts of speech for a word."""
    synsets = get_synsets(word)
    pos_info = {wn.synset(synset.name()).pos(): synset.name() for synset in synsets}
    return pos_info

words_of_interest = ['wind', 'chase']
for word in words_of_interest:
    print(f"\nWord: {word}")
    pos_info = get_word_info(word)
    for pos, synset in pos_info.items():
        print(f"POS: {pos}, Synset: {synset}")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


Tokenized and Tagged Sentence:
They: PRP
wind: VBP
back: RB
the: DT
clock: NN
,: ,
while: IN
we: PRP
chase: VBP
after: IN
the: DT
wind: NN
.: .

Parts of Speech and Pronunciations:
They: PRP
wind: VBP
back: RB
the: DT
clock: NN
,: ,
while: IN
we: PRP
chase: VBP
after: IN
the: DT
wind: NN
.: .

Word: wind
POS: n, Synset: wind.n.08
POS: v, Synset: hoist.v.01

Word: chase
POS: n, Synset: chase.n.03
POS: v, Synset: furrow.v.03


5.  Using the Python interpreter in interactive mode, experiment with the dictionary
examples in this chapter. Create a dictionary d, and add some entries. What happens whether you try to access a non-existent entry, e.g., d['xyz']?

In [7]:
# Create a dictionary
d = {'apple': 'a fruit', 'car': 'a vehicle', 'python': 'a programming language'}

# Add some entries
d['banana'] = 'a yellow fruit'
d['dog'] = 'a domestic animal'

# Print the dictionary
print("Dictionary:", d)

# Access an existing entry
print("Accessing 'apple':", d['apple'])

# Access a non-existent entry using the get method (safe way)
print("Accessing 'xyz' using get method:", d.get('xyz', 'Not found'))

# Access a non-existent entry directly (will raise a KeyError)
try:
    print("Accessing 'xyz' directly:", d['xyz'])
except KeyError:
    print("KeyError: 'xyz' not found in dictionary")


Dictionary: {'apple': 'a fruit', 'car': 'a vehicle', 'python': 'a programming language', 'banana': 'a yellow fruit', 'dog': 'a domestic animal'}
Accessing 'apple': a fruit
Accessing 'xyz' using get method: Not found
KeyError: 'xyz' not found in dictionary


6.  Try deleting an element from a dictionary d, using the syntax del d['abc']. Check
that the item was deleted.

In [8]:
# Create a dictionary
d = {'apple': 'a fruit', 'car': 'a vehicle', 'python': 'a programming language'}

# Add some entries
d['banana'] = 'a yellow fruit'
d['dog'] = 'a domestic animal'

# Print the dictionary before deletion
print("Dictionary before deletion:", d)

# Delete an entry
del d['banana']

# Print the dictionary after deletion
print("Dictionary after deletion:", d)

# Verify that the entry was deleted
# Accessing the deleted key directly will raise a KeyError
try:
    print("Accessing 'banana' directly:", d['banana'])
except KeyError:
    print("KeyError: 'banana' not found in dictionary")


Dictionary before deletion: {'apple': 'a fruit', 'car': 'a vehicle', 'python': 'a programming language', 'banana': 'a yellow fruit', 'dog': 'a domestic animal'}
Dictionary after deletion: {'apple': 'a fruit', 'car': 'a vehicle', 'python': 'a programming language', 'dog': 'a domestic animal'}
KeyError: 'banana' not found in dictionary


7. Create two dictionaries, d1 and d2, and add some entries to each. Now issue the
command d1.update(d2). What did this do? What might it be useful for?

In [9]:
d1 = {'hello': 1, 'world': 2, 'natural': 0}
d2 = {'natural': 3, 'language': 4, 'processing': 5}
d1.update(d2)
d1

{'hello': 1, 'world': 2, 'natural': 3, 'language': 4, 'processing': 5}

Update the dictionary with the key/value pairs from other, overwriting existing keys.

Useful when merging two dictionaries.

8. Create a dictionary e, to represent a single lexical entry for some word of your
choice. Define keys such as headword, part-of-speech, sense, and example, and assign them suitable values

In [11]:
e = {}
e['headword'] = ['NOUN', 'a word or term placed at the beginning (as of a chapter or an entry in an encyclopedia)']
e['part-of-speech'] = ['PHRASE', 'a traditional class of words distinguished according to the kind of idea denoted and the function performed in a sentence']
e['sense'] = ['NOUN', 'a meaning conveyed or intended']
e['example'] = ['NOUN', 'one that serves as a pattern to be imitated or not to be imitated']

import pprint
pprint.pprint(e)

{'example': ['NOUN',
             'one that serves as a pattern to be imitated or not to be '
             'imitated'],
 'headword': ['NOUN',
              'a word or term placed at the beginning (as of a chapter or an '
              'entry in an encyclopedia)'],
 'part-of-speech': ['PHRASE',
                    'a traditional class of words distinguished according to '
                    'the kind of idea denoted and the function performed in a '
                    'sentence'],
 'sense': ['NOUN', 'a meaning conveyed or intended']}


In [12]:
# Define a dictionary representing a lexical entry
e = {
    'headword': 'run',
    'part-of-speech': 'verb',
    'senses': [
        {
            'sense_id': 1,
            'definition': 'To move swiftly on foot.',
            'example': 'She can run a mile in under six minutes.'
        },
        {
            'sense_id': 2,
            'definition': 'To operate or function.',
            'example': 'The machine runs on electricity.'
        },
        {
            'sense_id': 3,
            'definition': 'To manage or be in charge of.',
            'example': 'He runs a small business from home.'
        }
    ]
}

# Print the dictionary
import pprint
pprint.pprint(e)


{'headword': 'run',
 'part-of-speech': 'verb',
 'senses': [{'definition': 'To move swiftly on foot.',
             'example': 'She can run a mile in under six minutes.',
             'sense_id': 1},
            {'definition': 'To operate or function.',
             'example': 'The machine runs on electricity.',
             'sense_id': 2},
            {'definition': 'To manage or be in charge of.',
             'example': 'He runs a small business from home.',
             'sense_id': 3}]}


9. Satisfy yourself that there are restrictions on the distribution of ** go ** and ** went **, in the sense that they cannot be freely interchanged in the kinds of contexts illustrated in (3d) in 7.

'We went on the excursion.' means the tense is past.

'We go on the excursion.' well, is it unlikely used in daily life?

10. Train a unigram tagger and run it on some new text. Observe that some words
are not assigned a tag. Why not?

In [19]:
from nltk.corpus import brown
import nltk

fd = nltk.FreqDist(brown.words(categories='news'))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
most_freq_words = fd.most_common(100)
likely_tags = dict((word, cfd[word].max()) for (word, __) in most_freq_words)
baseline_tagger = nltk.UnigramTagger(model=likely_tags)
baseline_tagger.evaluate(brown_tagged_sents)

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  baseline_tagger.evaluate(brown_tagged_sents)


0.45578495136941344

In [15]:
import nltk
from nltk.corpus import treebank
from nltk.tag import UnigramTagger
from nltk.tokenize import word_tokenize

nltk.download('treebank')

# Load training data
train_sents = treebank.tagged_sents()[:3000]
test_sents = treebank.tagged_sents()[3000:3500]

# Train a unigram tagger
tagger = UnigramTagger(train_sents)

# Define new text for tagging
new_text = "The quick brown fox jumps over the lazy dog"

# Tokenize new text
tokens = word_tokenize(new_text)

# Tag the tokens
tagged = tagger.tag(tokens)

# Print the tagged tokens
print(tagged)


[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\treebank.zip.


[('The', 'DT'), ('quick', 'JJ'), ('brown', None), ('fox', None), ('jumps', None), ('over', 'IN'), ('the', 'DT'), ('lazy', None), ('dog', None)]


In [16]:
from nltk.tag import DefaultTagger, BigramTagger

# Train a default tagger for unseen words
default_tagger = DefaultTagger('NN')  # Default tag for unseen words

# Train a bigram tagger with a backoff to the unigram tagger
bigram_tagger = BigramTagger(train_sents, backoff=tagger)

# Use the bigram tagger to tag new text
tagged_with_bigrams = bigram_tagger.tag(tokens)

print(tagged_with_bigrams)


[('The', 'DT'), ('quick', 'JJ'), ('brown', None), ('fox', None), ('jumps', None), ('over', 'IN'), ('the', 'DT'), ('lazy', None), ('dog', None)]


The words doesn't appear in the training text, and therefore the tagger can't speculate the word's tag.

When you train a unigram tagger and run it on new text, you might observe that some words are not assigned a tag. This occurs because a unigram tagger assigns tags based solely on the previous observation of individual words in the training data.

11. Learn about the affix tagger (type help(nltk.AffixTagger)). Train an affix tagger
and run it on some new text. Experiment with different settings for the affix length
and the minimum word length. Discuss your findings.

In [20]:
import nltk
from nltk.corpus import brown
from nltk.corpus import gutenberg

text = gutenberg.words('austen-persuasion.txt')

brown_sents = brown.sents(categories='news')
brown_tagged_sents = brown.tagged_sents(categories='news')
affix_tagger = nltk.AffixTagger(train=brown_tagged_sents, affix_length=1, min_stem_length=3)
print(affix_tagger.tag(text))




In [18]:
affix_tagger = nltk.AffixTagger(brown_tagged_sents, affix_length=3, min_stem_length=4)
test_text = 'Experiment with different settings for the affix length and the minimum word length'.split()
affix_tagger.tag(test_text)

[('Experiment', 'NN-TL'),
 ('with', None),
 ('different', 'JJ'),
 ('settings', 'NN'),
 ('for', None),
 ('the', None),
 ('affix', None),
 ('length', None),
 ('and', None),
 ('the', None),
 ('minimum', 'NNS'),
 ('word', None),
 ('length', None)]

12.  Train a bigram tagger with no backoff tagger, and run it on some of the training data. Next, run it on some new data. What happens to the performance of the tagger? Why?

In [21]:
from nltk.corpus import gutenberg
from nltk.corpus import brown
import nltk

brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
size = int(len(brown_tagged_sents) * 0.7)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

bigram_tagger = nltk.BigramTagger(train_sents)
bigram_tagger.evaluate(train_sents)
bigram_tagger.evaluate(test_sents)

# the performance drops off considerably because it relies on the stastical probablity that word is a particular part of speech based on the first word in a bigram. So when it gets to a new word that it hasn't seen in a bigram before it has nothing to go on and the word is returned as unknown. The accuracy plummets.

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  bigram_tagger.evaluate(train_sents)
  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  bigram_tagger.evaluate(test_sents)


0.09032237553252369

In [22]:
import nltk
from nltk.corpus import treebank
from nltk.tag import BigramTagger
from nltk.tokenize import word_tokenize

# Load training and test data
train_sents = treebank.tagged_sents()[:3000]
test_sents = treebank.tagged_sents()[3000:3500]

# Train a bigram tagger with no backoff
bigram_tagger = BigramTagger(train_sents)

# Evaluate on training data
train_accuracy = bigram_tagger.evaluate(train_sents)
print(f"Training accuracy: {train_accuracy:.2f}")

# Evaluate on new data
test_text = "The quick brown fox jumps over the lazy dog"
test_tokens = word_tokenize(test_text)
tagged_tokens = bigram_tagger.tag(test_tokens)
print("Tagged tokens:", tagged_tokens)

# Evaluate on test data
test_accuracy = bigram_tagger.evaluate(test_sents)
print(f"Test accuracy: {test_accuracy:.2f}")

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  train_accuracy = bigram_tagger.evaluate(train_sents)


Training accuracy: 0.91
Tagged tokens: [('The', 'DT'), ('quick', None), ('brown', None), ('fox', None), ('jumps', None), ('over', None), ('the', None), ('lazy', None), ('dog', None)]


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  test_accuracy = bigram_tagger.evaluate(test_sents)


Test accuracy: 0.11


13. We can use a dictionary to specify the values to be substituted into a formatting string. Read Python's library documentation for formatting strings http://docs.python.org/lib/typesseq-strings.html (404 NOT FOUND) ** and use this method to display today's date in two different formats.**

In [25]:
from datetime import datetime

# Get today's date
today = datetime.now()

# Format using f-strings
formatted_date1 = f"Today's date is {today:%Y-%m-%d}"
formatted_date2 = f"Today's date is {today:%B %d, %Y}"

print(formatted_date1)
print(formatted_date2)


Today's date is 2024-08-21
Today's date is August 21, 2024


In [26]:
from datetime import datetime

# Get today's date
today = datetime.now()

# Format using str.format()
formatted_date1 = "Today's date is {:%Y-%m-%d}".format(today)
formatted_date2 = "Today's date is {:%B %d, %Y}".format(today)

print(formatted_date1)
print(formatted_date2)


Today's date is 2024-08-21
Today's date is August 21, 2024


14. Use ** sorted() ** and ** set() ** to get a sorted list of tags used in the Brown corpus, removing duplicates.

In [28]:
from nltk.corpus import brown

tags = brown.tagged_words()
sorted_tags = sorted(tags)
unique_tags = set(sorted_tags)
vals =[val for key, val in unique_tags]	
print(sorted(set(vals)))

["'", "''", '(', '(-HL', ')', ')-HL', '*', '*-HL', '*-NC', '*-TL', ',', ',-HL', ',-NC', ',-TL', '--', '---HL', '.', '.-HL', '.-NC', '.-TL', ':', ':-HL', ':-TL', 'ABL', 'ABN', 'ABN-HL', 'ABN-NC', 'ABN-TL', 'ABX', 'AP', 'AP$', 'AP+AP-NC', 'AP-HL', 'AP-NC', 'AP-TL', 'AT', 'AT-HL', 'AT-NC', 'AT-TL', 'AT-TL-HL', 'BE', 'BE-HL', 'BE-TL', 'BED', 'BED*', 'BED-NC', 'BEDZ', 'BEDZ*', 'BEDZ-HL', 'BEDZ-NC', 'BEG', 'BEM', 'BEM*', 'BEM-NC', 'BEN', 'BEN-TL', 'BER', 'BER*', 'BER*-NC', 'BER-HL', 'BER-NC', 'BER-TL', 'BEZ', 'BEZ*', 'BEZ-HL', 'BEZ-NC', 'BEZ-TL', 'CC', 'CC-HL', 'CC-NC', 'CC-TL', 'CC-TL-HL', 'CD', 'CD$', 'CD-HL', 'CD-NC', 'CD-TL', 'CD-TL-HL', 'CS', 'CS-HL', 'CS-NC', 'CS-TL', 'DO', 'DO*', 'DO*-HL', 'DO+PPSS', 'DO-HL', 'DO-NC', 'DO-TL', 'DOD', 'DOD*', 'DOD*-TL', 'DOD-NC', 'DOZ', 'DOZ*', 'DOZ*-TL', 'DOZ-HL', 'DOZ-TL', 'DT', 'DT$', 'DT+BEZ', 'DT+BEZ-NC', 'DT+MD', 'DT-HL', 'DT-NC', 'DT-TL', 'DTI', 'DTI-HL', 'DTI-TL', 'DTS', 'DTS+BEZ', 'DTS-HL', 'DTX', 'EX', 'EX+BEZ', 'EX+HVD', 'EX+HVZ', 'EX+MD', '

In [31]:
list_of_tags = sorted(set([tag for (_, tag) in brown.tagged_words()]))
print(list_of_tags)

["'", "''", '(', '(-HL', ')', ')-HL', '*', '*-HL', '*-NC', '*-TL', ',', ',-HL', ',-NC', ',-TL', '--', '---HL', '.', '.-HL', '.-NC', '.-TL', ':', ':-HL', ':-TL', 'ABL', 'ABN', 'ABN-HL', 'ABN-NC', 'ABN-TL', 'ABX', 'AP', 'AP$', 'AP+AP-NC', 'AP-HL', 'AP-NC', 'AP-TL', 'AT', 'AT-HL', 'AT-NC', 'AT-TL', 'AT-TL-HL', 'BE', 'BE-HL', 'BE-TL', 'BED', 'BED*', 'BED-NC', 'BEDZ', 'BEDZ*', 'BEDZ-HL', 'BEDZ-NC', 'BEG', 'BEM', 'BEM*', 'BEM-NC', 'BEN', 'BEN-TL', 'BER', 'BER*', 'BER*-NC', 'BER-HL', 'BER-NC', 'BER-TL', 'BEZ', 'BEZ*', 'BEZ-HL', 'BEZ-NC', 'BEZ-TL', 'CC', 'CC-HL', 'CC-NC', 'CC-TL', 'CC-TL-HL', 'CD', 'CD$', 'CD-HL', 'CD-NC', 'CD-TL', 'CD-TL-HL', 'CS', 'CS-HL', 'CS-NC', 'CS-TL', 'DO', 'DO*', 'DO*-HL', 'DO+PPSS', 'DO-HL', 'DO-NC', 'DO-TL', 'DOD', 'DOD*', 'DOD*-TL', 'DOD-NC', 'DOZ', 'DOZ*', 'DOZ*-TL', 'DOZ-HL', 'DOZ-TL', 'DT', 'DT$', 'DT+BEZ', 'DT+BEZ-NC', 'DT+MD', 'DT-HL', 'DT-NC', 'DT-TL', 'DTI', 'DTI-HL', 'DTI-TL', 'DTS', 'DTS+BEZ', 'DTS-HL', 'DTX', 'EX', 'EX+BEZ', 'EX+HVD', 'EX+HVZ', 'EX+MD', '

15. Write programs to process the Brown Corpus and find answers to the following
questions:

a. Which nouns are more common in their plural form, rather than their singular
form? (Only consider regular plurals, formed with the -s suffix.)

b. Which word has the greatest number of distinct tags? What are they, and what
do they represent?

c. List tags in order of decreasing frequency. What do the 20 most frequent tags
represent?

d. Which tags are nouns most commonly found after? What do these tags
represent?

In [32]:
from nltk.corpus import brown

import nltk

cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
tags = brown.tagged_words(categories='news')
pos_tags = [val for key, val in tags]

#this represents the tags in decreasing order of frequency.

fd = nltk.FreqDist(pos_tags)
common_tags = fd.most_common(20)

# pulls out all the different words

conditions = cfd.conditions()
number_of_tags = []

# creates a new list with each word followed by the number of distinct tags that it has

for condition in conditions:
	number_of_tags.append((condition, len(cfd[condition])))

# makes a new list that has the words organized in decreasing order by how many tags they have. the answer is 'to' they are TO': 1222, 'IN': 880, 'TO-HL': 6, 'IN-HL': 5, 'IN-TL': 2, 'NPS': 1. it's most commonly used as a preposition

number_of_tags = list(reversed(sorted(number_of_tags, key=lambda x: x[1])))
print(number_of_tags[0:19])

# produces the list of nouns more common in their plural form than in their singular form.
# does this by looking in the frequency distribution for each word and comparing the number of
# hits for plural noun tags and singular tags

for condition in conditions:
	if cfd[condition]['NNS'] > cfd[condition]['NN']:
		print(condition)

# makes bigrams of the tag pairs in the text
word_tag_pairs = nltk.bigrams(tags)
# pulls out all the tags of the words that precede nouns
noun_preceders = [a[1] for (a, b) in word_tag_pairs if b[1] == 'NN']
# makes a frequency distribution
fdist = nltk.FreqDist(noun_preceders)
# pulls out all the most common tags
[tag for (tag, _) in fdist.most_common()]

[('to', 6), ('French', 5), ('cut', 5), ('3', 5), ('like', 5), ('set', 5), ('near', 5), ('for', 5), ('that', 5), ('Congolese', 4), ('First', 4), ('Place', 4), ('hit', 4), ('No', 4), ('To', 4), ('open', 4), ('half', 4), ('For', 4), ('St.', 4)]
irregularities
presentments
thanks
reports
voters
laws
legislators
topics
departments
practices
governments
offices
personnel
policies
steps
funds
services
homes
items
counties
jurors
taxpayers
appraisers
guardians
administrators
fees
procedures
recommendations
juries
citizens
actions
wards
costs
servants
criticisms
influences
concessionaires
prices
deputies
matters
officials
employes
Police
farms
Attorneys
responses
petitions
precincts
signatures
dissents
courses
names
candidates
relations
years
starts
areas
adjustments
chambers
$100
bonds
$30
courts
sales
contracts
highways
$3
$4
ones
authorities
$50
roads
plans
$10
allowances
details
raises
sessions
members
congressmen
votes
polls
calls
threats
days
protests
bankers
rules
questions
witnesses
dol

['AT',
 'JJ',
 'NN',
 'IN',
 'PP$',
 'DT',
 'CC',
 'AP',
 'NP',
 ',',
 'VBG',
 'CD',
 'NN-TL',
 'VBN',
 'OD',
 '.',
 'VB',
 'NP$',
 'NN$',
 'NNS',
 'DTI',
 'VBD',
 'CS',
 'NR',
 'JJR',
 '``',
 'JJ-TL',
 'JJT',
 'NNS$',
 'NN$-TL',
 "''",
 'RB',
 'JJS',
 'BEZ',
 'NP-TL',
 'VBZ',
 'NNS-TL',
 'BEDZ',
 'WDT',
 'NR$',
 'RP',
 '--',
 'WP$',
 ')',
 'ABN',
 'NPS$',
 'HV',
 'PPO',
 'WRB',
 'BE',
 'VBN-TL',
 'CD-TL',
 'BER',
 'NN-HL',
 'ABX',
 'FW-NN',
 'NNS$-TL',
 '(',
 'HVZ',
 'DTS',
 'BED',
 'QL',
 'BEN',
 ':',
 '*',
 'DO',
 'NNS-HL',
 'FW-DT',
 'DTX',
 "'",
 'RB-HL',
 'PPS+BEZ',
 'HVD',
 'DOZ',
 'AP$',
 'NP$-TL',
 'DOD',
 'AT-TL',
 'RB$',
 'NR-HL',
 'VBN-HL',
 'TO',
 'CD$',
 'PN$',
 'HVN',
 'OD-TL',
 'FW-NN-TL',
 'NR$-TL',
 'NPS$-TL',
 ':-HL',
 '.-HL',
 'HVG']

16. Explore the following issues that arise in connection with the lookup tagger:
# What happens to the tagger performance for the various model sizes when a backoff tagger is omitted?

"""i'd bet that it would perform better with lower sample sizes. because without a default tagging system, it would just have wide gaps with words that it had never seen before. no chance for hitting the words that it knew."""

# Consider the curve in 4.2; suggest a good size for a lookup tagger that balances memory and performance. Can you come up with scenarios where it would be preferable to minimize memory usage, or to maximize performance with no regard for memory usage?

"""if you're on a weak server or machine and need to minimize memory usage, sure. and ditto if you're in a situation in which POS tagging is hugely important."""

17. What is the upper limit of performance for a lookup tagger, assuming no limit
to the size of its table? (Hint: write a program to work out what percentage of tokens
of a word are assigned the most likely tag for that word, on average.)

In [33]:
import nltk
from nltk.corpus import brown


fd = nltk.FreqDist(brown.words(categories='news'))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))

avgs = []
for (word, freqs) in cfd.items():
    n = float(freqs.N())
    max_tag = freqs.max()
    avg = freqs[max_tag] / n
    print('likelihood for {} = {}'.format(word, avg))
    avgs.append(avg)

print('')
print('average correct = {}'.format(sum(avgs) / len(avgs)))

likelihood for The = 0.9615384615384616
likelihood for Fulton = 0.7142857142857143
likelihood for County = 1.0
likelihood for Grand = 0.8333333333333334
likelihood for Jury = 1.0
likelihood for said = 0.9502487562189055
likelihood for Friday = 1.0
likelihood for an = 1.0
likelihood for investigation = 1.0
likelihood for of = 0.9533169533169533
likelihood for Atlanta's = 1.0
likelihood for recent = 1.0
likelihood for primary = 0.7647058823529411
likelihood for election = 1.0
likelihood for produced = 0.8333333333333334
likelihood for `` = 1.0
likelihood for no = 0.9541284403669725
likelihood for evidence = 1.0
likelihood for '' = 1.0
likelihood for that = 0.6795511221945137
likelihood for any = 1.0
likelihood for irregularities = 1.0
likelihood for took = 1.0
likelihood for place = 0.84
likelihood for . = 0.9955334987593052
likelihood for jury = 0.9772727272727273
likelihood for further = 0.625
likelihood for in = 0.965662968832541
likelihood for term-end = 1.0
likelihood for presentmen

18. Generate some statistics for tagged data to answer the following questions:

a. What proportion of word types are always assigned the same part-of-speech tag?

b. How many words are ambiguous, in the sense that they appear with at least two tags?

c. What percentage of word ** tokens ** in the Brown Corpus involve these ambiguous words?

In [38]:
from nltk.corpus import brown
import nltk

#setting things up
brown_tagged_words = brown.tagged_words(categories='news')
cfd = nltk.ConditionalFreqDist(brown_tagged_words)
conditions = cfd.conditions()

# creates a new array of word types that only have one distinct word tag
mono_tags = [condition for condition in conditions if len(cfd[condition]) == 1]

# answers number one - the proportion of tags that have only one POS tag.
proportion_mono_tags = len(mono_tags) / len(conditions)
print(proportion_mono_tags)

# answers number two - the number of ambiguous words.
number_mono_tags = len(conditions) - len(mono_tags)
print(number_mono_tags)

# answers number three by calculating the number of ambiguous words in the total brown corpus with this small sample size. Gives you one percent.

total_brown_words = set(brown.words())
number_shared_words = [word for word in mono_tags if word in total_brown_words]
percen_brown = len(number_shared_words) / len(brown.words())
print(percen_brown)

0.8802973461164374
1723
0.010912062776870663


19. The evaluate() method works out how accurately the tagger performs on this
text. For example, if the supplied tagged text was [('the', 'DT'), ('dog',
'NN')] and the tagger produced the output [('the', 'NN'), ('dog', 'NN')], then
the score would be 0.5. Let’s try to figure out how the evaluation method works:

a. A tagger t takes a list of words as input, and produces a list of tagged words
as output. However, t.evaluate() is given correctly tagged text as its only
parameter. What must it do with this input before performing the tagging?

b. Once the tagger has created newly tagged text, how might the evaluate()
method go about comparing it with the original tagged text and computing
the accuracy score?

c. Now examine the source code to see how the method is implemented. Inspect
nltk.tag.api.__file__ to discover the location of the source code, and open
this file using an editor (be sure to use the api.py file and not the compiled
api.pyc binary file).

20. Write code to search the Brown Corpus for particular words and phrases according to tags, to answer the following questions:

a. Produce an alphabetically sorted list of the distinct words tagged as MD.

b. Identify words that can be plural nouns or third person singular verbs (e.g.,
deals, flies).

c. Identify three-word prepositional phrases of the form IN + DET + NN (e.g.,
in the lab).

d. What is the ratio of masculine to feminine pronouns?

In [39]:
import nltk
nltk.download('brown')
nltk.download('universal_tagset')


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\universal_tagset.zip.


True

In [40]:
from nltk.corpus import brown

# Extract words tagged as MD
md_words = set(word.lower() for word, tag in brown.tagged_words(tagset='universal') if tag == 'MD')

# Sort alphabetically
sorted_md_words = sorted(md_words)

print("Distinct words tagged as MD:")
print(sorted_md_words)


Distinct words tagged as MD:
[]


In [41]:
# Extract words tagged as NNS (plural nouns) or VBZ (third person singular verbs)
plural_nouns = set(word.lower() for word, tag in brown.tagged_words(tagset='universal') if tag == 'NNS')
third_person_singular_verbs = set(word.lower() for word, tag in brown.tagged_words(tagset='universal') if tag == 'VBZ')

# Find common words
common_words = plural_nouns.intersection(third_person_singular_verbs)

print("Words that can be plural nouns or third person singular verbs:")
print(common_words)


Words that can be plural nouns or third person singular verbs:
set()


In [42]:
from nltk import pos_tag, word_tokenize
from nltk.corpus import reuters

# Load sentences from the Brown Corpus
sentences = brown.sents()

prepositional_phrases = []

# Loop through sentences to find IN + DET + NN patterns
for sentence in sentences:
    tagged_sentence = pos_tag(sentence, tagset='universal')
    for i in range(len(tagged_sentence) - 2):
        if tagged_sentence[i][1] == 'IN' and tagged_sentence[i+1][1] == 'DET' and tagged_sentence[i+2][1] == 'NN':
            phrase = ' '.join(word for word, _ in tagged_sentence[i:i+3])
            prepositional_phrases.append(phrase)

print("Three-word prepositional phrases (IN + DET + NN):")
print(prepositional_phrases[:10])  # Print a sample


Three-word prepositional phrases (IN + DET + NN):
[]


In [43]:
# Extract masculine and feminine pronouns
masculine_pronouns = set(['he', 'him', 'his', 'himself'])
feminine_pronouns = set(['she', 'her', 'hers', 'herself'])

# Count occurrences
masculine_count = sum(1 for word, tag in brown.tagged_words(tagset='universal') if word.lower() in masculine_pronouns)
feminine_count = sum(1 for word, tag in brown.tagged_words(tagset='universal') if word.lower() in feminine_pronouns)

# Calculate ratio
ratio = masculine_count / (feminine_count if feminine_count > 0 else 1)  # Avoid division by zero

print(f"Masculine pronouns count: {masculine_count}")
print(f"Feminine pronouns count: {feminine_count}")
print(f"Ratio of masculine to feminine pronouns: {ratio:.2f}")


Masculine pronouns count: 19766
Feminine pronouns count: 6037
Ratio of masculine to feminine pronouns: 3.27


22. We defined the regexp_tagger that can be used as a fall-back tagger for unknown
words. This tagger only checks for cardinal numbers. By testing for particular prefix
or suffix strings, it should be possible to guess other tags. For example, we could
tag any word that ends with -s as a plural noun. Define a regular expression tagger
(using RegexpTagger()) that tests for at least five other patterns in the spelling of
words. (Use inline documentation to explain the rules.)

In [44]:
import nltk
from nltk.tag import RegexpTagger
from nltk.corpus import brown

# Define regular expression patterns and their corresponding tags
patterns = [
    # Pattern to match words ending in -s (plural nouns)
    (r'.*s$', 'NNS'),  # NNS - plural noun

    # Pattern to match words ending in -ed (past tense verbs)
    (r'.*ed$', 'VBD'),  # VBD - past tense verb

    # Pattern to match words ending in -ing (present participles)
    (r'.*ing$', 'VBG'),  # VBG - gerund or present participle

    # Pattern to match words starting with un- (prefix indicating negation)
    (r'^un.*$', 'JJ'),  # JJ - adjective

    # Pattern to match words ending in -ly (adverbs)
    (r'.*ly$', 'RB')    # RB - adverb
]

# Create the regular expression tagger with the defined patterns
regexp_tagger = RegexpTagger(patterns)

# Example sentences for testing
sentences = [
    'He walks to the park quickly.',
    'The children were playing happily in the garden.',
    'She unlocked the door yesterday.',
    'The birds were flying high in the sky.',
    'This is an unusual problem.'
]

# Tokenize sentences and apply the regular expression tagger
for sentence in sentences:
    tokens = nltk.word_tokenize(sentence)
    tagged_sentence = regexp_tagger.tag(tokens)
    print("Original Sentence:", sentence)
    print("Tagged Sentence:", tagged_sentence)
    print()


Original Sentence: He walks to the park quickly.
Tagged Sentence: [('He', None), ('walks', 'NNS'), ('to', None), ('the', None), ('park', None), ('quickly', 'RB'), ('.', None)]

Original Sentence: The children were playing happily in the garden.
Tagged Sentence: [('The', None), ('children', None), ('were', None), ('playing', 'VBG'), ('happily', 'RB'), ('in', None), ('the', None), ('garden', None), ('.', None)]

Original Sentence: She unlocked the door yesterday.
Tagged Sentence: [('She', None), ('unlocked', 'VBD'), ('the', None), ('door', None), ('yesterday', None), ('.', None)]

Original Sentence: The birds were flying high in the sky.
Tagged Sentence: [('The', None), ('birds', 'NNS'), ('were', None), ('flying', 'VBG'), ('high', None), ('in', None), ('the', None), ('sky', None), ('.', None)]

Original Sentence: This is an unusual problem.
Tagged Sentence: [('This', 'NNS'), ('is', 'NNS'), ('an', None), ('unusual', 'JJ'), ('problem', None), ('.', None)]



23. Consider the regular expression tagger developed in the exercises in the previous
section. Evaluate the tagger using its accuracy() method, and try to come up with
ways to improve its performance. Discuss your findings. How does objective evaluation help in the development process?

In [45]:
import nltk
from nltk.tag import RegexpTagger
from nltk.corpus import brown
from nltk import accuracy

# Define regular expression patterns and their corresponding tags
patterns = [
    (r'.*s$', 'NNS'),  # Plural nouns
    (r'.*ed$', 'VBD'), # Past tense verbs
    (r'.*ing$', 'VBG'),# Present participles
    (r'^un.*$', 'JJ'), # Adjectives with prefix 'un'
    (r'.*ly$', 'RB')   # Adverbs
]

# Create the regular expression tagger with the defined patterns
regexp_tagger = RegexpTagger(patterns)

# Use Brown Corpus for evaluation
# Extract tagged words from the Brown Corpus
train_sents = brown.tagged_sents(categories='news')
test_sents = brown.tagged_sents(categories='editorial')

# Evaluate the tagger on the test set
accuracy_score = regexp_tagger.accuracy(test_sents)
print(f"Accuracy: {accuracy_score:.2f}")

# For detailed evaluation, print some example taggings
for sent in test_sents[:3]:  # Print a few sentences for manual inspection
    tokens, true_tags = zip(*sent)
    tagged_sentence = regexp_tagger.tag(tokens)
    print("Original Tags:", true_tags)
    print("Predicted Tags:", [tag for _, tag in tagged_sentence])
    print()


Accuracy: 0.08
Original Tags: ('NN-HL', 'NN-HL', 'VBD-HL', 'AP-HL', 'NN-HL')
Predicted Tags: ['RB', None, None, None, None]

Original Tags: ('AT', 'JJ-TL', 'NN-TL', ',', 'WDT', 'VBZ', 'NR', ',', 'HVZ', 'VBN', 'IN', 'AT', 'NN', 'IN', 'NN', 'CC', 'NN', 'IN', 'AT', 'NN', 'PPS', 'VBD', '.')
Predicted Tags: [None, None, 'RB', None, None, 'NNS', None, None, 'NNS', 'VBD', None, None, None, None, 'NNS', None, None, None, None, None, None, 'VBD', None]

Original Tags: ('PPS', 'BEDZ', 'VBN', 'RB', 'IN', 'AT', 'NN', 'IN', 'AT', 'NNS', ',', 'AT', 'NN', 'WDT', 'BEDZ', 'VBN', 'RB', 'IN', 'NN', 'IN', 'AT', 'NN', 'IN', 'AT', 'NN', '*', 'TO', 'VB', 'VBG', 'NN', 'NN', '.')
Predicted Tags: [None, 'NNS', 'VBD', 'RB', None, None, None, None, None, 'NNS', None, None, None, None, 'NNS', None, 'RB', None, None, None, None, None, None, None, None, None, None, None, 'VBG', None, None, None]



24. How serious is the sparse data problem? Investigate the performance of n-gram
taggers as n increases from 1 to 6. Tabulate the accuracy score. Estimate the training
data required for these taggers, assuming a vocabulary size of 105 and a tagset size
of 102.

In [47]:
import nltk
from nltk.corpus import brown
from nltk.tag import UnigramTagger, BigramTagger, TrigramTagger
from nltk.metrics import accuracy

# Load and prepare data
train_sents = brown.tagged_sents(categories='news')
test_sents = brown.tagged_sents(categories='editorial')

# Define a function to evaluate n-gram taggers
def evaluate_ngram_tagger(n, train_sents, test_sents):
    if n == 1:
        tagger = UnigramTagger(train_sents)
    elif n == 2:
        tagger = BigramTagger(train_sents)
    elif n == 3:
        tagger = TrigramTagger(train_sents)
    else:
        raise ValueError("n must be between 1 and 3")

    accuracy_score = tagger.evaluate(test_sents)
    return accuracy_score

# Evaluate taggers for n from 1 to 3
results = []
for n in range(1, 4):
    accuracy_score = evaluate_ngram_tagger(n, train_sents, test_sents)
    results.append((n, accuracy_score))

# Print results
print(f"{'N-gram Order':<12} {'Accuracy':<10}")
for n, acc in results:
    print(f"{n:<12} {acc:<10.4f}")


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  accuracy_score = tagger.evaluate(test_sents)


N-gram Order Accuracy  
1            0.8137    
2            0.1139    
3            0.0673    


25. Obtain some tagged data for another language, and train and evaluate a variety
of taggers on it. If the language is morphologically complex, or if there are any
orthographic clues (e.g., capitalization) to word classes, consider developing a regular expression tagger for it (ordered after the unigram tagger, and before the default tagger). How does the accuracy of your tagger(s) compare with the same
taggers run on English data? Discuss any issues you encounter in applying these
methods to the language.

In [49]:
!pip3 install nltk udapi-python




ERROR: Could not find a version that satisfies the requirement udapi-python (from versions: none)
ERROR: No matching distribution found for udapi-python


In [1]:
import nltk
from nltk.corpus import treebank
from nltk.tag import UnigramTagger, BigramTagger, RegexpTagger
from nltk.metrics import accuracy
import re
from nltk.corpus.reader import TaggedCorpusReader

# 1. Obtain tagged data (for demonstration, we will use NLTK's treebank corpus)
# In practice, replace treebank.tagged_sents() with your own tagged data
tagged_sents = treebank.tagged_sents()

# Split the data into training and testing sets
train_data = tagged_sents[:3000]
test_data = tagged_sents[3000:]

# 2. Train and evaluate taggers

# Unigram Tagger
unigram_tagger = UnigramTagger(train_data)
unigram_accuracy = unigram_tagger.evaluate(test_data)
print(f'Unigram Tagger Accuracy: {unigram_accuracy:.4f}')

# Bigram Tagger (without backoff)
bigram_tagger = BigramTagger(train_data)
bigram_accuracy = bigram_tagger.evaluate(test_data)
print(f'Bigram Tagger Accuracy: {bigram_accuracy:.4f}')

# Regular Expression Tagger
patterns = [
    (r'.*ing$', 'VBG'),  # gerunds
    (r'.*ed$', 'VBD'),   # past tense verbs
    (r'.*es$', 'VBZ'),   # 3rd person singular present verbs
    (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
    (r'.*\'s$', 'POS'),  # possessive nouns
]

regexp_tagger = RegexpTagger(patterns, backoff=unigram_tagger)
regexp_accuracy = regexp_tagger.evaluate(test_data)
print(f'Regexp Tagger Accuracy: {regexp_accuracy:.4f}')

# Evaluate on a custom regular expression tagger that checks for prefixes or suffixes
custom_patterns = [
    (r'.*mente$', 'RB'),   # adverbs ending in 'mente' (for Romance languages like Spanish)
    (r'.*ción$', 'NN'),    # nouns ending in 'ción'
    (r'.*idad$', 'NN'),    # nouns ending in 'idad'
    (r'.*ar$', 'VB'),      # infinitive verbs ending in 'ar'
    (r'^[A-Z].*', 'NNP'),  # proper nouns (capitalized words)
]

custom_regexp_tagger = RegexpTagger(custom_patterns, backoff=unigram_tagger)
custom_regexp_accuracy = custom_regexp_tagger.evaluate(test_data)
print(f'Custom Regexp Tagger Accuracy: {custom_regexp_accuracy:.4f}')

# 3. Compare with the taggers trained on English data
# Here, you would compare the accuracy of these taggers with their performance on another language dataset.

# 4. Discuss the results
# This part would involve analyzing the results, discussing issues like sparse data, and any challenges encountered.

# Note: Replace `treebank.tagged_sents()` with your own tagged data in practice.


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  unigram_accuracy = unigram_tagger.evaluate(test_data)


Unigram Tagger Accuracy: 0.8572


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  bigram_accuracy = bigram_tagger.evaluate(test_data)


Bigram Tagger Accuracy: 0.1132


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  regexp_accuracy = regexp_tagger.evaluate(test_data)


Regexp Tagger Accuracy: 0.8360


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  custom_regexp_accuracy = custom_regexp_tagger.evaluate(test_data)


Custom Regexp Tagger Accuracy: 0.8555


28. Experiment with taggers using the simplified tagset (or make one of your own
by discarding all but the first character of each tag name). Such a tagger has fewer
distinctions to make, but much less information on which to base its work. Discuss
your findings

29. Recall the example of a bigram tagger which encountered a word it hadn’t seen
during training, and tagged the rest of the sentence as None. It is possible for a
bigram tagger to fail partway through a sentence even if it contains no unseen words
(even if the sentence was used during training). In what circumstance can this
happen? Can you write a program to find some examples of this?

In [2]:
import nltk
from nltk.corpus import brown
from nltk.tag import BigramTagger, UnigramTagger
from nltk.data import load

# Load the Brown corpus and split it into training and test sets
brown_tagged_sents = brown.tagged_sents(categories='news')
train_sents = brown_tagged_sents[:4000]
test_sents = brown_tagged_sents[4000:4050]

# Train a BigramTagger
bigram_tagger = BigramTagger(train_sents, backoff=UnigramTagger(train_sents))

# Function to find examples where the tagger fails
def find_broken_sentences(tagger, test_sents):
    broken_sentences = []
    for sent in test_sents:
        tagged_sent = tagger.tag([word for word, tag in sent])
        if None in [tag for word, tag in tagged_sent]:
            broken_sentences.append(sent)
    return broken_sentences

# Find and print examples where the bigram tagger fails
broken_sentences = find_broken_sentences(bigram_tagger, test_sents)
for sent in broken_sentences:
    print("Original:", sent)
    print("Tagged:  ", bigram_tagger.tag([word for word, tag in sent]))
    print()



Original: [('In', 'IN'), ("Ruth's", 'NP$'), ('day', 'NN'), ('--', '--'), ('and', 'CC'), ('until', 'IN'), ('this', 'DT'), ('year', 'NN'), ('--', '--'), ('the', 'AT'), ('schedule', 'NN'), ('was', 'BEDZ'), ('154', 'CD'), ('games', 'NNS'), ('.', '.')]
Tagged:   [('In', 'IN'), ("Ruth's", 'NP$'), ('day', 'NN'), ('--', '--'), ('and', 'CC'), ('until', 'CS'), ('this', 'DT'), ('year', 'NN'), ('--', '--'), ('the', 'AT'), ('schedule', 'NN'), ('was', 'BEDZ'), ('154', None), ('games', 'NNS'), ('.', '.')]

Original: [('Baseball', 'NN'), ('commissioner', 'NN'), ('Ford', 'NP'), ('Frick', 'NP'), ('has', 'HVZ'), ('ruled', 'VBN'), ('that', 'CS'), ("Ruth's", 'NP$'), ('record', 'NN'), ('will', 'MD'), ('remain', 'VB'), ('official', 'JJ'), ('unless', 'CS'), ('it', 'PPS'), ('is', 'BEZ'), ('broken', 'VBN'), ('in', 'IN'), ('154', 'CD'), ('games', 'NNS'), ('.', '.')]
Tagged:   [('Baseball', 'NN-TL'), ('commissioner', 'NN'), ('Ford', 'NP'), ('Frick', 'NP'), ('has', 'HVZ'), ('ruled', 'VBN'), ('that', 'CS'), ("Ruth'

30. Preprocess the Brown News data by replacing low-frequency words with UNK,
but leaving the tags untouched. Now train and evaluate a bigram tagger on this
data. How much does this help? What is the contribution of the unigram tagger
and default tagger now?

In [3]:
import nltk
from nltk.corpus import brown
from nltk.probability import FreqDist
from nltk.tag import BigramTagger, UnigramTagger, DefaultTagger

# Step 1: Load the Brown corpus (News category)
brown_tagged_sents = brown.tagged_sents(categories='news')

# Step 2: Identify low-frequency words
word_freq = FreqDist(word.lower() for sent in brown_tagged_sents for word, tag in sent)
low_freq_threshold = 3  # Set a threshold for low-frequency words
low_freq_words = {word for word in word_freq if word_freq[word] < low_freq_threshold}

# Step 3: Replace low-frequency words with "UNK"
def replace_low_freq_words(tagged_sent, low_freq_words):
    return [(word if word.lower() not in low_freq_words else 'UNK', tag) for word, tag in tagged_sent]

brown_tagged_sents_unk = [replace_low_freq_words(sent, low_freq_words) for sent in brown_tagged_sents]

# Step 4: Split data into training and testing sets
train_sents = brown_tagged_sents_unk[:4000]
test_sents = brown_tagged_sents_unk[4000:4050]

# Step 5: Train taggers
default_tagger = DefaultTagger('NN')
unigram_tagger = UnigramTagger(train_sents, backoff=default_tagger)
bigram_tagger = BigramTagger(train_sents, backoff=unigram_tagger)

# Step 6: Evaluate the tagger
accuracy = bigram_tagger.evaluate(test_sents)
print(f"Bigram Tagger Accuracy with UNK: {accuracy:.4f}")


Bigram Tagger Accuracy with UNK: 0.8538


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  accuracy = bigram_tagger.evaluate(test_sents)


32. Consult the documentation for the Brill tagger demo function, using
help(nltk.tag.brill.demo). Experiment with the tagger by setting different values
for the parameters. Is there any trade-off between training time (corpus size) and
performance?

In [7]:
import nltk
from nltk.tag import UnigramTagger, BigramTagger, brill, brill_trainer
from nltk.corpus import brown

# Load the Brown corpus (a subset for demonstration purposes)
train_sents = brown.tagged_sents(categories='news')[:2000]
test_sents = brown.tagged_sents(categories='news')[2000:2050]

# Train a backoff tagger (Unigram + Bigram)
unigram_tagger = UnigramTagger(train_sents)
bigram_tagger = BigramTagger(train_sents, backoff=unigram_tagger)

# Set up templates for the Brill tagger
templates = brill.fntbl37()

# Train the Brill tagger
trainer = brill_trainer.BrillTaggerTrainer(initial_tagger=bigram_tagger, templates=templates)
brill_tagger = trainer.train(train_sents, max_rules=200, min_score=2, min_acc=0.7)

# Evaluate the Brill tagger
accuracy = brill_tagger.evaluate(test_sents)
print(f"Brill Tagger Accuracy: {accuracy:.4f}")


Brill Tagger Accuracy: 0.7455


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  accuracy = brill_tagger.evaluate(test_sents)


33. Write code that builds a dictionary of dictionaries of sets. Use it to store the set
of POS tags that can follow a given word having a given POS tag, i.e., wordi → tagi →
tagi+1.

In [8]:
import nltk
from nltk.corpus import brown
from collections import defaultdict

# Ensure you have the Brown corpus
nltk.download('brown')

def build_pos_tag_dict(tagged_sents):
    """
    Build a dictionary of dictionaries of sets to store the POS tags that can follow
    a given word with a given POS tag.

    :param tagged_sents: List of tagged sentences
    :return: Dictionary of dictionaries of sets
    """
    pos_tag_dict = defaultdict(lambda: defaultdict(set))

    # Iterate over each sentence in the tagged corpus
    for sent in tagged_sents:
        # Iterate over each word and its tag in the sentence
        for i in range(len(sent) - 1):
            word1, tag1 = sent[i]
            word2, tag2 = sent[i + 1]
            pos_tag_dict[word1][tag1].add(tag2)

    return pos_tag_dict

# Load the Brown corpus tagged sentences
tagged_sents = brown.tagged_sents(categories='news')

# Build the dictionary
pos_tag_dict = build_pos_tag_dict(tagged_sents)

# Print some examples to verify
for word in list(pos_tag_dict.keys())[:5]:  # Just print the first 5 words for brevity
    print(f"Word: {word}")
    for tag in pos_tag_dict[word]:
        print(f"  Tag: {tag} -> Follows Tags: {pos_tag_dict[word][tag]}")


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


Word: The
  Tag: AT -> Follows Tags: {'NP-TL', 'NPS', 'VBN-TL', 'JJ', 'CD', '``', 'NN', 'QL', 'NNS', 'NNS-TL', 'JJS', 'RB', 'NN-TL', 'NPS$', 'JJT', 'NN$-TL', 'NN$', 'NNS$', 'OD', 'VBG', 'VBN', 'AP', 'JJR', 'JJ-TL', 'NP', 'OD-TL', 'FW-JJ-TL', 'RB-TL', 'VBG-TL'}
  Tag: AT-TL -> Follows Tags: {'NN-TL', 'NP-TL', 'NR-TL', 'JJ-TL', 'AP-TL', 'NN$-TL', 'NN', 'NP', 'NNS-TL'}
  Tag: AT-HL -> Follows Tags: {'NP-HL', 'NN-HL'}
Word: Fulton
  Tag: NP-TL -> Follows Tags: {'NN-TL', 'JJ-TL'}
  Tag: NP -> Follows Tags: {'NNS', 'NN$'}
Word: County
  Tag: NN-TL -> Follows Tags: {'NN-TL', 'NP-TL', ',', 'JJ', '.', 'JJ-TL', 'MD', 'VBD', 'NN', 'HVZ', 'VBG', 'CC', 'NNS', 'BEZ', 'NNS-TL', 'IN', 'RB', 'HV'}
Word: Grand
  Tag: JJ-TL -> Follows Tags: {'NN-TL', 'NNS-TL'}
  Tag: FW-JJ-TL -> Follows Tags: {'FW-NN-TL'}
Word: Jury
  Tag: NN-TL -> Follows Tags: {'VBD', 'NNS'}


34. There are 264 distinct words in the Brown Corpus having exactly three possible
tags.

a. Print a table with the integers 1..10 in one column, and the number of distinct
words in the corpus having 1..10 distinct tags in the other column.

b. For the word with the greatest number of distinct tags, print out sentences
from the corpus containing the word, one for each possible tag.

In [9]:
import nltk
from nltk.corpus import brown
from collections import defaultdict

# Ensure you have the Brown corpus
nltk.download('brown')

def count_distinct_tags(tagged_sents):
    """
    Count distinct POS tags for each word and summarize counts for different numbers of distinct tags.

    :param tagged_sents: List of tagged sentences
    :return: A dictionary mapping number of distinct tags to count of words
    """
    word_tags = defaultdict(set)

    # Iterate over each sentence in the tagged corpus
    for sent in tagged_sents:
        # Iterate over each word and its tag in the sentence
        for word, tag in sent:
            word_tags[word].add(tag)

    # Count how many words have exactly n distinct tags
    tag_counts = defaultdict(int)
    for tags in word_tags.values():
        tag_counts[len(tags)] += 1

    return tag_counts, word_tags

def find_word_with_max_tags(word_tags):
    """
    Find the word with the maximum number of distinct tags.

    :param word_tags: Dictionary mapping words to their distinct tags
    :return: Word with the maximum number of distinct tags
    """
    max_word = max(word_tags, key=lambda w: len(word_tags[w]))
    return max_word

def print_sentences_for_word(word, tagged_sents):
    """
    Print sentences from the corpus containing the word for each of its possible tags.

    :param word: The word to find in the sentences
    :param tagged_sents: List of tagged sentences
    """
    tags_for_word = set()
    sentences_with_word = defaultdict(list)

    # Collect sentences containing the word and their tags
    for sent in tagged_sents:
        if any(w == word for w, _ in sent):
            for w, tag in sent:
                if w == word:
                    tags_for_word.add(tag)
                    sentences_with_word[tag].append(sent)

    # Print sentences for each tag
    for tag in tags_for_word:
        print(f"Tag: {tag}")
        for sent in sentences_with_word[tag]:
            print(' '.join([w for w, t in sent]))
        print()

# Load the Brown corpus tagged sentences
tagged_sents = brown.tagged_sents(categories='news')

# Count distinct tags and get the mapping
tag_counts, word_tags = count_distinct_tags(tagged_sents)

# Print the table for integers 1..10
print("Number of Distinct Tags | Count of Words")
for i in range(1, 11):
    print(f"{i:22} | {tag_counts.get(i, 0)}")

# Get the word with the maximum number of distinct tags
max_word = find_word_with_max_tags(word_tags)

# Print sentences for the word with the maximum number of distinct tags
print(f"Word with the greatest number of distinct tags: {max_word}")
print_sentences_for_word(max_word, tagged_sents)


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


Number of Distinct Tags | Count of Words
                     1 | 12671
                     2 | 1478
                     3 | 204
                     4 | 32
                     5 | 8
                     6 | 1
                     7 | 0
                     8 | 0
                     9 | 0
                    10 | 0
Word with the greatest number of distinct tags: to
Tag: NPS
Also noted are the marriages of Elizabeth Browning , daughter of the George L. Brownings , to Austin C. Smith Jr. ; ;

Tag: TO-HL
Three groups to meet
' know enough to sue '
Refuses to grant bail
' had to know peddlers '
' intend to attend '
Nothing to fear

Tag: TO
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. .
It recommended that Fulton legislators act `` to have these laws studied and revised to the end of modernizing and improving the

35. Write a program to classify contexts involving the word must according to the
tag of the following word. Can this be used to discriminate between the epistemic
and deontic uses of must?

In [10]:
import nltk
from nltk.corpus import brown
from collections import defaultdict, Counter

# Ensure you have the Brown corpus
nltk.download('brown')

def classify_must_contexts(tagged_sents):
    """
    Classify contexts involving the word 'must' based on the tag of the following word.

    :param tagged_sents: List of tagged sentences
    :return: A Counter object with tags of the word following 'must'
    """
    following_tags = Counter()

    for sent in tagged_sents:
        # Iterate over each word and its tag in the sentence
        for i, (word, tag) in enumerate(sent):
            if word.lower() == 'must' and i + 1 < len(sent):
                next_word, next_tag = sent[i + 1]
                following_tags[next_tag] += 1

    return following_tags

def analyze_and_classify(tags_count):
    """
    Analyze the tags that follow 'must' and classify the contexts.
    
    :param tags_count: Counter object with tags of the word following 'must'
    :return: Classification of the contexts
    """
    # Common tags that might follow 'must'
    # You might need to adjust these based on empirical analysis or linguistic knowledge
    epistemic_tags = {'VB', 'VBP', 'VBZ'}  # Verb forms
    deontic_tags = {'NN', 'NNS', 'NNP'}    # Noun forms
    
    # Analyze and classify
    classification = {
        'Epistemic': 0,
        'Deontic': 0,
        'Unclassified': 0
    }
    
    for tag, count in tags_count.items():
        if tag in epistemic_tags:
            classification['Epistemic'] += count
        elif tag in deontic_tags:
            classification['Deontic'] += count
        else:
            classification['Unclassified'] += count
    
    return classification

# Load the Brown corpus tagged sentences
tagged_sents = brown.tagged_sents(categories='news')

# Classify contexts involving 'must'
tags_count = classify_must_contexts(tagged_sents)

# Analyze and classify the contexts
classification = analyze_and_classify(tags_count)

# Print the results
print("Tag Counts Following 'must':")
for tag, count in tags_count.items():
    print(f"{tag}: {count}")

print("\nClassification of Contexts:")
for context, count in classification.items():
    print(f"{context}: {count}")


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


Tag Counts Following 'must':
BE: 17
VB: 23
VB-HL: 1
HV: 5
*: 3
VB-TL: 1
IN: 1
'': 1
RB: 1

Classification of Contexts:
Epistemic: 23
Deontic: 0
Unclassified: 30


36. Create a regular expression tagger and various unigram and n-gram taggers,
incorporating backoff, and train them on part of the Brown Corpus.

a. Create three different combinations of the taggers. Test the accuracy of each
combined tagger. Which combination works best?

b. Try varying the size of the training corpus. How does it affect your results?

In [11]:
import nltk
from nltk.corpus import brown
from nltk.tag import UnigramTagger, BigramTagger, TrigramTagger, RegexpTagger, DefaultTagger
from nltk.tag import ClassifierBasedTagger
from nltk.classify import apply_features
from nltk.classify import NaiveBayesClassifier
from nltk.tag import hmm

# Ensure you have the Brown corpus
nltk.download('brown')

# Load the Brown Corpus and split into train and test sets
tagged_sents = brown.tagged_sents(categories='news')
train_sents = tagged_sents[:int(len(tagged_sents) * 0.8)]
test_sents = tagged_sents[int(len(tagged_sents) * 0.8):]

# Define different taggers
def create_taggers(train_sents):
    # Regular Expression Tagger
    regexp_tagger = RegexpTagger(
        [('.*ed$', 'VBD'), 
         ('.*es$', 'VBZ'), 
         ('.*ing$', 'VBG'), 
         ('.*s$', 'NNS'),
         ('.*\'s$', 'VBZ'),
         ('^.*$', 'NN')], 
        backoff=DefaultTagger('NN')
    )
    
    # Unigram Tagger
    unigram_tagger = UnigramTagger(train_sents, backoff=regexp_tagger)
    
    # Bigram Tagger
    bigram_tagger = BigramTagger(train_sents, backoff=unigram_tagger)
    
    # Trigram Tagger
    trigram_tagger = TrigramTagger(train_sents, backoff=bigram_tagger)
    
    return regexp_tagger, unigram_tagger, bigram_tagger, trigram_tagger

# Function to evaluate a tagger
def evaluate_tagger(tagger, test_sents):
    accuracy = tagger.evaluate(test_sents)
    return accuracy

# Train and evaluate different tagger combinations
def evaluate_combinations(train_sents, test_sents):
    # Create taggers
    regexp_tagger, unigram_tagger, bigram_tagger, trigram_tagger = create_taggers(train_sents)
    
    # Evaluate each tagger
    print("Evaluating individual taggers:")
    print(f"Regexp Tagger Accuracy: {evaluate_tagger(regexp_tagger, test_sents)}")
    print(f"Unigram Tagger Accuracy: {evaluate_tagger(unigram_tagger, test_sents)}")
    print(f"Bigram Tagger Accuracy: {evaluate_tagger(bigram_tagger, test_sents)}")
    print(f"Trigram Tagger Accuracy: {evaluate_tagger(trigram_tagger, test_sents)}")
    
    # Combined Taggers
    combined_tagger_1 = TrigramTagger(train_sents, backoff=bigram_tagger)
    combined_tagger_2 = BigramTagger(train_sents, backoff=unigram_tagger)
    combined_tagger_3 = UnigramTagger(train_sents, backoff=regexp_tagger)
    
    print("\nEvaluating combined taggers:")
    print(f"Combined Tagger 1 (Trigram with Bigram Backoff) Accuracy: {evaluate_tagger(combined_tagger_1, test_sents)}")
    print(f"Combined Tagger 2 (Bigram with Unigram Backoff) Accuracy: {evaluate_tagger(combined_tagger_2, test_sents)}")
    print(f"Combined Tagger 3 (Unigram with Regexp Backoff) Accuracy: {evaluate_tagger(combined_tagger_3, test_sents)}")

# Run evaluation
evaluate_combinations(train_sents, test_sents)

# Vary training corpus size
def evaluate_with_varying_sizes(tagged_sents):
    sizes = [0.1, 0.2, 0.4, 0.6, 0.8]
    
    for size in sizes:
        split_index = int(len(tagged_sents) * size)
        train_sents = tagged_sents[:split_index]
        test_sents = tagged_sents[split_index:]
        
        print(f"\nEvaluating with training size: {size * 100}%")
        evaluate_combinations(train_sents, test_sents)

# Run evaluation with varying training corpus sizes
evaluate_with_varying_sizes(tagged_sents)


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


Evaluating individual taggers:
Regexp Tagger Accuracy: 0.1797774459270678


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  accuracy = tagger.evaluate(test_sents)


Unigram Tagger Accuracy: 0.8451274146153476
Bigram Tagger Accuracy: 0.8544727588034106
Trigram Tagger Accuracy: 0.8521123368177658

Evaluating combined taggers:
Combined Tagger 1 (Trigram with Bigram Backoff) Accuracy: 0.8521123368177658
Combined Tagger 2 (Bigram with Unigram Backoff) Accuracy: 0.8544727588034106
Combined Tagger 3 (Unigram with Regexp Backoff) Accuracy: 0.8451274146153476

Evaluating with training size: 10.0%
Evaluating individual taggers:
Regexp Tagger Accuracy: 0.18318127638717152
Unigram Tagger Accuracy: 0.7541578941505419
Bigram Tagger Accuracy: 0.7576112020853524
Trigram Tagger Accuracy: 0.7574552462431352

Evaluating combined taggers:
Combined Tagger 1 (Trigram with Bigram Backoff) Accuracy: 0.7574552462431352
Combined Tagger 2 (Bigram with Unigram Backoff) Accuracy: 0.7576112020853524
Combined Tagger 3 (Unigram with Regexp Backoff) Accuracy: 0.7541578941505419

Evaluating with training size: 20.0%
Evaluating individual taggers:
Regexp Tagger Accuracy: 0.18221883

In [12]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

37. Our approach for tagging an unknown word has been to consider the letters of
the word (using RegexpTagger()), or to ignore the word altogether and tag it as a
noun (using nltk.DefaultTagger()). These methods will not do well for texts having new words that are not nouns. Consider the sentence I like to blog on Kim’s
blog. If blog is a new word, then looking at the previous tag (TO versus NP$) would
probably be helpful, i.e., we need a default tagger that is sensitive to the preceding
tag.

a. Create a new kind of unigram tagger that looks at the tag of the previous word,
and ignores the current word. (The best way to do this is to modify the source
code for UnigramTagger(), which presumes knowledge of object-oriented programming in Python.)

b. Add this tagger to the sequence of backoff taggers (including ordinary trigram
and bigram taggers that look at words), right before the usual default tagger.

c. Evaluate the contribution of this new unigram tagger

In [19]:
import nltk
from nltk.tag import UnigramTagger, DefaultTagger, BigramTagger, TrigramTagger, RegexpTagger
from nltk.corpus import brown

# Custom PreviousTagUnigramTagger class
class PreviousTagUnigramTagger:
    def __init__(self, train_sents, backoff=None):
        self.backoff = backoff
        self.tagdict = self._train(train_sents)

    def _train(self, train_sents):
        tagdict = {}
        for sent in train_sents:
            prev_tag = 'START'
            for word, tag in sent:
                if (prev_tag, word) not in tagdict:
                    tagdict[(prev_tag, word)] = tag
                prev_tag = tag
        return tagdict

    def tag(self, sentence):
        tagged = []
        prev_tag = 'START'
        for word in sentence:
            tag = self.tagdict.get((prev_tag, word), 'NN')  # Default to 'NN' if not in tagdict
            tagged.append((word, tag))
            prev_tag = tag
        if self.backoff:
            tagged = self.backoff.tag([word for word, _ in tagged])
        return tagged

    def evaluate(self, test_sents):
        correct = 0
        total = 0
        for sent in test_sents:
            ref_tags = [tag for _, tag in sent]
            tagged = self.tag([word for word, _ in sent])
            tagged_tags = [tag for _, tag in tagged]
            correct += sum([1 for ref, tag in zip(ref_tags, tagged_tags) if ref == tag])
            total += len(sent)
        return correct / total

# Load the Brown Corpus and split into train and test sets
tagged_sents = brown.tagged_sents(categories='news')
train_sents = tagged_sents[:int(len(tagged_sents) * 0.8)]
test_sents = tagged_sents[int(len(tagged_sents) * 0.8):]

# Create other taggers
regexp_tagger = RegexpTagger(
    [(r'.*ed$', 'VBD'), 
     (r'.*es$', 'VBZ'), 
     (r'.*ing$', 'VBG'), 
     (r'.*s$', 'NNS')], 
    backoff=DefaultTagger('NN')
)

unigram_tagger = UnigramTagger(train_sents, backoff=regexp_tagger)
bigram_tagger = BigramTagger(train_sents, backoff=unigram_tagger)
trigram_tagger = TrigramTagger(train_sents, backoff=bigram_tagger)

# Create the custom PreviousTagUnigramTagger and combine it with other taggers
prev_tag_tagger = PreviousTagUnigramTagger(train_sents, backoff=trigram_tagger)

# Evaluate the taggers
def evaluate_tagger(tagger, test_sents):
    accuracy = tagger.evaluate(test_sents)
    return accuracy

# Evaluate with the combined tagger including PreviousTagUnigramTagger
print("Evaluating combined taggers with PreviousTagUnigramTagger included:")
print(f"Accuracy: {evaluate_tagger(prev_tag_tagger, test_sents)}")

# Evaluate with just the existing taggers (without PreviousTagUnigramTagger)
basic_tagger = TrigramTagger(train_sents, backoff=bigram_tagger)
print("\nEvaluating with existing taggers (without PreviousTagUnigramTagger):")
print(f"Accuracy: {evaluate_tagger(basic_tagger, test_sents)}")


Evaluating combined taggers with PreviousTagUnigramTagger included:
Accuracy: 0.8521123368177658

Evaluating with existing taggers (without PreviousTagUnigramTagger):
Accuracy: 0.8521123368177658


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  accuracy = tagger.evaluate(test_sents)


39. Use some of the estimation techniques in nltk.probability, such as Lidstone or
Laplace estimation, to develop a statistical tagger that does a better job than ngram backoff taggers in cases where contexts encountered during testing were not
seen during training.

In [28]:
import nltk
from nltk.corpus import treebank
from nltk.tag import UnigramTagger, BigramTagger, TrigramTagger, RegexpTagger
from nltk.tag import brill, brill_trainer
from nltk.probability import LidstoneProbDist, LaplaceProbDist, ConditionalFreqDist, ConditionalProbDist

# Load treebank data
train_sents = treebank.tagged_sents()[:3000]  # Adjust as needed
test_sents = treebank.tagged_sents()[3000:]   # Adjust as needed

# Create frequency distributions
def create_conditional_prob_dist(tagged_sents, estimator):
    cfd = ConditionalFreqDist(
        (tag, word) for sent in tagged_sents for word, tag in sent
    )
    cpfd = ConditionalProbDist(
        cfd, estimator, bins=1000
    )
    return cpfd

# Define smoothing methods
def lidstone_estimator():
    return LidstoneProbDist(create_conditional_prob_dist(train_sents, LidstoneProbDist), 0.2)

def laplace_estimator():
    return LaplaceProbDist(create_conditional_prob_dist(train_sents, LaplaceProbDist))

# Define and train taggers with smoothing
def train_lidstone_taggers():
    unigram_tagger = UnigramTagger(train_sents)
    bigram_tagger = BigramTagger(train_sents, backoff=unigram_tagger)
    trigram_tagger = TrigramTagger(train_sents, backoff=bigram_tagger)
    return unigram_tagger, bigram_tagger, trigram_tagger

def train_laplace_taggers():
    unigram_tagger = UnigramTagger(train_sents)
    bigram_tagger = BigramTagger(train_sents, backoff=unigram_tagger)
    trigram_tagger = TrigramTagger(train_sents, backoff=bigram_tagger)
    return unigram_tagger, bigram_tagger, trigram_tagger

# Train taggers with Lidstone and Laplace estimators
lidstone_taggers = train_lidstone_taggers()
laplace_taggers = train_laplace_taggers()

# Evaluate individual taggers
def evaluate_taggers(tagger_list):
    accuracies = {}
    for tagger_name, tagger in tagger_list.items():
        accuracy = tagger.evaluate(test_sents)
        accuracies[tagger_name] = accuracy
    return accuracies

# Evaluate Lidstone and Laplace taggers
lidstone_accuracies = evaluate_taggers({
    "Lidstone Unigram": lidstone_taggers[0],
    "Lidstone Bigram": lidstone_taggers[1],
    "Lidstone Trigram": lidstone_taggers[2]
})

laplace_accuracies = evaluate_taggers({
    "Laplace Unigram": laplace_taggers[0],
    "Laplace Bigram": laplace_taggers[1],
    "Laplace Trigram": laplace_taggers[2]
})

# Print accuracies
print("Lidstone Tagger Accuracies:")
for name, accuracy in lidstone_accuracies.items():
    print(f"{name} Accuracy: {accuracy:.4f}")

print("\nLaplace Tagger Accuracies:")
for name, accuracy in laplace_accuracies.items():
    print(f"{name} Accuracy: {accuracy:.4f}")


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  accuracy = tagger.evaluate(test_sents)


Lidstone Tagger Accuracies:
Lidstone Unigram Accuracy: 0.8572
Lidstone Bigram Accuracy: 0.8648
Lidstone Trigram Accuracy: 0.8647

Laplace Tagger Accuracies:
Laplace Unigram Accuracy: 0.8572
Laplace Bigram Accuracy: 0.8648
Laplace Trigram Accuracy: 0.8647
