#### Name : Sai Kumar Gandham
#### Student ID : IG45378

#### 11. Learn about the affix tagger (type ** help(nltk.AffixTagger) **). Train an affix tagger and run it on some new text. Experiment with different settings for the affix length and the minimum word length. Discuss your findings.

In [1]:
import nltk
from nltk.tag import AffixTagger
from nltk.corpus import brown

help(nltk.AffixTagger)

Help on class AffixTagger in module nltk.tag.sequential:

class AffixTagger(ContextTagger)
 |  AffixTagger(train=None, model=None, affix_length=-3, min_stem_length=2, backoff=None, cutoff=0, verbose=False)
 |  
 |  A tagger that chooses a token's tag based on a leading or trailing
 |  substring of its word string.  (It is important to note that these
 |  substrings are not necessarily "true" morphological affixes).  In
 |  particular, a fixed-length substring of the word is looked up in a
 |  table, and the corresponding tag is returned.  Affix taggers are
 |  typically constructed by training them on a tagged corpus.
 |  
 |  Construct a new affix tagger.
 |  
 |  :param affix_length: The length of the affixes that should be
 |      considered during training and tagging.  Use negative
 |      numbers for suffixes.
 |  :param min_stem_length: Any words whose length is less than
 |      min_stem_length+abs(affix_length) will be assigned a
 |      tag of None by this tagger.
 |  
 |  Me

In [2]:
# Here we are training the AffixTagger
train_sents = brown.tagged_sents(categories='news')[:500]  # Using a portion of the Brown corpus for training
affix_tagger = AffixTagger(train_sents, affix_length=-3, min_stem_length=2)

# Example of running the tagger on a different text
test_text = "She sells seashells by the seashore."
tagged_text = affix_tagger.tag(nltk.word_tokenize(test_text))
print(tagged_text)

[('She', None), ('sells', 'NNS'), ('seashells', 'NNS'), ('by', None), ('the', None), ('seashore', 'IN'), ('.', None)]


- "She" doesn't have a tag because it's not clear what part of speech it is.
- "sells" and "seashells" are both identified as plural nouns (NNS).
- "by" and "the" don't have tags, meaning the tagger couldn't determine their part of speech.
- "seashore" is mistakenly tagged as a preposition or subordinating conjunction (IN), which is incorrect.
- "." doesn't have a tag because it's punctuation.

Overall, the AffixTagger correctly identified "sells" and "seashells" as plural nouns but struggled with other words, assigning incorrect tags to some of them.

#### 14. Use ** sorted() ** and ** set() ** to get a sorted list of tags used in the Brown corpus, removing duplicates

In [3]:
# Get all tagged words from the Brown Corpus
tagged_words = brown.tagged_words()

# Extract tags and convert them to a set to remove duplicates
tags_set = set(tag for word, tag in tagged_words)

# Sort the set of tags
sorted_tags = sorted(tags_set)

print(sorted_tags)

["'", "''", '(', '(-HL', ')', ')-HL', '*', '*-HL', '*-NC', '*-TL', ',', ',-HL', ',-NC', ',-TL', '--', '---HL', '.', '.-HL', '.-NC', '.-TL', ':', ':-HL', ':-TL', 'ABL', 'ABN', 'ABN-HL', 'ABN-NC', 'ABN-TL', 'ABX', 'AP', 'AP$', 'AP+AP-NC', 'AP-HL', 'AP-NC', 'AP-TL', 'AT', 'AT-HL', 'AT-NC', 'AT-TL', 'AT-TL-HL', 'BE', 'BE-HL', 'BE-TL', 'BED', 'BED*', 'BED-NC', 'BEDZ', 'BEDZ*', 'BEDZ-HL', 'BEDZ-NC', 'BEG', 'BEM', 'BEM*', 'BEM-NC', 'BEN', 'BEN-TL', 'BER', 'BER*', 'BER*-NC', 'BER-HL', 'BER-NC', 'BER-TL', 'BEZ', 'BEZ*', 'BEZ-HL', 'BEZ-NC', 'BEZ-TL', 'CC', 'CC-HL', 'CC-NC', 'CC-TL', 'CC-TL-HL', 'CD', 'CD$', 'CD-HL', 'CD-NC', 'CD-TL', 'CD-TL-HL', 'CS', 'CS-HL', 'CS-NC', 'CS-TL', 'DO', 'DO*', 'DO*-HL', 'DO+PPSS', 'DO-HL', 'DO-NC', 'DO-TL', 'DOD', 'DOD*', 'DOD*-TL', 'DOD-NC', 'DOZ', 'DOZ*', 'DOZ*-TL', 'DOZ-HL', 'DOZ-TL', 'DT', 'DT$', 'DT+BEZ', 'DT+BEZ-NC', 'DT+MD', 'DT-HL', 'DT-NC', 'DT-TL', 'DTI', 'DTI-HL', 'DTI-TL', 'DTS', 'DTS+BEZ', 'DTS-HL', 'DTX', 'EX', 'EX+BEZ', 'EX+HVD', 'EX+HVZ', 'EX+MD', '

#### 15. Write programs to process the Brown Corpus and find answers to the following questions:
a. Which nouns are more common in their plural form, rather than their singular form? (Only consider regular plurals, formed with the ** -s ** suffix.)


In [4]:
from collections import defaultdict

In [5]:
# Get tagged words from the Brown Corpus
brown_tagged = brown.tagged_words()

# let's first initialize a ConditionalFreqDist to count occurrences of singular and plural forms
cfd = nltk.ConditionalFreqDist(brown_tagged)

# let's initialize a set to store nouns that are more common in plural form
common_plural_nouns = set()

# Iterate over each word in the Brown Corpus
for word in set(brown.words()):
    
    # now here we have to check if the plural form occurs more frequently than the singular form
    if cfd[word+'s']['NNS'] > cfd[word]['NN']:
        common_plural_nouns.add(word)

print("Nouns more common in plural form:", common_plural_nouns)


Nouns more common in plural form: {'Camera', 'Investor', 'cartoon', 'turnpike', 'metaphysical', 'defender', 'duct', 'ailment', 'nude', 'working', 'scholar', 'Idea', 'psychiatrist', 'Teacher', 'alcoholic', 'publisher', 'intermediate', 'omission', '1950', 'carrier', 'banker', 'urging', 'riddle', 'Demon', 'hit', 'referral', 'obligation', 'batten', 'polymer', 'Superintendent', 'Lot', 'singer', 'finding', 'symptom', 'geologist', 'parasite', 'movie', '45-degree', 'consonant', 'lip', 'Loan', 'conservative', 'compulsive', 'runner', 'up', 'four', 'lark', 'ballistic', 'tranquilizer', 'cardinal', 'swell', 'Corporation', 'headline', 'fee', 'guest', 'Girl', 'lashing', 'aberration', 'intangible', 'neighbor', 'sensitive', 'dropping', 'grouping', 'rule', 'nuance', 'believer', 'plastic', 'no', 'bronchiole', 'Song', 'regulation', 'norm', 'aborigine', 'adviser', 'modification', 'implication', 'lobule', 'cop', 'painting', 'franc', '2-year-old', 'sneaker', 'official', 'bound', 'cleaner', 'fathom', 'Farmer'

b. Which word has the greatest number of distinct tags. What are they, and what do they represent

In [6]:
word_tags = defaultdict(set)
for word, tag in brown.tagged_words():
    word_tags[word].add(tag)

word_with_most_tags, num_tags = max(word_tags.items(), key=lambda x: len(x[1]))
print("Word with the greatest number of distinct tags:", word_with_most_tags)
print("Tags for the word:", num_tags)

Word with the greatest number of distinct tags: that
Tags for the word: {'CS-NC', 'WPS-HL', 'WPS', 'NIL', 'CS', 'DT', 'WPO', 'WPS-NC', 'WPO-NC', 'CS-HL', 'DT-NC', 'QL'}


The word with the greatest number of distinct tags in the Brown Corpus is "that". Here are the tags associated with it and what they represent:

1. NIL: This indicates that the word "that" doesn't have a specific tag in certain contexts.
2. QL: This tag is used for qualifiers or intensifiers that modify adjectives or adverbs, often indicating the degree or extent (e.g., "very").
3. WPS-NC: This tag represents a possessive relative pronoun in non-contract form (e.g., "whose").
4. WPO-NC: This tag represents an objective relative pronoun in non-contract form (e.g., "whom").
5. CS-NC: This tag represents a subordinating conjunction in non-contract form (e.g., "while").
6. WPO: This tag represents an objective relative pronoun (e.g., "whom").
7. WPS-HL: This tag represents a subjective relative pronoun in headline form (e.g., "who").
8. CS-HL: This tag represents a subordinating conjunction in headline form (e.g., "because").
9. DT-NC: This tag represents a noun phrase determiner in non-contract form (e.g., "which").
10. DT: This tag represents a determiner that introduces a noun phrase (e.g., "the").
11. WPS: This tag represents a subjective relative pronoun (e.g., "who").
12. CS: This tag represents a subordinating conjunction (e.g., "because").

These tags help to identify the various grammatical roles and relationships of the word "that" in different contexts within sentences.

c. List tags in order of decreasing frequency. What do the 20 most frequent tags represent?


In [7]:
tag_freq = nltk.FreqDist(tag for _, tag in brown.tagged_words())
top_20_tags = tag_freq.most_common(20)
print("Top 20 most frequent tags:")
for tag, freq in top_20_tags:
    print(tag, ":", freq)

Top 20 most frequent tags:
NN : 152470
IN : 120557
AT : 97959
JJ : 64028
. : 60638
, : 58156
NNS : 55110
CC : 37718
RB : 36464
NP : 34476
VB : 33693
VBN : 29186
VBD : 26167
CS : 22143
PPS : 18253
VBG : 17893
PP$ : 16872
TO : 14918
PPSS : 13802
CD : 13510


These tags are like labels that tell us what each word is doing in a sentence. For example, they can tell us if a word is a noun (like "cat" or "dog"), a verb (like "run" or "eat"), or a describing word (like "big" or "happy"). These labels are really important for computers to understand sentences properly. They help computers with tasks like figuring out the structure of a sentence, identifying different parts of speech, and understanding the meaning of words in a sentence.

d. Which tags are nouns most commonly found after? What do these tags represent?

In [8]:
noun_followed_by_tags = defaultdict(int)
prev_tag = None

for _, tag in brown.tagged_words():
    if prev_tag and prev_tag.startswith('NN'):
        noun_followed_by_tags[tag] += 1
    prev_tag = tag

# Here we are getting the top 10 most common tags found after nouns
top_10_after_noun = sorted(noun_followed_by_tags.items(), key=lambda x: x[1], reverse=True)[:10]

print("Top 10 tags most commonly found after nouns:")
for tag, count in top_10_after_noun:
    print(tag, ":", count)

Top 10 tags most commonly found after nouns:
IN : 57873
. : 28988
, : 27676
CC : 13811
NN : 13774
NNS : 6795
VBD : 5210
CS : 4521
MD : 4291
BEZ : 4281


These are the top 10 tags most commonly found after nouns in the Brown Corpus:

1. IN: Preposition or subordinating conjunction
2. .: Sentence-final punctuation (period)
3. ,: Comma
4. CC: Coordinating conjunction
5. NN: Singular noun
6. NNS: Plural noun
7. VBD: Past tense verb
8. CS: Subordinating conjunction
9. MD: Modal auxiliary
10. BEZ: Third person singular present tense verb (inflectional form of "be")

These tags represent various parts of speech and grammatical functions that frequently occur after nouns in English sentences.