# UNIT 3: Learning to Classify Text

Detecting patterns is a central part of Natural Language Processing. Words ending in **-ed** tend to be past tense
verbs. Frequent use of will is indicative of news text. These observable patterns — word structure and
word frequency — happen to correlate with particular aspects of meaning, such as tense and topic. But how
did we know where to start looking, which aspects of form to associate with which aspects of meaning?
The goal of this chapter is to answer the following questions:

1. How can we identify particular features of language data that are salient for classifying it?

2. How can we construct models of language that can be used to perform language processing tasks
automatically?

3. What can we learn about language from these models?

Along the way we will study some important machine learning techniques, We will gloss over the mathematical and statistical
underpinnings of these techniques, focusing instead on how and when to use them

# 1 Supervised Classification

Classification is the task of choosing the correct class label for a given input. In basic classification tasks,
each input is considered in isolation from all other inputs, and the set of labels is defined in advance. Some
examples of classification tasks are:

Deciding whether an email is spam or not.

Deciding what the topic of a news article is, from a fixed list of topic areas such as "sports,"
"technology," and "politics."

Deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial
institution, the act of tilting to the side, or the act of depositing something in a financial institution.

The basic classification task has a number of interesting variants. For example, in multi-class classification,
each instance may be assigned multiple labels; in open-class classification, the set of labels is not defined in
advance; and in sequence classification, a list of inputs are jointly classified.
A classifier is called supervised if it is built based on training corpora containing the correct label for each
input. The framework used by supervised classification is shown below:

In the rest of this section, we will look at how classifiers can be employed to solve a wide variety of tasks.
Our discussion is not intended to be comprehensive, but to give a representative sample of tasks that can be
performed with the help of text classifiers.


# 1.1 Gender Identification

Names ending in **a, e and i are
likely to be female, while names ending in k, o, r, s and t are likely to be male.**

Let's build a classifier to
model these differences more precisely.

The first step in creating a classifier is deciding what features of the input are relevant, and how to encode
those features. For this example, we'll start by just looking at the final letter of a given name.

The following
feature extractor function builds a dictionary containing relevant information about a given name:

In [1]:
def gender_features(word):
    return {'last_letter': word[-1]}#The function returns a dictionary containing a single feature,
#where the key is 'last_letter',
#and the value is the last letter of the input word (word[-1]).
gender_features('Shrek')

{'last_letter': 'k'}

The returned dictionary, known as a feature set, maps from feature names to their values. Feature names are
case-sensitive strings that typically provide a short human-readable description of the feature, as in the
example 'last_letter'. Feature values are values with simple types, such as booleans, numbers, and strings.

Now that we've defined a feature extractor, we need to prepare a list of examples and corresponding class
labels.

Thus, the following code creates a list of labeled names by pairing each name from the 'male.txt' file with the label 'male' and each name from the 'female.txt' file with the label 'female'. The resulting list is then shuffled randomly.


In [3]:
from nltk.corpus import names
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
import random #This line imports the random module, which provides functions
#for generating random numbers and performing random operations.
random.shuffle(labeled_names)

LookupError: 
**********************************************************************
  Resource [93mnames[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('names')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/names[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


Next, we use the feature extractor to process the names data, and divide the resulting list of feature sets into a
training set and a test set. The training set is used to train a new "naive Bayes" classifier.


In [None]:
import nltk
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)


We will learn more about the naive Bayes classifier later in the chapter. For now, let's just test it out on some
names that did not appear in its training data:

In [None]:
classifier.classify(gender_features('Neo'))




In [None]:
classifier.classify(gender_features('Trinity'))

Observe that these character names from The Matrix are correctly classified. Although this science fiction
movie is set in 2199, it still conforms with our expectations about names and genders. We can systematically
evaluate the classifier on a much larger quantity of unseen data:

In [None]:
 print(nltk.classify.accuracy(classifier, test_set))


Finally, we can examine the classifier to determine which features it found most effective for distinguishing
the names' genders:


In [None]:
classifier.show_most_informative_features(10)

NOTE:

In this example, the classifier is making predictions based on the last letter of names. The output shows the most informative features, the corresponding last letter, and the weight associated with each feature. For instance, if the last letter is 'a', the classifier is 35.7 times more likely to predict 'female' compared to 'male'. Similarly, if the last letter is 'k', the classifier is 31.2 times more likely to predict 'male' compared to 'female'.

In other words, This listing shows that the names in the training set that end in "a" are female 33 times more often than they
are male, but names that end in "k" are male 32 times more often than they are female. These ratios are
known as likelihood ratios, and can be useful for comparing different feature-outcome relationships.

When working with large corpora, constructing a single list that contains the features of every instance can
use up a large amount of memory. In these cases, use the function nltk.classify.apply_features, which
returns an object that acts like a list but does not store all the feature sets in memory:

In [None]:
from nltk.classify import apply_features
train_set = apply_features(gender_features, labeled_names[500:])
test_set = apply_features(gender_features, labeled_names[:500])


# 1.2 Choosing The Right Features
Selecting relevant features and deciding how to encode them for a learning method can have an enormous
impact on the learning method's ability to extract a good model. Much of the interesting work in building a
classifier is deciding what features might be relevant, and how we can represent them. Although it's often
possible to get decent performance by using a fairly simple and obvious set of features, there are usually
significant gains to be had by using carefully constructed features based on a thorough understanding of the
task at hand.

Typically, feature extractors are built through a process of trial-and-error, guided by intuitions about what
information is relevant to the problem. It's common to start with a "kitchen sink" approach, including all the
features that you can think of, and then checking to see which features actually are helpful.

In [None]:
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

In [None]:
 gender_features2('Suhana')

However, there are usually limits to the number of features that you should use with a given learning
algorithm — if you provide too many features, then the algorithm will have a higher chance of relying on
idiosyncrasies (a mode of behaviour) of your training data that don't generalize well to new examples. This problem is known as
overfitting, and can be especially problematic when working with small training sets.

For example, if we
train a naive Bayes classifier using the feature extractor shown in 1.2, it will overfit the relatively small
training set, resulting in a system whose accuracy is about 1% lower than the accuracy of a classifier that
only pays attention to the final letter of each name:

In [None]:
featuresets = [(gender_features2(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))


Once an initial set of features has been chosen, a very productive method for refining the feature set is **error
analysis**. First, we select a development set, containing the corpus data for creating the model. This
development set is then subdivided into the training set and the dev-test set.

In [None]:
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]

The training set is used to train the model, and the dev-test set is used to perform error analysis. The test set
serves in our final evaluation of the system. For reasons discussed below, it is important that we employ a
separate dev-test set for error analysis, rather than just using the test set.

![Untitled.png](attachment:Untitled.png)

Having divided the corpus into appropriate datasets, we train a model using the training set , and then run it
on the dev-test set.

In [None]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

Using the dev-test set, we can generate a list of the errors that the classifier makes when predicting name
genders:


In [None]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )
print(errors)

We can then examine individual error cases where the model predicted the wrong label, and try to determine
what additional pieces of information would allow it to make the right decision (or which existing pieces of
information are tricking it into making the wrong decision). The feature set can then be adjusted accordingly.
The names classifier that we have built generates about 100 errors on the dev-test corpus:


In [None]:
for (tag, guess, name) in sorted(errors):
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))


**Detailed Explanation:**

Print Formatted Information:

For each error tuple, it prints a formatted string using the print statement.
The format string 'correct={:<8} guess={:<8s} name={:<30}' specifies how the information should be formatted:

{:<8}: Left-align the first element (actual label tag) within an 8-character wide space.

{:<8s}: Left-align the second element (predicted label guess) within an 8-character wide space (with 's' indicating that it's a string).

{:<30}: Left-align the third element (name name) within a 30-character wide space.

The format method substitutes these placeholders with the corresponding values from the error tuple.

In summary, this code iterates through the sorted list of errors and prints information about each error in a neatly formatted way. This information includes the actual label (tag), predicted label (guess), and the name (name). The formatting ensures that the output is aligned and easy to read. This type of output is helpful for analyzing the patterns of errors made by the classifier during testing.

Looking through this list of errors makes it clear that some suffixes that are more than one letter can be
indicative of name genders. For example, names ending in **yn appear to be predominantly female, despite the
fact that names ending in n tend to be male; and names ending in ch are usually male, even though names that
end in h tend to be female**. We therefore adjust our feature extractor to include features for two-letter
suffixes:


In [None]:
def gender_features(word):
    return {'suffix1': word[-1:],'suffix2': word[-2:]}


**Rebuilding the classifier with the new feature extractor, we see that the performance on the dev-test dataset
improves by almost 2 percentage points (from 75.2% to 77.2%):**

In [None]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

This error analysis procedure can then be repeated, checking for patterns in the errors that are made by the
newly improved classifier. Each time the error analysis procedure is repeated, we should select a different
dev-test/training split, to ensure that the classifier does not start to reflect idiosyncrasies in the dev-test set.
But once we've used the dev-test set to help us develop the model, we can no longer trust that it will give us
an accurate idea of how well the model would perform on new data. It is therefore important to keep the test
set separate, and unused, until our model development is complete. At that point, we can use the test set to
evaluate how well our model will perform on new input values.


# 1.3 Document Classification

In 1, we saw several examples of corpora where documents have been labeled with categories. Using these
corpora, we can build classifiers that will automatically tag new documents with appropriate category labels.


First, we construct a list of documents, labeled with the appropriate categories. For this example, we've
chosen the Movie Reviews Corpus, which categorizes each review as positive or negative.

In [None]:
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)


Next, we define a feature extractor for documents, so the classifier will know which aspects of the data it
should pay attention to (1.4). For document topic identification, we can define a feature for each word,
indicating whether the document contains that word. To limit the number of features that the classifier needs
to process, we begin by constructing a list of the 2000 most frequent words in the overall corpus . We can
then define a feature extractor that simply checks whether each of these words is present in a given
document.


In [None]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [None]:
 print(document_features(movie_reviews.words('pos/cv957_8737.txt')))

Now that we've defined our feature extractor, we can use it to train a classifier to label new movie reviews
(1.5). To check how reliable the resulting classifier is, we compute its accuracy on the test set . And once
again, we can use show_most_informative_features() to find out which features the classifier found to be
most informative

In [None]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

In [None]:
classifier.show_most_informative_features(5)

Apparently in this corpus, a review that mentions "Seagal" is almost 8 times more likely to be negative than
positive, while a review that mentions "Damon" is about 6 times more likely to be positive.


# 1.4 Part-of-Speech Tagging
We
can train a classifier to work out which suffixes are most informative. Let's begin by finding out what the
most common suffixes are:

In [None]:
import nltk
from nltk.corpus import brown
suffix_fdist = nltk.FreqDist()
for word in brown.words():
    word = word.lower()
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1

In [None]:
common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]
print(common_suffixes)

Next, we'll define a feature extractor function which checks a given word for these suffixes:


In [None]:
def pos_features(word):
    features = {}
    for suffix in common_suffixes:
        features['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
    return features

Feature extraction functions behave like tinted glasses, highlighting some of the properties (colors) in our
data and making it impossible to see other properties. The classifier will rely exclusively on these highlighted
properties when determining how to label inputs. In this case, the classifier will make its decisions based only
on information about which of the common suffixes (if any) a given word has:

In [None]:
tagged_words = brown.tagged_words(categories='news')
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]

In [None]:
classifier = nltk.DecisionTreeClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)


In [None]:
classifier.classify(pos_features('good'))

One nice feature of decision tree models is that they are often fairly easy to interpret — we can even instruct
NLTK to print them out as pseudocode:


In [None]:
print(classifier.pseudocode(depth=4))

Here, we can see that the classifier begins by checking whether a word ends with a comma — if so, then it
will receive the special tag ",". Next, the classifier checks if the word ends in "the", in which case it's almost
certainly a determiner. This "suffix" gets used early by the decision tree because the word "the" is so
common. Continuing on, the classifier checks if the word ends in "s". If so, then it's most likely to receive the
verb tag VBZ (unless it's the word "is", which has a special tag BEZ), and if not, then it's most likely a noun
(unless it's the punctuation mark "."). The actual classifier contains further nested if-then statements below
the ones shown here, but the depth=4 argument just displays the top portion of the decision tree.


# 1.5 Exploiting Context

By augmenting the feature extraction function, we could modify this part-of-speech tagger to leverage a
variety of other word-internal features, such as the **length of the word, the number of syllables it contains, or
its prefix**. However, as long as the feature extractor just looks at the target word, we have no way to add
features that depend on the context that the word appears in. But contextual features often provide powerful
clues about the correct tag — for example, when tagging the word "fly," knowing that the previous word is
"a" will allow us to determine that it is functioning as a noun, not a verb.
In order to accommodate features that depend on a word's context, we must revise the pattern that we used to
define our feature extractor. Instead of just passing in the word to be tagged, we will pass in a complete
(untagged) sentence, along with the index of the target word. This approach is demonstrated in 1.6, which
employs a context-dependent feature extractor to define a part of speech tag classifier.


In [None]:
def pos_features(sentence, i):
    # Initialize an empty dictionary to store features
    features = {
        "suffix(1)": sentence[i][-1:],  # Last character of the word at index i
        "suffix(2)": sentence[i][-2:],  # Last two characters of the word at index i
        "suffix(3)": sentence[i][-3:]   # Last three characters of the word at index i
    }

    # Check if the current word is the first word in the sentence
    if i == 0:
        features["prev-word"] = "<START>"  # Special tag for the first word
    else:
        features["prev-word"] = sentence[i - 1]  # Previous word in the sentence

    # Return the generated features
    return features



In [None]:
pos_features(brown.sents()[0], 8)
{'suffix(3)': 'ion', 'prev-word': 'an', 'suffix(2)': 'on', 'suffix(1)': 'n'}
tagged_sents = brown.tagged_sents(categories='news')
featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append( (pos_features(untagged_sent, i), tag) )
        size = int(len(featuresets) * 0.1)
        train_set, test_set = featuresets[size:], featuresets[:size]
        classifier = nltk.NaiveBayesClassifier.train(train_set)
        nltk.classify.accuracy(classifier, test_set)


**Detailed Explanation**

The code is an implementation of a part-of-speech (POS) tagging model using the Brown Corpus from the Natural Language Toolkit (nltk) in Python. Let's break down the code step by step:

1. **pos_features(brown.sents()[0], 8)**:
   - This calls the `pos_features` function with the first sentence (`brown.sents()[0]`) of the Brown Corpus and the index `8`. The resulting features for the word at index 8 are:
     - 'suffix(1)': 'n' (last character)
     - 'suffix(2)': 'on' (last two characters)
     - 'suffix(3)': 'ion' (last three characters)
     - 'prev-word': 'an' (previous word)

2. **tagged_sents = brown.tagged_sents(categories='news')**:
   - This retrieves the tagged sentences from the Brown Corpus, specifically those in the 'news' category. Each sentence is a list of tuples where each tuple contains a word and its corresponding part-of-speech tag.

3. **featuresets = []**:
   - Initializes an empty list to store feature sets.

4. **for tagged_sent in tagged_sents:**
   - Iterates over each tagged sentence in the 'news' category.

5. **untagged_sent = nltk.tag.untag(tagged_sent)**:
   - Removes the part-of-speech tags, leaving only a list of words from the current sentence.

6. **for i, (word, tag) in enumerate(tagged_sent):**:
   - Iterates over each word-tag pair in the current tagged sentence.

7. **featuresets.append((pos_features(untagged_sent, i), tag))**:
   - Calls `pos_features` to extract features for the word at index `i` in the untagged sentence and appends a tuple `(features, tag)` to the `featuresets` list.

8. **size = int(len(featuresets) * 0.1)**:
   - Calculates 10% of the total number of feature sets.

9. **train_set, test_set = featuresets[size:], featuresets[:size]**:
   - Splits the featuresets into training and testing sets.

10. **classifier = nltk.NaiveBayesClassifier.train(train_set)**:
   - Trains a Naive Bayes classifier using the training set.

11. **nltk.classify.accuracy(classifier, test_set)**:
   - Evaluates the accuracy of the trained classifier on the test set.

Note: The indentation of the code appears to be incorrect, and the training and testing code should be outside the loop over the tagged sentences for proper execution. Also, make sure to fix the indentation to ensure that the classifier is trained and tested after all feature sets are generated.

It is clear that exploiting contextual features improves the performance of our part-of-speech tagger. For
example, the classifier learns that a word is likely to be a noun if it comes immediately after the word "large"
or the word "gubernatorial". However, it is unable to learn the generalization that a word is probably a noun
if it follows an adjective, because it doesn't have access to the previous word's part-of-speech tag. In general,
simple classifiers always treat each input as independent from all other inputs. In many contexts, this makes
perfect sense. For example, decisions about whether names tend to be male or female can be made on a caseby-case basis. However, there are often cases, such as part-of-speech tagging, where we are interested in
solving classification problems that are closely related to one another.


# 1.6 Sequence Classification
In order to capture the dependencies between related classification tasks, we can use joint classifier models,
which choose an appropriate labeling for a collection of related inputs. **In the case of part-of-speech tagging,
a variety of different sequence classifier models can be used to jointly choose part-of-speech tags for all the
words in a given sentence.
One sequence classification strategy, known as consecutive classification or greedy sequence classification,
is to find the most likely class label for the first input, then to use that answer to help find the best label for
the next input. The process can then be repeated until all of the inputs have been labeled.**

In [None]:
def pos_features(sentence, i, history):
    features = {"suffix(1)": sentence[i][-1:],"suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
        features["prev-tag"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
        features["prev-tag"] = history[i-1]
    return features

class ConsecutivePosTagger(nltk.TaggerI):
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents
        untagged_sent = nltk.tag.untag(tagged_sent)
        history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = pos_features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = pos_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

tagged_sents = brown.tagged_sents(categories='news')
size = int(len(tagged_sents) * 0.1)
train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]
tagger = ConsecutivePosTagger(train_sents)
print(tagger.evaluate(test_sents))



**Detailed Explanation**

The provided code defines a part-of-speech (POS) tagger using a consecutive approach, where the prediction of the current tag is influenced by the previous word's tag. This approach is often used for sequence labeling tasks like POS tagging. Let's break down the code:

1. **pos_features function**:
   ```python
   def pos_features(sentence, i, history):
       features = {
           "suffix(1)": sentence[i][-1:],
           "suffix(2)": sentence[i][-2:],
           "suffix(3)": sentence[i][-3:]
       }
       if i == 0:
           features["prev-word"] = "<START>"
           features["prev-tag"] = "<START>"
       else:
           features["prev-word"] = sentence[i-1]
           features["prev-tag"] = history[i-1]
       return features
   ```
   - This function takes a sentence, an index `i`, and a history of tags as input.
   - It extracts features for the word at index `i` in the sentence, including the last three characters of the word, the previous word, and the previous tag.
   - If the word is the first word in the sentence, special tags `<START>` are used for both the previous word and tag.

2. **ConsecutivePosTagger class**:
   ```python
   class ConsecutivePosTagger(nltk.TaggerI):
       def __init__(self, train_sents):
           train_set = []
           for tagged_sent in train_sents:
               untagged_sent = nltk.tag.untag(tagged_sent)
               history = []
               for i, (word, tag) in enumerate(tagged_sent):
                   featureset = pos_features(untagged_sent, i, history)
                   train_set.append((featureset, tag))
                   history.append(tag)
           self.classifier = nltk.NaiveBayesClassifier.train(train_set)
       
       def tag(self, sentence):
           history = []
           for i, word in enumerate(sentence):
               featureset = pos_features(sentence, i, history)
               tag = self.classifier.classify(featureset)
               history.append(tag)
           return zip(sentence, history)
   ```
   - This class inherits from `nltk.TaggerI` and represents a consecutive POS tagger.
   - The `__init__` method trains the tagger using a training set. It iterates over tagged sentences, extracts features using `pos_features`, and builds a training set.
   - The `tag` method tags a given input sentence using the trained classifier. It maintains a history of predicted tags as it iterates through the words in the sentence.

3. **Testing the tagger**:
   ```python
   tagged_sents = brown.tagged_sents(categories='news')
   size = int(len(tagged_sents) * 0.1)
   train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]
   tagger = ConsecutivePosTagger(train_sents)
   print(tagger.evaluate(test_sents))
   ```
   - It tests the tagger on a test set and prints the evaluation result.

Note: The indentation within the `__init__` method is corrected in this response for better readability. Ensure proper indentation when using the code.

# 2 Further Examples of Supervised Classification

**2.1 Sentence Segmentation**

Sentence segmentation can be viewed as a classification task for punctuation: whenever we encounter a
symbol that could possibly end a sentence, such as a period or a question mark, we have to decide whether it
terminates the preceding sentence.
The first step is to obtain some data that has already been segmented into sentences and convert it into a form
that is suitable for extracting features:

In [None]:
sents = nltk.corpus.treebank_raw.sents()
tokens = []
boundaries = set()
offset = 0
for sent in sents:
    tokens.extend(sent)
    offset += len(sent)
    boundaries.add(offset-1)

**Detailed Explanation**

The provided code processes sentences from the Treebank corpus in the Natural Language Toolkit (nltk). Let's break down the code step by step:

```python
sents = nltk.corpus.treebank_raw.sents()
```

This line retrieves the sentences from the Treebank corpus using `nltk.corpus.treebank_raw.sents()`. The variable `sents` now contains a list of sentences, where each sentence is represented as a list of words.

```python
tokens = []
boundaries = set()
offset = 0
```

These lines initialize three variables:
- `tokens`: An empty list to store individual words.
- `boundaries`: An empty set to store indices where sentence boundaries occur.
- `offset`: A variable to keep track of the cumulative length of sentences.

```python
for sent in sents:
    tokens.extend(sent)
    offset += len(sent)
    boundaries.add(offset-1)
```

This loop iterates through each sentence in `sents`:
- `tokens.extend(sent)`: It appends all the words from the current sentence to the `tokens` list, effectively concatenating all the sentences into a single list of words.
- `offset += len(sent)`: It updates the `offset` variable by adding the length of the current sentence. This keeps track of the cumulative length of all the sentences processed so far.
- `boundaries.add(offset-1)`: It adds the current offset (minus 1) to the `boundaries` set. The subtraction of 1 is likely done to represent the index of the last word in each sentence, serving as a boundary marker.

After this loop, `tokens` contains all the words from the sentences concatenated into a single list, and `boundaries` contains indices that mark the boundaries between sentences.

In summary, this code snippet is a pre-processing step that processes sentences from the Treebank corpus, concatenates all the words into a single list (`tokens`), and identifies boundaries between sentences using the `boundaries` set.

Here, tokens is a merged list of tokens from the individual sentences, and boundaries is a set containing the
indexes of all sentence-boundary tokens. Next, we need to specify the features of the data that will be used in
order to decide whether punctuation indicates a sentence-boundary:


In [None]:
def punct_features(tokens, i):
    return {'next-word-capitalized': tokens[i+1][0].isupper(),
            'prev-word': tokens[i-1].lower(),
            'punct': tokens[i],
            'prev-word-is-one-char': len(tokens[i-1]) == 1}


Based on this feature extractor, we can create a list of labeled featuresets by selecting all the punctuation
tokens, and tagging whether they are boundary tokens or not:


In [None]:
featuresets = [(punct_features(tokens, i), (i in boundaries))
               for i in range(1, len(tokens)-1)
               if tokens[i] in '.?!']


Using these feature sets, we can train and evaluate a punctuation classifier:


In [None]:
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

To use this classifier to perform sentence segmentation, we simply check each punctuation mark to see
whether it's labeled as a boundary; and divide the list of words at the boundary marks. The listing below
shows how this can be done.

In [None]:
def segment_sentences(words):
    start = 0
    sents = []
    for i, word in enumerate(words):
        if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
            sents.append(words[start:i+1])
            start = i+1
    if start < len(words):
        sents.append(words[start:])
    return sents


# 2.2 Identifying Dialogue Act Types

When processing dialogue, it can be useful to think of utterances as a type of action performed by the
speaker. This interpretation is most straightforward for performative statements such as "I forgive you" or "I
bet you can't climb that hill." But greetings, questions, answers, assertions, and clarifications can all be
thought of as types of speech-based actions. Recognizing the dialogue acts underlying the utterances in a
dialogue can be an important first step in understanding the conversation.


The NPS Chat Corpus, which was demonstrated in 1, consists of over 10,000 posts from instant messaging
sessions. These posts have all been labeled with one of 15 dialogue act types, such as "Statement,"
"Emotion," "ynQuestion", and "Continuer." We can therefore use this data to build a classifier that can
identify the dialogue act types for new instant messaging posts. The first step is to extract the basic
messaging data. We will call xml_posts() to get a data structure representing the XML annotation for each
post:

In [None]:
import nltk
posts = nltk.corpus.nps_chat.xml_posts()[:10000]


**Detailed Explanation**

posts = nltk.corpus.nps_chat.xml_posts()[:10000]

This line retrieves a subset of posts from the NPS Chat Corpus. Here's a breakdown of each part:

- `nltk.corpus.nps_chat`: This accesses the NPS Chat Corpus included in the NLTK library. The NPS Chat Corpus contains instant messaging chats.

- `.xml_posts()`: This method specifically retrieves the posts from the corpus. Each post typically represents a single message in a chat.

- `[:10000]`: This slice notation is used to select the first 10,000 posts from the NPS Chat Corpus. It limits the number of posts to the first 10,000 in the dataset.

After running this code, the variable `posts` will contain a list of XML posts from the NPS Chat Corpus. Each post is likely to have attributes such as the content of the message, the user who posted it, and other relevant information. This subset of posts (the first 10,000) can be used for various natural language processing tasks, such as text analysis, sentiment analysis, or any other application that involves analyzing chat messages.

Next, we'll define a simple feature extractor that checks what words the post contains:


In [None]:
import nltk

def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features


**Detailed Explanation**

This code defines a function called `dialogue_act_features` that extracts features from a given text post for the purpose of dialogue act classification. Dialogue act classification involves categorizing a piece of text (e.g., a sentence or message) based on the type of discourse act it represents, such as a question, statement, request, etc.

Let's break down the function:

```python
import nltk
```
This line imports the Natural Language Toolkit (NLTK) library, which is widely used for natural language processing tasks in Python.

```python
def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features
```
- **`def dialogue_act_features(post):`**: This line defines the `dialogue_act_features` function, which takes a single argument `post` representing a text post.

- **`features = {}`**: Initializes an empty dictionary called `features` to store the extracted features.

- **`for word in nltk.word_tokenize(post):`**: This line tokenizes the input `post` into words using NLTK's `word_tokenize` function, and then iterates through each word.

- **`features['contains({})'.format(word.lower())] = True`**: For each word in the tokenized post, it creates a feature key in the `features` dictionary using the format `'contains({})'.format(word.lower())`. This format represents whether the word is present in the post, and it converts the word to lowercase for case-insensitive matching. The value associated with each feature is set to `True`.

- **`return features`**: After iterating through all words in the post, the function returns the dictionary of features.

The purpose of this function is to generate a simple bag-of-words representation of the input text post. Each word in the post becomes a feature, and the feature value is set to `True` if the word is present. This type of feature extraction is often used as a basic representation for text classification tasks, including dialogue act classification. The resulting features can be used as input to a machine learning model to train a classifier for predicting dialogue acts based on the words present in a text post.

Finally, we construct the training and testing data by applying the feature extractor to each post (using
post.get('class') to get a post's dialogue act type), and create a new classifier:

In [None]:
featuresets = [(dialogue_act_features(post.text), post.get('class'))
for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

# 3 Evaluation

In order to decide whether a classification model is accurately capturing a pattern, we must evaluate that
model. The result of this evaluation is important for deciding how trustworthy the model is, and for what
purposes we can use it. Evaluation can also be an effective tool for guiding us in making future
improvements to the model.

# 3.1 The Test Set

Most evaluation techniques calculate a score for a model by comparing the labels that it generates for the
inputs in a test set (or evaluation set) with the correct labels for those inputs. This test set typically has the
same format as the training set. However, it is very important that the test set be distinct from the training
corpus: if we simply re-used the training set as the test set, then a model that simply memorized its input,
without learning how to generalize to new examples, would receive misleadingly high scores.

When building the test set, there is often a trade-off between the amount of data available for testing and the
amount available for training. For classification tasks that have a small number of well-balanced labels and a
diverse test set, a meaningful evaluation can be performed with as few as 100 evaluation instances. But if a
classification task has a large number of labels, or includes very infrequent labels, then the size of the test set
should be chosen to ensure that the least frequent label occurs at least 50 times. Additionally, if the test set
contains many closely related instances — such as instances drawn from a single document — then the size
of the test set should be increased to ensure that this lack of diversity does not skew the evaluation results.
When large amounts of annotated data are available, it is common to err on the side of safety by using 10%
of the overall data for evaluation.

Another consideration when choosing the test set is the degree of similarity between instances in the test set
and those in the development set. The more similar these two datasets are, the less confident we can be that
evaluation results will generalize to other datasets. For example, consider the part-of-speech tagging task. At
one extreme, we could create the training set and test set by randomly assigning sentences from a data source
that reflects a single genre (news):

In [None]:
import random
from nltk.corpus import brown
tagged_sents = list(brown.tagged_sents(categories='news'))
random.shuffle(tagged_sents)
size = int(len(tagged_sents) * 0.1)
train_set, test_set = tagged_sents[size:], tagged_sents[:size]

In this case, our test set will be very similar to our training set. The training set and test set are taken from the
same genre, and so we cannot be confident that evaluation results would generalize to other genres. What's
worse, because of the call to random.shuffle(), the test set contains sentences that are taken from the same
documents that were used for training. If there is any consistent pattern within a document — say, if a given
word appears with a particular part-of-speech tag especially frequently — then that difference will be
reflected in both the development set and the test set. A somewhat better approach is to ensure that the
training set and test set are taken from different documents:
In this case, our test set will be very similar to our training set. The training set and test set are taken from the
same genre, and so we cannot be confident that evaluation results would generalize to other genres. What's
worse, because of the call to random.shuffle(), the test set contains sentences that are taken from the same
documents that were used for training. If there is any consistent pattern within a document — say, if a given
word appears with a particular part-of-speech tag especially frequently — then that difference will be
reflected in both the development set and the test set. A somewhat better approach is to ensure that the
training set and test set are taken from different documents:


In [None]:
file_ids = brown.fileids(categories='news')
size = int(len(file_ids) * 0.1)
train_set = brown.tagged_sents(file_ids[size:])
test_set = brown.tagged_sents(file_ids[:size])


If we want to perform a more stringent evaluation, we can draw the test set from documents that are less
closely related to those in the training set:

In [None]:
train_set = brown.tagged_sents(categories='news')
test_set = brown.tagged_sents(categories='fiction')


If we build a classifier that performs well on this test set, then we can be confident that it has the power to
generalize well beyond the data that it was trained on.


# 3.2 Accuracy

The simplest metric that can be used to evaluate a classifier, accuracy, measures the percentage of inputs in
the test set that the classifier correctly labeled. For example, a name gender classifier that predicts the correct
name 60 times in a test set containing 80 names would have an accuracy of 60/80 = 75%. The function
nltk.classify.accuracy() will calculate the accuracy of a classifier model on a given test set:


In [None]:
import nltk

# Example features (replace this with your actual feature extraction logic)
def extract_features(sentence):
    return {'contains_word': 'word' in sentence.lower()}

# Example dataset
train_set = [({'contains_word': True}, 'positive'), ({'contains_word': False}, 'negative')]
test_set = [({'contains_word': True}, 'positive'), ({'contains_word': False}, 'negative')]

# Training the Naive Bayes classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)

# Evaluating and printing the accuracy
accuracy = nltk.classify.accuracy(classifier, test_set)
print('Accuracy: {:4.2f}'.format(accuracy))



Note:
When interpreting the accuracy score of a classifier, it is important to take into consideration the frequencies of the individual class labels in the test set.

# 3.3 Precision and Recall

Another instance where accuracy scores can be misleading is in "search" tasks, such as information retrieval,
where we are attempting to find documents that are relevant to a particular task. Since the number of
irrelevant documents far outweighs the number of relevant documents, the accuracy score for a model that
labels every document as irrelevant would be very close to 100%.

![prec.png](attachment:prec.png)

It is therefore conventional to employ a different set of measures for search tasks, based on the number of
items in each of the four categories shown in 3.1:
    
True positives are relevant items that we correctly identified as relevant.

True negatives are irrelevant items that we correctly identified as irrelevant.

False positives (or Type I errors) are irrelevant items that we incorrectly identified as relevant.

False negatives (or Type II errors) are relevant items that we incorrectly identified as irrelevant.

Given these four numbers, we can define the following metrics:
    
Precision, which indicates how many of the items that we identified were relevant, is TP/(TP+FP).

Recall, which indicates how many of the relevant items that we identified, is TP/(TP+FN).

The F-Measure (or F-Score), which combines the precision and recall to give a single score, is
defined to be the harmonic mean of the precision and recall: (2 × Precision × Recall) / (Precision +
Recall).


# 3.4 Confusion Matrices

When performing classification tasks with three or more labels, it can be informative to subdivide the errors
made by the model based on which types of mistake it made. A confusion matrix is a table where each cell
[i,j] indicates how often label j was predicted when the correct label was i. Thus, the diagonal entries (i.e.,
cells |ii|) indicate labels that were correctly predicted, and the off-diagonal entries indicate errors.

In [4]:
import nltk

def tag_list(tagged_sents):
    return [tag for sent in tagged_sents for (word, tag) in sent]

def apply_tagger(tagger, corpus):
    return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus]

# Training a UnigramTagger using the Brown Corpus
brown_corpus = nltk.corpus.brown
train_sents = brown_corpus.tagged_sents(categories='editorial')  # or choose a suitable category
t2 = nltk.UnigramTagger(train_sents)

gold = tag_list(brown_corpus.tagged_sents(categories='editorial'))
test = tag_list(apply_tagger(t2, brown_corpus.tagged_sents(categories='editorial')))

cm = nltk.ConfusionMatrix(gold, test)
print(cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))



LookupError: 
**********************************************************************
  Resource [93mbrown[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('brown')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/brown[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


The confusion matrix indicates that common errors include a substitution of NN for JJ (for 1.6% of words),
and of NN for NNS (for 1.5% of words). Note that periods (.) indicate cells whose value is 0, and that the
diagonal entries — which correspond to correct classifications — are marked with angle brackets. .. XXX
explain use of "reference" in the legend above.


# 3.5 Cross-Validation

In order to evaluate our models, we must reserve a portion of the annotated data for the test set. As we
already mentioned, if the test set is too small, then our evaluation may not be accurate. However, making the
test set larger usually means making the training set smaller, which can have a significant impact on
performance if a limited amount of annotated data is available.

One solution to this problem is to perform multiple evaluations on different test sets, then to combine the
scores from those evaluations, a technique known as cross-validation. In particular, we subdivide the
original corpus into N subsets called folds. For each of these folds, we train a model using all of the data
except the data in that fold, and then test that model on the fold. Even though the individual folds might be
too small to give accurate evaluation scores on their own, the combined evaluation score is based on a large
amount of data, and is therefore quite reliable.

A second, and equally important, advantage of using cross-validation is that it allows us to examine how
widely the performance varies across different training sets. If we get very similar scores for all N training
sets, then we can be fairly confident that the score is accurate. On the other hand, if scores vary widely across
the N training sets, then we should probably be skeptical about the accuracy of the evaluation score.


# 4. Naive Bayes Classifiers

In naive Bayes classifiers, every feature gets a say in determining which label should be assigned to a given
input value. To choose a label for an input value, the naive Bayes classifier begins by calculating the prior
probability of each label, which is determined by checking frequency of each label in the training set. The
contribution from each feature is then combined with this prior probability, to arrive at a likelihood estimate
for each label. The label whose likelihood estimate is the highest is then assigned to the input value.

![NB.png](attachment:NB.png)



NOTE:

An abstract illustration of the procedure used by the naive Bayes classifier to choose the
topic for a document. In the training corpus, most documents are automotive, so the classifier starts out
at a point closer to the "automotive" label. But it then considers the effect of each feature. In this
example, the input document contains the word "dark," which is a weak indicator for murder mysteries,
but it also contains the word "football," which is a strong indicator for sports documents. After every
feature has made its contribution, the classifier checks which label it is closest to, and assigns that label
to the input.

Individual features make their contribution to the overall decision by "voting against" labels that don't occur
with that feature very often. In particular, the likelihood score for each label is reduced by multiplying it by
the probability that an input value with that label would have the feature. For example, if the word run occurs
in 12% of the sports documents, 10% of the murder mystery documents, and 2% of the automotive


documents, then the likelihood score for the sports label will be multiplied by 0.12; the likelihood score for
the murder mystery label will be multiplied by 0.1, and the likelihood score for the automotive label will be
multiplied by 0.02. The overall effect will be to reduce the score of the murder mystery label slightly more
than the score of the sports label, and to significantly reduce the automotive label with respect to the other
two labels. This process is illustrated below:
    
    

![NB2.png](attachment:NB2.png)

Calculating label likelihoods with naive Bayes. Naive Bayes begins by calculating the prior
probability of each label, based on how frequently each label occurs in the training data. Every feature
then contributes to the likelihood estimate for each label, by multiplying it by the probability that input
values with that label will have that feature. The resulting likelihood score can be thought of as an
estimate of the probability that a randomly selected value from the training set would have both the
given label and the set of features, assuming that the feature probabilities are all independent.


Another way of understanding the naive Bayes classifier is that it chooses the most likely label for an input,
under the assumption that every input value is generated by first choosing a class label for that input value,
and then generating each feature, entirely independent of every other feature. Of course, this assumption is
unrealistic; features are often highly dependent on one another. We'll return to some of the consequences of
this assumption at the end of this section. This simplifying assumption, known as the naive Bayes
assumption (or independence assumption) makes it much easier to combine the contributions of the
different features, since we don't need to worry about how they should interact with one another.

![NB%20classifier.png](attachment:NB%20classifier.png)

As depicted in the above diagram, A Bayesian Network Graph illustrating the generative process that is assumed by the naive
Bayes classifier. To generate a labeled input, the model first chooses a label for the input, then it
generates each of the input's features based on that label. Every feature is assumed to be entirely
independent of every other feature, given the label.


# Underlying Probabilistic Model


Based on this assumption, we can calculate an expression for P(label|features), the probability that an input
will have a particular label given that it has a particular set of features. To choose a label for a new input, we
can then simply pick the label l that maximizes P(l|features).

To begin, we note that P(label|features) is equal to the probability that an input has a particular label and the
specified set of features, divided by the probability that it has the specified set of features:
    
(2) P(label|features) = P(features, label)/P(features)

Next, we note that P(features) will be the same for every choice of label, so if we are simply interested in
finding the most likely label, it suffices to calculate P(features, label), which we'll call the label likelihood.

Note
If we want to generate a probability estimate for each label, rather than just choosing the
most likely label, then the easiest way to compute P(features) is to simply calculate the sum
over labels of P(features, label):
(3) P(features) = Σl in| labels P(features, label)


The label likelihood can be expanded out as the probability of the label times the probability of the features
given the label:
(4) P(features, label) = P(label) × P(features|label)

Furthermore, since the features are all independent of one another (given the label), we can separate out the
probability of each individual feature:

(5) P(features, label) = P(label) × Prodf in| featuresP(f|label)`

This is exactly the equation we discussed above for calculating the label likelihood: P(label) is the prior
probability for a given label, and each P(f|label) is the contribution of a single feature to the label likelihood.
