<a href="https://colab.research.google.com/github/IgnatiusEzeani/NLP-Lecture/blob/main/Week_18_Lab_Text_Feature_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Credit**: The example code below was taken from [Chapters 6 of the NLTK book](https://www.nltk.org/book/ch06.html).

# **Section 1**

## Gender Identification

NLTK has a wordlist corpus, `Names`, containing 8,000 first names categorized by gender. The male and female names are stored in separate files. Let's find names which appear in both files, i.e. names that are ambiguous for gender:

###**Import `nltk` and download the `name` corpus**

In [None]:
# !pip install nltk matplotlib ## Uncomment to install
import nltk
import random
nltk.download('names')
names = nltk.corpus.names 

###**Names in both male and female list**

In [None]:
print(names.fileids())
male_names = names.words('male.txt')
female_names = names.words('female.txt')
male_female = [w for w in male_names if w in female_names]
print(len(male_female))
for name in male_female[:20]:
  print(name)


###**Distribution of last letters**
According to [NLTK](https://www.nltk.org/book/ch02.html#sec-lexical-resources) suggests that male and female names have some distinctive characteristics. Names ending in `a`, `e` and `i` are likely to be female, while names ending in `k`, `o`, `r`, `s` and `t` are likely to be male. Let's see...

In [None]:
cfd = nltk.ConditionalFreqDist(
    (fileid, name[-1])
    for fileid in names.fileids()
    for name in names.words(fileid))
cfd.plot()

###**Feature extractor functions**
Let's build a classifier to model these differences more precisely. The first step in creating a classifier is deciding what features of the input are relevant, and how to encode those features. For this example, we'll start by just looking at the final letter of a given name.

The following feature extractors function builds a dictionary containing relevant information about a given name

In [None]:
# feature extractor 1
def gender_features(word):
  return {'last_letter': word[-1]}

# feature extractor 2
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

# feature extractor 3
def gender_features3(word):
  return {'suffix1': word[-1:], 'suffix2': word[-2:]}

###**Compiling the training instances**

In [None]:
# Building the training instances
labeled_names = ([(name, 'male') for name in names.words('male.txt')] 
                 + [(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)
# len(labeled_names)

###**Train-DevTest-Test Split**

In [None]:
# train-devtest-test split
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]
print(len(train_names), len(devtest_names), len(test_names))

###**Extracting the features**

In [None]:
# Extracting the features
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]

###**Training and Testing the Classifier**

In [None]:
# Training the classifier
random.shuffle(train_set)
classifier = nltk.NaiveBayesClassifier.train(train_set)

# apply the classifier to the development test
print("Accuracy = ", nltk.classify.accuracy(classifier, devtest_set))

###**Building the Error List**

In [None]:
# error analysis
errors = []
for (name, tag) in devtest_names:
  guess = classifier.classify(gender_features(name))
  if guess != tag:
    errors.append((tag, guess, name))

###**Show errors**

In [None]:
# Error list
print("Errors:", len(errors))
for (tag, guess, name) in sorted(errors[:20]):
  print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

###**Most informative features**

In [None]:
# Most informative features
classifier.show_most_informative_features(10)

###**Classifying other names**

In [None]:
print(classifier.classify(gender_features('Neo')))
# Output: 'male'
print(classifier.classify(gender_features('Trinity')))
# Output: 'female'

###**Classifying your name**

In [None]:
## Uncomment and modify below to classify your name with your best classifier
# print(classifier.classify(gender_features(<your name>))) #remember to change your 

##**Task 1**

Write a code that trains three different classifiers (`classifier1`, `classifier2` and `classifier3`) using the three feature extractor functions defined above `gender_features()`, `gender_features2()` and `gender_features3()`.

1. Apply the the three classifiers to the `dev_test` and for each report the *percentage accuracy*, *error count*, *error list*.  **Which of the feature extraction methods performed best on classifying the `dev_test`? Can you explain why?**

2. Apply the best performing classfier to the `test_set`. **What is the classification accuracy, error list?**

3. Modify your feature extractor or any part of the code to see if you can improve the accuracy score?

In [None]:
# 1. Your code here

In [None]:
# 2. Your code here

In [None]:
# 3. Your code here

---
# **Section 2**

## Document Classification
First, we construct a list of documents, labeled with the appropriate categories. For this example, we've chosen the Movie Reviews Corpus, which categorizes each review as `positive` or `negative`.

In [None]:
nltk.download('movie_reviews')
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category) 
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

Next, we define a feature extractor for documents, so the classifier will know which aspects of the data it should pay attention to. For document topic identification, we can define a feature for each word, indicating whether the document contains that word.

To limit the number of features that the classifier needs to process, we begin by constructing a list of the _2000 most frequent words_ in the overall corpus

In [None]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

We can then define a feature extractor `document_features()` that simply checks whether each of these words is present in a given document.

In [None]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

The reason for computing the set of all words in a document in Line 2, rather than just checking if word in document, is that checking whether a word occurs in a set is much faster than checking whether it occurs in a list.

Now, let's test our feature extractor by looking at the words that appeared in this positive review file `pos/cv957_8737.txt`

In [None]:
print(document_features(movie_reviews.words('pos/cv957_8737.txt'))) 

Now that we've defined our feature extractor, we can use it to train a classifier to label new movie reviews.

In [None]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

To check how reliable the resulting classifier is, we compute its accuracy on the test set. 

In [None]:
print(nltk.classify.accuracy(classifier, test_set))

Again, we can use `show_most_informative_features()` to find out which features the classifier found to be most informative.

In [None]:
classifier.show_most_informative_features()

Apparently in this corpus, a review that mentions **shoddy** is almost 7 times more likely to be negative than positive, while a review that mentions **singers** is about 6 times more likely to be positive.

##**Task 2**

The document feature extractor checks whether each word is present in a given document. Can you create other feature extractors as defined below?: 

1. `document_features2()`: uses the word frequency counts (and not their presence) as features.

2. `document_features3()`: extracts and uses the bigrams present in the document as features

3. `document_features4()`: combine the unigrams (words) and bigram presence as features

Test your results with these and share your observation

In [None]:
# 1. Your code here

In [None]:
# 2. Your code here

In [None]:
# 3. Your code here