# NLTK Basics - Gender Classification and POS Tagging

## Gender Classification

** Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance. **

** A classifier is called supervised if it is built based on training corpora containing the correct label for each input. **

## Implementation

** The method defined below is a requirement in classifying the names into genders. This method will simply return the last letter of the input name which would be our basis of classification. **

In [1]:
def gender_features(word):
    return {'last_letter':word[-1]}

**We are importing the list of names that will play the role of a primitive dataset**

In [2]:
from nltk.corpus import names

In [3]:
names.words()

['Abagael',
 'Abagail',
 'Abbe',
 'Abbey',
 'Abbi',
 'Abbie',
 'Abby',
 'Abigael',
 'Abigail',
 'Abigale',
 'Abra',
 'Acacia',
 'Ada',
 'Adah',
 'Adaline',
 'Adara',
 'Addie',
 'Addis',
 'Adel',
 'Adela',
 'Adelaide',
 'Adele',
 'Adelice',
 'Adelina',
 'Adelind',
 'Adeline',
 'Adella',
 'Adelle',
 'Adena',
 'Adey',
 'Adi',
 'Adiana',
 'Adina',
 'Adora',
 'Adore',
 'Adoree',
 'Adorne',
 'Adrea',
 'Adria',
 'Adriaens',
 'Adrian',
 'Adriana',
 'Adriane',
 'Adrianna',
 'Adrianne',
 'Adrien',
 'Adriena',
 'Adrienne',
 'Aeriel',
 'Aeriela',
 'Aeriell',
 'Ag',
 'Agace',
 'Agata',
 'Agatha',
 'Agathe',
 'Aggi',
 'Aggie',
 'Aggy',
 'Agna',
 'Agnella',
 'Agnes',
 'Agnese',
 'Agnesse',
 'Agneta',
 'Agnola',
 'Agretha',
 'Aida',
 'Aidan',
 'Aigneis',
 'Aila',
 'Aile',
 'Ailee',
 'Aileen',
 'Ailene',
 'Ailey',
 'Aili',
 'Ailina',
 'Ailyn',
 'Aime',
 'Aimee',
 'Aimil',
 'Aina',
 'Aindrea',
 'Ainslee',
 'Ainsley',
 'Ainslie',
 'Ajay',
 'Alaine',
 'Alameda',
 'Alana',
 'Alanah',
 'Alane',
 'Alanna',
 

** We can obtain the number of names that are available in this dataset by the following piece of code.**

In [4]:
print(len(names.words()))

7944


** Our primary aim is to classify names and predict whether a given name is male or female. To do so, it is of the utmost importance to perform a classification of the whole list of names.**

** Here we will be classifying all the names into 2 categories for our convenience namely "male" and "female".**

** We create a new list where we store the names along with their gender as a tuple. We determine the gender by checking whether a given name exists in the file "male.txt" or "female.txt" and then assign the gender appropriately.**

In [5]:
labeled_names=([(name,'male') for name in names.words('male.txt')] + [(name,'female') for name in names.words('female.txt')])

In [10]:
print(labeled_names)

[('Pooh', 'male'), ('Nadia', 'female'), ('Kania', 'female'), ('Jacenta', 'female'), ('Erasmus', 'male'), ('Lishe', 'female'), ('Annabell', 'female'), ('Mohammad', 'male'), ('Samaria', 'female'), ('Jaime', 'male'), ('Aime', 'female'), ('Reta', 'female'), ('Christof', 'male'), ('Umeko', 'female'), ('Hadleigh', 'male'), ('Cletus', 'male'), ('Debbra', 'female'), ('Pattie', 'male'), ('Shea', 'male'), ('Hari', 'male'), ('Wilie', 'female'), ('Garrett', 'male'), ('Trixie', 'female'), ('Corry', 'female'), ('Harlie', 'female'), ('Martynne', 'female'), ('Louis', 'male'), ('Rutger', 'male'), ('Carena', 'female'), ('Jude', 'male'), ('Len', 'male'), ('Clancy', 'male'), ('Mattie', 'male'), ('Estell', 'female'), ('Srinivas', 'male'), ('Guthrie', 'male'), ('Shamit', 'female'), ('Rozina', 'female'), ('Enrique', 'male'), ('Tatum', 'female'), ('Shanan', 'male'), ('Stephanie', 'female'), ('Tad', 'male'), ('Codee', 'female'), ('Nelie', 'female'), ('Jolie', 'female'), ('Gracie', 'female'), ('Marybeth', 'fema

** The following code simply performs the operation of randomizing the order of the elements of the list.**

In [6]:
import random
random.shuffle(labeled_names)

** Next we define another list where we will store tuples of information that includes a specific gender along with the features associated with the name. This helps in associating a standard name with its gender making it easier for identification of the gender of the name. **

In [7]:
featuresets = [(gender_features(name),gender) for (name,gender) in labeled_names]

In [8]:
print(featuresets)

[({'last_letter': 'h'}, 'male'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 's'}, 'male'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'l'}, 'female'), ({'last_letter': 'd'}, 'male'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'e'}, 'male'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'f'}, 'male'), ({'last_letter': 'o'}, 'female'), ({'last_letter': 'h'}, 'male'), ({'last_letter': 's'}, 'male'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'e'}, 'male'), ({'last_letter': 'a'}, 'male'), ({'last_letter': 'i'}, 'male'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 't'}, 'male'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'y'}, 'female'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 's'}, 'male'), ({'last_letter': 'r'}, 'male'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'e'}, 'male'), ({'last

** We initialize train_set and test_set with specific ranges of the above list associating genders with its corresponding gender features. **

In [11]:
train_set, test_set=featuresets[5000:], featuresets[:2944]

** We use the training set to train a "naive Bayes" classifier. **

** Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.**

** This classifier is based on making a naive assumption that the features are completely independent of one another. In mathematical terms, if 2 events E1 and E2 are independent of one another, then P(E1,E2) = P(E1)P(E2).**

** We can obtain useful features that differ among the 2 genders by making use of the show_most_informative_features() method. They are expressed as a ratio of occurrence. **

In [16]:
import nltk
classifier=nltk.NaiveBayesClassifier.train(train_set)
classifier.show_most_informative_features(10)

Most Informative Features
             last_letter = 'm'              male : female =     32.0 : 1.0
             last_letter = 'a'            female : male   =     28.5 : 1.0
             last_letter = 'f'              male : female =     15.1 : 1.0
             last_letter = 'o'              male : female =      8.0 : 1.0
             last_letter = 'p'              male : female =      7.3 : 1.0
             last_letter = 'd'              male : female =      7.1 : 1.0
             last_letter = 'r'              male : female =      6.9 : 1.0
             last_letter = 'b'              male : female =      5.7 : 1.0
             last_letter = 'i'            female : male   =      5.3 : 1.0
             last_letter = 's'              male : female =      5.0 : 1.0


** Testing the model**

In [19]:
classifier.classify(gender_features('Rahul Gopinath'))

'male'

** We can obtain the accuracy of the model by passing the defined classifier and testing set into classify.accuracy() nethod. It returns the accuracy in fractional format. ** 

In [20]:
print(nltk.classify.accuracy(classifier,test_set))

0.7516983695652174


# POS Tagging

In [6]:
import nltk

In [7]:
text='''William Shakespeare was an English poet, playwright, and actor, widely regarded as the greatest writer in the English language and the world's greatest dramatist. He is often called England's national poet. Shakespeare was born and raised in Stratford-upon-Avon, Warwickshire.'''

** We have the input text that has to be broken down into tokens. Each of these tokens have to be assigned with a particular parts-of-speech tag.**

** We perform tokenization by using the RegexpTokenizer module. **

In [8]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)

['William', 'Shakespeare', 'was', 'an', 'English', 'poet', 'playwright', 'and', 'actor', 'widely', 'regarded', 'as', 'the', 'greatest', 'writer', 'in', 'the', 'English', 'language', 'and', 'the', 'world', 's', 'greatest', 'dramatist', 'He', 'is', 'often', 'called', 'England', 's', 'national', 'poet', 'Shakespeare', 'was', 'born', 'and', 'raised', 'in', 'Stratford', 'upon', 'Avon', 'Warwickshire']


**Next we generate the part-of-speech of the tokens of the corresponding text**

In [10]:
result = nltk.pos_tag(tokenized_text)
print(result)

[('William', 'NNP'), ('Shakespeare', 'NNP'), ('was', 'VBD'), ('an', 'DT'), ('English', 'JJ'), ('poet', 'NN'), ('playwright', 'NN'), ('and', 'CC'), ('actor', 'NN'), ('widely', 'RB'), ('regarded', 'VBD'), ('as', 'IN'), ('the', 'DT'), ('greatest', 'JJS'), ('writer', 'NN'), ('in', 'IN'), ('the', 'DT'), ('English', 'JJ'), ('language', 'NN'), ('and', 'CC'), ('the', 'DT'), ('world', 'NN'), ('s', 'NN'), ('greatest', 'JJS'), ('dramatist', 'NN'), ('He', 'PRP'), ('is', 'VBZ'), ('often', 'RB'), ('called', 'VBN'), ('England', 'NNP'), ('s', 'VBD'), ('national', 'JJ'), ('poet', 'NN'), ('Shakespeare', 'NNP'), ('was', 'VBD'), ('born', 'VBN'), ('and', 'CC'), ('raised', 'VBN'), ('in', 'IN'), ('Stratford', 'NNP'), ('upon', 'IN'), ('Avon', 'NNP'), ('Warwickshire', 'NNP')]
