# <center> <font size = 24 color = 'steelblue'> <b>Text Classification using Naive Bayes Classifier

## Overview: 

The goal is to develop an NLP model by downloading and preparing a text corpus from NLTK. It involves acquiring data, extracting features, and building a model for a specific task. Finally, the model's performance is evaluated to ensure its effectiveness.

<div class="alert alert-block alert-info">
    
<font size = 4> 

**By the end of this notebook you will be able to:**
- Learn to extract features from text
- Learn to train a Naive Bayes classifier model for basic text classification
- Explore evaluation of text classification model built

## Problem Description:
In this classification task, the goal is to predict the gender based on a person's given name. Names often carry certain linguistic patterns or features that can be indicative of gender, making it possible to develop a model that leverages these patterns to classify names accordingly. This problem is common in fields like social media analysis, marketing, or customer personalization, where predicting gender can help in better segmentation and targeted communication.

## Problem Statement:
You are provided with a dataset containing a list of names along with their corresponding gender labels. The objective is to build a machine learning model that can accurately predict the gender of a name that it has not seen before.

# <a id= 'c0'> 
<font size = 4>
    
**Table of contents:**<br>
[1. Installation and import of necessary packages](#c1)<br>
[2. Download the necessary corpus from NLTK](#c2)<br>
[3. Data acquisiton](#c3)<br>
[4. Feature extraction](#c4)<br>
[5. Model development](#c5)<br>
[6. Evaluation](#c6)<br>
    

##### <a id = 'c1'>
<font size = 10 color = 'midnightblue'> <b>Import necessary packages

In [2]:
# We begin by installing and importing the required libraries.
import nltk
import string
import random
import pandas as pd

##### <a id = 'c2'>
<font size = 10 color = 'midnightblue'> <b>Download necessary corpus and models from nltk

- We will use the "names" corpus from NLTK to build a simple model for gender classification based on names.
- The "names" corpus contains two text files: male.txt and female.txt, listing names commonly used for males and females.    

In [3]:
nltk.download("names")
nltk.download('product_reviews_1')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package names to /voc/work/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.
[nltk_data] Downloading package product_reviews_1 to
[nltk_data]     /voc/work/nltk_data...
[nltk_data]   Unzipping corpora/product_reviews_1.zip.
[nltk_data] Downloading package punkt to /voc/work/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /voc/work/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
print(nltk.corpus.names.fileids())

['female.txt', 'male.txt']


[top](#c0)

##### <a id = 'c3'>
<font size = 10 color = 'midnightblue'> <b>Data acquisition

<div class="alert alert-block alert-success">
    
<font size = 4> 
    
- The names corpus contains two text files.
- `male.txt` contains list of names which are most frequently used for males.
- `female.txt` contains list of names most commonly used for females.

<font size = 5 color = seagreen> Here, we extract names from the NLTK corpus:

In [5]:
female_names = nltk.corpus.names.words('female.txt')
male_names = nltk.corpus.names.words('male.txt')

<font size = 5 color = seagreen> We then label the data as either **female** or **male** and combine them into a list of tuples:

In [6]:
labeled_data = ([(name, 'female') for name in female_names] +
                    [(name, 'male') for name in male_names])

[top](#c0)

##### <a id = 'c4'>
<font size = 10 color = 'midnightblue'> <b> Feature Extraction

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Text data is unstructured and features need to be extracted in order to use it in ML models.
- Here features are identified manually as ***length***, ***first letter***, ***last letter***, ***count of each letter*** and ***count of vowels*** in the name.
- The function below extracts these features and returns a dictionary of features.
- We will define a function that extracts basic features from names, such as their length, first letter, and vowel count.

In [7]:
def getFeatures(name):
    # Lower casing
    name = name.lower()
    feature_dict = {}

    # Getting the features like length, first_letter, last_letter
    feature_dict['length'] = len(name)
    feature_dict['first_letter'] = name[0]
    feature_dict['last_letter'] = name[-1]

    feature_dict['vowels_count'] = 0

    # Get the counts of alphabets and vowels
    for char in string.ascii_lowercase:
        feature_dict[f'count_{char}'] = name.count(char)
        if (char in 'aeiou' )and (char in name):
            feature_dict['vowels_count'] += name.count(char)

    return feature_dict



<font size = 5 color = seagreen> <b> Transform names in the labeled data to these features using the above function.

In [8]:

new_lab_data= []
for name, label in labeled_data:
    features = getFeatures(name)
    new_lab_data.append((features, label))

>

##### <a id = 'c5'>
<font size = 10 color = 'midnightblue'> <b> Model development 

<font size = 5 color = seagreen> Before splitting the data into training and test sets, shuffle the labeled dataset:

In [9]:
random.shuffle(new_lab_data)

<font size = 5 color = seagreen> Select the first 1000 records for testing, and the remaining for training:

In [10]:
test_data = new_lab_data[:1000]
train_data = new_lab_data[1000:]

<font size = 5 color = seagreen> Preview of the first 5 rows of training data:

In [11]:
pd.DataFrame(train_data[:5], columns=['Features', 'Label'])

Unnamed: 0,Features,Label
0,"{'length': 10, 'first_letter': 'f', 'last_lett...",female
1,"{'length': 8, 'first_letter': 'w', 'last_lette...",male
2,"{'length': 4, 'first_letter': 'd', 'last_lette...",female
3,"{'length': 6, 'first_letter': 'g', 'last_lette...",male
4,"{'length': 5, 'first_letter': 'm', 'last_lette...",female


<font size = 5 color = seagreen> Now, we can train a Naive Bayes classifier using NLTK’s built-in method:

In [12]:
classifier = nltk.naivebayes.NaiveBayesClassifier.train(train_data)

##### <a id = 'c6'>
<font size = 6 color = 'midnightblue'> <b> Evaluation

<font size = 3 color = seagreen> To test how well the model performs, we will classify some names from the test set.

**Classify a single input:**

In [13]:
classifier.classify(getFeatures('Johnny'))

'male'

<div class="alert alert-block alert-info">
<font size = 4> 
    
**Note :**
  - For classification input text needs to be converted into features similar to the training data
  - We can use the same feature extraction function here for transformation
    


<font size = 5 color = seagreen> <b> This classifier object can also be used to classify multiple text inputs at the same time.

<div class="alert alert-block alert-success">
<font size = 4> 

- In order to do so, pass a unlabeled data to the classifier associated function `classify_many`.
- The below snippet separates the labels from the preprocessed (feature extracted) list and prepares the data input for the classification function.

In [14]:
test_features = []
test_labels = []
for feature_set, label in test_data:
    test_features.append(feature_set)
    test_labels.append(label)

<font size = 5 color = seagreen> <b> Obtain the classes for the test input.

In [15]:
test_labels_pred = classifier.classify_many(test_features)

[top](#c0)

## Preview of the test data and predicted labels:

In [22]:
pd.DataFrame({'Name': [labeled_data[i][0] for i in range(1000)],
              'Actual Label': test_labels,
              'Predicted Label': test_labels_pred})


Unnamed: 0,Name,Actual Label,Predicted Label
0,Abagael,female,female
1,Abagail,female,female
2,Abbe,male,male
3,Abbey,female,female
4,Abbi,female,female
...,...,...,...
995,Clair,male,male
996,Claire,female,male
997,Clara,female,female
998,Clarabelle,male,female


<font size = 5 color = seagreen> <b> Use the evaluation metrics for classification models, like confusion matrix, accuracy, etc. to assess the model

<font size = 5 color = pwdrblue> <b> Confusion Matrix

In [16]:
for_matrix = pd.DataFrame({'pred' : test_labels_pred, 'act' : test_labels})

In [17]:
confusion_mat = pd.crosstab(for_matrix.pred, for_matrix.act)
confusion_mat

act,female,male
pred,Unnamed: 1_level_1,Unnamed: 2_level_1
female,515,112
male,113,260


In [18]:
# Get the values of true positives, true negatives, false positives, false negatives for computation of accuracy and other measures
TP = confusion_mat.iloc[0,0]
TN = confusion_mat.iloc[1,1]
FP = confusion_mat.iloc[0,1]
FN = confusion_mat.iloc[1,0]

In [19]:
Accuracy = (TP + TN) / sum([TP, TN, FP, FN]) * 100
print(f"Accuracy : {Accuracy:0.2f} %")

Accuracy : 77.50 %


<font size = 5 color = seagreen> <b> NLTK also provides functions to obtain accuracy for the model.

In [20]:
## Accuracy on test data :
nltk.classify.accuracy(classifier, test_data)

0.775

<font size = 5 color = seagreen> <b> The nltk `naive bayes model` also provides the `top n` important features contributing in classification.

In [21]:
classifier.show_most_informative_features(n = 15)

Most Informative Features
             last_letter = 'a'            female : male   =     42.3 : 1.0
             last_letter = 'k'              male : female =     29.3 : 1.0
             last_letter = 'f'              male : female =     16.0 : 1.0
             last_letter = 'v'              male : female =     15.3 : 1.0
             last_letter = 'p'              male : female =     11.2 : 1.0
             last_letter = 'd'              male : female =     10.0 : 1.0
             last_letter = 'm'              male : female =      8.9 : 1.0
             last_letter = 'o'              male : female =      8.9 : 1.0
                 count_v = 2              female : male   =      8.4 : 1.0
             last_letter = 'r'              male : female =      7.2 : 1.0
                 count_w = 2                male : female =      5.1 : 1.0
             last_letter = 'w'              male : female =      5.1 : 1.0
                 count_a = 3              female : male   =      4.8 : 1.0

<div class="alert alert-block alert-info">
<font size = 4> 
    
**Note :**

**This model can be modified to be used for any labeled data with required data cleaning and preprocessing.**

- The NLTK naive bayes classifier accepts the text data in a specific format, i.e. a list containingtuples which contain the feature dictionary and the label as its items.
- The data should be transformed in this manner and used for classification.
- `sklearn` classifiers may also be used but they require transforming text data to numerical formats (discussed in next chapters).
