1. Corpus Loading and Splitting
we’ll use the Universal Declaration of Human Rights (UDHR) corpus that comes with NLTK. We’ll select English as our target language and Farsi as our non-English sample, split the texts into sentences, label them accordingly, and then perform an 80/20 random split for training and testing.

In [4]:
import nltk
import random
from nltk.corpus import udhr
from nltk.tokenize import sent_tokenize

# Download required resources if not already available
nltk.download('udhr')
nltk.download('punkt')
nltk.download('punkt_tab')

# Load the UDHR texts for English and Farsi (Persian)
english_text = udhr.raw('English-Latin1')
farsi_text = udhr.raw('Farsi_Persian-v2-UTF8')

# Split texts into sentences
english_sentences = sent_tokenize(english_text)
farsi_sentences = sent_tokenize(farsi_text)

# Label the sentences: English as 'English' and Farsi as 'Non-English'
data = [(sent, 'English') for sent in english_sentences] + [(sent, 'Non-English') for sent in farsi_sentences]

# Shuffle the data to randomize the order
random.shuffle(data)

# Split data into 80% training and 20% testing
split_point = int(0.8 * len(data))
train_data = data[:split_point]
test_data = data[split_point:]


[nltk_data] Downloading package udhr to
[nltk_data]     /Users/majidtavakoli/nltk_data...
[nltk_data]   Package udhr is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/majidtavakoli/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/majidtavakoli/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


2. Preprocessing and Feature Extraction
A simple bag-of-words approach is used here. We define a function that tokenizes the text and creates a dictionary of words as features. Each word is lowercased and marked as present.

In [5]:
def extract_features(text):
    # Tokenize the text into words
    words = nltk.word_tokenize(text)
    # Create a feature dictionary: each word is a feature with value True
    return {word.lower(): True for word in words}

# Create feature sets for training and testing data
train_features = [(extract_features(text), label) for (text, label) in train_data]
test_features = [(extract_features(text), label) for (text, label) in test_data]


3. Training the Naïve Bayes Classifier
We now train the classifier using NLTK’s built-in Naïve Bayes implementation.

In [6]:
# Train the Naïve Bayes classifier on the training features
classifier = nltk.NaiveBayesClassifier.train(train_features)

# Optionally, display the most informative features
classifier.show_most_informative_features(10)


Most Informative Features
                     the = None           Non-En : Englis =      6.8 : 1.0
                      to = None           Non-En : Englis =      6.8 : 1.0
                       ، = None           Englis : Non-En =      3.1 : 1.0
                      به = None           Englis : Non-En =      3.1 : 1.0
                       و = None           Englis : Non-En =      3.1 : 1.0
                      که = None           Englis : Non-En =      2.8 : 1.0
                      of = None           Non-En : Englis =      2.7 : 1.0
                     and = None           Non-En : Englis =      2.4 : 1.0
                      در = None           Englis : Non-En =      2.3 : 1.0
                everyone = None           Non-En : Englis =      2.2 : 1.0


4. Evaluation: Accuracy, Confusion Matrix, Precision, and Recall
After training, evaluate the classifier’s performance on the test set. We calculate accuracy, generate a confusion matrix, and compute precision and recall for each class

In [7]:
from nltk import ConfusionMatrix
from nltk.metrics import precision, recall

# Evaluate accuracy on the test set
accuracy = nltk.classify.accuracy(classifier, test_features)
print(f'Accuracy: {accuracy:.2f}')

# Get the actual and predicted labels for the test set
actual = [label for (_, label) in test_features]
predicted = [classifier.classify(features) for (features, _) in test_features]

# Create and print the confusion matrix
cm = ConfusionMatrix(actual, predicted)
print("Confusion Matrix:")
print(cm)

# Prepare sets for calculating precision and recall
# Each set contains the indices of test samples for a given label
actual_sets = {'English': set(), 'Non-English': set()}
predicted_sets = {'English': set(), 'Non-English': set()}

for index, (act, pred) in enumerate(zip(actual, predicted)):
    actual_sets[act].add(index)
    predicted_sets[pred].add(index)

# Calculate precision and recall for each label
for label in ['English', 'Non-English']:
    p = precision(actual_sets[label], predicted_sets[label])
    r = recall(actual_sets[label], predicted_sets[label])
    print(f'{label} - Precision: {p:.2f}, Recall: {r:.2f}')


Accuracy: 1.00
Confusion Matrix:
            |     N |
            |     o |
            |     n |
            |     - |
            |  E  E |
            |  n  n |
            |  g  g |
            |  l  l |
            |  i  i |
            |  s  s |
            |  h  h |
------------+-------+
    English |<14> . |
Non-English |  . <4>|
------------+-------+
(row = reference; col = test)

English - Precision: 1.00, Recall: 1.00
Non-English - Precision: 1.00, Recall: 1.00
