### Introduction
The Simple Naive Bayes Classifier is a group of classifiers based on probabilities that usually have acceptable accuracy. This method is based on independent events and the Bayes theorem. In this project, we intend to write a program that receives data consisting of spam and non-spam messages, as well as a list of anonymous messages, and determines whether the anonymous messages are spam or not.

### Importing Necessary Libraries

In [12]:
import math
import re
from collections import defaultdict

### Training the Classifier

The `train` function is responsible for training the text classifier on a given training dataset. It takes the path to the training file as an input and returns several variables that store information about the training data and the learned model.

Here's how the function works:

- Initialize variables to keep track of spam and ham counts, word counts, and the vocabulary.
- Open the training file and read each line.
- Split the line into the label (spam or ham) and the text.
- Extract individual words from the text using regular expressions.
- Update the spam or ham count based on the label and increment the word counts for the respective category.
- Update the vocabulary with the unique words found in the text.
- Return the obtained counts, word dictionaries, and vocabulary.

This training process builds the foundation for the text classifier and allows it to make predictions based on the learned probabilities.

In [3]:
def train(train_file):
    spam_count = 0
    ham_count = 0
    word_counts_spam = defaultdict(int)
    word_counts_ham = defaultdict(int)
    vocabulary = set()

    with open(train_file, 'r', encoding='utf-8') as file:
        for line in file:
            label, text = line.strip().split('\t')
            words = re.findall(r'\w+', text)
            vocabulary.update(words)

            if label == 'spam':
                spam_count += 1
                for word in words:
                    word_counts_spam[word] += 1
            else:
                ham_count += 1
                for word in words:
                    word_counts_ham[word] += 1

    return spam_count, ham_count, word_counts_spam, word_counts_ham, vocabulary

### Calculating Probability

The `calculate_probability` function calculates the probability of a word occurring in a given category (spam or ham). It takes the word, the word counts dictionary for the relevant category, the total number of words in the category, the smoothing parameter alpha, and the vocabulary size as inputs.

Here's how the probability calculation is performed:

- Retrieve the count of the word from the word counts dictionary.
- Calculate the probability using the formula: `(word_count + alpha) / (total_words + alpha * vocabulary_size)`.
- Return the calculated probability.

This function is used during the classification process to estimate the likelihood of a word belonging to a specific category.

In [4]:
def calculate_probability(word, word_counts, total_words, alpha, vocabulary_size):
    word_count = word_counts[word]
    return (word_count + alpha) / (total_words + alpha * vocabulary_size)

### Classifying Text

The `classify_text` function is responsible for classifying a given text as spam or ham. It takes the text, the spam and ham counts, the word counts dictionaries, the vocabulary, and an optional smoothing parameter alpha as inputs.

Here's an overview of the classification process:

- Extract individual words from the text using regular expressions.
- Calculate the spam and ham probabilities based on the spam and ham counts.
- Initialize the log probabilities for spam and ham.
- For each word in the text:
   - If the word is present in the vocabulary:
     - Update the spam and ham log probabilities using the calculated probabilities for the word in the respective categories.
- Add the log probabilities of the spam and ham to their corresponding overall probabilities.
- Determine the final label based on the higher probability (spam or ham).
- Return the predicted label.

This function uses the trained model obtained from the training process to classify new texts.

In [5]:
def classify_text(text, spam_count, ham_count, word_counts_spam, word_counts_ham, vocabulary, alpha=1):
    words = re.findall(r'\w+', text)
    vocabulary_size = len(vocabulary)
    spam_probability = spam_count / (spam_count + ham_count)
    ham_probability = ham_count / (spam_count + ham_count)
    spam_log_probability = 0
    ham_log_probability = 0

    for word in words:
        if word in vocabulary:
            spam_log_probability += math.log(calculate_probability(word, word_counts_spam, spam_count, alpha, vocabulary_size))
            ham_log_probability += math.log(calculate_probability(word, word_counts_ham, ham_count, alpha, vocabulary_size))

    spam_log_probability += math.log(spam_probability)
    ham_log_probability += math.log(ham_probability)

    if spam_log_probability > ham_log_probability:
        return 'spam'
    else:
        return 'ham'

### Testing the Classifier

The `test` function applies the trained text classifier to a given test dataset. It takes the path to the test file, the output file, the spam and ham counts, the word counts dictionaries, and the vocabulary as inputs.

Here's an outline of the testing process:

1. Open the test file for reading and the output file for writing(cont.)

```markdown
1. Open the test file for reading and the output file for writing.
2. For each line in the test file:
   - Strip the line and extract the text.
   - Use the `classify_text` function to predict the label for the text.
   - Write the predicted label and the original text to the output file.
3. Close both the input and output files.

This function allows you to evaluate the performance of the trained classifier on new, unseen data.

In [16]:
def test(test_file, output_file, spam_count, ham_count, word_counts_spam, word_counts_ham, vocabulary):
    with open(test_file, 'r', encoding='utf-8') as input_file, open(output_file, 'w', encoding='utf-8') as output_file:
        for line in input_file:
            text = line.strip()
            label = classify_text(text, spam_count, ham_count, word_counts_spam, word_counts_ham, vocabulary)
            output_file.write(f'{label}\t{text}\n')

### Main Execution

The main execution code allows you to run the text classification pipeline by specifying the paths to the training file, test file, and output file.

Here's how the main execution works:

- Specify the paths to the training file, test file, and output file.
- Call the `train` function to train the text classifier on the training data.
- Call the `test` function to classify the texts in the test file and write the predicted labels to the output file.

Make sure to replace the placeholder file paths with the actual file paths before running the code.

This main execution code provides a convenient way to run the text classification pipeline with minimal setup.

In [17]:
training_file = "train_data"
test_file = "sample_test_data"
output_file = "output_file"

# Training the classifier
spam_counts, ham_counts, spam_word_counts, ham_word_counts, vocabulary = train(training_file)

# Testing the classifier
test(test_file, output_file, spam_counts, ham_counts, spam_word_counts, ham_word_counts, vocabulary)

### Results
After evaluations, the classifier got a score of 100 / 100 which shows that the **accuracy was 90 percent or higher**. 