# Notebook E-tivity 3 CE4021 Task 2

Student name: Guillermo Alcantara Gonzalez

Student ID: 23123982

<hr style=\"border:2px solid gray\"> </hr>

## Imports

In [1]:
from CE4021.NaiveBayes import Classifier, confusion_matrix
import json

## Data

In [2]:
# Define the datasets as a JSON object.
# The keys are the labels and the values are lists of emails.
with open('dataset.json') as f:
    dataset = json.load(f)
    
data_previous = dataset['previous']
data_new = dataset['new']

<hr style=\"border:2px solid gray\"> </hr>

# Task 2

Use the below information to create a Naive Bayes SPAM filter. Test your filter using the messages in new_emails. You may add as many cells as you require to complete the task.

# Naive Bayes Email Classifier
## Outline
1. **Data Preparation**: Define the `previous_ham` and `previous_spam` email datasets.
2. **Feature Extraction**: Tokenize the emails and create a vocabulary.
3. **Model Training**: Calculate the probabilities required for the Naive Bayes Classifier.
4. **Classification**: Use Bayes' Rule to classify emails in the `new_emails` dictionary.
5. **Evaluation**: Compare the classifier's decisions with the actual labels in `new_emails`.
6. **Learning from New Data**: Update the model based on the `new_emails`.

In [3]:
def main():
    # 1. Data Preparation
    try:
        classifier = Classifier(data_previous['ham'], data_previous['spam'])
    except ValueError as e:
        print(f"Error initializing classifier: {e}")
        return
    # 2. Feature Extraction
    actual_labels = []
    predicted_labels = []
    classifications = {label: [] for label in ['ham', 'spam']}
    # 3. Model Training
    for label, emails in data_previous.items():
        for email in emails:
            predicted_label = classifier.classify(email)
            actual_labels.append(label)
            predicted_labels.append(predicted_label)
            classifications[label].append(predicted_label)
            print(f"Email: {email}\n\t"
                  f"Actual: {label}\n\t"
                  f"Predicted: {predicted_label}\n")
    # 4. Classification
    labels = ['ham', 'spam']
    cm = confusion_matrix(actual_labels, predicted_labels, labels=labels)
    # 5. Evaluation
    correct_classifications = sum(
        classified.upper() == label.upper() for label, results in classifications.items() for classified in results
    )
    total_classifications = sum(len(results) for results in classifications.values())
    accuracy = correct_classifications / total_classifications

    print(f"Classification Accuracy: {accuracy * 100:.2f}%")
    print(f"Confusion Matrix:\n{cm}")

    # 6. Update the classifier with new data
    try:
        classifier.update_and_learn(data_new)
    except (TypeError, ValueError) as e:
        print(f"Error updating classifier: {e}")
        return

    return classifications, accuracy, cm

In [4]:
if __name__ == "__main__":
    main()

Email: Hey, how are you?
	Actual: ham
	Predicted: ham

Email: Are you coming to the meeting?
	Actual: ham
	Predicted: ham

Email: The project deadline is approaching.
	Actual: ham
	Predicted: ham

Email: Let's catch up soon!
	Actual: ham
	Predicted: ham

Email: Your invoice is attached.
	Actual: ham
	Predicted: ham

Email: Your activity report
	Actual: ham
	Predicted: ham

Email: benefits physical activity
	Actual: ham
	Predicted: ham

Email: the importance vows
	Actual: ham
	Predicted: ham

Email: Hey, how are you?
	Actual: ham
	Predicted: ham

Email: Are you coming to the meeting?
	Actual: ham
	Predicted: ham

Email: The project deadline is approaching.
	Actual: ham
	Predicted: ham

Email: Let's catch up soon!
	Actual: ham
	Predicted: ham

Email: Your invoice is attached.
	Actual: ham
	Predicted: ham

Email: Your activity report
	Actual: ham
	Predicted: ham

Email: Congratulations, you won a prize!
	Actual: spam
	Predicted: spam

Email: You are eligible for a loan.
	Actual: spam
	Pre

## Reflection

Reflections on the Naive Bayes classifier code:

### Design and Implementation:

- The overall design using OOP with a NaiveBayesClassifier class is clean and intuitive. The class encapsulates the data and methods nicely.
- The use of staticmethods like `_validate_input`,`` _count_words, etc improves readability by grouping utility functions.
- Docstrings are descriptive and follow PEP 257 style and Googly-styling.
- Type hints are included which improve understandability. Don't decrease my grades for this import, please.
- Validation of inputs is done to fail fast on bad data.
- Word probabilities are smoothed using Laplace smoothing to avoid 0 probabilities.

### Opportunities:
- Parts of the code should be separated into files.
    - The _update_word_count could be moved outside the class as a standalone utility function. 
    - Validation functions could be separate in a utils.py module to avoid cluttering the classifier code.
- A training or fit() method could be added to encapsulate the training logic separately from initialization.
- Saving the trained model parameters to disk can be added to persist the learned model.
- Rigorous input validation can be added.
- Testing, especially around edge cases, can be added to improve robustness.

### Future Considerations:

- Options to use different smoothing techniques like Good-Turing smoothing based on data.
- declare configuration values in a config file
- Compare performance against other classification algorithms.
- Add support for multi-class classification, not just binary ham/spam.
- Explore optimizations like removing stopwords to improve speed and memory usage for large datasets.

Overall the implementation follows sound OOP principles and has a good foundation. A few tweaks like the ones mentioned above can make it more robust and production-ready.

## Peer advice
I declare that no peer material was used. 