# **OPEN-ARC**
---

### Project 12: Spam Mail Classification Model:
**Challenge:** Create an AI model, capable of classifying mail as either spam or ham.


### Terms and Use:
Learn more about the project's [LICENSE](https://github.com/Infinitode/OPEN-ARC/blob/main/LICENSE) and read our [CODE_OF_CONDUCT](https://github.com/Infinitode/OPEN-ARC/blob/main/CODE_OF_CONDUCT) before contributing to the project. You can contribute to this project from here: [https://github.com/Infinitode/OPEN-ARC/](https://github.com/Infinitode/OPEN-ARC/).

---

Please fill out this performance sheet to help others quickly see your model's performance **(optional)**:

### Performance Sheet:
| Contributor | Architecture Type | Platform | Base Model | Dataset | Accuracy | Link |
|-------------|-------------------|----------|------------|---------|----------|------|
| Infinitode  | MultinomialNB  | Kaggle   | ✔  | Spam Mail Classifier Dataset | 98.4%    | [Notebook](https://github.com/Infinitode/OPEN-ARC/blob/main/Project-12/notebook.ipynb) |
| Username  | Unknown  | Kaggle   | ✗/✔  | Unknown | Score    | [Notebook](https://github.com) |

---

## 1. Data Processing

Check for missing values and split features and target. We'll also create our train test split sets.

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Load the dataset
# Make sure the file path is correct for your environment
try:
    df = pd.read_csv('/kaggle/input/spam-mail-classifier/sms_spam.csv')
except FileNotFoundError:
    print("Dataset not found. Please check the file path.")

# Handle missing values by dropping them
df.dropna(inplace=True)

# Separate features (text) and target (label)
X = df['text']
y = df['type']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Vectorize the text data using CountVectorizer
# This converts text documents into a matrix of token counts
vectorizer = CountVectorizer(stop_words='english')
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

## 2. Model Training

Train a simple `Multinomial Naive Bayes model` (works especially well with discrete features).

In [4]:
from sklearn.naive_bayes import MultinomialNB

# Initialize and train the model
model = MultinomialNB()
model.fit(X_train_vectorized, y_train)

## 3. Evaluation & Scores

Evalute and score the trained model on the held out testing set.

In [6]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Make predictions on the test data
y_pred = model.predict(X_test_vectorized)

# Define 'spam' as the positive label for metrics calculation
positive_label = 'spam'

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label=positive_label, average='binary')
recall = recall_score(y_test, y_pred, pos_label=positive_label, average='binary')
f1 = f1_score(y_test, y_pred, pos_label=positive_label, average='binary')
conf_matrix = confusion_matrix(y_test, y_pred, labels=y.unique())

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision (for '{positive_label}'): {precision:.4f}")
print(f"Recall (for '{positive_label}'): {recall:.4f}")
print(f"F1 Score (for '{positive_label}'): {f1:.4f}")
print("\nConfusion Matrix:")
print(conf_matrix)
print("Rows represent true labels (ham, spam), columns represent predicted labels (ham, spam).")

Accuracy: 0.9848
Precision (for 'spam'): 0.9615
Recall (for 'spam'): 0.9317
F1 Score (for 'spam'): 0.9464

Confusion Matrix:
[[948   6]
 [ 11 150]]
Rows represent true labels (ham, spam), columns represent predicted labels (ham, spam).


Got an `accuracy of 0.9848` after testing, which means that the model is really good at classifying text as spam or ham.

## 4. Inference & Testing

Test using a predefined string array.

In [10]:
# Example of a new email to classify
new_emails = [
    "Congratulations! You've won a free prize. Click the link to claim.", # Likely spam
    "Hi, just confirming our meeting for tomorrow at 10 AM. Thanks." # Likely not spam
]

# Vectorize the new emails using the same fitted vectorizer
new_emails_vectorized = vectorizer.transform(new_emails)

# Make predictions
predictions = model.predict(new_emails_vectorized)

for i, email in enumerate(new_emails):
    print(f"\nEmail: '{email}'")
    print(f"Prediction: {predictions[i]}")


Email: 'Congratulations! You've won a free prize. Click the link to claim.'
Prediction: spam

Email: 'Hi, just confirming our meeting for tomorrow at 10 AM. Thanks.'
Prediction: ham


## 5. Saving

We're using `joblib` to quickly and easily save our trained model and vectorizer to JSON files.

In [11]:
import joblib

# Save the model to a file
joblib.dump(model, 'spam_classifier_model.joblib')
print("Model saved as 'spam_classifier_model.joblib'")

# Save the vectorizer to a file
joblib.dump(vectorizer, 'vectorizer.joblib')
print("Vectorizer saved as 'vectorizer.joblib'")

Model saved as 'spam_classifier_model.joblib'
Vectorizer saved as 'vectorizer.joblib'


### The End:

This is the end of this project notebook, make sure to experiment and contribute to help improve the model and implementation. You can browse more of the open-source free projects on our GitHub repository: https://github.com/Infinitode/OPEN-ARC. If you like this project, make sure to star the repo and contribute your implementation, or help others in the community.

~ Infinitode