<a href="https://colab.research.google.com/github/0Daniii/Streamlit/blob/main/Untitled1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Build a spam detection model using a synthetic dataset, apply text vectorization, train a Multinomial Naive Bayes classifier, evaluate its performance, and create a function to predict whether custom email messages are 'spam' or 'ham', then demonstrate it with example messages and summarize the project.

## Create Synthetic Dataset

### Subtask:
Generate a synthetic dataset containing example 'spam' and 'ham' email messages, as the 'fetch_20newsgroups' dataset is not directly suitable for spam detection. This will involve creating a Pandas DataFrame with 'text' and 'label' columns.


**Reasoning**:
The subtask requires generating a synthetic dataset with 'ham' and 'spam' email messages and storing them in a pandas DataFrame. This code block will import pandas, define example messages, create the DataFrame, and display its head.



In [10]:
import pandas as pd

# Create example 'ham' (non-spam) messages
ham_messages = [
    "Hi team, please find the Q3 report attached.",
    "Meeting at 10 AM in conference room B. Don't be late!",
    "Could you review my pull request by end of day? Thanks!",
    "Your Amazon order #12345 has been shipped.",
    "Recipe for a delicious chocolate cake: 2 cups flour, 1 cup sugar, etc.",
    "Reminder: Your doctor's appointment is scheduled for tomorrow at 2 PM.",
    "Monthly newsletter from your favorite tech blog. Read the latest news!",
    "Project update: We've successfully integrated the new module.",
    "Family dinner this Sunday at my place, hope you can make it!",
    "Regarding your query, please refer to the attached documentation."
]

# Create example 'spam' messages
spam_messages = [
    "Win a free iPhone now! Click this link to claim your prize!",
    "URGENT: Your account has been suspended. Verify your details here.",
    "Congratulations! You've won a cash prize of $1,000,000. Reply to claim.",
    "Enlarge your manhood by 5 inches with our revolutionary product!",
    "Nigerian Prince needs your help to transfer funds. Great returns guaranteed!",
    "Limited time offer: Get rich quick! Invest in our cryptocurrency scheme.",
    "You have a new message from a secret admirer. Click here to find out who!",
    "Claim your free vacation to a tropical island. Just pay handling fees.",
    "Your credit card has been compromised. Update your information immediately.",
    "Unlock exclusive discounts and coupons! Click this phishing link now!"
]

# Combine messages and create labels
texts = ham_messages + spam_messages
labels = ['ham'] * len(ham_messages) + ['spam'] * len(spam_messages)

# Create a pandas DataFrame
df = pd.DataFrame({
    'text': texts,
    'label': labels
})

# Display the first few rows of the DataFrame
print("Synthetic dataset created:")
print(df.head())

Synthetic dataset created:
                                                text label
0       Hi team, please find the Q3 report attached.   ham
1  Meeting at 10 AM in conference room B. Don't b...   ham
2  Could you review my pull request by end of day...   ham
3         Your Amazon order #12345 has been shipped.   ham
4  Recipe for a delicious chocolate cake: 2 cups ...   ham


## Text Vectorization

### Subtask:
Initialize and apply Scikit-Learn's `CountVectorizer` to transform the text data into numerical feature vectors. This step will also fit the vectorizer on the training data.


**Reasoning**:
To perform text vectorization as per the subtask, I need to import `CountVectorizer` from `sklearn.feature_extraction.text`, initialize it, then use its `fit_transform` method on the 'text' column of the DataFrame to create numerical feature vectors, and finally, print the shape of the resulting matrix.



In [11]:
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate CountVectorizer
vectorizer = CountVectorizer()

# Apply fit_transform to the 'text' column to convert text data into numerical feature vectors
X = vectorizer.fit_transform(df['text'])

# Print the shape of X to see the dimensions of the vectorized data
print("Shape of feature matrix X:", X.shape)

Shape of feature matrix X: (20, 152)


## Split Data

### Subtask:
Divide the vectorized dataset into training and testing sets using `train_test_split` from Scikit-Learn, typically with an 80/20 split for training and testing respectively.


**Reasoning**:
To prepare for model training, I will split the vectorized data and labels into training and testing sets using `train_test_split` with an 80/20 ratio for evaluation.



In [12]:
from sklearn.model_selection import train_test_split

# Define the target variable y
y = df['label']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting sets to verify the split
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (16, 152)
X_test shape: (4, 152)
y_train shape: (16,)
y_test shape: (4,)


## Train Multinomial Naive Bayes Model

### Subtask:
Instantiate and train a `MultinomialNB` classifier from Scikit-Learn using the training data.


**Reasoning**:
To train a Multinomial Naive Bayes classifier, I will import `MultinomialNB` from `sklearn.naive_bayes`, instantiate the model, and then fit it using the `X_train` (features) and `y_train` (labels) datasets.

**Reasoning**:
Following the instructions, I will now write the Python code to import `MultinomialNB`, instantiate the model, and train it using the `X_train` and `y_train` datasets.



In [13]:
from sklearn.naive_bayes import MultinomialNB

# Instantiate Multinomial Naive Bayes classifier
model = MultinomialNB()

# Train the model using the training data
model.fit(X_train, y_train)

print("Multinomial Naive Bayes model trained successfully.")

Multinomial Naive Bayes model trained successfully.


## Evaluate Model

### Subtask:
Evaluate the trained model's performance on the test data by calculating and printing the accuracy score and generating a confusion matrix to visualize true positives, true negatives, false positives, and false negatives.


**Reasoning**:
To evaluate the model's performance as per the subtask, I will first make predictions on the test set, then calculate the accuracy score and generate a confusion matrix using `sklearn.metrics` to assess the model's effectiveness.



In [14]:
from sklearn.metrics import accuracy_score, confusion_matrix

# 1. Make predictions on the test data
y_pred = model.predict(X_test)

# 2. Calculate and print the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy Score: {accuracy:.2f}")

# 3. Generate and print the confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)

# Explanation of confusion matrix (assuming 'ham' is negative (0) and 'spam' is positive (1) if encoded, or based on alphabetical order)
# Given our labels are 'ham' and 'spam', sklearn typically orders alphabetically.
# So, 'ham' is likely the first class (index 0) and 'spam' is the second (index 1).
# cm[0,0]: True Negatives (Correctly predicted 'ham')
# cm[0,1]: False Positives (Actual 'ham' predicted as 'spam')
# cm[1,0]: False Negatives (Actual 'spam' predicted as 'ham')
# cm[1,1]: True Positives (Correctly predicted 'spam')

print("\nInterpretation of Confusion Matrix:")
print(f"True Negatives (Actual Ham, Predicted Ham): {cm[0,0]}")
print(f"False Positives (Actual Ham, Predicted Spam): {cm[0,1]}")
print(f"False Negatives (Actual Spam, Predicted Ham): {cm[1,0]}")
print(f"True Positives (Actual Spam, Predicted Spam): {cm[1,1]}")

Accuracy Score: 1.00

Confusion Matrix:
[[2 0]
 [0 2]]

Interpretation of Confusion Matrix:
True Negatives (Actual Ham, Predicted Ham): 2
False Positives (Actual Ham, Predicted Spam): 0
False Negatives (Actual Spam, Predicted Ham): 0
True Positives (Actual Spam, Predicted Spam): 2


## Create Prediction Function

### Subtask:
Develop a Python function that takes a custom email message as input, vectorizes it using the previously fitted `CountVectorizer`, and then uses the trained `MultinomialNB` model to predict whether it is 'spam' or 'ham'.


**Reasoning**:
To create the prediction function, I will define a Python function that takes an email message, uses the pre-fitted `CountVectorizer` to transform the message, and then uses the trained `MultinomialNB` model to predict its label ('spam' or 'ham').



In [15]:
def predict_spam_ham(message):
    # Vectorize the input message using the fitted CountVectorizer
    message_vectorized = vectorizer.transform([message])

    # Predict the label using the trained Multinomial Naive Bayes model
    prediction = model.predict(message_vectorized)

    # Return the predicted label
    return prediction[0]

print("Prediction function 'predict_spam_ham' created successfully.")

Prediction function 'predict_spam_ham' created successfully.


## Test with Custom Messages

### Subtask:
Demonstrate the prediction function by testing it with a few custom-written spam and ham messages to show its real-world application.


**Reasoning**:
To demonstrate the prediction function, I need to define custom 'ham' and 'spam' messages, then iterate through each to predict its label and print the results, following the instructions.



In [23]:
print("Demonstrating the prediction function with single custom messages:")

# Test with a single custom 'ham' message
single_ham_message = "Please confirm, ASAP, so we don't loose the window. NOthing is changed yet. Reply when you see this."
prediction_ham = predict_spam_ham(single_ham_message)
print(f"\nMessage: '{single_ham_message}' -> Predicted: {prediction_ham}")

# Test with a single custom 'spam' message
single_spam_message = "Your account has been suspended, verify imediatelly to avoid permanent lock. Bit.dy/secure-now msg & data rates may apply. Stop to cancel."
prediction_spam = predict_spam_ham(single_spam_message)
print(f"Message: '{single_spam_message}' -> Predicted: {prediction_spam}")



# The original code for multiple custom messages is commented out below:
# custom_ham_messages = [
#     "Hello, please review the attached document at your convenience.",
#     "Reminder: Team meeting is scheduled for tomorrow at 10 AM."
# ]
# custom_spam_messages = [
#     "URGENT! You've won a lottery! Click here to claim your prize!",
#     "Free money! Limited time offer. Act now!"
# ]
# print("\n--- Testing with custom HAM messages ---")
# for message in custom_ham_messages:
#     prediction = predict_spam_ham(message)
#     print(f"Message: '{message}' -> Predicted: {prediction}")
# print("\n--- Testing with custom SPAM messages ---")
# for message in custom_spam_messages:
#     prediction = predict_spam_ham(message)
#     print(f"Message: '{message}' -> Predicted: {prediction}")

Demonstrating the prediction function with single custom messages:

Message: 'Please confirm, ASAP, so we don't loose the window. NOthing is changed yet. Reply when you see this.' -> Predicted: ham
Message: 'Your account has been suspended, verify imediatelly to avoid permanent lock. Bit.dy/secure-now msg & data rates may apply. Stop to cancel.' -> Predicted: spam
