## Machine learning
### We train a model on labeled spam and ham emails so it can learn patterns and classify new messages.
#### This is supervised learning for text classification, using a Multinomial Naive Bayes algorithm.
#### Observations: each email/message in the `texts` list.
#### Labels/target: each value in `labels` (spam or ham).
#### Features: TF-IDF vectors computed from the text.


In [109]:
from sklearn.model_selection import train_test_split  # Split data into train/test sets
from sklearn.feature_extraction.text import TfidfVectorizer  # Turn text into TF-IDF features
from sklearn.naive_bayes import MultinomialNB  # Naive Bayes classifier for text
from sklearn.pipeline import Pipeline  # Chain vectorizer + model
from sklearn.metrics import classification_report  # Metrics summary for predictions
import pandas as pd  # Tabular display


In [110]:
texts=["win a free prize now", "congratulations you have won a prize",
    "The IRS is Trying to Contact You"," You Have a Refund Coming","Fake Subscription Renewal",
    "lets have launch tomorrow","sara can you send me the report please",
    " Low-Interest Credit Card Offers","Verify Your Apple iCloud ID",
    "meeting moved to 3pm today","can you review the draft report?",
    "lunch at 12?", "please see attached invoice for your records"]  # Training text samples
labels=["spam","spam",
        "spam","spam","spam",
        "ham","ham",
        "spam","spam",
        "ham","ham",
        "ham","ham"]  # Label for each text in the same order


In [111]:
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.33, random_state=42, stratify=labels
)  # Split data and keep class balance

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),  # Convert text to TF-IDF features
    ('clf', MultinomialNB()),  # Train Naive Bayes classifier
])


In [112]:
model = pipeline.fit(X_train, y_train)  # Train the pipeline on the training split
pred = model.predict(X_test)  # Predict labels for the test split
print(classification_report(y_test, pred))  # Show precision/recall/F1 for each class


              precision    recall  f1-score   support

         ham       0.50      1.00      0.67         2
        spam       1.00      0.33      0.50         3

    accuracy                           0.60         5
   macro avg       0.75      0.67      0.58         5
weighted avg       0.80      0.60      0.57         5



In [113]:
for text in texts:  # Loop through each training example
   print(text, "->", model.predict([text])[0])  # Predict and print its label


win a free prize now -> spam
congratulations you have won a prize -> spam
The IRS is Trying to Contact You -> spam
 You Have a Refund Coming -> spam
Fake Subscription Renewal -> ham
lets have launch tomorrow -> ham
sara can you send me the report please -> ham
 Low-Interest Credit Card Offers -> spam
Verify Your Apple iCloud ID -> ham
meeting moved to 3pm today -> ham
can you review the draft report? -> ham
lunch at 12? -> ham
please see attached invoice for your records -> ham


In [114]:
new_texts = [
"Congratulations! You've won a free ticket to Bahamas. Click here to claim your prize.",
"Reminder: Your appointment with Dr. Smith is scheduled for tomorrow at 10 AM.",
"Get paid to work from home! Sign up now and start earning money quickly.",
"Don't forget to submit your project report by the end of the week.",
]  # New messages to classify


In [115]:
new_texts = [
"Congratulations! You've won a free ticket to Bahamas. Click here to claim your prize.",
"Reminder: Your appointment with Dr. Smith is scheduled for tomorrow at 10 AM.",
"Get paid to work from home! Sign up now and start earning money quickly.",
"Don't forget to submit your project report by the end of the week.",
]  # New messages to classify


In [120]:
summary.Description


0                          Each email/message in texts
1                   Each entry in labels (spam or ham)
2    TF-IDF vectors from TfidfVectorizer (term weig...
Name: Description, dtype: object

In [124]:
summary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Item         3 non-null      object
 1   Description  3 non-null      object
dtypes: object(2)
memory usage: 180.0+ bytes
