# Implementation: Spam Filter (TF-IDF)

**Goal**: Build a classifier using classic NLP.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

# 1. Mock Data
corpus = [
    "Win a free iPhone now! Click here.",
    "Hey, are we still meeting for lunch?",
    "URGENT! You have won a lottery.",
    "Can you review the code PR?"
]
labels = [1, 0, 1, 0] # 1=Spam, 0=Ham

# 2. Vectorize (Text -> Numbers)
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# See the vocab
print("Vocabulary:", vectorizer.get_feature_names_out())
print("Shape:", X.shape)

# 3. Train Classifier (Naive Bayes is great for sparse text)
clf = MultinomialNB()
clf.fit(X, labels)

# 4. Predict
new_email = ["Free lottery ticket inside"]
X_new = vectorizer.transform(new_email)
pred = clf.predict(X_new)
print(f"Prediction for '{new_email[0]}': {'Spam' if pred[0]==1 else 'Ham'}")

## Conclusion
The word "Free" and "lottery" had high TF-IDF scores in the Spam class, so the model learned to associate them.