# **SMS Spam Collection**

Build an AI model that can classify SMS messages as spam or
legitimate. Use techniques like TF-IDF or word embeddings with
classifiers like Naive Bayes, Logistic Regression, or Support Vector
Machines to identify spam messages

**Dataset URL -** https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv

**1. Import Libraries and Load data**

In [1]:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

In [26]:
url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'
data = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])

**2. Encode Labels and Fill Missing values**

In [27]:
data['label'] = data['label'].map({'ham': 0, 'spam': 1})

In [28]:
data['message'] = data['message'].fillna('')

**3. Split the data**

In [29]:
X_train, X_test, y_train, y_test = train_test_split(data['message'], data['label'], test_size=0.2, random_state=42)

**4. Convert all messages to string type**

In [30]:
X_train = X_train.astype(str)
X_test = X_test.astype(str)

**5. Text preprocessing and TF-IDF feature extraction**

In [31]:
tfidf = TfidfVectorizer(stop_words='english', max_df=0.95, min_df=1)

In [32]:
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

**6. Define classifiers**

In [33]:
classifiers = {
    'Naive Bayes': MultinomialNB(),
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Support Vector Machine': SVC()
}

**7. Train, Evaluate models and Print Results**

In [34]:
results = {}
for name, clf in classifiers.items():
    clf.fit(X_train_tfidf, y_train)
    y_pred = clf.predict(X_test_tfidf)
    report = classification_report(y_test, y_pred, output_dict=True)
    results[name] = report

In [36]:
for model_name, metrics in results.items():
    print(f"Results for {model_name}:")
    for metric, values in metrics.items():
        print(f"{metric}:")
        if isinstance(values, dict):
            for sub_metric, score in values.items():
                print(f"  {sub_metric}: {score}")
        else:
            print(f"  {values}")
    print("\n")

Results for Naive Bayes:
0:
  precision: 0.9757575757575757
  recall: 1.0
  f1-score: 0.9877300613496933
  support: 966
1:
  precision: 1.0
  recall: 0.8389261744966443
  f1-score: 0.9124087591240876
  support: 149
accuracy:
  0.97847533632287
macro avg:
  precision: 0.9878787878787878
  recall: 0.9194630872483222
  f1-score: 0.9500694102368905
  support: 1115
weighted avg:
  precision: 0.9789971463514063
  recall: 0.97847533632287
  f1-score: 0.9776647034738053
  support: 1115


Results for Logistic Regression:
0:
  precision: 0.966
  recall: 1.0
  f1-score: 0.982706002034588
  support: 966
1:
  precision: 1.0
  recall: 0.7718120805369127
  f1-score: 0.8712121212121211
  support: 149
accuracy:
  0.9695067264573991
macro avg:
  precision: 0.983
  recall: 0.8859060402684564
  f1-score: 0.9269590616233545
  support: 1115
weighted avg:
  precision: 0.9705434977578475
  recall: 0.9695067264573991
  f1-score: 0.9678068197542763
  support: 1115


Results for Support Vector Machine:
0:
  prec