## SPAM SMS DETECTION

Build an AI model that can classify SMS messages as spam or
legitimate. Use techniques like TF-IDF or word embeddings with
classifiers like Naive Bayes, Logistic Regression, or Support Vector

Machines to identify spam messages

DATASET : [CLICK HERE](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)

## Importing necessary libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
import re

## Data Retrieval

In [2]:
# Load the dataset
df = pd.read_csv("C:\\Users\\rishi\\Downloads\\archive (15)\\spam.csv", encoding="latin-1")

## Data Preprocessing

In [3]:
# Rename columns and map labels to numerical values
df = df.rename(columns={'v1': 'label', 'v2': 'text'})
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
X = df['text']
y = df['label']

## Data Splitting

In [4]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Vectorize the text data using TF-IDF

In [5]:
# Convert text data into numerical vectors using TF-IDF vectorization
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

## Model Training

Logistic Regression

In [12]:

# Train Logistic Regression model
lr_model = LogisticRegression()
lr_model.fit(X_train_tfidf, y_train)

Random Forest

In [7]:
# Train Random Forest model
rf_model = RandomForestClassifier()
rf_model.fit(X_train_tfidf, y_train)

Support Vector Machine

In [8]:
# Train Support Vector Machine model
svm_model = SVC()
svm_model.fit(X_train_tfidf, y_train)

In [9]:
# Function to evaluate model performance
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    return accuracy, report

## Evaluation

In [10]:
# Evaluate models
lr_accuracy, lr_report = evaluate_model(lr_model, X_test_tfidf, y_test)
rf_accuracy, rf_report = evaluate_model(rf_model, X_test_tfidf, y_test)
svm_accuracy, svm_report = evaluate_model(svm_model, X_test_tfidf, y_test)

## Results

In [11]:
# Print results
print("Logistic Regression:")
print(f"Accuracy: {lr_accuracy}")
print("Classification Report:")
print(lr_report)
print("\nRandom Forest:")
print(f"Accuracy: {rf_accuracy}")
print("Classification Report:")
print(rf_report)
print("\nSupport Vector Machine:")
print(f"Accuracy: {svm_accuracy}")
print("Classification Report:")
print(svm_report)
# Thank You !!

Logistic Regression:
Accuracy: 0.9659192825112107
Classification Report:
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       965
           1       0.99      0.75      0.86       150

    accuracy                           0.97      1115
   macro avg       0.98      0.88      0.92      1115
weighted avg       0.97      0.97      0.96      1115


Random Forest:
Accuracy: 0.9730941704035875
Classification Report:
              precision    recall  f1-score   support

           0       0.97      1.00      0.98       965
           1       1.00      0.80      0.89       150

    accuracy                           0.97      1115
   macro avg       0.98      0.90      0.94      1115
weighted avg       0.97      0.97      0.97      1115


Support Vector Machine:
Accuracy: 0.9820627802690582
Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       965
           1       