# 📘 Logistic Regression & SVM in NLP (Spam Detection)

---

## 🔹 1. Introduction
In the **statistical era of NLP**, algorithms like **Naive Bayes, Logistic Regression, and SVM** were widely used for text classification before deep learning.  

Here, we’ll use **Logistic Regression** and **Support Vector Machines (SVM)** to detect spam messages from the classic **SMS Spam dataset (spam.csv from Kaggle)**.

---

## 🔹 2. Dataset
- Dataset: **SMS Spam Collection** (spam.csv)  
- Source: Kaggle  
- Classes:  
  - `ham` → normal message  
  - `spam` → unwanted / promotional message  

In [2]:
# 📦 Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

In [3]:
# 📂 Load dataset
df = pd.read_csv("spam.csv", encoding="latin-1")

# Keep only useful columns
df = df[['v1', 'v2']]
df.columns = ['label', 'message']

# Map labels: spam=1, ham=0
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

df.head()

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


---

## 🔹 3. Preprocessing
We’ll use **TF-IDF Vectorization** to convert text into numerical features.

---


In [4]:
# ✨ Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    df['message'], df['label'], test_size=0.2, random_state=42
)

# 🔡 Convert text → TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english', max_features=3000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print("Training data shape:", X_train_tfidf.shape)


Training data shape: (4457, 3000)


---

## 🔹 4. Logistic Regression Model
Logistic Regression learns **probabilities** for each class (Spam vs Ham) using the **sigmoid function**.

---


In [5]:
# Train Logistic Regression
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_tfidf, y_train)

# 📊 Predictions
y_pred_log = log_reg.predict(X_test_tfidf)

# 📝 Evaluation
print("🔹 Logistic Regression Results:")
print("Accuracy:", accuracy_score(y_test, y_pred_log))
print("\nClassification Report:\n", classification_report(y_test, y_pred_log))


🔹 Logistic Regression Results:
Accuracy: 0.9641255605381166

Classification Report:
               precision    recall  f1-score   support

           0       0.96      1.00      0.98       965
           1       0.97      0.75      0.85       150

    accuracy                           0.96      1115
   macro avg       0.97      0.88      0.91      1115
weighted avg       0.96      0.96      0.96      1115



---

## 🔹 5. Support Vector Machine (SVM) Model
SVM tries to find the **best separating boundary** (hyperplane) between classes.  
It works well in high-dimensional spaces like text data.

---


In [7]:
# Train Support Vector Machine
svm = SVC(kernel='linear')  # linear kernel works well for text
svm.fit(X_train_tfidf, y_train)

# 📊 Predictions
y_pred_svm = svm.predict(X_test_tfidf)

# 📝 Evaluation
print("🔹 SVM Results:")
print("Accuracy:", accuracy_score(y_test, y_pred_svm))
print("\nClassification Report:\n", classification_report(y_test, y_pred_svm))


🔹 SVM Results:
Accuracy: 0.97847533632287

Classification Report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99       965
           1       0.97      0.87      0.92       150

    accuracy                           0.98      1115
   macro avg       0.97      0.93      0.95      1115
weighted avg       0.98      0.98      0.98      1115



---

## 🔹 6. Results & Comparison

- **Logistic Regression** → probability-based, faster, interpretable.  
- **SVM** → margin-based, strong in high-dimensional data.  

Both models can achieve **90–97% accuracy** on this dataset.  

---

✅ You can now experiment by:
- Changing `max_features` in TF-IDF.  
- Trying `ngram_range=(1,2)` for bigram features.  
- Using other kernels in SVM (`rbf`, `poly`).  

---
