## SAMAROHA CHATTERJEE
## ROLL: MDS202342

In [4]:
# General Libraries
import os
import pandas as pd
import joblib

# Machine Learning Libraries
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
# Define dataset paths in Google Drive
drive_path = "/content/drive/MyDrive/AppliedMachineLearning/Assignment_1"
train_path = f"{drive_path}/train.csv"
val_path = f"{drive_path}/validation.csv"
test_path = f"{drive_path}/test.csv"

# Load datasets
train_df = pd.read_csv(train_path)
val_df = pd.read_csv(val_path)
test_df = pd.read_csv(test_path)

# Verify data
print(train_df.head())
print(f"✅ Data Loaded: Train ({len(train_df)}), Validation ({len(val_df)}), Test ({len(test_df)})")



   label                                            message
0      0                                          guy close
1      0  please come imin towndontmatter urgoin outlrju...
2      0                          ok ksry knw sivatats askd
3      0                                ill see prolly yeah
4      0        ill see swing bit got thing take care firsg
✅ Data Loaded: Train (4457), Validation (557), Test (558)


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Handle missing values
train_df['message'] = train_df['message'].fillna("")
val_df['message'] = val_df['message'].fillna("")
test_df['message'] = test_df['message'].fillna("")

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)  # Limit to 5000 features

# Transform data
X_train = vectorizer.fit_transform(train_df['message'])
X_val = vectorizer.transform(val_df['message'])
X_test = vectorizer.transform(test_df['message'])

y_train = train_df['label']
y_val = val_df['label']
y_test = test_df['label']

print("✅ Missing values handled and TF-IDF Vectorization Complete!")


✅ Missing values handled and TF-IDF Vectorization Complete!


In [8]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

def create_pipeline(model):
    """
    Creates a pipeline with the classifier.
    """
    return Pipeline([
        ('clf', model)
    ])

def train_models(X_train, y_train, X_val, y_val):
    """
    Trains multiple models and selects the best one based on validation accuracy.
    """
    models = {
        "Naive Bayes": create_pipeline(MultinomialNB()),
        "Logistic Regression": create_pipeline(LogisticRegression(max_iter=1000)),
        "SVM": create_pipeline(SVC(kernel='linear', probability=True))
    }

    model_scores = {}
    for name, model in models.items():
        print(f"\n🔹 Training {name}...")
        model.fit(X_train, y_train)

        # Evaluate on train and validation sets
        train_acc = accuracy_score(y_train, model.predict(X_train))
        val_acc = accuracy_score(y_val, model.predict(X_val))

        print(f"✅ {name} - Train Accuracy: {train_acc:.4f}, Validation Accuracy: {val_acc:.4f}")
        model_scores[name] = val_acc

    # Select Best Model
    best_model_name = max(model_scores, key=model_scores.get)
    best_model = models[best_model_name]
    print(f"\n🏆 Best Model Selected: {best_model_name}")
    return best_model_name, best_model

best_model_name, best_model = train_models(X_train, y_train, X_val, y_val)



🔹 Training Naive Bayes...
✅ Naive Bayes - Train Accuracy: 0.9800, Validation Accuracy: 0.9659

🔹 Training Logistic Regression...
✅ Logistic Regression - Train Accuracy: 0.9686, Validation Accuracy: 0.9605

🔹 Training SVM...
✅ SVM - Train Accuracy: 0.9937, Validation Accuracy: 0.9838

🏆 Best Model Selected: SVM


In [10]:
from sklearn.model_selection import GridSearchCV

def fine_tune_hyperparameters(best_model_name, best_model, X_train, y_train):
    """
    Fine-tunes hyperparameters using Grid Search.
    """
    param_grid = {
        "SVM": {"clf__C": [ 0.1, 1, 10]}
    }

    grid_search = GridSearchCV(best_model, param_grid[best_model_name], scoring='accuracy', cv=5)
    grid_search.fit(X_train, y_train)

    print(f"\n🔍 Best hyperparameters: {grid_search.best_params_}")
    return grid_search.best_estimator_

best_model = fine_tune_hyperparameters(best_model_name, best_model, X_train, y_train)



🔍 Best hyperparameters: {'clf__C': 10}


In [11]:
from sklearn.metrics import classification_report

def evaluate_model(model, X_test, y_test):
    """
    Evaluates the model on the test dataset.
    """
    test_preds = model.predict(X_test)
    test_acc = accuracy_score(y_test, test_preds)

    print(f"\n✅ Test Accuracy: {test_acc:.4f}")
    print("\n📊 Test Classification Report:\n", classification_report(y_test, test_preds))

evaluate_model(best_model, X_test, y_test)



✅ Test Accuracy: 0.9713

📊 Test Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.98      0.98       483
           1       0.89      0.89      0.89        75

    accuracy                           0.97       558
   macro avg       0.94      0.94      0.94       558
weighted avg       0.97      0.97      0.97       558



In [12]:
import joblib

# Save the trained model
model_path = os.path.join(drive_path, "SVM_model.pkl")
joblib.dump(best_model, model_path)

print(f"\n✅ Model saved successfully at {model_path}")



✅ Model saved successfully at /content/drive/MyDrive/AppliedMachineLearning/Assignment_1/SVM_model.pkl


## **📌 SMS Spam Classification: Model Training & Evaluation**

### **1️⃣ Objective**
The goal of this project was to build a robust **SMS spam classifier** that can distinguish between **ham (non-spam) messages** and **spam messages**. We implemented a **machine learning pipeline** that involved **feature extraction, model training, fine-tuning, and evaluation**.

---

### **2️⃣ Methodology**
The following workflow was followed:

1. **Data Loading**  
   - Preprocessed and cleaned dataset loaded from Google Drive.
   - Train, Validation, and Test splits used for model training and evaluation.

2. **Feature Extraction using TF-IDF**  
   - We converted SMS text into numerical form using **Term Frequency - Inverse Document Frequency (TF-IDF)**.
   - The vocabulary was limited to **5000 features** for efficiency.

3. **Model Training & Selection**  
   - Three models were trained:  
     - **Naïve Bayes**  
     - **Logistic Regression**  
     - **Support Vector Machine (SVM)**
   - Model performance was evaluated using **accuracy on the validation set**.

| Model  | Train Accuracy | Validation Accuracy |
|--------|---------------|---------------------|
| **Naïve Bayes** | 98.00% | 96.59% |
| **Logistic Regression** | 96.86% | 96.05% |
| **SVM** | 99.37% | 98.38% |

🏆 **Best Model Selected: Support Vector Machine (SVM)**

4. **Hyperparameter Tuning (Grid Search)**  
   - Optimized `C` parameter in SVM to find the best-performing setting.
   - **Best hyperparameter found:**  
     \[
     C = 10
     \]

5. **Model Evaluation on Test Set**  
   - The **fine-tuned SVM model** was evaluated on the unseen test set.

---

### **3️⃣ Results & Final Model Performance**
#### **🔹 Test Accuracy:**
\[
97.13\%
\]

#### **🔹 Classification Report:**
| Class  | Precision | Recall | F1-Score | Support |
|--------|-----------|--------|----------|---------|
| **Ham (0)** | 98% | 98% | 98% | 483 |
| **Spam (1)** | 89% | 89% | 89% | 75 |

✅ **The model achieves high accuracy, with strong spam detection capabilities.**  
📉 **Possible improvements:** Handling class imbalance using **oversampling techniques (SMOTE)** or **ensemble learning methods**.

---

### **4️⃣ Model Deployment & Future Work**
- The **trained SVM model was saved** for future use:
  ```bash
  SVM_model.pkl
