# Task
Develop and evaluate an SVM classifier for email spam detection using the dataset at "/content/sample_data/SMSSpamCollection", exploring different kernels and analyzing the impact of word frequency, class distribution, and hyperparameters on performance.

## Load the dataset

### Subtask:
Load the email dataset from the provided path `/content/sample_data/SMSSpamCollection` into a pandas DataFrame.


**Reasoning**:
Load the dataset into a pandas DataFrame with specified separator and no header, then assign column names.



In [8]:
import pandas as pd

df = pd.read_csv('/content/sample_data/SMSSpamCollection', sep='\t', header=None)
df.columns = ['label', 'message']
display(df.head())

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Explore the data

### Subtask:
Perform exploratory data analysis to understand the structure of the dataset, class distribution (spam vs. ham), and potentially identify key features or words.


**Reasoning**:
Perform exploratory data analysis by printing the shape, getting info, checking value counts, and calculating percentages of the 'label' column.



In [None]:
print("Shape of the DataFrame:")
print(df.shape)

print("\nInfo of the DataFrame:")
df.info()

print("\nValue counts of 'label' column:")
print(df['label'].value_counts())

print("\nPercentage of spam and ham messages:")
print(df['label'].value_counts(normalize=True) * 100)

Shape of the DataFrame:
(5572, 2)

Info of the DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    5572 non-null   object
 1   message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB

Value counts of 'label' column:
label
ham     4825
spam     747
Name: count, dtype: int64

Percentage of spam and ham messages:
label
ham     86.593683
spam    13.406317
Name: proportion, dtype: float64


**Reasoning**:
The exploratory data analysis steps including shape, info, value counts, and percentages have been completed. Now, examine the first few messages for both spam and ham to get a qualitative feel for the data.



In [None]:
print("\nFirst 5 ham messages:")
display(df[df['label'] == 'ham']['message'].head())

print("\nFirst 5 spam messages:")
display(df[df['label'] == 'spam']['message'].head())


First 5 ham messages:


Unnamed: 0,message
0,"Go until jurong point, crazy.. Available only ..."
1,Ok lar... Joking wif u oni...
3,U dun say so early hor... U c already then say...
4,"Nah I don't think he goes to usf, he lives aro..."
6,Even my brother is not like to speak with me. ...



First 5 spam messages:


Unnamed: 0,message
2,Free entry in 2 a wkly comp to win FA Cup fina...
5,FreeMsg Hey there darling it's been 3 week's n...
8,WINNER!! As a valued network customer you have...
9,Had your mobile 11 months or more? U R entitle...
11,"SIX chances to win CASH! From 100 to 20,000 po..."


## Preprocess the data

### Subtask:
Clean and preprocess the text data. This may involve removing punctuation, stop words, and converting text to lowercase.


**Reasoning**:
Define a function to preprocess the text data by converting to lowercase, removing punctuation, and removing stop words, then apply this function to the 'message' column to create a new 'cleaned_message' column.



In [None]:
import string
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()
    text = ''.join([char for char in text if char not in string.punctuation])
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

df['cleaned_message'] = df['message'].apply(preprocess_text)
display(df.head())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,label,message,cleaned_message
0,ham,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah dont think goes usf lives around though


## Vectorize the text data

### Subtask:
Convert the text data into numerical feature vectors using techniques like TF-IDF or Count Vectorization.


**Reasoning**:
Convert the cleaned text messages into numerical feature vectors using TF-IDF.



In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X = tfidf_vectorizer.fit_transform(df['cleaned_message'])
print("Shape of the TF-IDF matrix:")
print(X.shape)

KeyError: 'cleaned_message'

## Split the data

### Subtask:
Split the dataset into training and testing sets.


**Reasoning**:
Split the data into training and testing sets.



In [None]:
from sklearn.model_selection import train_test_split

y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (4457, 5000)
Shape of X_test: (1115, 5000)
Shape of y_train: (4457,)
Shape of y_test: (1115,)


## Train svm models

### Subtask:
Train SVM models with different kernels (linear and RBF) on the training data.


**Reasoning**:
Train SVM models with linear and RBF kernels on the training data as per the instructions.



In [None]:
from sklearn.svm import SVC

svm_linear = SVC(kernel='linear')
svm_linear.fit(X_train, y_train)

svm_rbf = SVC(kernel='rbf')
svm_rbf.fit(X_train, y_train)

## Evaluate the models

### Subtask:
Evaluate the performance of the trained SVM models using appropriate metrics such as accuracy, precision, recall, and F1-score on the testing data.


**Reasoning**:
Evaluate the performance of the trained SVM models using appropriate metrics.



In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred_linear = svm_linear.predict(X_test)
accuracy_linear = accuracy_score(y_test, y_pred_linear)
precision_linear = precision_score(y_test, y_pred_linear, pos_label='spam')
recall_linear = recall_score(y_test, y_pred_linear, pos_label='spam')
f1_linear = f1_score(y_test, y_pred_linear, pos_label='spam')

print("SVM with Linear Kernel:")
print(f"  Accuracy: {accuracy_linear:.4f}")
print(f"  Precision: {precision_linear:.4f}")
print(f"  Recall: {recall_linear:.4f}")
print(f"  F1-score: {f1_linear:.4f}")

y_pred_rbf = svm_rbf.predict(X_test)
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)
precision_rbf = precision_score(y_test, y_pred_rbf, pos_label='spam')
recall_rbf = recall_score(y_test, y_pred_rbf, pos_label='spam')
f1_rbf = f1_score(y_test, y_pred_rbf, pos_label='spam')

print("\nSVM with RBF Kernel:")
print(f"  Accuracy: {accuracy_rbf:.4f}")
print(f"  Precision: {precision_rbf:.4f}")
print(f"  Recall: {recall_rbf:.4f}")
print(f"  F1-score: {f1_rbf:.4f}")

SVM with Linear Kernel:
  Accuracy: 0.9830
  Precision: 0.9924
  Recall: 0.8792
  F1-score: 0.9324

SVM with RBF Kernel:
  Accuracy: 0.9794
  Precision: 1.0000
  Recall: 0.8456
  F1-score: 0.9164


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

param_grid_rbf = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['rbf']}
param_grid_linear = {'C': [0.1, 1, 10, 100], 'kernel': ['linear']}

**Reasoning**:
Instantiate and fit GridSearchCV with the RBF SVM model and its parameter grid to find the best hyperparameters using cross-validation, then print the best parameters and score.



In [None]:
grid_search_rbf = GridSearchCV(SVC(), param_grid_rbf, scoring='f1', cv=3)
grid_search_rbf.fit(X_train, y_train)

print("Best parameters for RBF kernel:", grid_search_rbf.best_params_)
print("Best cross-validation F1-score for RBF kernel:", grid_search_rbf.best_score_)

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_validation.py", line 949, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_scorer.py", line 288, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_scorer.py", line 380, in _score
    y_pred = method_caller(
             ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_scorer.py", line 90, in _cached_call
    result, _ = _get_response_values(
                ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/utils/_response.py", line 207, in _get_response_values
    raise ValueErro

Best parameters for RBF kernel: {'C': 0.1, 'gamma': 1, 'kernel': 'rbf'}
Best cross-validation F1-score for RBF kernel: nan


**Reasoning**:
The previous GridSearchCV for the RBF kernel failed because the scoring metric 'f1' needs the `pos_label` parameter when the labels are strings. I will try again with `pos_label='spam'` in the scoring.



In [None]:
grid_search_rbf = GridSearchCV(SVC(), param_grid_rbf, scoring='f1', cv=3, error_score='raise', verbose=1)
grid_search_rbf.fit(X_train, y_train)

print("Best parameters for RBF kernel:", grid_search_rbf.best_params_)
print("Best cross-validation F1-score for RBF kernel:", grid_search_rbf.best_score_)

Fitting 3 folds for each of 16 candidates, totalling 48 fits


ValueError: pos_label=1 is not a valid label: It should be one of ['ham' 'spam']

**Reasoning**:
The error indicates that the `pos_label` for the 'f1' scoring is still defaulting to 1 somewhere, even though I specified 'f1' in the `scoring` parameter. I need to explicitly pass the `pos_label='spam'` to the `scoring` parameter in `GridSearchCV`.



In [None]:
from sklearn.metrics import make_scorer, f1_score

f1_scorer = make_scorer(f1_score, pos_label='spam')

grid_search_rbf = GridSearchCV(SVC(), param_grid_rbf, scoring=f1_scorer, cv=3, error_score='raise', verbose=1)
grid_search_rbf.fit(X_train, y_train)

print("Best parameters for RBF kernel:", grid_search_rbf.best_params_)
print("Best cross-validation F1-score for RBF kernel:", grid_search_rbf.best_score_)

Fitting 3 folds for each of 16 candidates, totalling 48 fits
Best parameters for RBF kernel: {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}
Best cross-validation F1-score for RBF kernel: 0.917454472912809


**Reasoning**:
Now that I have tuned the RBF kernel, I need to tune the linear kernel using GridSearchCV as specified in the instructions.



**Reasoning**:
Now that I have tuned the RBF kernel, I need to tune the linear kernel using GridSearchCV as specified in the instructions.

In [None]:
grid_search_linear = GridSearchCV(SVC(), param_grid_linear, scoring=f1_scorer, cv=3, error_score='raise', verbose=1)
grid_search_linear.fit(X_train, y_train)

print("Best parameters for Linear kernel:", grid_search_linear.best_params_)
print("Best cross-validation F1-score for Linear kernel:", grid_search_linear.best_score_)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
Best parameters for Linear kernel: {'C': 1, 'kernel': 'linear'}
Best cross-validation F1-score for Linear kernel: 0.9078026261020926


## Analyze the results

### Subtask:
Analyze the performance of different models and kernels, and discuss insights into word frequency, class distribution, and SVM hyperparameters.

**Reasoning**:
Now that I have the best hyperparameters for both kernels, I will train the final models using these parameters and evaluate them on the test set to compare their performance.

In [None]:
# Train and evaluate the best RBF model
best_svm_rbf = SVC(**grid_search_rbf.best_params_)
best_svm_rbf.fit(X_train, y_train)
y_pred_best_rbf = best_svm_rbf.predict(X_test)

accuracy_best_rbf = accuracy_score(y_test, y_pred_best_rbf)
precision_best_rbf = precision_score(y_test, y_pred_best_rbf, pos_label='spam')
recall_best_rbf = recall_score(y_test, y_pred_best_rbf, pos_label='spam')
f1_best_rbf = f1_score(y_test, y_pred_best_rbf, pos_label='spam')

print("Best SVM with RBF Kernel on Test Set:")
print(f"  Accuracy: {accuracy_best_rbf:.4f}")
print(f"  Precision: {precision_best_rbf:.4f}")
print(f"  Recall: {recall_best_rbf:.4f}")
print(f"  F1-score: {f1_best_rbf:.4f}")

# Train and evaluate the best Linear model
best_svm_linear = SVC(**grid_search_linear.best_params_)
best_svm_linear.fit(X_train, y_train)
y_pred_best_linear = best_svm_linear.predict(X_test)

accuracy_best_linear = accuracy_score(y_test, y_pred_best_linear)
precision_best_linear = precision_score(y_test, y_pred_best_linear, pos_label='spam')
recall_best_linear = recall_score(y_test, y_pred_best_linear, pos_label='spam')
f1_best_linear = f1_score(y_test, y_pred_best_linear, pos_label='spam')

print("\nBest SVM with Linear Kernel on Test Set:")
print(f"  Accuracy: {accuracy_best_linear:.4f}")
print(f"  Precision: {precision_best_linear:.4f}")
print(f"  Recall: {recall_best_linear:.4f}")
print(f"  F1-score: {f1_best_linear:.4f}")

Best SVM with RBF Kernel on Test Set:
  Accuracy: 0.9865
  Precision: 0.9855
  Recall: 0.9128
  F1-score: 0.9477

Best SVM with Linear Kernel on Test Set:
  Accuracy: 0.9830
  Precision: 0.9924
  Recall: 0.8792
  F1-score: 0.9324
