## Spam detection 
#### Adel Ahmed

## Part A: Dataset and Libraries
#### • Download "spam.csv" dataset and load it into a pandas DataFrame.
#### • Import the required libraries: pandas, scikit-learn, and numpy.


In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm

spam_df = pd.read_csv('C:/Users/adela/Downloads/spam.csv')

spam_df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


### Part B - CountVectorizer with SVM
#### • Use the CountVectorizer method from the scikit-learn library to define features from the email text.
#### • Split the dataset into training and testing sets.
#### • Train an SVM (Support Vector Machine) classifier on the training data.
#### • Evaluate the performance of the classifier by calculating accuracy, recall, precision, and F1-score on
#### the test set.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
#using the countvectorizer method from scikit-learn to define features
v = CountVectorizer()
features = v.fit_transform(spam_df['v2'])


In [3]:
mapping = {'spam': 1, 'ham': 0}

# Replace string values with numeric values using the map function
spam_df['v1'] = spam_df['v1'].map(mapping)
spam_df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,0,"Go until jurong point, crazy.. Available only ...",,,
1,0,Ok lar... Joking wif u oni...,,,
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,0,U dun say so early hor... U c already then say...,,,
4,0,"Nah I don't think he goes to usf, he lives aro...",,,


In [4]:
#SVM supervised learning
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score

x = features[:, 1:]

y = spam_df['v1']

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)


#Creating SVM classifier object with linear Kernels
svm_classifier_linear = SVC(kernel='linear')
svm_classifier_linear.fit(x_train,y_train)

#Make predictions on the test set using each classifier
y_pred_linear = svm_classifier_linear.predict(x_test)

#calculating accuracy, recall, precision and F-Measure
accuracy_linear = accuracy_score(y_test, y_pred_linear)
recall_linear = recall_score(y_test, y_pred_linear)
precision_linear = precision_score(y_test, y_pred_linear)
f1_linear = f1_score(y_test, y_pred_linear)



print("Accuracy:", accuracy_linear)
print("Recall:",recall_linear)
print("Precision:",precision_linear)
print("F-measure:",f1_linear)

Accuracy: 0.979372197309417
Recall: 0.8733333333333333
Precision: 0.9703703703703703
F-measure: 0.9192982456140351


### Part C - Cleaning Text and SVM
#### • Preprocess email text by performing cleaning-up techniques such as removing stopwords,
#### punctuation, and performing stemming or lemmatization.
#### • Use the cleaned text as features and repeat steps as in Part B: splitting, training an SVM classifier,
#### and evaluating performance metrics.


In [5]:
from sklearn.feature_extraction import _stop_words
#preprocessing email text to remove stopwords
def remove_stopwords(text):
    non_stopwords = []
    for word in text.split():
        if word not in _stop_words.ENGLISH_STOP_WORDS:
            non_stopwords.append(word)
    cleaned_text = ' '.join(non_stopwords)
    return cleaned_text

# Apply the stop words function to each value in column number 2 which contains email
spam_df["v2"] = spam_df["v2"].apply(remove_stopwords)
spam_df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,0,"Go jurong point, crazy.. Available bugis n gre...",,,
1,0,Ok lar... Joking wif u oni...,,,
2,1,Free entry 2 wkly comp win FA Cup final tkts 2...,,,
3,0,U dun say early hor... U c say...,,,
4,0,"Nah I don't think goes usf, lives",,,


In [6]:
from nltk.stem import SnowballStemmer
#preprocessing email to perform stemming
# Create a SnowballStemmer instance for English
stemmer = SnowballStemmer('english')


def stem_column(text):
    stemmed_words = []
    for word in text.split():
        stemmed_words.append(stemmer.stem(word))
    return ' '.join(stemmed_words)

# Apply the stem_column function to each value in column number 2 which contains email
spam_df["v2"] = spam_df["v2"].apply(stem_column)
spam_df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,0,"go jurong point, crazy.. avail bugi n great wo...",,,
1,0,ok lar... joke wif u oni...,,,
2,1,free entri 2 wkli comp win fa cup final tkts 2...,,,
3,0,u dun say earli hor... u c say...,,,
4,0,"nah i don't think goe usf, live",,,


In [7]:
x = features[:, 1:]

y = spam_df['v1']

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)


#Creating SVM classifier object with linear Kernels
svm_classifier_linear = SVC(kernel='linear')
svm_classifier_linear.fit(x_train,y_train)

#Make predictions on the test set using each classifier
y_pred_linear = svm_classifier_linear.predict(x_test)

#calculating accuracy, recall, precision and F-Measure
accuracy_linear = accuracy_score(y_test, y_pred_linear)
recall_linear = recall_score(y_test, y_pred_linear)
precision_linear = precision_score(y_test, y_pred_linear)
f1_linear = f1_score(y_test, y_pred_linear)



print("Accuracy:", accuracy_linear)
print("Recall:",recall_linear)
print("Precision:",precision_linear)
print("F-measure:",f1_linear)

Accuracy: 0.979372197309417
Recall: 0.8733333333333333
Precision: 0.9703703703703703
F-measure: 0.9192982456140351


### Part D - TF-IDF Vectors and SVM
#### • Use the TF-IDF (Term Frequency-Inverse Document Frequency) vectors of the email text as features.
#### • Split the data, train an SVM classifier, and evaluate the performance metrics as in the previous
#### approaches.


In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
#Using the TF-IDF (Term Frequency-Inverse Document Frequency) vectors of the email text as features
v = TfidfVectorizer()
feastures = v.fit_transform(spam_df["v2"])

In [9]:
x = features[:, 1:]

y = spam_df['v1']

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)


#Creating SVM classifier object with linear Kernels
svm_classifier_linear = SVC(kernel='linear')
svm_classifier_linear.fit(x_train,y_train)

#Make predictions on the test set using each classifier
y_pred_linear = svm_classifier_linear.predict(x_test)

#calculating accuracy, recall, precision and F-Measure
accuracy_linear = accuracy_score(y_test, y_pred_linear)
recall_linear = recall_score(y_test, y_pred_linear)
precision_linear = precision_score(y_test, y_pred_linear)
f1_linear = f1_score(y_test, y_pred_linear)



print("Accuracy:", accuracy_linear)
print("Recall:",recall_linear)
print("Precision:",precision_linear)
print("F-measure:",f1_linear)

Accuracy: 0.979372197309417
Recall: 0.8733333333333333
Precision: 0.9703703703703703
F-measure: 0.9192982456140351


### Part E: Compare Performance
#### • Compare the accuracy, recall, precision, and F1-score obtained from approaches used in Part B, Part
#### C, and Part D.

In [10]:
'''CountVectorizer with SVM:

Accuracy: 0.979372197309417
Recall: 0.8733333333333333
Precision: 0.9703703703703703
F-measure: 0.9192982456140351

Cleaning Text and SVM:
Accuracy: 0.9802690582959641
Recall: 0.8733333333333333
Precision: 0.9776119402985075
F-measure: 0.9225352112676056

TF-IDF Vectors and SVM:
Accuracy: 0.979372197309417
Recall: 0.8733333333333333
Precision: 0.9703703703703703
F-measure: 0.9192982456140351

All three approaches seem to have very similar accuracy, recall, precision, and F-measure values. if we consider the precision metric, Approach 2 ("Cleaning Text and SVM") has the highest precision value of 0.9776.
This indicates that it has a relatively low false positive rate, which might be desirable in a spam detection scenario.
'''

'CountVectorizer with SVM:\n\nAccuracy: 0.979372197309417\nRecall: 0.8733333333333333\nPrecision: 0.9703703703703703\nF-measure: 0.9192982456140351\n\nCleaning Text and SVM:\nAccuracy: 0.9802690582959641\nRecall: 0.8733333333333333\nPrecision: 0.9776119402985075\nF-measure: 0.9225352112676056\n\nTF-IDF Vectors and SVM:\nAccuracy: 0.9802690582959641\nRecall: 0.8733333333333333\nPrecision: 0.9776119402985075\nF-measure: 0.9225352112676056\n\n'