# SPAM SMS CLASSIFIER

## STEPS

1. Data Collection

    First, you need a dataset of SMS messages labeled as spam or ham (non-spam). The SMS Spam Collection Dataset is a commonly used dataset that contains labeled SMS messages. [DATASET](http://localhost:8888/lab/tree/Downloads/SMSSpamCollection.txt)

2. Data Preprocessing

    Preprocessing the text data involves cleaning and preparing it for model training. 
    This includes:
    
    - **Tokenization**: Splitting the SMS into individual words or tokens.
    - **Lowercasing**: Converting all text to lowercase.
    - **Removing Punctuation and Special Characters**: Filtering out symbols that don't add meaning.
    - **Removing Stop Words**: Excluding common words that don't contribute much to the classification.
    - **Stemming/Lemmatization**: Reducing words to their base or root form.


3. Feature Extraction

    Transform the text data into numerical features using techniques like:

    - **Bag of Words (BoW)**: Counting the frequency of each word in the SMS.
    - **Term Frequency-Inverse Document Frequency (TF-IDF)**: Weighing the importance of words based on their frequency across multiple SMS messages.
    - **Word2Vec**: Represents words as dense vectors in a continuous vector space. We can use the Word2Vec model from the gensim library to generate word embeddings.


4. Train-Test Split

    Split the dataset into a training set and a testing set to evaluate the model's performance.

5. Model Training

    Use the Naive Bayes algorithm to train the classifier. There are different variants of Naive Bayes, such as Multinomial Naive Bayes, which is commonly used for text classification.

6. Model Evaluation
 
    Evaluate the model using metrics like accuracy, precision, recall, and F1 score to ensure it performs well on unseen data.

In [1]:
# install necessary libraries
#!pip install nltk
#!pip install gensim

In [2]:
# import necessary libraries
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_curve, auc 

In [3]:
# download necessary libraries
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jenis\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jenis\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jenis\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [4]:
# load the dataset
df = pd.read_csv(r'C:\Users\jenis\Downloads\SMSSpamCollection.txt', sep='\t', header=None)
df.columns = ['label', 'text']
print(df.head())

  label                                               text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


In [5]:
# pre-processing
# convert text to lowercase
df['text'] = df['text'].str.lower()
df['label'] = df['label'].str.lower()

# tokenization
df['tokens'] = df['text'].apply(word_tokenize)

# removing punctuation and special characters
punctuation = string.punctuation
df['tokens'] = df['tokens'].apply(lambda tokens: [token for token in tokens if token not in punctuation])

# remove stopwords
stop_words = set(stopwords.words('english'))
df['tokens'] = df['tokens'].apply(lambda tokens: [token for token in tokens if token not in stop_words])

# stemming
stemmer = PorterStemmer()
df['tokens'] = df['tokens'].apply(lambda tokens: [stemmer.stem(token) for token in tokens])

# lemmatization
lemmatizer = WordNetLemmatizer()
df['tokens'] = df['tokens'].apply(lambda tokens: [lemmatizer.lemmatize(token) for token in tokens])

# pre-processed data
print(df.head())

  label                                               text  \
0   ham  go until jurong point, crazy.. available only ...   
1   ham                      ok lar... joking wif u oni...   
2  spam  free entry in 2 a wkly comp to win fa cup fina...   
3   ham  u dun say so early hor... u c already then say...   
4   ham  nah i don't think he goes to usf, he lives aro...   

                                              tokens  
0  [go, jurong, point, crazi, .., avail, bugi, n,...  
1             [ok, lar, ..., joke, wif, u, oni, ...]  
2  [free, entri, 2, wkli, comp, win, fa, cup, fin...  
3  [u, dun, say, earli, hor, ..., u, c, alreadi, ...  
4  [nah, n't, think, goe, usf, live, around, though]  


In [6]:
# feature extraction
# BoW
cv = CountVectorizer() # initialize 
X_bow = cv.fit_transform(df['text']) # fit and transform

# TF-IDF
tfidf_vectorizer = TfidfVectorizer() # initialize
X_tfidf = tfidf_vectorizer.fit_transform(df['text']) # fit and transform

# Word2Vec
word2vec_model = Word2Vec(sentences=df['tokens'], vector_size=100, window=5, min_count=1, workers=4, sg=1) #sg: 1 for skip gram, 0 for CBoW, hs: 1 for hierarchical softmax, 0 for negative sampling
word_vectors = word2vec_model.wv

In [7]:
# train test split
X = tfidf_vectorizer.fit_transform(df['text'])
y = df['label']

# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [8]:
# train the model
model = MultinomialNB()
model.fit(X_train, y_train)

In [9]:
# make predictions
y_pred = model.predict(X_test)

In [10]:
# evaluate the model
# accuracy
accuracy = accuracy_score(y_test, y_pred) 
print(f"Accuracy: ",accuracy)

# precision
precision = precision_score(y_test, y_pred, pos_label='spam') 
print(f"Precision: ",precision)

# recall
recall = recall_score(y_test, y_pred, pos_label='spam')
print(f"Recall: ",recall)

# f1 score
f1 = f1_score(y_test, y_pred, pos_label='spam')
print(f"F1 Score: ",f1)

# confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(f"Confusion matrix: \n",cm)

# classification report
cls_report = classification_report(y_test, y_pred)
print(f"Classification report: \n",cls_report)

Accuracy:  0.9668161434977578
Precision:  1.0
Recall:  0.7516778523489933
F1 Score:  0.8582375478927203
Confusion matrix: 
 [[966   0]
 [ 37 112]]
Classification report: 
               precision    recall  f1-score   support

         ham       0.96      1.00      0.98       966
        spam       1.00      0.75      0.86       149

    accuracy                           0.97      1115
   macro avg       0.98      0.88      0.92      1115
weighted avg       0.97      0.97      0.96      1115

