# **SMS Spam Classification using NLP**




Sms spam classifiaction using NLP is a text analysis technique that helps identify and distinguish between unwanted spam messages and legitimate text messages. By applying Natural Language Processing (NLP) methods, such as text tokenization, stemming, and TF-IDF vectorization, this approach allows for the automatic detection of spam in SMS messages. This technology is essential for enhancing mobile communication security by filtering out unsolicited and potentially harmful content.

In [1]:
import pandas as pd
df_train=pd.read_csv("/content/drive/MyDrive/luminar dataset/SMS_train.csv",encoding='ISO-8859-1')
df_train

Unnamed: 0,S. No.,Message_body,Label
0,1,Rofl. Its true to its name,Non-Spam
1,2,The guy did some bitching but I acted like i'd...,Non-Spam
2,3,"Pity, * was in mood for that. So...any other s...",Non-Spam
3,4,Will ü b going to esplanade fr home?,Non-Spam
4,5,This is the 2nd time we have tried 2 contact u...,Spam
...,...,...,...
952,953,hows my favourite person today? r u workin har...,Non-Spam
953,954,How much you got for cleaning,Non-Spam
954,955,Sorry da. I gone mad so many pending works wha...,Non-Spam
955,956,Wat time ü finish?,Non-Spam


In [2]:
df_test=pd.read_csv("/content/drive/MyDrive/luminar dataset/SMS_test.csv",encoding='ISO-8859-1')
df_test

Unnamed: 0,S. No.,Message_body,Label
0,1,"UpgrdCentre Orange customer, you may now claim...",Spam
1,2,"Loan for any purpose £500 - £75,000. Homeowner...",Spam
2,3,Congrats! Nokia 3650 video camera phone is you...,Spam
3,4,URGENT! Your Mobile number has been awarded wi...,Spam
4,5,Someone has contacted our dating service and e...,Spam
...,...,...,...
120,121,7 wonders in My WORLD 7th You 6th Ur style 5th...,Non-Spam
121,122,Try to do something dear. You read something f...,Non-Spam
122,123,Sun ah... Thk mayb can if dun have anythin on....,Non-Spam
123,124,"SYMPTOMS when U are in love: ""1.U like listeni...",Non-Spam


In [3]:
df=pd.concat([df_train,df_test],ignore_index=True,axis=0)
df

Unnamed: 0,S. No.,Message_body,Label
0,1,Rofl. Its true to its name,Non-Spam
1,2,The guy did some bitching but I acted like i'd...,Non-Spam
2,3,"Pity, * was in mood for that. So...any other s...",Non-Spam
3,4,Will ü b going to esplanade fr home?,Non-Spam
4,5,This is the 2nd time we have tried 2 contact u...,Spam
...,...,...,...
1077,121,7 wonders in My WORLD 7th You 6th Ur style 5th...,Non-Spam
1078,122,Try to do something dear. You read something f...,Non-Spam
1079,123,Sun ah... Thk mayb can if dun have anythin on....,Non-Spam
1080,124,"SYMPTOMS when U are in love: ""1.U like listeni...",Non-Spam


In [4]:
df.drop(['S. No.'],axis=1,inplace=True)
df

Unnamed: 0,Message_body,Label
0,Rofl. Its true to its name,Non-Spam
1,The guy did some bitching but I acted like i'd...,Non-Spam
2,"Pity, * was in mood for that. So...any other s...",Non-Spam
3,Will ü b going to esplanade fr home?,Non-Spam
4,This is the 2nd time we have tried 2 contact u...,Spam
...,...,...
1077,7 wonders in My WORLD 7th You 6th Ur style 5th...,Non-Spam
1078,Try to do something dear. You read something f...,Non-Spam
1079,Sun ah... Thk mayb can if dun have anythin on....,Non-Spam
1080,"SYMPTOMS when U are in love: ""1.U like listeni...",Non-Spam


In [5]:
df['Label'].unique()

array(['Non-Spam', 'Spam'], dtype=object)

In [6]:
df['Label']=df['Label'].map({'Spam':1,'Non-Spam':0})
df.head()

Unnamed: 0,Message_body,Label
0,Rofl. Its true to its name,0
1,The guy did some bitching but I acted like i'd...,0
2,"Pity, * was in mood for that. So...any other s...",0
3,Will ü b going to esplanade fr home?,0
4,This is the 2nd time we have tried 2 contact u...,1


In [7]:
msg=df.Message_body
msg

0                              Rofl. Its true to its name
1       The guy did some bitching but I acted like i'd...
2       Pity, * was in mood for that. So...any other s...
3                    Will ü b going to esplanade fr home?
4       This is the 2nd time we have tried 2 contact u...
                              ...                        
1077    7 wonders in My WORLD 7th You 6th Ur style 5th...
1078    Try to do something dear. You read something f...
1079    Sun ah... Thk mayb can if dun have anythin on....
1080    SYMPTOMS when U are in love: "1.U like listeni...
1081    Great. Have a safe trip. Dont panic surrender ...
Name: Message_body, Length: 1082, dtype: object

In [8]:
msg=msg.str.replace('[^a-zA-Z0-9]+'," ")
msg

  msg=msg.str.replace('[^a-zA-Z0-9]+'," ")


0                               Rofl Its true to its name
1       The guy did some bitching but I acted like i d...
2       Pity was in mood for that So any other suggest...
3                      Will b going to esplanade fr home 
4       This is the 2nd time we have tried 2 contact u...
                              ...                        
1077    7 wonders in My WORLD 7th You 6th Ur style 5th...
1078    Try to do something dear You read something fo...
1079    Sun ah Thk mayb can if dun have anythin on Thk...
1080    SYMPTOMS when U are in love 1 U like listening...
1081     Great Have a safe trip Dont panic surrender all 
Name: Message_body, Length: 1082, dtype: object

In [9]:
import nltk
from nltk.stem import PorterStemmer
nltk.download('punkt')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [10]:
stemmer=PorterStemmer()
#By default english is taken as language here unlike snowball stemmer
msg=msg.apply(lambda line:[stemmer.stem(token.lower()) for token in word_tokenize(line)]).apply(lambda token:" ".join(token))
msg

0                                 rofl it true to it name
1       the guy did some bitch but i act like i d be i...
2           piti wa in mood for that so ani other suggest
3                           will b go to esplanad fr home
4       thi is the 2nd time we have tri 2 contact u u ...
                              ...                        
1077    7 wonder in my world 7th you 6th ur style 5th ...
1078       tri to do someth dear you read someth for exam
1079    sun ah thk mayb can if dun have anythin on thk...
1080    symptom when u are in love 1 u like listen son...
1081        great have a safe trip dont panic surrend all
Name: Message_body, Length: 1082, dtype: object

In [11]:
#removing stop words
from nltk.corpus import stopwords
nltk.download('stopwords')
sw=stopwords.words('english')
msg=msg.apply(lambda line:[token for token in word_tokenize(line) if token not in sw]).apply(lambda token:" ".join(token))
msg

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


0                                          rofl true name
1       guy bitch act like interest buy someth els nex...
2                                piti wa mood ani suggest
3                                   b go esplanad fr home
4       thi 2nd time tri 2 contact u u 750 pound prize...
                              ...                        
1077    7 wonder world 7th 6th ur style 5th ur smile 4...
1078                     tri someth dear read someth exam
1079    sun ah thk mayb dun anythin thk book e lesson ...
1080    symptom u love 1 u like listen song 2 u get st...
1081                   great safe trip dont panic surrend
Name: Message_body, Length: 1082, dtype: object

In [12]:
msg=msg.apply(lambda line:[token for token in word_tokenize(line) if len(token)>2]).apply(lambda token:" ".join(token))
msg

0                                          rofl true name
1       guy bitch act like interest buy someth els nex...
2                                   piti mood ani suggest
3                                           esplanad home
4       thi 2nd time tri contact 750 pound prize claim...
                              ...                        
1077    wonder world 7th 6th style 5th smile 4th perso...
1078                     tri someth dear read someth exam
1079    sun thk mayb dun anythin thk book lesson pilat...
1080    symptom love like listen song get stop see nam...
1081                   great safe trip dont panic surrend
Name: Message_body, Length: 1082, dtype: object

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec=TfidfVectorizer()
data_vec=vec.fit_transform(msg)
print(data_vec)

  (0, 1738)	0.5086856793431559
  (0, 2596)	0.5352804139572925
  (0, 2142)	0.6743246681420617
  (1, 1116)	0.1975505629026879
  (1, 1159)	0.32775007986689586
  (1, 2732)	0.232360173562465
  (1, 1771)	0.26486371982881535
  (1, 964)	0.30359792197566066
  (1, 2326)	0.27484084856681323
  (1, 614)	0.27484084856681323
  (1, 1385)	0.3183810904534948
  (1, 1525)	0.21717710288602465
  (1, 331)	0.3539999646600926
  (1, 540)	0.37483567038885635
  (1, 1230)	0.2679790477690628
  (2, 2425)	0.5165656915002457
  (2, 399)	0.36716239650585775
  (2, 1697)	0.5469696796701571
  (2, 1921)	0.5469696796701571
  (3, 1303)	0.5461172911588754
  (3, 991)	0.8377087228251189
  (4, 2055)	0.23243263023633923
  (4, 1747)	0.26941298641228506
  (4, 1666)	0.2401340087334995
  (4, 1893)	0.23610804613087077
  :	:
  (1079, 1844)	0.2706898396742334
  (1079, 1713)	0.2706898396742334
  (1079, 2431)	0.26153957460502464
  (1079, 1514)	0.24774213284441401
  (1079, 561)	0.2374366876682329
  (1079, 2521)	0.49548426568882803
  (1079, 

In [14]:
data_vec.shape

(1082, 2844)

In [15]:
y=df['Label'].values
y

array([0, 0, 0, ..., 0, 0, 0])

In [16]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(data_vec,y,test_size=0.3,random_state=1)

In [17]:
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import classification_report
sv=SVC()
nb=MultinomialNB()
rf=RandomForestClassifier()
ab=AdaBoostClassifier()
models=[sv,nb,rf,ab]
for model in models:
  print(model)
  model.fit(X_train,y_train)
  y_pred=model.predict(X_test)
  y_new=model.predict(vec.transform(["Bought a fraction of Microsoft today Small wins"]))
  if y_new==1:
      print('Positive')
  elif y_new==0:
      print("Neutral")
  else:
      print('Negative')
  print(classification_report(y_test,y_pred))

SVC()
Neutral
              precision    recall  f1-score   support

           0       0.88      1.00      0.93       260
           1       0.97      0.45      0.61        65

    accuracy                           0.89       325
   macro avg       0.92      0.72      0.77       325
weighted avg       0.90      0.89      0.87       325

MultinomialNB()
Neutral
              precision    recall  f1-score   support

           0       0.88      1.00      0.94       260
           1       1.00      0.46      0.63        65

    accuracy                           0.89       325
   macro avg       0.94      0.73      0.78       325
weighted avg       0.91      0.89      0.88       325

RandomForestClassifier()
Neutral
              precision    recall  f1-score   support

           0       0.92      0.99      0.96       260
           1       0.96      0.68      0.79        65

    accuracy                           0.93       325
   macro avg       0.94      0.83      0.88       325
wei

# **Conclusion**

In the SMS spam classification using NLP, a diverse range of classifiers, including SVC, MultinomialNB, RandomForest, and AdaBoost, were employed. Among these, the AdaBoost classifier exhibited the highest accuracy of 94%, indicating its effectiveness in distinguishing between spam and non-spam messages.
