# MACHINE LEARNING INTERN - CODSOFT

# TASK 4 SPAM SMS DETECTION

OBJECTIVE: 

Build an AI model that can classify SMS messages as spam or legitimate. Use techniques like TF-IDF or word embeddings with classifiers like  Logistic Regression, Naive Bayes, or Support Vector Machines to identify spam messages

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
ssd = pd.read_csv('spam.csv',encoding = "latin1")
ssd.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [3]:
# checking size of the Dataset
ssd.shape

(5572, 5)

In [4]:
# checking Null values
ssd.isnull().sum()

v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64

In [5]:
# Checking duplicate values
ssd.duplicated().sum()

403

In [6]:
# checking descreptive of the dataset 
ssd.describe()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""","MK17 92H. 450Ppw 16""","GNT:-)"""
freq,4825,30,3,2,2


In [7]:
# checking information of the dataset
ssd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


 As we can see more than 99% null values are there in some columns so we are dropping of the columns 
 so it does not disturb in building feature models.

In [8]:
ssd.drop(columns=ssd[['Unnamed: 2','Unnamed: 3','Unnamed: 4']],axis=1,inplace=True)

In [9]:
ssd.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [10]:
ssd.shape

(5572, 2)

In [11]:
# Rename columns names for easy to understand, we can also use df.rename
ssd.columns=['spam/ham','sms']

In [12]:
# Convert the text data into numerical form
ssd.loc[ssd['spam/ham'] == 'spam', 'spam/ham',] = 0
ssd.loc[ssd['spam/ham'] == 'ham', 'spam/ham',] = 1

# Separating Input Features and Target Column:

In [13]:
x = ssd.sms
x.head()

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: sms, dtype: object

In [14]:
y = ssd['spam/ham']
y.head()

0    1
1    1
2    0
3    1
4    1
Name: spam/ham, dtype: object

In [15]:
# Divide the whole dataset into training and testing set for model training
from sklearn.model_selection import train_test_split

In [16]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3,random_state=42)

In [17]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(3900,)
(1672,)
(3900,)
(1672,)


In [18]:
x_train

708     To review and KEEP the fantastic Nokia N-Gage ...
4338                   Just got outta class gonna go gym.
5029    Is there coming friday is leave for pongal?do ...
4921    Hi Dear Call me its urgnt. I don't know whats ...
2592    My friend just got here and says he's upping h...
                              ...                        
3772    I came hostel. I m going to sleep. Plz call me...
5191                               Sorry, I'll call later
5226        Prabha..i'm soryda..realy..frm heart i'm sory
5390                           Nt joking seriously i told
860                   In work now. Going have in few min.
Name: sms, Length: 3900, dtype: object

In [19]:
x_test

3245    Funny fact Nobody teaches volcanoes 2 erupt, t...
944     I sent my scores to sophas and i had to do sec...
1044    We know someone who you know that fancies you....
2484    Only if you promise your getting out as SOON a...
812     Congratulations ur awarded either å£500 of CD ...
                              ...                        
2505                 Congrats kano..whr s the treat maga?
2525    Say this slowly.? GOD,I LOVE YOU &amp; I NEED ...
4975    You are gorgeous! keep those pix cumming :) th...
650     Thats cool! Sometimes slow and gentle. Sonetim...
4463         Ranjith cal drpd Deeraj and deepak 5min hold
Name: sms, Length: 1672, dtype: object

In [20]:
y_train

708     0
4338    1
5029    1
4921    1
2592    1
       ..
3772    1
5191    1
5226    1
5390    1
860     1
Name: spam/ham, Length: 3900, dtype: object

In [21]:
y_test

3245    1
944     1
1044    0
2484    1
812     0
       ..
2505    1
2525    1
4975    1
650     1
4463    1
Name: spam/ham, Length: 1672, dtype: object

# Text to Vector Conversion:

Observation:
As we know that,machine learning algorithms only performs well with respect to numbers,so we need to convert all the text data into numbers.To do so I will use TfidfVectorizer techinque from feature_extraction of sklearn.

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [23]:
feat_vect = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)
feat_vect

In [24]:
y_train = y_train.astype('int')
y_test = y_test.astype('int')

In [25]:
x_train_vec = feat_vect.fit_transform(x_train)
x_train_vec

<3900x6946 sparse matrix of type '<class 'numpy.float64'>'
	with 30413 stored elements in Compressed Sparse Row format>

In [26]:
print(x_train_vec)

  (0, 6784)	0.2036674061100652
  (0, 5127)	0.16069009765718537
  (0, 832)	0.2942029533221057
  (0, 6425)	0.2299888748870684
  (0, 4294)	0.2942029533221057
  (0, 1730)	0.1830967933231891
  (0, 1701)	0.2942029533221057
  (0, 6835)	0.1703267398384584
  (0, 1687)	0.23571983525661494
  (0, 2013)	0.2942029533221057
  (0, 2795)	0.2274465254896455
  (0, 2788)	0.2942029533221057
  (0, 4325)	0.3863819557282538
  (0, 2533)	0.24260404766807295
  (0, 5179)	0.26276866734255794
  (1, 2988)	0.48547999128245667
  (1, 2895)	0.38594294491815584
  (1, 1661)	0.40369967264325335
  (1, 4507)	0.5411887811303221
  (1, 2914)	0.29947657175915554
  (1, 3510)	0.2642201388730181
  (2, 4689)	0.33702897982631685
  (2, 6786)	0.3126942228233435
  (2, 4293)	0.4056944099354124
  (2, 4758)	0.46869821034881815
  :	:
  (3894, 5333)	0.3355206697396443
  (3894, 4373)	0.34776563110407227
  (3895, 3201)	0.49796356031820593
  (3895, 3180)	0.444758353553002
  (3895, 4724)	0.377219004761111
  (3895, 1471)	0.3587277158940319
  (389

In [27]:
x_test_vec = feat_vect.transform(x_test)
x_test_vec

<1672x6946 sparse matrix of type '<class 'numpy.float64'>'
	with 11442 stored elements in Compressed Sparse Row format>

In [28]:
print(x_test_vec)

  (0, 6720)	0.3047058203677519
  (0, 6332)	0.39828278711386805
  (0, 4244)	0.37945200739329527
  (0, 3510)	0.17367416497787852
  (0, 3234)	0.3472605768360241
  (0, 3028)	0.3472605768360241
  (0, 2781)	0.3401014466481085
  (0, 2503)	0.3472605768360241
  (0, 1638)	0.31135176945323223
  (1, 6136)	0.2724042527070941
  (1, 6133)	0.1950483936798134
  (1, 5399)	0.220101857716323
  (1, 5336)	0.3523559459587206
  (1, 5332)	0.3523559459587206
  (1, 5331)	0.2557448905246011
  (1, 5139)	0.3356965837762275
  (1, 4442)	0.30088362046477946
  (1, 3475)	0.32387657820103527
  (1, 2477)	0.32387657820103527
  (1, 1840)	0.2485812254740083
  (1, 1805)	0.2253541529012992
  (2, 4728)	0.4021320697253737
  (2, 3828)	0.5083401006088638
  (2, 3581)	0.4886099291402999
  (2, 2529)	0.48430581952163576
  :	:
  (1668, 4717)	0.3776751165484962
  (1668, 4614)	0.21959656170278494
  (1668, 4265)	0.1729987063210788
  (1668, 4063)	0.29289917468482934
  (1668, 3809)	0.1695055149823706
  (1668, 3066)	0.21959656170278494
  (16

In [29]:
print(x_train)

708     To review and KEEP the fantastic Nokia N-Gage ...
4338                   Just got outta class gonna go gym.
5029    Is there coming friday is leave for pongal?do ...
4921    Hi Dear Call me its urgnt. I don't know whats ...
2592    My friend just got here and says he's upping h...
                              ...                        
3772    I came hostel. I m going to sleep. Plz call me...
5191                               Sorry, I'll call later
5226        Prabha..i'm soryda..realy..frm heart i'm sory
5390                           Nt joking seriously i told
860                   In work now. Going have in few min.
Name: sms, Length: 3900, dtype: object


# Logistic Regression Algorithm:

In [30]:
from sklearn.linear_model import LogisticRegression

In [31]:
log_reg = LogisticRegression()
log_reg.fit(x_train_vec,y_train)

In [32]:
log_reg.score(x_train_vec,y_train)

0.9669230769230769

In [33]:
log_reg.score(x_test_vec,y_test)

0.9527511961722488

In [34]:
pred_log_reg=log_reg.predict(x_test_vec)
pred_log_reg

array([1, 1, 1, ..., 1, 1, 1])

In [35]:
# Evaluation metrics:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score,f1_score,recall_score,precision_score

In [36]:
accuracy_score(y_test,pred_log_reg)

0.9527511961722488

In [37]:
confusion_matrix(y_test,pred_log_reg)

array([[ 143,   76],
       [   3, 1450]], dtype=int64)

In [38]:
print(classification_report(y_test,pred_log_reg))

              precision    recall  f1-score   support

           0       0.98      0.65      0.78       219
           1       0.95      1.00      0.97      1453

    accuracy                           0.95      1672
   macro avg       0.96      0.83      0.88      1672
weighted avg       0.95      0.95      0.95      1672



Concluding Remarks:


1. Proper data tagging is crucial for training models effectively and making accurate predictions based on new data.
   
2. Incorrectly labeled data can significantly impact the performance of machine learning models, potentially leading to misleading outcomes.

3. When encountering suspicious emails in your inbox, such as potential spam, consider reporting them accurately to help improve email filtering systems.

4. Reporting spam emails not only helps you but also benefits other users who might receive similar emails in the future.

5. Exercise caution when labeling emails as spam to avoid mistakenly classifying genuine emails as spam, which could lead to important messages being overlooked.

6. Being mindful and accurate in your email tagging decisions can enhance the efficiency of spam filters and improve email categorization for all users.