<a href="https://colab.research.google.com/github/SameerR007/Spam_Nonspam-Classifier/blob/main/Spam_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Building a Spam Classifier using logistic Regression

#Importing libraries

In [3]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

#loading the data from csv file to a pandas Dataframe

In [4]:
import pandas as pd
raw_mail_data = pd.read_csv('mail_data1.csv')

#Data Preprocessing

In [5]:
#checking for null values
raw_mail_data.isnull().sum()

Category    0
Message     0
dtype: int64

1. There are no null values in the dataset.

In [6]:
mail_data = raw_mail_data

In [7]:
# printing the first 5 rows of the dataframe
mail_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [8]:
# checking the number of rows and columns in the dataframe
mail_data.shape

(5572, 2)

1. We have 5572 number of records classified as Category and Message.

In [9]:
# label spam mail as 0;  ham mail as 1;

mail_data.loc[mail_data['Category'] == 'spam', 'Category'] = 0
mail_data.loc[mail_data['Category'] == 'ham', 'Category'] = 1

In [10]:
mail_data.head()

Unnamed: 0,Category,Message
0,1,"Go until jurong point, crazy.. Available only ..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives aro..."


In [11]:
def low(a):
    return(a.lower())
mail_data['Message']=mail_data['Message'].apply(low)

In [12]:
from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()

In [13]:
mail_data['Message']
def stem(a):
    y=[]
    for i in a.split():
        y.append(ps.stem(i))
    return(" ".join(y))

In [14]:
mail_data['Message']=mail_data['Message'].apply(stem)

In [15]:
# separating the data as texts and label

X = mail_data['Message']

Y = mail_data['Category']

#Splitting data into train test dataset

In [16]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=3)

In [17]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)

(5572,)
(4457,)
(1115,)


In [18]:
X_train.head()

3075                  don know. i did't msg him recently.
1787    do you know whi god creat gap between your fin...
1614                          thnx dude. u guy out 2nite?
4304                                      yup i'm free...
3266    44 7732584351, do you want a new nokia 3510i c...
Name: Message, dtype: object

In [19]:
print(mail_data['Message'].apply(len).median())
print(mail_data.shape[0])
print(mail_data.shape[0]*mail_data['Message'].apply(len).median())

59.0
5572
328748.0


In [20]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=5000,stop_words='english')

In [21]:
# transform the text data to feature vectors that can be used as input to the Logistic regression
X_train_features = cv.fit_transform(X_train)
X_test_features = cv.transform(X_test)

In [22]:
X_train_features.shape

(4457, 5000)

In [23]:
# convert Y_train and Y_test values as integers

Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

In [24]:
X_train_features

<4457x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 33584 stored elements in Compressed Sparse Row format>

#Implementing logistic Regression

In [25]:
model = LogisticRegression()

In [26]:
# training the Logistic Regression model with the training data
model.fit(X_train_features, Y_train)

In [27]:
# prediction on training data

prediction_on_training_data = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

In [28]:
print('Accuracy on training data : ', accuracy_on_training_data)

Accuracy on training data :  0.9946152120260264


1. We get the accuracy of 96.70 in training data.

In [29]:
# prediction on test data

prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

In [30]:
print('Accuracy on test data : ', accuracy_on_test_data)

Accuracy on test data :  0.97847533632287


We get the accracy of 96.59 in our test data.

#Inputting the message to be classified as spam or non spam

In [31]:
input_mail=[input()]
# convert text to feature vectors
input_data_features = cv.transform(input_mail)

# making prediction

prediction = model.predict(input_data_features)
print(prediction)


if (prediction[0]==1):
  print('Ham mail')

else:
  print('Spam mail')

[1]
Ham mail
