<a href="https://colab.research.google.com/github/AJ-0504/SpamOrHamDetection/blob/main/spam_or_ham.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Importing libraries

In [101]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer#convert text data into numerical values or feature vector
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

### Data Collection and Preprocessing

In [102]:
df=pd.read_csv("/content/mail_data.csv")#data loaded of rawMailData

In [103]:
print(df)

     Category                                            Message
0         ham  Go until jurong point, crazy.. Available only ...
1         ham                      Ok lar... Joking wif u oni...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
3         ham  U dun say so early hor... U c already then say...
4         ham  Nah I don't think he goes to usf, he lives aro...
...       ...                                                ...
6448     spam            get vouchers for free in the given link
6449     spam  congratulations on winning the lottery click t...
6450     spam  Your credits have been topped up for http://ww...
6451     spam  your credit card has been blocked click on the...
6452     spam  your bills are due and connection will be cut ...

[6453 rows x 2 columns]


In [104]:
#replace the num values with a null string
mailData=df.where((pd.notnull(df)),"") #null values get replace with empty string ""

In [105]:
mailData.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [106]:
mailData.shape#total rows and columns

(6453, 2)

In [107]:
#spam=0;ham=1
mailData.loc[mailData["Category"]=='spam','Category',] = 0
mailData.loc[mailData["Category"]=='ham','Category',] = 1

Spam=0
Ham=1

In [108]:
mailData.head()

Unnamed: 0,Category,Message
0,1,"Go until jurong point, crazy.. Available only ..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives aro..."


In [109]:
X=mailData['Message']
Y=mailData['Category']

In [110]:
print(X)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
6448              get vouchers for free in the given link
6449    congratulations on winning the lottery click t...
6450    Your credits have been topped up for http://ww...
6451    your credit card has been blocked click on the...
6452    your bills are due and connection will be cut ...
Name: Message, Length: 6453, dtype: object


In [111]:
print(Y)

0       1
1       1
2       0
3       1
4       1
       ..
6448    0
6449    0
6450    0
6451    0
6452    0
Name: Category, Length: 6453, dtype: object


## Train Test Splitting

In [112]:
x_train,x_test,y_train,y_test= train_test_split(X,Y,test_size=0.2,random_state=3)

In [113]:
print(X.shape)
print(x_test.shape)
print(x_train.shape)

(6453,)
(1291,)
(5162,)


## Feature Extraction

In [114]:
#transforming text data to feature vectors that can be used as input to the Logistic Regression
feature_extraction=TfidfVectorizer(min_df = 1,stop_words='english',lowercase=True)

x_train_features = feature_extraction.fit_transform(x_train)
x_test_features=feature_extraction.transform(x_test)
#converting object data of y to integer
y_train=y_train.astype('int')
y_test=y_test.astype('int')

In [115]:
print(x_train_features)#x train in numerical form

  (0, 2893)	0.17065451898369613
  (0, 2991)	0.35372161243744893
  (0, 7225)	0.23963814558465055
  (0, 1549)	0.36004477236910265
  (0, 3600)	0.36004477236910265
  (0, 4000)	0.20550348724168935
  (0, 798)	0.4001671718713782
  (0, 1334)	0.41936730657290416
  (0, 2249)	0.23654878775133237
  (0, 3185)	0.30802179059984897
  (1, 1719)	0.31512291357344685
  (1, 843)	0.3597503447898352
  (1, 3824)	0.2779016827236604
  (1, 3143)	0.31512291357344685
  (1, 4308)	0.21960619795378242
  (1, 1270)	0.33159357060995215
  (1, 5080)	0.3088610603865591
  (1, 7352)	0.30343679643006904
  (1, 6838)	0.2662155655802826
  (1, 6117)	0.3597503447898352
  (1, 3506)	0.2209166870357592
  (2, 6768)	0.4405905975644386
  (2, 3919)	0.3778073340192466
  (2, 1868)	0.33771787850736656
  (2, 5075)	0.4856873830142803
  :	:
  (5159, 5716)	0.33358889460407515
  (5159, 4498)	0.25423540594187677
  (5159, 6160)	0.2352927540701108
  (5159, 6991)	0.20086390580430105
  (5160, 149)	0.293981484112151
  (5160, 650)	0.30224362947923195
 

# Training the Model

## logistic Regression

In [116]:
model=LogisticRegression()#loading logistric Regression model

In [117]:
#training the logistic regression model with the training data
model.fit(x_train_features,y_train)

Evaluating the trained model

In [118]:
#prediction on training data
prediction_on_training_data=model.predict(x_train_features)
accuracy_on_training_data=accuracy_score(y_train,prediction_on_training_data)

In [119]:
print("Accuracy on training data: ",accuracy_on_training_data)

Accuracy on training data:  0.9742347927160016


In [120]:
prediction_on_testing_data=model.predict(x_test_features)
accuracy_on_testing_data=accuracy_score(y_test,prediction_on_testing_data)

In [121]:
print("Accuracy on testing data: ",accuracy_on_testing_data)

Accuracy on testing data:  0.9697908597986057


## Building a predictive system

In [123]:
input_mail=[input("Enter email message: ")]

#converting text to feature vectors
input_data_features=feature_extraction.transform(input_mail) #string to integers

#making predictions
prediction=model.predict(input_data_features)


if(prediction[0]==1):
  print("ham")
else:
  print("spam")

Enter email message: your bill is due recharge immedieately
ham
