<a href="https://colab.research.google.com/github/Rahul081203/ML_projects/blob/master/Spam_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Importing Dependencies

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

#Data Collection and Analysis

In [None]:
data=pd.read_csv("/content/mail_data.csv")
data.head() # printing first five rows of the dataset

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
# printing the stats of the dataset
data.describe()

Unnamed: 0,Category,Message
count,5572,5572
unique,2,5157
top,ham,"Sorry, I'll call later"
freq,4825,30


In [None]:
data.dtypes # datatypes of the columns

Category    object
Message     object
dtype: object

From this we can infer that there are two categories of the mail namely, Ham and Spam.

#Splitting the features and target values

**Label Encoding**

In [None]:
X=data['Message']
Y=data['Category']
for i in range(len(Y)):
  if Y[i]=='spam':
    Y[i]=0
  else:
    Y[i]=1
Y=Y.astype('int')

In [None]:
print(Y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: int64


<h3>Vectorizing the Textual Data

In [None]:
vectorizer=TfidfVectorizer(min_df=1,stop_words='english',lowercase='True')
vectorized_X=vectorizer.fit_transform(X)
vectorized_X

<5572x8440 sparse matrix of type '<class 'numpy.float64'>'
	with 43529 stored elements in Compressed Sparse Row format>

This line of code is creating an instance of the TfidfVectorizer class, which is a feature extraction technique used in natural language processing and machine learning. The three parameters passed to the constructor are:

<li><b>min_df=1:</b> This parameter is used to ignore terms that have a document frequency lower than the given value. In this case, all terms will be included in the feature matrix.

<li><b>stop_words='english':</b> This parameter is used to remove stop words, which are commonly used words that do not provide much information, such as "the", "is", "an", etc. The value 'english' tells the vectorizer to remove stop words in English language.

<li><b>lowercase='True':</b> This parameter is used to convert all the characters to lowercase before tokenizing the text, so that words such as "Machine" and "machine" are considered the same.

This feature_extraction variable can be used later to transform the input text into a matrix of TF-IDF features, which can be used as input for machine learning models, such as Naive Bayes, to classify the text into Spam or Not Spam.

<h4>Train_Test_Split</h4>

In [None]:
x_train,x_test,y_train,y_test=train_test_split(vectorized_X,Y,test_size=0.2,random_state=2,stratify=Y)
print(y_train)

5426    1
4724    1
536     1
3488    1
2551    1
       ..
1697    1
422     0
4007    1
3474    1
3074    1
Name: Category, Length: 4457, dtype: int64


#Model Initialization and Training

In [None]:
#@title
model=LogisticRegression()
model.fit(x_train,y_train)

LogisticRegression()

#Model Evaluation

<h2>Training data

In [None]:
y_train_predicted=model.predict(x_train)
print(f"{accuracy_score(y_train_predicted,y_train)*100} % accuracy on the training data.")

96.67938074938299 % accuracy on the training data.


<h2>Test Data

In [None]:
y_test_predicted=model.predict(x_test)
print(f"{accuracy_score(y_test_predicted,y_test)*100} % accuracy on the test data.")

96.32286995515696 % accuracy on the test data.


#Predictive System

In [None]:
new_mail=input("Write any mail dialogues: ")
new_features=[new_mail]
extracted_features=vectorizer.transform(new_features)
index_prediction=model.predict(extracted_features)
categories=['spam','ham']
print(f"Prediction is:\n{categories[int(index_prediction)]}")

Write any mail dialogues: Subject: Exclusive Offer - Limited Time Only!  Dear valued customer,  We are excited to offer you an exclusive opportunity to invest in our new and innovative stock trading program. With our cutting-edge technology and expert market analysis, you can make huge returns on your investment in no time!  But hurry, this offer is only available for a limited time. Don't miss out on your chance to make a fortune!  To get started, simply click the link below and enter your information. We'll take care of the rest.  Best regards,  The Stock Trading Team  P.S. This is not a spam, this is a legitimate offer with a money back guarantee.  Click here to start making money now: [Insert link to fake website]  Disclaimer: Any investment carries a risk, please read the offer and consult a financial advisor before making any decision.
Prediction is:
ham
