#Importing Dependencies

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

#Data Collection and Analysis

In [28]:
data=pd.read_csv("/content/mail_data.csv")
data.head() # printing first five rows of the dataset

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
# printing the stats of the dataset
data.describe()

Unnamed: 0,Category,Message
count,5572,5572
unique,2,5157
top,ham,"Sorry, I'll call later"
freq,4825,30


In [5]:
data.dtypes # datatypes of the columns

Category    object
Message     object
dtype: object

From this we can infer that there are two categories of the mail namely, Ham and Spam.

In [6]:
data.isnull().value_counts()

Category  Message
False     False      5572
dtype: int64

Hence, we can also see that there is no NULL value in the dataset

#Splitting the features and target values

**Label Encoding**

In [7]:
X=data['Message']
Y=data['Category']
for i in range(len(Y)):
  if Y[i]=='spam':
    Y[i]=0
  else:
    Y[i]=1
Y=Y.astype('int')

In [8]:
print(Y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: int64


<h3>Vectorizing the Textual Data

In [10]:
vectorizer=TfidfVectorizer(min_df=1,stop_words='english',lowercase=True)
vectorized_X=vectorizer.fit_transform(X)
vectorized_X

<5572x8440 sparse matrix of type '<class 'numpy.float64'>'
	with 43529 stored elements in Compressed Sparse Row format>

This line of code is creating an instance of the TfidfVectorizer class, which is a feature extraction technique used in natural language processing and machine learning. The three parameters passed to the constructor are:

<li><b>min_df=1:</b> This parameter is used to ignore terms that have a document frequency lower than the given value. In this case, all terms will be included in the feature matrix.

<li><b>stop_words='english':</b> This parameter is used to remove stop words, which are commonly used words that do not provide much information, such as "the", "is", "an", etc. The value 'english' tells the vectorizer to remove stop words in English language.

<li><b>lowercase='True':</b> This parameter is used to convert all the characters to lowercase before tokenizing the text, so that words such as "Machine" and "machine" are considered the same.

This feature_extraction variable can be used later to transform the input text into a matrix of TF-IDF features, which can be used as input for machine learning models, such as Naive Bayes, to classify the text into Spam or Not Spam.

<h4>Train_Test_Split</h4>

In [11]:
x_train,x_test,y_train,y_test=train_test_split(vectorized_X,Y,test_size=0.2,random_state=2,stratify=Y)
print(y_train)

5426    1
4724    1
536     1
3488    1
2551    1
       ..
1697    1
422     0
4007    1
3474    1
3074    1
Name: Category, Length: 4457, dtype: int64


#Model Initialization and Training

In [13]:
model=LogisticRegression()
model.fit(x_train,y_train)

#Model Evaluation

<h2>Training data

In [14]:
y_train_predicted=model.predict(x_train)
print(f"{accuracy_score(y_train_predicted,y_train)*100} % accuracy on the training data.")

96.67938074938299 % accuracy on the training data.


<h2>Test Data

In [15]:
y_test_predicted=model.predict(x_test)
print(f"{accuracy_score(y_test_predicted,y_test)*100} % accuracy on the test data.")

96.32286995515696 % accuracy on the test data.


#Predictive System

In [16]:
new_mail=input("Write any mail dialogues: ")
new_features=[new_mail]
extracted_features=vectorizer.transform(new_features)
index_prediction=model.predict(extracted_features)
categories=['spam','ham']
print(f"Prediction is:\n{categories[int(index_prediction)]}")

Write any mail dialogues: Hi Rahul     Imagine being part of NASA's thrilling new quest for innovation and exploration! Well, guess what? You can be! NASA, the trailblazer of space exploration, is reaching out to the brightest minds like you, and they need your help.  Here's the deal: NASA and our challenge partner Blue Clarity wants you to uncover groundbreaking emerging technologies and research from all corners of the globe. They're on a mission to investigate three exciting areas:  Safety in Advanced Air Mobility Vehicles: Think about redefining the safety standards in the future of air travel. We're talking about innovations that make air mobility not just efficient but also unbelievably safe.  Innovative Lightweight Protection against Galactic Cosmic Rays: As we aim for the stars, we need to protect our explorers from cosmic rays. Your mission? Find lightweight solutions that shield astronauts on their cosmic journey.  Revolutionary Sensors for Atmospheric Characterization: The a

#Trying a Neural Network to increase Accuracy

In [19]:
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

model = Sequential([
    Dense(32,activation='relu',input_dim=x_train.shape[1]),
    Dense(16,activation='relu'),
    Dense(1,activation='sigmoid')]
)
model.compile(loss='binary_crossentropy',optimizer=Adam(), metrics=['accuracy'])

In [23]:
model.fit(x_train.toarray(),y_train,epochs=10)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x796d53c0a2f0>

In [24]:
loss, accuracy = model.evaluate(x_test.toarray(),y_test)
print("Test Set Accuracy: ",accuracy*100)

Test Set Accuracy:  98.92376661300659


In [25]:
new_mail=input("Write any mail dialogues: ")
new_features=[new_mail]
extracted_features=vectorizer.transform(new_features)
index_prediction=model.predict(extracted_features.toarray())
categories=['spam','ham']
print(f"Prediction is:\n{categories[int(index_prediction)]}")

Write any mail dialogues: Hi Rahul     Imagine being part of NASA's thrilling new quest for innovation and exploration! Well, guess what? You can be! NASA, the trailblazer of space exploration, is reaching out to the brightest minds like you, and they need your help.  Here's the deal: NASA and our challenge partner Blue Clarity wants you to uncover groundbreaking emerging technologies and research from all corners of the globe. They're on a mission to investigate three exciting areas:  Safety in Advanced Air Mobility Vehicles: Think about redefining the safety standards in the future of air travel. We're talking about innovations that make air mobility not just efficient but also unbelievably safe.  Innovative Lightweight Protection against Galactic Cosmic Rays: As we aim for the stars, we need to protect our explorers from cosmic rays. Your mission? Find lightweight solutions that shield astronauts on their cosmic journey.  Revolutionary Sensors for Atmospheric Characterization: The a