<a href="https://colab.research.google.com/github/Saketsaurav4/Email-Spam-Detection/blob/main/Email_spam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [115]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

* **pandas** is a powerful library for data manipulation and analysis.
* **numpy** is used for numerical operations and handling arrays.
* **TfidfVectorizer** converts a collection of raw documents to a matrix of TF-IDF features, which is useful for text processing.

  * Term Frequency (TF): This measures how frequently a term appears in a document. The more a term appears, the higher its TF value.

  * Inverse Document Frequency (IDF): This measures how important a term is. It decreases the weight of terms that appear frequently in many documents and increases the weight of terms that appear rarely.
  * TF-IDF: This is the product of TF and IDF. It helps to highlight words that are important to a document but not too common across all documents.

* **LogisticRegression** is a machine learning algorithm used for binary classification tasks.
* **train_test_split** is used to split a dataset into training and testing sets.
* **accuracy_score** is used to calculate the accuracy of a classification model.

In [116]:
!pip install kaggle



In [117]:
!kaggle datasets download -d ozlerhakan/spam-or-not-spam-dataset

Dataset URL: https://www.kaggle.com/datasets/ozlerhakan/spam-or-not-spam-dataset
License(s): other
spam-or-not-spam-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


This command is used to download a dataset from Kaggle.

In [118]:
!unzip spam-or-not-spam-dataset.zip

Archive:  spam-or-not-spam-dataset.zip
replace spam_or_not_spam.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

The command is used to extract the contents of a ZIP file.

In [119]:
data = pd.read_csv("/content/spam_or_not_spam.csv")

This line is used to read a CSV file into a pandas DataFrame.

In [120]:
data.head()

Unnamed: 0,email,label
0,date wed NUMBER aug NUMBER NUMBER NUMBER NUMB...,0
1,martin a posted tassos papadopoulos the greek ...,0
2,man threatens explosion in moscow thursday aug...,0
3,klez the virus that won t die already the most...,0
4,in adding cream to spaghetti carbonara which ...,0


In [121]:
data.tail()

Unnamed: 0,email,label
2995,abc s good morning america ranks it the NUMBE...,1
2996,hyperlink hyperlink hyperlink let mortgage le...,1
2997,thank you for shopping with us gifts for all ...,1
2998,the famous ebay marketing e course learn to s...,1
2999,hello this is chinese traditional 子 件 NUMBER世...,1


In [122]:
data.shape

(3000, 2)

It return (3000, 2), which means the DataFrame has 3000 rows and 2 columns.

In [123]:
data.isna().sum()

email    1
label    0
dtype: int64

In [124]:
data = data.dropna(subset=['email'])

This line is used to remove rows from the DataFrame data that have missing values in the **email** column.




In [125]:
data.isna().sum()

email    0
label    0
dtype: int64

In [126]:
data.shape

(2999, 2)

In [127]:
data["label"].value_counts()

label
0    2500
1     499
Name: count, dtype: int64

This line is used to count the occurrences of each unique value in the **label** column of the DataFrame.

In [128]:
X = data['email']
Y = data['label']

These lines are used to separate the features and the target variable from the DataFrame data.

X variable will store the features, which in this case are the email texts.

Y variable will store the target variable, which are the labels indicating whether an email is spam (1) or not spam (0).

In [129]:
print(X)

0        date wed NUMBER aug NUMBER NUMBER NUMBER NUMB...
1       martin a posted tassos papadopoulos the greek ...
2       man threatens explosion in moscow thursday aug...
3       klez the virus that won t die already the most...
4        in adding cream to spaghetti carbonara which ...
                              ...                        
2995     abc s good morning america ranks it the NUMBE...
2996     hyperlink hyperlink hyperlink let mortgage le...
2997     thank you for shopping with us gifts for all ...
2998     the famous ebay marketing e course learn to s...
2999     hello this is chinese traditional 子 件 NUMBER世...
Name: email, Length: 2999, dtype: object


In [130]:
print(Y)

0       0
1       0
2       0
3       0
4       0
       ..
2995    1
2996    1
2997    1
2998    1
2999    1
Name: label, Length: 2999, dtype: int64


In [131]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)

This line is used to split dataset into training and testing sets.The function randomly splits the data into training and testing sets.

test_size=0.3: This specifies the proportion of the dataset to include in the test split. Here, 30% of the data will be used for testing, and the remaining 70% will be used for training.



In [132]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)

(2999,)
(2099,)
(900,)


In [133]:
print(Y.shape)
print(Y_train.shape)
print(Y_test.shape)

(2999,)
(2099,)
(900,)


In [134]:
feature_extraction = TfidfVectorizer(min_df = 1, stop_words='english', lowercase=True)
X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

* **Vectorizes Text Data**: Converts the raw text data into numerical features that can be used by machine learning algorithms.

* **TfidfVectorizer**: This initializes the TfidfVectorizer with specific parameters.

  * min_df=1: This means that terms must appear in at least one document to be included in the vocabulary.

  * stop_words='english': This removes common English stop words (e.g., 'the', 'is') from the text.
  * lowercase=True: This converts all characters to lowercase before processing.

* **fit_transform**(X_train): This fits the TfidfVectorizer to the training data (X_train) and transforms the training data into a TF-IDF matrix.
  * fit: Learns the vocabulary and IDF from the training data.

  * transform: Transforms the training data into a TF-IDF matrix based on the learned vocabulary and IDF.

* **transform**(X_test): This transforms the test data (X_test) into a TF-IDF matrix using the vocabulary and IDF learned from the training data. It ensures that the test data is transformed in the same way as the training data.



In [145]:
Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

These lines are used to convert the data type of the Y_train and Y_test arrays to integers, which is often required for machine learning algorithms that expect numerical input.

In [136]:
print(X_train)

805     at NUMBER NUMBER am NUMBER NUMBER NUMBER NUMBE...
2898     an additional income stream from your current...
205     hi i ve got an normal NUMBER NUMBER cd rw ide ...
597      gary lawrence murphy said s stephen d william...
34      hi thank you for the useful replies i have fou...
                              ...                        
472     on fri NUMBER sep NUMBER russell turpin wrote ...
2317    url URL date not supplied plastic discs design...
878     draw the pain ken never wrote about this metho...
1646     guido takers how is esr s bogofilter packaged...
1590    gary funck gary intrepid com NUMBER NUMBER NUM...
Name: email, Length: 2099, dtype: object


In [137]:
print(X_train_features)


  (0, 5822)	0.17422148238747767
  (0, 24158)	0.13248905399720048
  (0, 4001)	0.1616042551530536
  (0, 16502)	0.1525567177644922
  (0, 19655)	0.16970966666510703
  (0, 22625)	0.5091289999953211
  (0, 14214)	0.10560968678501835
  (0, 13246)	0.17974349854875288
  (0, 23059)	0.02900505500392935
  (0, 17022)	0.08160977650950955
  (0, 21457)	0.17974349854875288
  (0, 8314)	0.17422148238747767
  (0, 11811)	0.07177219049426495
  (0, 21543)	0.13645696926897133
  (0, 21143)	0.16970966666510703
  (0, 2134)	0.08184545899871208
  (0, 20259)	0.08695649223773386
  (0, 15290)	0.18686261556572187
  (0, 8437)	0.06666341985899485
  (0, 17854)	0.15706853348686284
  (0, 14278)	0.07606431508422054
  (0, 2729)	0.16259054964813804
  (0, 24292)	0.08784568278115233
  (0, 10093)	0.11627004515367319
  (0, 18762)	0.15470994257032572
  :	:
  (2098, 3012)	0.06129820228451742
  (2098, 18905)	0.038563889110742114
  (2098, 17760)	0.03275884006166619
  (2098, 1672)	0.0536878321864542
  (2098, 14207)	0.04716395600732664


In [138]:
model = LogisticRegression()

This line initializes a logistic regression model .Logistic regression is a linear model used for binary classification tasks, where the goal is to predict one of two possible outcomes.

In [139]:
model.fit(X_train_features, Y_train)

This line is used to train the logistic regression model on the training data.

In [140]:
prediction_on_training_data = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

These lines of code are used to make predictions on the training data and calculate the accuracy of the model on this data.

* **model.predict(X_train_features)**: This method uses the trained logistic regression model to make predictions on the training data features (X_train_features).

* **accuracy_score(Y_train, prediction_on_training_data)**: This function calculates the accuracy of the model by comparing the true labels (Y_train) with the predicted labels (prediction_on_training_data).

In [141]:
print('Accuracy on training data : ', accuracy_on_training_data)

Accuracy on training data :  0.9647451167222487


In [142]:
prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

These lines of code are used to make predictions on the test data and calculate the accuracy of the model on this data.

* **model.predict(X_test_features)**: This method uses the trained logistic regression model to make predictions on the test data features (X_test_features).

* **accuracy_score(Y_test, prediction_on_test_data)**: This function calculates the accuracy of the model by comparing the true labels (Y_test) with the predicted labels (prediction_on_test_data).

In [143]:
print('Accuracy on test data : ', accuracy_on_test_data)

Accuracy on test data :  0.9488888888888889


In [144]:
input_mail = ["Dear Students,This is a final reminder to deposit your hostel fees (first installment) along with the penalty by 12 noon today.Please note that the hostel room rejection process will start at 12:01 PM for those who fail to make the payment by the specified time and you will have to vacate the hostel room immediately."]

input_data_features = feature_extraction.transform(input_mail)

prediction = model.predict(input_data_features)
print(prediction)


if (prediction[0]==1):
  print('Spam mail')

else:
  print('Ham mail')

[0]
Ham mail


This creates a list containing a single email message. This email will be used as input for the spam detection model.

* **input_data_features = feature_extraction.transform(input_mail)**:
This transforms the input email into TF-IDF features using the TfidfVectorizer that was previously fitted on the training data. This ensures that the input email is represented in the same way as the training data.

* **prediction = model.predict(input_data_features)**:
This uses the trained logistic regression model to predict whether the input email is spam or not. The prediction will be an array containing the predicted label (1 for spam, 0 for not spam).
