**Email Spam Claasification**

1. **Import Libraries**: 
    - `numpy` (as `np`): Numerical computing library.
    - `pandas` (as `pd`): Data manipulation and analysis library.
    - `nltk`: Natural Language Toolkit library for text processing.
    - `stopwords` from `nltk.corpus`: Stopwords are common words that are often filtered out in natural language processing tasks.
    - `string`: A module that provides common string operations.

In [20]:

import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string


2. **Read Data**: 
    - Read a CSV file named "emails.csv" into a pandas DataFrame `df`.
    - Display the first few rows using `df.head()`.
    - Check the shape of the DataFrame using `df.shape`.
    - Display column names using `df.columns`.
    - Remove duplicate rows from the DataFrame using `df.drop_duplicates(inplace=True)`.

In [21]:
df = pd.read_csv("emails.csv")
df.head()
df.shape
df.columns
df.drop_duplicates(inplace=True)
print(df.shape)

(5695, 2)


3. **Data Preprocessing**:
    - Check for missing data by printing the sum of missing values for each column using `print(df.isnull().sum())`.
    - Download the stopwords package from NLTK using `nltk.download("stopwords")`.
    - Define a function `process(text)` to preprocess text data:
        - Remove punctuation from the text using list comprehension.
        - Join the characters back into a string.
        - Tokenize the text by splitting it into words and removing stopwords.
    - Apply the `process` function to the 'text' column of the DataFrame to tokenize the text data.

In [22]:
print(df.isnull().sum())
nltk.download("stopwords")
def process(text):
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)

    clean = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    return clean

df['text'].head().apply(process)

text    0
spam    0
dtype: int64


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\darsh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0    [Subject, naturally, irresistible, corporate, ...
1    [Subject, stock, trading, gunslinger, fanny, m...
2    [Subject, unbelievable, new, homes, made, easy...
3    [Subject, 4, color, printing, special, request...
4    [Subject, money, get, software, cds, software,...
Name: text, dtype: object

4. **Feature Engineering**:
    - Use `CountVectorizer` from `sklearn.feature_extraction.text` to convert text data into a matrix of token counts.

In [23]:
from sklearn.feature_extraction.text import CountVectorizer
message = CountVectorizer(analyzer=process).fit_transform(df['text'])

5. **Split Data**:
    - Split the data into training and testing sets using `train_test_split` from `sklearn.model_selection`.

In [24]:
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(message, df['spam'], test_size=0.20, random_state=0)
print(message.shape)

(5695, 37229)


6. **Model Training**:
    - Create and train a Naive Bayes Classifier using `MultinomialNB` from `sklearn.naive_bayes`.

In [25]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(xtrain, ytrain)
print(classifier.predict(xtrain))
print(ytrain.values)

[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]


7. **Model Evaluation**:
    - Evaluate the model on the training dataset:
        - Print classification report, confusion matrix, and accuracy score using `classification_report`, `confusion_matrix`, and `accuracy_score` from `sklearn.metrics`.
    - Evaluate the model on the testing dataset similarly.

In [26]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = classifier.predict(xtrain)
print(classification_report(ytrain, pred))
print()
print("Confusion Matrix: \n", confusion_matrix(ytrain, pred))
print("Accuracy: \n", accuracy_score(ytrain, pred))

print(classifier.predict(xtest))
print(ytest.values)

pred = classifier.predict(xtest)
print(classification_report(ytest, pred))
print()
print("Confusion Matrix: \n", confusion_matrix(ytest, pred))
print("Accuracy: \n", accuracy_score(ytest, pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3457
           1       0.99      1.00      0.99      1099

    accuracy                           1.00      4556
   macro avg       0.99      1.00      1.00      4556
weighted avg       1.00      1.00      1.00      4556


Confusion Matrix: 
 [[3445   12]
 [   1 1098]]
Accuracy: 
 0.9971466198419666
[1 0 0 ... 0 0 0]
[1 0 0 ... 0 0 0]
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       870
           1       0.97      1.00      0.98       269

    accuracy                           0.99      1139
   macro avg       0.98      0.99      0.99      1139
weighted avg       0.99      0.99      0.99      1139


Confusion Matrix: 
 [[862   8]
 [  1 268]]
Accuracy: 
 0.9920983318700615
