## Problem Statement
- The goal of this project is to build a classifier that can accurately differentiate between spam and non-spam emails. This involves preprocessing the email text data, extracting relevant features, training a classification model, and evaluating its performance.

### Approach
- Using countvectorizer to convert the text data into numerical features
- Model building using Naive Bayes algorithm.

## Importing Necessary Libraries

In [162]:
# for dataframe analysis and manipulation.
import pandas as pd
import numpy as np
import regex as re

# for data preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer

# for model Building
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

# for removing stopwords
from nltk.corpus import stopwords
stop = stopwords.words('english')

## Loading the Dataset

In [55]:
df = pd.read_table('SMSSpamCollection',names =['Label','Message'])
df.head()

Unnamed: 0,Label,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


#### About the dataset
- The dataset contains labeled emails, where each email is classified as spam or ham. It has both the email text and corresponding labels.

In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Label    5572 non-null   object
 1   Message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


#### Analysis
- There are no missing value in the dataset.
- The target value is in categorical form

## Data preparation

### Encoding Target variable

In [59]:
le = LabelEncoder()
df['Label'] = le.fit_transform(df['Label'])

### Data cleaning

In [57]:
def data_cleaning(text):
    text = str(text)
    text = text.lower()
    text = re.sub('[^a-z]',' ',text)
    words = text.split()
    imp_words = [w for w in words if w not in stop]
    return ' '.join (imp_words)

In [58]:
df['Message'] = df['Message'].apply(lambda x:data_cleaning(x))

### Train test split

In [61]:
X = df['Message']
y = df['Label']

In [62]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 5)

### Feature extraction

- Converting the text data into numerical features using countvectorizer.

In [63]:
cv = CountVectorizer()
X_train = cv.fit_transform(X_train)
X_test = cv.transform(X_test)

## Model Building

### Model training using Naive Bayes classifier

In [None]:
# model training
multi_nb = MultinomialNB()
multi_nb.fit(X_train,y_train)

In [149]:
# model evaluation
y_pred = multi_nb.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
conf_matrix = confusion_matrix(y_test,y_pred)
report = classification_report(y_test,y_pred)

# printing scores                    
print("Accuracy:", round(accuracy,4))
print("Confusion Matrix:\n", conf_matrix)
print("\nClassification Report:\n",report)

Accuracy: 0.9919
Confusion Matrix:
 [[966   4]
 [  5 140]]

Classification Report:
               precision    recall  f1-score   support

           0       0.99      1.00      1.00       970
           1       0.97      0.97      0.97       145

    accuracy                           0.99      1115
   macro avg       0.98      0.98      0.98      1115
weighted avg       0.99      0.99      0.99      1115



### Observation:
- The model has high accuracy of 99.19%, indicating its ability to correctly classify the majority of emails.
- confusion matrix has high true positive and true negative values, indicating the model is accurately classifying spam and ham emails as shown by high precision and recall scores.

## Conclusion
- The spam classification model has high performance in classifying both spam and ham emails.
- But the evaluation of the model was based on a specific dataset and may not generalize to all email spam classification scenarios.
- In that case we can use different feature extraction technique like TF-IDF and Tokenizer and also different training algorithm like SVM, Random Forest and also deep sequence modelling like RNN,LSTM.

### Predictions on sample data

In [17]:
sample_data1 = ['Your Mobile number has been awarded with a rs2000 Bonus Caller Prize.Call 09058095201 from land line. Valid 12hrs only']
sample_data2 = ['I think I‘m waiting for the same bus! Inform me when you get there, if you ever get there.']
sample_data3 = ['PRIVATE! Your 2004 Account Statement for 07742676969 shows 786 unredeemed Bonus Points.To claim call 08719180248 Identifier Code: 45239 Expires']

In [157]:
def predictor(data):
    test = pd.Series(data)
    cleaned_data = test.apply(lambda x:data_cleaning(x))
    tranform_data = cv.transform(cleaned_data)
    result = le.inverse_transform(model.predict(tranform_data))[0]
    return result

In [158]:
predictor(sample_data1)

'spam'

In [159]:
prediction(sample_data2)

'ham'

In [160]:
prediction(sample_data3)

'spam'