### BOW, TDIDF Spam Ham Classification
We followed below steps earlier:
1. Preprocessing and cleaning 
2. **BOW and TF-IDF (Sentences -> Vectors)**
3. **Train Test Split**
4. Train Models

So, correction is that step#2 and #3 should be interchanged. By doing so, we are preventing Data Leakage.

1. Preprocessing and cleaning 
2. **Train Test Split**
3. **BOW and TF-IDF (Sentences -> Vectors) => {Prevent Data Leakage}**
4. Train Models

In [30]:
import pandas as pd

messages = pd.read_csv("./Bag_Of_Words/SMSSpamCollection.txt",sep="\t",names=["Label","Message"])

messages

Unnamed: 0,Label,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [31]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

corpus = []

def create_corpus():

    lemtzr = WordNetLemmatizer()
    for idx in range(len(messages)):
        message = messages['Message'][idx]
        message = re.sub("[^a-zA-Z]", " ", str(message))
        words = message.split()
        words = [lemtzr.lemmatize(word) for word in words if word not in stopwords.words("english")]
        corpus.append(" ".join(words))

create_corpus()
print(corpus)




In [32]:
from sklearn.model_selection import train_test_split

y = messages["Label"].map({"ham":0,"spam":1}).values

X_train, X_test, y_train, y_test = train_test_split(corpus, y, test_size=0.2)

In [36]:
# BOW or TF-IDF implementation

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.naive_bayes import MultinomialNB

# vec = CountVectorizer(max_features=2500, ngram_range=(1,2), lowercase=True)
vec = TfidfVectorizer(max_features=2500, ngram_range=(1,2), lowercase=True)

X_train_new = vec.fit_transform(X_train).toarray()
X_test_new = vec.transform(X_test).toarray()
model = MultinomialNB().fit(X_train_new, y_train)

y_pred = model.predict(X_test_new)

ac=accuracy_score(y_test,y_pred)
cm = confusion_matrix(y_test,y_pred)
cr=classification_report(y_test,y_pred)
print(ac)
print(cm)
print(cr)




0.9820627802690582
[[983   0]
 [ 20 112]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       983
           1       1.00      0.85      0.92       132

    accuracy                           0.98      1115
   macro avg       0.99      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115



In [35]:
# BOW or TF-IDF implementation

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.naive_bayes import MultinomialNB

vec = CountVectorizer(max_features=2500, ngram_range=(1,2), lowercase=True)

X_train_new = vec.fit_transform(X_train).toarray()
X_test_new = vec.transform(X_test).toarray()
model = MultinomialNB().fit(X_train_new, y_train)

y_pred = model.predict(X_test_new)

ac=accuracy_score(y_test,y_pred)
cm = confusion_matrix(y_test,y_pred)
cr=classification_report(y_test,y_pred)
print(ac)
print(cm)
print(cr)




0.9874439461883409
[[977   6]
 [  8 124]]
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       983
           1       0.95      0.94      0.95       132

    accuracy                           0.99      1115
   macro avg       0.97      0.97      0.97      1115
weighted avg       0.99      0.99      0.99      1115

