# Why not to perform the the bag of words before train test split ? 

Performing a train-test split before applying the Bag of Words (BoW) model is crucial for several reasons:

Preventing Data Leakage: The primary purpose of splitting the data is to avoid data leakage. If you apply BoW to the entire dataset before splitting, the model has access to information from the test set during training, which can lead to overly optimistic performance metrics. This is akin to studying the exam questions beforehand rather than preparing based on the syllabus (training data).

Generalization: By splitting the data first, the model learns to generalize better. The training set should represent the data with which the model will learn, while the test set is meant for evaluating how well the model can perform on unseen data.

Preprocessing: Itâ€™s essential to preprocess and clean the text data before the train-test split. However, once you split the data, any further transformations such as BoW should be applied only to the training set and then used to transform the test set. This ensures that the model does not learn any patterns from the test set.

In summary, performing the train-test split before applying the BoW model helps maintain the integrity of the evaluation process and ensures that your model's performance metrics reflect its true capabilities on unseen data.

In [32]:
import nltk
import pandas as pd
import nltk.corpus 
from nltk.stem.porter import PorterStemmer
nltk.download('stopwords')
nltk.download('wordnet')
import re
from nltk.corpus import stopwords


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\KIIT0001\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\KIIT0001\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [33]:
messages=pd.read_csv('spam.csv',
                    sep='\t',names=["label","message"])

## Preprocessing the data and storing them in a corpus

In [34]:
import re
corpus = []

stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

for msg in messages['message']:
    review = re.sub('[^a-zA-Z]', ' ', msg)
    review = review.lower()
    review = review.split()
    review = [stemmer.stem(word) for word in review if word not in stop_words]
    review = ' '.join(review)
    corpus.append(review)

corpus

['messag',
 'go jurong point crazi avail bugi n great world la e buffet cine got amor wat',
 'ok lar joke wif u oni',
 'free entri wkli comp win fa cup final tkt st may text fa receiv entri question std txt rate c appli',
 'u dun say earli hor u c alreadi say',
 'nah think goe usf live around though',
 'freemsg hey darl week word back like fun still tb ok xxx std chg send rcv',
 'even brother like speak treat like aid patent',
 'per request mell mell oru minnaminungint nurungu vettam set callertun caller press copi friend callertun',
 'winner valu network custom select receivea prize reward claim call claim code kl valid hour',
 'mobil month u r entitl updat latest colour mobil camera free call mobil updat co free',
 'gonna home soon want talk stuff anymor tonight k cri enough today',
 'six chanc win cash pound txt csh send cost p day day tsandc appli repli hl info',
 'urgent week free membership prize jackpot txt word claim c www dbuk net lccltd pobox ldnw rw',
 'final match head towa

### Selcting the dependent and independent variables

In [35]:
y=pd.get_dummies(messages['label'])
y=y.iloc[:,1].values

## Performing the train test split (Before ag of words)

In [36]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(corpus,y,test_size=0.2,random_state=0)

In [37]:
x_train

['meant calcul lt gt unit lt gt school realli expens start practic accent import decid year dental school nmde exam',
 'hi hope good day better night',
 'lol enjoy role play much',
 'print oh lt gt come upstair',
 'pick variou point go yeovil motor project hour u take home max easi',
 'aiyah ok wat long got improv alreadi wat',
 'messag free welcom new improv sex dog club unsubscrib servic repli stop msg p',
 'nope juz work',
 'think sent text home phone cant display text still want send number',
 'hi babe thank come even though didnt go well want bed hope see soon love kiss xxx',
 'well obvious peopl cool colleg life went home',
 'pride almost lt gt year old takin money kid suppos deal stuff grownup stuff tell',
 'six chanc win cash pound txt csh send cost p day day tsandc appli repli hl info',
 'u go back urself lor',
 'yup shd haf ard page add figur got mani page',
 'seem unnecessarili affection',
 'shall fine avalarr hollalat',
 'sinc side fever vomitin',
 'contract mobil mnth late

## Applying the bag of words on training set and testing set

In [38]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=10000)
x_train=cv.fit_transform(x_train).toarray()                                                                                                                                                                                                                                                       
    
x_test=cv.transform(x_test).toarray()

In [39]:
x_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [40]:
cv.vocabulary_

{'meant': 864,
 'calcul': 201,
 'lt': 830,
 'gt': 594,
 'unit': 1519,
 'school': 1216,
 'realli': 1137,
 'expens': 476,
 'start': 1350,
 'practic': 1089,
 'accent': 5,
 'import': 692,
 'decid': 357,
 'year': 1658,
 'dental': 366,
 'nmde': 964,
 'exam': 470,
 'hi': 636,
 'hope': 652,
 'good': 575,
 'day': 350,
 'better': 134,
 'night': 962,
 'lol': 814,
 'enjoy': 447,
 'role': 1183,
 'play': 1061,
 'much': 923,
 'print': 1097,
 'oh': 991,
 'come': 285,
 'upstair': 1529,
 'pick': 1049,
 'variou': 1545,
 'point': 1073,
 'go': 569,
 'yeovil': 1660,
 'motor': 913,
 'project': 1106,
 'hour': 659,
 'take': 1400,
 'home': 650,
 'max': 858,
 'easi': 425,
 'aiyah': 30,
 'ok': 993,
 'wat': 1574,
 'long': 817,
 'got': 579,
 'improv': 693,
 'alreadi': 38,
 'messag': 876,
 'free': 527,
 'welcom': 1590,
 'new': 955,
 'sex': 1245,
 'dog': 394,
 'club': 271,
 'unsubscrib': 1525,
 'servic': 1242,
 'repli': 1163,
 'stop': 1362,
 'msg': 918,
 'nope': 969,
 'juz': 742,
 'work': 1628,
 'think': 1437,
 'sent

## Training the model by Naive Bayes

In [41]:
from sklearn.naive_bayes import MultinomialNB
nb=MultinomialNB()
spam_detect=nb.fit(x_train,y_train)
y_pred=spam_detect.predict(x_test)

## Printing the metrics

In [42]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[ 13   2]
 [  1 116]]
0.9772727272727273
              precision    recall  f1-score   support

       False       0.93      0.87      0.90        15
        True       0.98      0.99      0.99       117

    accuracy                           0.98       132
   macro avg       0.96      0.93      0.94       132
weighted avg       0.98      0.98      0.98       132

