
#### **NAIIVE BAYES CLASSIFIER**
A probabilistic machine learning algorithm based on Bayes' theorem, assuming that features are conditionally independent of each other given the class label. Despite the "naive" assumption, it works well for text classification, spam detection, and sentiment analysis. It calculates the probability of a class given the input features and selects the class with the highest probability.

#### **Text feature extraction**
The process of converting raw text data into numerical features suitable for machine learning algorithms. This involves techniques like tokenization, stemming, lemmatization, and vectorization (e.g., Bag of Words, TF-IDF) to represent text in a structured form while retaining its semantic meaning.

#### **Bag of Words Approach**
 Text feature extraction technique where a document is represented as a collection of its words without considering word order or context. It involves:

i. Tokenizing the text into words.

ii. Creating a vocabulary of unique words across all documents.

iii. Representing each document as a vector of word frequencies or binary values (indicating presence/absence of words).

### **N-grams:**

N-grams are contiguous sequences of
𝑁
words or tokens from a given text. They are used in natural language processing to capture context and relationships between words.

Example (for text "I love NLP"):

1-gram (unigram): ["I", "love", "NLP"]

2-gram (bigram): ["I love", "love NLP"]

3-gram: ["I love NLP"]


### **Unigram:**
A single word or token extracted from the text, representing the smallest unit of N-grams (
𝑁
=
1
).

Example (for text "I love NLP"): ["I", "love", "NLP"]

### **Bigram:**

A sequence of two consecutive words or tokens (
𝑁
=
2
) in the text, used to capture simple word relationships.

Example (for text "I love NLP"): ["I love", "love NLP"]

### **Bayes Model For SPAM FILTER DETECTION**

####Important Points  to Consider in  Naive Bayes classifier
 1. In The classification of Naive Bayes all model characteristics are independent.
 i.e. We assume that every word in the message is independent of all other words in the context of the spam filters, and we count them with the ignorance of the context.

 2. Then the classification algorithm generates probabilities of the message to be spam or not spam.

 3. The probability estimation is based on the Bayes formula.

 4. The formula components are determined on the  basis of the word frequencies in the whole message package.

### **NAIIVE Bayes Classifier equation:**

###  $ P(C|X_{1},X_{2},X_{3}.....X_{N}) = P(X_{1}|C) *P(X_{2}|C)*

###.....P(X_{N}|C) * P(C)$

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('/content/spam_email.csv',encoding='latin1')
df.head()



Unnamed: 0,Category,Msg
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### **Convert data Categorical data into numeric data and find the message length**

#### In prevous labs  we  used the label encoding method or one hot encoding method

#### lets try a very simple one as below that converts contents of Category column into numbers

### i.e. For email/Ham message label  = 0 and Spam message label =1



In [None]:
df['label'] = df.Category.map({'ham':0, 'spam':1})
df['msg_len'] = df.Msg.apply(len)
df.head()

Unnamed: 0,Category,Msg,label,msg_len
0,ham,"Go until jurong point, crazy.. Available only ...",0,111
1,ham,Ok lar... Joking wif u oni...,0,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1,155
3,ham,U dun say so early hor... U c already then say...,0,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",0,61


### **Split the overall dataset for train(80%) and test dataset(20% or 0.2)**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(df.Msg, df.label, test_size =0.2)

#### **BOW Model** --> Convert all texts in the data set to matrix representaiton  which is called BOW model and the orderring of words doesnt matter.

### In python scikit library it is done using Countvectorizer( ) method. Hence using Bag of words Model(BOW)  we Convert Text messges of Msg column into numbers

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

word_freq_count = CountVectorizer()
X_train_count = word_freq_count.fit_transform(X_train.values)

print(word_freq_count.get_feature_names_out())


['00' '000' '000pes' ... 'zyada' 'ãº1' 'ã¼']


### **N-grams** -In how many group of words we divide the whole texts of each documents
### for the sentence

#### "It was a great Hilarious Day"  convert to unigram, bigram and 3 grams

####  **unigram** = {'It' , 'Was', 'a', 'great', 'day', 'hilarious'}
#### **bigram** = {'It Was', 'was a', ' a great', 'great day', 'day hilarious'}
####  **3grams** = {'It was a', 'was a great' 'a great day' 'GREAT DAY HILARIOUS'}

### In our spam detection unigram offers better dintinction because we recognize spam or email better by single words like  'Sale',  'Discount' ,'Fine' etc etc than bigrams or 3grams

### **Feed the processed data into the machine learning model -->in this case use MultinomialNaiive Bayes  for discrete data**  

#### using the scikit library naive_bayes function  --->   "MultinomialNB ()"

In [None]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_count,Y_train)


###**Test it now from the trained model above by giving your own texts**

In [None]:
mail_text = [ 'Get the children ready we will go to dinner', 'Congratulations you got"

 "a massive  offer']

mail = word_freq_count.transform(mail_text)
model.predict(mail)


array([0, 0])

###**Perform model's accuracy from the data we processed and the bayesian classifer model**

In [None]:
X_test_count = word_freq_count.transform(X_test)
model.score(X_test_count, Y_test)

0.9847533632286996