## Building a NB Classifier from scrach



this CountVectorizer is quite useful. It counts the words and represents

- each sentence / email as a row
- each word / term as a column

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

texts = ["blue house and blue window", "window window green house", "green house and blue blue"]

vectorizer = CountVectorizer()
term_count_matrix = vectorizer.fit_transform(texts)

# The unique words / column names
print(vectorizer.get_feature_names()) #in our sklearn version it is get_feature_names, in newer versions it's get_feature_names_out

# Each word"s amount of occurences in texts
print(term_count_matrix.toarray())



['and', 'blue', 'green', 'house', 'window']
[[1 2 0 1 1]
 [0 0 1 1 2]
 [1 2 1 1 0]]


**Looking at our data in data_cooked.csv which looks something like this (example)**


urgent contact last weekend draw show prize call claim code valid,1

get dump heap mom decid come low bore,0

ok lor soni ericsson salesman ask shuhui say quit gd use consid,0

privat account statement show un redeem point call identifi code expir,1


In [4]:
#First get the data
import numpy as np


#I took this from our prepared data and adjusted a bit to illustrate better
data= np.array([
("free claim call code", 1),
("heap mom bore",0),
("ok salesman quit",0),
("urgent call free identifi code",1)
])

mails = data[:,0] #all rows, 0th column
labels = data[:,1] #all rows, 1st column


# Second: use the vectorizer on the mails
vectorizer = CountVectorizer()
word_count_matrix = vectorizer.fit_transform(mails)
print(vectorizer.get_feature_names()) # Column Names 
print('Word Count Matrix looks like this:\n', word_count_matrix.toarray())




['bore', 'call', 'claim', 'code', 'free', 'heap', 'identifi', 'mom', 'ok', 'quit', 'salesman', 'urgent']
Word Count Matrix looks like this:
 [[0 1 1 1 1 0 0 0 0 0 0 0]
 [1 0 0 0 0 1 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 1 1 0]
 [0 1 0 1 1 0 1 0 0 0 0 1]]


as a next step, we can calculating:

for each word (meaning for each column): Count how often the word appears in a spam sentence and then calculate its probability as 

Occurences in Spam Mails  /  Amount all mails

In [5]:
probabilities = {}


number_all_mails = len(word_count_matrix.toarray())
for index,word in enumerate(vectorizer.get_feature_names()):
    column = word_count_matrix[:,index]
    count = np.count_nonzero(column[column>0]) # column[column>0] makes column a 1D array of True and False values 
    probabilities[word] = count / number_all_mails

probabilities


{'bore': 0.25,
 'call': 0.5,
 'claim': 0.25,
 'code': 0.5,
 'free': 0.5,
 'heap': 0.25,
 'identifi': 0.25,
 'mom': 0.25,
 'ok': 0.25,
 'quit': 0.25,
 'salesman': 0.25,
 'urgent': 0.25}

See how words like call, code, free are rated higher than other words

it's still very simple the way it is but could be a start 

In the following code cells I tried to code the Naive Bayes Calssifier from our lecture, starting at slide number 28 in "Reasoning with uncertainty.pptx":


In [6]:
#Start by getting some data and making it usable:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

data_path = "../Data/Preprocessing/data_cooked.csv"

dataframe = pd.read_csv(data_path)

# somehow there are missing values somewhere, may have to check Preprocessing.ipynb again
# very important to reset index! Because pandas would otherwise memorize the old indices which is bad for what is about to come
dataframe = dataframe.dropna(subset=['Message']).reset_index(drop=True)

mails = dataframe['Message']
labels = dataframe['Category']

cv = CountVectorizer()
word_count_matrix = cv.fit_transform(mails.to_numpy()).toarray()


In [7]:
#1.Attempt
#CAUTION: This was my initial attempt which follows exactly what we did in the lecture on slide 32
#but it takes a long time to execute. See a below cell that uses numpy matrix operations


#Calculating P(SPAM) and P(not SPAM)
spam_count_total = len(labels.loc[labels==1])
not_spam_count_total = len(labels.loc[labels==0])
total = spam_count_total + not_spam_count_total

P_spam = spam_count_total / total
P_not_spam = not_spam_count_total / total

#Calculating the Conditional Probability for each word:


# An array that counts the values for each word
unique_word_count = len(word_count_matrix[0]) #how many elements does one row have + eunique words in corpus

words = cv.get_feature_names()
word_proba_array = np.zeros((2,unique_word_count)) 
#2 rows:
# FIRST ROW for the word's cond. probab for spam=yes
# SECOND ROW for the word's probab for spam=no
#(see slide 32 in lecture)

#Y columns: as many colums as unique words there are in data


#Laplace thing to not get 0-probabs
alpha = 1

#Get the idices for all spam rows and for all non spam rows
spam_rows_indices = labels[labels == 1].index.tolist()
not_spam_rows_indices = labels[labels == 0].index.tolist()

# Calculate the two conditional probabilities for each word
for word_idx in range(len(word_count_matrix[0])):
    
    #calculating for spam=yes rows
    for row_index in spam_rows_indices:
        word_count = word_count_matrix[row_index][word_idx]
        if(word_count > 0): #only checks if the word appears, not how often
            word_proba_array[0][word_idx] += 1 / (spam_count_total)
            #-> word_proba_array[1] means writing to the row which is for concitional probabs for Spam=Yes
            #-> word_proba_array[1][word_idx] means we're writing to the index of the word we"re looking at right now
            #-> "+= 1 / spam_count_total" increases the probability by one each time a word is found for a spam=yes mail


    #calculating for spam=no rows
    for row_index in not_spam_rows_indices:
        word_count = word_count_matrix[row_index][word_idx]
        if(word_count > 0): #only checks if the word appears, not how often
            word_proba_array[1][word_idx] += 1 / (not_spam_count_total)
            #-> same as before but for spam=no mails



#COncatenating words with their probabilities

probabilities_df = pd.DataFrame({
    'Word': words,
    'P(Word|Spam)': word_proba_array[0],
    'P(Word|NotSpam)': word_proba_array[1]
})

KeyboardInterrupt: 

(this next cell once crashed with a MemoryError for me. But after I rebooted my laptop, it had no more problems) 

In [8]:
# 2nd Attempt 

spam_count_total = np.sum(labels == 1)
not_spam_count_total = np.sum(labels == 0)
total = spam_count_total + not_spam_count_total

P_spam = spam_count_total / total
P_not_spam = not_spam_count_total / total

alpha = 1
unique_word_count = len(cv.get_feature_names())
words = np.array(cv.get_feature_names())

#instead of looping through the whole dataset counting word appearcances and adding probabilities, make 
#an array that counts each word's appearances over the whole dataset (also considers how often a word
# appears in a sentence and not only if or if it doean's appear as in the 1.attempt)
spam_word_counts = word_count_matrix[labels == 1].sum(axis=0)
not_spam_word_counts = word_count_matrix[labels == 0].sum(axis=0)

#Calculating the contiditional probalbilities by going  spam_word_counts/ total amount of spam word
# + it adds laplace smoothing as seen here:  https://towardsdatascience.com/laplace-smoothing-in-na%C3%AFve-bayes-algorithm-9c237a8bdece
word_proba_spam = (spam_word_counts + alpha) / (spam_word_counts.sum() + alpha * unique_word_count)
word_proba_not_spam = (not_spam_word_counts + alpha) / (not_spam_word_counts.sum() + alpha * unique_word_count)



probabilities_df = pd.DataFrame({
    'Word': words,
    'P(Word|Spam)': word_proba_spam,
    'P(Word|NotSpam)': word_proba_not_spam
})




Now write this as a function:

In [9]:
def get_prior_probas(mails:pd.DataFrame, labels:pd.DataFrame):

    spam_count_total = np.sum(labels == 1)
    not_spam_count_total = np.sum(labels == 0)
    total = spam_count_total + not_spam_count_total

    P_spam = spam_count_total / total
    P_not_spam = not_spam_count_total / total

    ret_dict ={
        "P_spam" : P_spam, 
        "P_not_spam": P_not_spam
    } 
    return ret_dict

def get_cond_probas(mails:pd.DataFrame, labels:pd.DataFrame):

    cv = CountVectorizer()
    word_count_matrix = cv.fit_transform(mails.to_numpy()).toarray()

    alpha = 1
    unique_word_count = len(cv.get_feature_names())
    words = np.array(cv.get_feature_names())

    #instead of looping through the whole dataset counting word appearcances and adding probabilities, make 
    #an array that counts each word's appearances over the whole dataset (also considers how often a word
    # appears in a sentence and not only if or if it doean's appear as in the 1.attempt)
    spam_word_counts = word_count_matrix[labels == 1].sum(axis=0)
    not_spam_word_counts = word_count_matrix[labels == 0].sum(axis=0)

    #Calculating the contiditional probalbilities by going  spam_word_counts/ total amount of spam word
    # + it adds laplace smoothing as seen here:  https://towardsdatascience.com/laplace-smoothing-in-na%C3%AFve-bayes-algorithm-9c237a8bdece
    word_proba_spam = (spam_word_counts + alpha) / (spam_word_counts.sum() + alpha * unique_word_count)
    word_proba_not_spam = (not_spam_word_counts + alpha) / (not_spam_word_counts.sum() + alpha * unique_word_count)

    probabilities_df = pd.DataFrame({
        'Word': words,
        'P(Word|Spam)': word_proba_spam,
        'P(Word|NotSpam)': word_proba_not_spam
    })

    return probabilities_df


In [10]:
#GIve it a test
data_path = "../Data/Preprocessing/data_cooked.csv"

dataframe = pd.read_csv(data_path)

# somehow there are missing values somewhere, may have to check Preprocessing.ipynb again
# very important to reset index! Because pandas would otherwise memorize the old indices which is bad for what is about to come
dataframe = dataframe.dropna(subset=['Message']).reset_index(drop=True)

mails = dataframe['Message']
labels = dataframe['Category']

cond_probas = get_cond_probas(mails=mails,labels=labels)

cond_probas.head()

Unnamed: 0,Word,P(Word|Spam),P(Word|NotSpam)
0,aa,1.5e-05,0.000124
1,aaa,5e-06,2.4e-05
2,aaaenerfax,5e-06,3e-06
3,aadedeji,5e-06,3e-06
4,aagraw,5e-06,3e-06


In [11]:
#Just out of curiosity some smaple conditional probabilities
examples = ['free', 'call', 'friend', "tomorrow", 'signup', 'money', 'win', 'verif', 'confirm']

selected_probabilities = probabilities_df[probabilities_df['Word'].isin(examples)]

print(selected_probabilities)



           Word  P(Word|Spam)  P(Word|NotSpam)
3417       call      0.002998         0.002650
4867    confirm      0.000252         0.000737
9150       free      0.004293         0.000895
9216     friend      0.000524         0.000376
15768     money      0.003429         0.000252
22408    signup      0.000098         0.000003
24994  tomorrow      0.000113         0.000531
26319     verif      0.000118         0.000019
27228       win      0.000776         0.000094


Now on slide number 45: Classifying spam:

In [12]:
#Method for classifying a single message

def classify_message(message, cond_probas, prior_probas):
    

    #the next line make a countvectorizer that uses the same 
    #columns names as in the training (column names = unique words)
    #this is done by adding the same vocabulary as in Word
    cv = CountVectorizer(vocabulary=cond_probas['Word'].values)

    #with this CountVectorizer you can make a vector from the message
    #same as the word_count_matrix, but only for one mail text
    message_vector = cv.transform([message]).toarray().flatten()

    #for the calculation it's somehow better to use log of the actual probabs
    #avoids some kind of problem that arieses if one calculates with very small numbers
    log_prob_spam = np.log(prior_probas['P_spam'])
    log_prob_not_spam = np.log(prior_probas['P_not_spam'])

    #in this loop the calculation for the vector is done for spam=yes as well as for spam=no
    for word, count in zip(cv.get_feature_names(), message_vector): 
        if count > 0:  
            if word in cond_probas['Word'].values: #could be that test data contains words that haavent been in training data
                word_data = cond_probas[cond_probas['Word'] == word] # get the word's conditional probabilities
                log_prob_spam += count * np.log(word_data['P(Word|Spam)'].values[0]) 
                log_prob_not_spam += count * np.log(word_data['P(Word|NotSpam)'].values[0])

    #make the choice by comparing the two values
    if log_prob_spam > log_prob_not_spam:
        return 1
    else:
        return 0

In [13]:
#Warning
# This cell may take a few minutes: 
# it loads the data, trains the model, makes predictions and calculates a score of the predictions


#using train_test_split to make a train and a test set
from sklearn.model_selection import train_test_split

data_path = "../Data/Preprocessing/data_cooked.csv"
dataframe = pd.read_csv(data_path)
dataframe = dataframe.dropna(subset=['Message']).reset_index(drop=True)


mails = dataframe['Message']
labels = dataframe['Category']


#making a train set which is used to calculate the probabilities and a test set to test afterwards with non seen cases
mails_train, mails_test, labels_train, labels_test = train_test_split(
    mails, labels, test_size=0.25, random_state=42,stratify=labels) #random_state shuffles all lines in the dataset, stratify balances the calsses spam and no spam because there's a lot more spam=no than spam=yes in our data


#calling our functions from earlier
prior_probas = get_prior_probas(mails_train, labels_train)
cond_probas = get_cond_probas(mails_train, labels_train)

#creating the result array with a list comprehension for each test mail in mails_test
results = [classify_message(msg, cond_probas, prior_probas) for msg in mails_test]


#calculate mean of an array that looks like this [True,False,False,True,True,True...]
#using a list comprehension that zips together the model's predictions in results and the actual
#labels in labels_test
#then with pred == real it creates a True/False value, indicating, if the prediction was right/wrong
#finally using np.mean to create the average
accuracy = np.mean([pred == real for pred, real in zip(results, labels_test)])
print(f"Accuracy is: {accuracy:.2%}")



KeyboardInterrupt: 

## Another NBC example

In [14]:
#importing the packages and data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.naive_bayes import MultinomialNB 

data_path = "../Data/Preprocessing/data_cooked.csv"
dataframe = pd.read_csv(data_path)

# There were some null values in the Message column so Idropped these 
dataframe = dataframe.dropna(subset=['Message']) 

dataframe.groupby('Category').describe() 

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,9172,8592,sorri call later,30
1,2115,1942,privat account statement show un redeem point ...,9


In [15]:
# create train/test split

x_train, x_test, y_train, y_test = train_test_split(dataframe.Message, dataframe.Category)


# get word count and store the data as a matrix 
cv = CountVectorizer()
x_train_count = cv.fit_transform(x_train.values)

x_train_count 

<8465x24774 sparse matrix of type '<class 'numpy.int64'>'
	with 386131 stored elements in Compressed Sparse Row format>

In [16]:
# train model
naiveBayes = MultinomialNB()
naiveBayes.fit(x_train_count, y_train) 

MultinomialNB()

In [17]:
#pre-test for ham
email_ham = ["hey wanna meet up for dinner tonight?"]
email_ham_count = cv.transform(email_ham)
naiveBayes.predict(email_ham_count) 

array([0], dtype=int64)

In [18]:
#pre-test for spam
email_spam = ["reward money insurance click win"]
email_spam_count = cv.transform(email_spam)
naiveBayes.predict(email_spam_count) 

array([1], dtype=int64)

In [19]:
# test model 
x_test_count = cv.transform(x_test)
naiveBayes.score(x_test_count, y_test)

0.9386959603118356

## Deep Learning Model with tensorflow ##

Important: to use tensorflow, you'll have to

1. conda remove --name spam_classifier_env --all (to remove the old environment)
2. conda env create -f environment.yml (to create the new enviroment containing keras)
3. conda activate spam_classifier_env (to activate the new environment)
4. restart notebook



In [1]:
import tensorflow as tf
print("TensorFlow version:", tf.__version__)


#importing the packages and data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer 


TensorFlow version: 2.13.0


In [4]:
#Load data and prepare
data_path = "../Data/Preprocessing/data_cooked.csv"
dataframe = pd.read_csv(data_path)
dataframe = dataframe.dropna(subset=['Message']).reset_index(drop=True)


X = dataframe['Message']
y = dataframe['Category']

cv = CountVectorizer()
X = cv.fit_transform(X).toarray()  # make it a matrix, the network can handle (not sparse matrix)


#this will tell how the input shape of the network needs to look like
print(len(cv.get_feature_names()))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


28166


In [5]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(512, input_shape=[X_train.shape[1]], activation="relu"),
    tf.keras.layers.Dense(128,  activation="relu"),
    tf.keras.layers.Dense(128,  activation="relu"),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(
    optimizer='adam', 
    loss='binary_crossentropy', 
    metrics=['accuracy'])


In [6]:
#Takes a while, reduce epochs do less training...

#10 epochs got me 97% accuracy and
#3 epochs got me 96.9 accuracy
# there's no big difference right now. Only if we did more finetuning, there would be more improvement
#over the course of 10 epochs

# Model training
history = model.fit(
    X_train, 
    y_train, 
    epochs=3, 
    batch_size=32, 
    validation_data=(X_test, y_test)
)

# Evaluation
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Accuracy on test data: {accuracy*100:.2f}%")



Epoch 1/3
Epoch 2/3
Epoch 3/3
Accuracy on test data: 98.54%
