## Title: Naive Bayes Model
#### Name: Sunmi Kim
#### Date: 06/07/2023

#### **Machine Learning: 7 major steps**

1. Collect and explore data
1. Prepare the data
1. Choose a model
1. Train your model
1. Evaluate your model
1. Parameter tuning
1. Make predictions

- **Data preparation** is the process of transforming raw data into a format that is suitable for machine learning algorithms. It involves cleaning, filtering, inputing, scaling, encoding, and splitting.    
- **Data exploration** helps to understand the patterns and problems in the dataset as well as deciding which model or algorithm to use. Data are reorganized in a way that they are presented in an understandable way. It involves visualizing, summarizing, testing, and modeling. 

## Step 1 Data Exploration and Preparation

#### raw data location: 
/kaggle/input/naive-bayes-chapter-8/emails.csv

In [2]:
import numpy as np
import re
import pandas as pd
import pathlib

path = 'emails.csv'
emails = pd.read_csv(path) # ('/kaggle/input/naive-bayes-chapter-8/emails.csv')
emails[0:10]


Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
5,"Subject: great nnews hello , welcome to medzo...",1
6,Subject: here ' s a hot play in motion homela...,1
7,Subject: save your money buy getting this thin...,1
8,Subject: undeliverable : home based business f...,1
9,Subject: save your money buy getting this thin...,1


In [3]:
# The emails DataFrame has a column for the text of each email
# List (ordered not removing duplicates) vs. Set (unordered removing duplicates)

def process_email(text):
    '''takes text string and splits into a list of words removing duplicates'''
    text = text.lower() # make strings lower case
    # The words are passed to a set function to remove duplicates
    return list(set(text.split()))

In [4]:
# apply process_email function to 'text' column and create 'words' column
emails['words'] = emails['text'].apply(process_email)
emails[0:10]

Unnamed: 0,text,spam,words
0,Subject: naturally irresistible your corporate...,1,"[system, at, interested, ordered, is, good, io..."
1,Subject: the stock trading gunslinger fanny i...,1,"[superior, incredible, fanny, mcdougall, cloth..."
2,Subject: unbelievable new homes made easy im ...,1,"[homes, fixed, -, that, complete, this, wantin..."
3,Subject: 4 color printing special request add...,1,"[/, -, graphix, 91706, special, advertisement,..."
4,"Subject: do not have money , get software cds ...",1,"[', with, by, it, finish, me, compatibility, o..."
5,"Subject: great nnews hello , welcome to medzo...",1,"[ntiaiity, introduce, -, leisure, hlpplng, ov,..."
6,Subject: here ' s a hot play in motion homela...,1,"[why, advertisement, omit, term, o, toois, ban..."
7,Subject: save your money buy getting this thin...,1,"[with, getting, cannot, real, -, that, country..."
8,Subject: undeliverable : home based business f...,1,"[telecom, s, fjt, 75, sun, is, :, unknown, mtp..."
9,Subject: save your money buy getting this thin...,1,"[with, getting, cannot, real, -, that, lasts, ..."


#### Testing to check if Data Cleaning and Processing are done correctly

In [5]:
raw_data_line1 = emails['text'][0]
print(raw_data_line1)

cleaned_data_line1 = emails['words'][0]
print(cleaned_data_line1)

num_words1a = len(emails['text'][0])
num_words1b = len(emails['words'][0])
print()
print("Raw_data_line1: ", num_words1a, "words") # 1484
print("Cleaned_data_line1: ", num_words1b, "words") # 139

Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you  will see logo drafts within three business days . affordability : your  mar

In [6]:
num_emails = len(emails) # count the number of emails
#num_spam = sum(emails['spam']) # sum function takes spam column and get total
num_spam = emails['spam'][emails['spam']==1].count()

print("Number of emails: ", num_emails) # 5728
print("Number of spam emails: ", num_spam) # 1368
print()

# Calculating the prior probability that an email is spam
print("Probability of spam:", num_spam/num_emails) # 0.23882...

Number of emails:  5728
Number of spam emails:  1368

Probability of spam: 0.2388268156424581


## Step 2: Calculate the priors

In [7]:
def calculate_frequency_average(data, label_value):
    total = len(data)
    num_label = data[data == label_value].count()
    frequency = num_label / total
    return frequency

label_spam = 1
label_ham = 0

# meta data
num_emails = len(emails)
counts_label = emails['spam'].value_counts()
num_spam = counts_label[label_spam]
print(counts_label)

print("Number of emails: ", num_emails)
print("Number of spam emails: ", num_spam)

# calculating the prior probability an email is spam
dict_priors = calculate_frequency_average(emails['spam'], label_spam)
# dict_priors = num_spam/num_emails
print("Probability of spam: ", dict_priors)


0    4360
1    1368
Name: spam, dtype: int64
Number of emails:  5728
Number of spam emails:  1368
Probability of spam:  0.2388268156424581


## Step 3 Training the Model

Now, it will calculate word frequencies based on the whole text, then train the model by calculating the frequencies for each word in each label.

In [8]:
def construct_frequency_dict_from_series(data_series):
    frequency_dict = {}
    for text in data_series:
        if isinstance(text, list):
            for word in text:
                if word in frequency_dict:
                    frequency_dict[word] += 1
                else:
                    frequency_dict[word] = 1
    return frequency_dict

In [9]:
def calculate_labeled_frequencies(frequency_dict,emails): 
   labeled_frequencies = {}
   unique_labels = emails['spam'].unique()
   for label in unique_labels:
       subset = emails[emails['spam'] == label]
       label_frequency_dict = construct_frequency_dict_from_series(subset['words'])
       labeled_frequencies[label] = label_frequency_dict
   return labeled_frequencies

In [10]:
dict_frequencies_whole_text = construct_frequency_dict_from_series(emails['text'])
dict_model = calculate_labeled_frequencies(dict_frequencies_whole_text, emails)

print(dict_model[1].get('lottery', 0))
print(dict_model[1].get('sale', 0))
print(dict_model[1].get('already', 0))

8
38
64


In [16]:
words_to_check = ['lottery', 'sale', 'already', 'buy', 'free', 'cheap']
for word in words_to_check:
    word_frequencies = {}
    for label, frequencies in dict_model.items():
        if word in frequencies:
            word_frequencies[label] = frequencies[word]
        else:
            word_frequencies[label] = 0
    print(f"Frequencies for '{word}': {word_frequencies}") # 1 = spam 0 = ham

Frequencies for 'lottery': {1: 8, 0: 0}
Frequencies for 'sale': {1: 38, 0: 41}
Frequencies for 'already': {1: 64, 0: 317}
Frequencies for 'buy': {1: 119, 0: 131}
Frequencies for 'free': {1: 239, 0: 442}
Frequencies for 'cheap': {1: 48, 0: 13}


#### Check dict_model

In [17]:
#print(dict_model)
print(dict_model.keys())

dict_keys([1, 0])


In [21]:
#print(dict_model.values())

## Step 4: Using the model to make predictions

- To write a dictionary, and in this dictionary record every word, and its pair of occurrences in spam and ham

In [22]:
model = {}

# Training process
for index, email in emails.iterrows():
    for word in email['words']:
        if word not in model:
            model[word] = {'spam': 1, 'ham': 1}
        if word in model:
            if email['spam']:
                model[word]['spam'] += 1
            else:
                model[word]['ham'] += 1


In [23]:
model['lottery'] # probably spam 9 out of 10

{'spam': 9, 'ham': 1}

In [24]:
model['sale'] # boundary between ham and spam

{'spam': 39, 'ham': 42}

In [25]:
model['subscribe'] # not likely a spam

{'spam': 8, 'ham': 28}

In [26]:
model['already'] # most likely not a spam

{'spam': 65, 'ham': 318}

In [27]:
def predict_bayes(word):
    word = word.lower()
    num_spam_with_word = model[word]['spam']
    num_ham_with_word = model[word]['ham']
    return 1.0 * num_spam_with_word / (num_spam_with_word + num_ham_with_word)

In [28]:
predict_bayes('lottery')

0.9

In [29]:
predict_bayes('sale')

0.48148148148148145

In [30]:
predict_bayes('already')

0.16971279373368145

## Step 5: Predict Naive Bayes Model

In [31]:
def predict_naive_bayes(email):
    total = len(emails)
    num_spam = sum(emails['spam'])
    num_ham = total - num_spam
    email = email.lower()
    words = set(email.split())
    spams = [1.0]
    hams = [1.0]
    for word in words:
        if word in model:
            spams.append(model[word]['spam']/num_spam * total)
            hams.append(model[word]['ham']/num_ham * total)
    prod_spams = np.compat.long(np.prod(spams) * num_spam)
    prod_hams = np.compat.long(np.prod(hams) * num_ham)
    return prod_spams/(prod_spams + prod_hams)

In [32]:
predict_naive_bayes('lottery sale') # probably spam 96%

0.9638144992048691

In [33]:
predict_naive_bayes('Hi mom how are you?') # spam unlikely 14%

0.13743544730963977

In [34]:
predict_naive_bayes('enter the lottery to win three million dollars') # probably spam 99%

0.9995234218677428

In [35]:
predict_naive_bayes('buy cheap lottery easy money now') # spam 99%

0.999973472265966

In [36]:
list_email1 = [
    "By joining, you've agreed to our Terms of Use and Privacy Statement.",
    "Netflix will automatically continue your membership and charge the membership fee",
    " to your payment method on a monthly basis until you cancel.",
    "To cancel, go to Your Account and click on Cancel membership.",
    "There are no refunds or credits for partial months."
]

# Calculate probability for each email in list_email1
for email in list_email1:
    probability_spam = predict_naive_bayes(email)
    print("Probability email is spam:", probability_spam)

Probability email is spam: 0.922827245754548
Probability email is spam: 0.9341463637917197
Probability email is spam: 0.15453530205676974
Probability email is spam: 0.7408112065416577
Probability email is spam: 0.0550578729994848


In [37]:
# adding new words to the dictionary
list_email2 = [
    "You can review the submission details using the link below, or", 
    "can reply to this comment by responding to this message.",
    "When allowed, if you need to include an attachment,",
    "please log in to Canvas and reply to the submission.", 
    "asdfghjkl"
]

# Calculate probability for each email in list_email2
for email in list_email2:
    probability_spam = predict_naive_bayes(email)
    print("Probability email is spam:", probability_spam)


Probability email is spam: 0.12542974139861737
Probability email is spam: 0.00847912113320933
Probability email is spam: 0.04468788165156035
Probability email is spam: 0.1314133675366208
Probability email is spam: 0.2388268156424581


## Step 6: Do the results make sense?

In [38]:
print("privacy", model['privacy'])
print("membership", model['membership'])
print("payment", model['payment'])
print("cancel", model['cancel'])
print("refunds", model['refunds'])
print("review", model['review'])
print("comment", model['comment'])
print("attachment", model['attachment'])
print("reply", model['reply'])
print("please", model['please'])
print()
print("Bayes_Predict: privacy", predict_bayes('privacy'))
print("Bayes_Predict: membership", predict_bayes('membership'))
print("Bayes_Predict: payment", predict_bayes('payment'))
print("Bayes_Predict: cancel", predict_bayes('cancel'))
print("Bayes_Predict: refunds", predict_bayes('refunds'))
print("Bayes_Predict: review", predict_bayes('review'))
print("Bayes_Predict: comment", predict_bayes('comment'))
print("Bayes_Predict: attachment", predict_bayes('attachment'))
print("Bayes_Predict: reply", predict_bayes('reply'))
print("Bayes_Predict: please", predict_bayes('please'))

privacy {'spam': 50, 'ham': 3}
membership {'spam': 31, 'ham': 25}
payment {'spam': 41, 'ham': 62}
cancel {'spam': 8, 'ham': 38}
refunds {'spam': 1, 'ham': 6}
review {'spam': 47, 'ham': 422}
comment {'spam': 2, 'ham': 40}
attachment {'spam': 3, 'ham': 55}
reply {'spam': 135, 'ham': 174}
please {'spam': 350, 'ham': 2460}

Bayes_Predict: privacy 0.9433962264150944
Bayes_Predict: membership 0.5535714285714286
Bayes_Predict: payment 0.39805825242718446
Bayes_Predict: cancel 0.17391304347826086
Bayes_Predict: refunds 0.14285714285714285
Bayes_Predict: review 0.10021321961620469
Bayes_Predict: comment 0.047619047619047616
Bayes_Predict: attachment 0.05172413793103448
Bayes_Predict: reply 0.4368932038834951
Bayes_Predict: please 0.12455516014234876


How accurate do you feel the naive Bayes model is at predicting spam vs. ham? What could improve the accuracy?

Naive Bayes are effective in classifying text data but its accuracy may vary depending on the specific dataset and its characteristics. Some aspects that could potentially improve the accuracy:

1. **Quality and Quantity of Training Data**: The performance of any machine learning model, including Naive Bayes, heavily relies on the quality and quantity of the training data. A larger and more diverse dataset that accurately represents the characteristics of both spam and ham emails can improve the model's accuracy.
2. **Feature Selection and Engineering**: Choosing relevant features or creating new ones can significantly impact the model's accuracy. In the case of email classification, features such as word frequencies, presence of specific keywords, or structural information like email headers and metadata can be valuable in distinguishing between spam and ham.
3. **Handling Imbalanced Data**: In spam classification tasks, the number of spam emails is often significantly smaller than the number of ham emails, resulting in imbalanced data. Techniques such as oversampling the minority class (spam) or under sampling the majority class (ham) can help address this issue and improve model accuracy.
4. **Text Preprocessing**: Properly preprocessing the text data can enhance the accuracy of the Naive Bayes model. Techniques such as removing stop words, handling special characters, and performing lowercase normalization can help improve the quality of the text representation.
5. **Model Selection and Evaluation**: Although Naive Bayes is a good choice for text classification, it's worth exploring other algorithms as well. Different models may exhibit varying accuracies depending on the specific dataset. Evaluating multiple algorithms and selecting the one that performs best on the given data can improve accuracy.
6. **Hyperparameter Tuning**: Naive Bayes has few hyperparameters, such as smoothing parameters for handling zero probabilities. Tuning these hyperparameters using techniques like cross-validation can help optimize the model's performance.
