# <center>CSC 723 Final Project</center>
Project Code 1: Naive Bayes

&emsp; Version 1.0<br> 
&emsp; March 2023

&emsp; CSC 723<br>
&emsp; Machine Learning for Cyber Security<br>
&emsp; Dakota State University

Robert Chavez<br>
Kiera Conway

--------

## Import Data
### Libraries

In [None]:
import numpy as np   # array mathematical operations library
import pandas as pd  # data analysis library3

####  <a id=install>Installing chardet</a>

<div class="alert alert-block alert-info">
    <b>[1]:</b>

Unless previously installed, the `import chardet` cell will produce a `ModuleNotFoundError` error. If that error occurs, uncomment and run the following cell to install the module - if it is already installed, skip ahead to [Importing chardet](#import).
</div>

In [None]:
# !pip install chardet

#### <a id=import>Importing chardet</a>

In [None]:
import chardet       # universal character encoding detector

<div class="alert alert-block alert-info">
    <b>[2]:</b>

Possible Warnings and Errors:
    
* `ModuleNotFoundError: No module named 'chardet'`
    * Fix: Go to [Installing chardet](#install), uncomment cell, and run it. 
    * Note: It will be using sudo permissions to install a library. For information about the library, please visit [the official site](https://pypi.org/project/chardet/).


* `ERROR: Could not find a version that satisfies the requirement chardet (from versions: none)`
    * Fix: Go to Notebook Settings (right-hand side), navigate to Notebook Options, and ensure internet access is enabled.


* `ERROR: No matching distribution found for chardet`
    * Fix: Go to Notebook Settings (right-hand side), navigate to Notebook Options, and ensure internet access is enabled.


* `WARNING: There was an error checking the latest version of pip.`
    * Fix: Go to Notebook Settings (right-hand side), navigate to Notebook Options, and ensure internet access is enabled.
</div>

### Data Set

In [None]:
# Set File Path
file_path = '/kaggle/input/spam-or-ham/SMSCollection.csv'

# Obtain Data from File Path
sms_data = pd.read_csv(file_path)

## Review Dataset
### Dataset Information

In [None]:
# General Information
sms_data.info()

In [None]:
# Check for null values
sms_data.isnull().sum()

In [None]:
# View first and last 5 Observations
sms_data

In [None]:
print(sms_data.Class[42])      # view variable 2 (message) of 72nd message
print(sms_data.sms[42])      # view variable 1 (spam/ham) of 72nd message

In [None]:
# Statistical Information
sms_data.describe()

### Analyze Information

#### .describe() Key

| Title  | Definition                                 | 
| ------ | --------                                   |
| Count  | Count/Occurences of each feature           | 
| Unique | The number of possible unique observations |
| Top    | The most frequent value                    | 
| Freq   | The frequency of the top value             | 

#### Data Analysis

The features of this dataset are 'Class' and 'sms', where 'Class' indicates whether the message is `spam` or a valid sms message, `ham` and 'sms' contains the corresponding message.

The count values above shows us there are 5572 non-null data enteries in each feature. As each feature contains the same count value, we can conclude there are no missing data points that we need to trim. 

The unique value of 2 under the Class feature verifies all messages are either `spam` or `ham`, and contain no erroneous values. Since the sms feature contains a unique value of 5169, which is less than 5572, we can assume that some messages are identical.

The top and freq values under Class show us that most messages are categorized as `ham`, with 4825 occurences. We can therefore determine there are 747 remaining messages categorized as `spam`. The top and freq values under sms confirm the previous hypothesis that some messages are identical; we can see that the most frequent message, occuring 30 times, contains the text "Sorry, I'll call later"

Using this information, we can identify the format of our data, determine its completeness, and verify the values contained are expected.

## Modify Data
### Create Column: Numerical Representation for Spam/Ham

In [None]:
# View first 5 Observations
print("Before Modification:\n") 
sms_data.head()

In [None]:
# Create New Column
sms_data['Class_num'] = sms_data.Class.map({'ham':0, 'spam':1})   # Ham becomes 0, Spam becomes 1

# View first 5 Observations
print("After Modification:\n") 
sms_data.head()

In [None]:
"""# Create Mapping Dictionary
scale_Class = {"spam":1, "ham": 0,}

# Execute In-Place Mapping
data['Class'].replace(scale_Class, inplace=True)

# View New Data
data"""

### Create Column: Message Lengths

In [None]:
# View first 5 Observations
print("Before Modification:\n") 
sms_data.head()

In [None]:
# Create New Column
sms_data['sms_len'] = sms_data.sms.apply(len)   #apply length counter to each tweet 

# View first 5 Observations
print("After Modification:\n") 
sms_data.head()

## Graph Data

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns              #statistical data visualization

sns.set_style('whitegrid')          #set visual style
plt.style.use('fivethirtyeight')   #set plot visual style
plt.figure(figsize=(12,8))         #set plot size

# Plot Ham/ Spam Message Length as Histogram
sms_data[sms_data.Class=='ham'].sms_len.plot(bins=35, kind='hist', color='blue', label='Ham Messages', alpha=0.5)
sms_data[sms_data.Class=='spam'].sms_len.plot(kind='hist', color='red', label='Spam Messages', alpha=0.5)

plt.legend()
plt.xlabel("Message Length")

## Analyze Data
### Complete Data Set

In [None]:
sms_data.describe()

#### Analysis


* Data set includes Ham (0) and Spam (1) combined into Class_num
* A Class_num mean of 0.134 means that 13.4% of data is spam
    * Inversely, 86.6% is Ham
* SMS messages average 80.48 characters
* The shortest message length is 2 characters
* The longest message length is 910 characters


### Ham Data Set

In [None]:
# Analyze data labeled 'ham'
sms_data[sms_data.Class=='ham'].describe()

#### Analysis


* This data set includes Ham (0) only
* There are 4,825 ham messages
* Remember, ham is 0, so all other stats in Class_num will == 0
* Ham messages average 71.48 characters
* The shortest ham message length is 2 characters
* The longest ham message length is 910 characters

### Spam Data Set

In [None]:
# Analyze data labeled 'spam'
sms_data[sms_data.Class=='spam'].describe()

#### Analysis


* This data set includes Spam (1) only
* There are 747 spam messages
* Remember, spam is 1, so all other stats in Class_num will == 1
    * except standard deviation, as there is no deviation between 1 and 1
* Spam messages average 138.67 characters
* The shortest spam message length is 13 characters
* The longest spam message length is 223 characters

## Prepare Data using Natural Language Processing
### Create Function to Clean up Messages

In [None]:
import string
from nltk.corpus import stopwords

# List of common abbreviations
abrv = ['rofl', 'stfu', 'icymi', 'tldr', 'ok', 'tmi', 'afaik', 'lmk', 'nvm', 'ftw', 'byob', 'rt', 'bogo', 'jk', 'jw', 'im', 'pm', 'ig', 'tgif', 'bh', 'tbf', 'rn', 'fubar', 'brb', 'iso', 'brt', 'btw', 'ftfy', 'gg', 'bfd', 'irl', 'dae', 'lol', 'smh', 'ngl', 'bts', 'ikr', 'ttyl', 'hmu', 'fwiw', 'imo', 'wyd', 'imho', 'idk', 'idc', 'idgaf', 'nbd', 'tba', 'tbd', 'afk', 'abt', 'iykyk', 'b4', 'bc', 'jic', 'fomo', 'snafu', 'gtg', 'g2g', 'h8', 'lmao', 'iykwim', 'myob', 'pov', 'tlc', 'bd', 'w/e', 'wtf', 'wysiwyg', 'fwif', 'tw', 'eod', 'faq', 'aka', 'asap', 'diy', 'lmgtfy', 'np', 'n/a', 'ooo', 'ia', 'cob', 'fyi', 'nsfw', 'wfh', 'omw', 'wdyt', 'wygam', 'smp', 'dm', 'fb', 'ig', 'li', 'yt', 'ff', 'im', 'pm', 'op', 'qotd', 'ootd', 'rt', 'tbt', 'til', 'ama', 'eli5', 'fbf', 'mfw', 'hmu', 'ily', 'mcm', 'wcw', 'bf', 'gf', 'ae', 'lysm', 'pda', 'ltr', 'dtr', 'xoxo', 'otp', 'loml']

def Process_Tweet(sms):
    
    STOPWORDS = stopwords.words('english')+abrv                           #set stopwords (SW) variable to nltk english SW
    
    nopunc = [char for char in sms if char not in string.punctuation]     #remove punctuation
    
    nopunc = ''.join(nopunc)                                              #join every item in list using '' as a separator
    
    nopunc = ' '.join([word for word in nopunc.split() if word.lower() not in STOPWORDS])    #remove Stopwords

    return nopunc

#### Code Breakdown
    
`nopunc = [char for char in sms if char not in string.punctuation]` <br>
" for every character in the message, <br>
if the character is not in the list of punctionation, <br>
save that char into the list 'nopunc' "
* removes punctuation
* Essentially, nopunc is the same as sms, just without the punctuation


`nopunc = ' '.join([word for word in nopunc.split() if word.lower() not in STOPWORDS])` <br>
" for every word in the 'nopunc' list,  <br>
if the word [changed to lowercase] is not in 'STOPWORDS',  <br>
save it into the list 'nopunc'"

* remove Stopwords


### Create Column: Save Cleaned Messages

In [None]:
# View first 5 Observations
print("Before Modification:\n") 
sms_data.head()

In [None]:
# Create New Column
sms_data['sms_clean'] = sms_data.sms.apply(Process_Tweet)   #send each message to function 'temp_process'

# View first 5 Observations
print("After Modification:\n") 
sms_data.head()

### Extract Words
#### Ham Messages

In [None]:
ham_words = sms_data[sms_data.Class_num==0].sms_clean.apply(lambda x: [word.lower() for word in x.split()])    #Save messages as lowercase list

'''
for each ham message, 
split words into a list, 
covert to lowercase, 
and save to 'ham_words'
'''

ham_words    #remaining words in ham messages

#### Spam Messages

In [None]:
spam_words = sms_data[sms_data.Class_num==1].sms_clean.apply(lambda x: [word.lower() for word in x.split()])    #Save messages as lowercase list

'''
for each ham message, 
split words into a list, 
covert to lowercase, 
and save to 'ham_words'
'''

spam_words    #remaining words in spam messages

## Create Frequency Tables
### Ham Word Frequencies

In [None]:
from collections import Counter
ham_word_count = Counter()

for each_word in ham_words:                #for each word in words
    ham_word_count.update(each_word)       #count frequency of each_word
    
print(ham_word_count.most_common(50))      #print 50 most common words

#### Analysis
<i>This is a good place to check for additional stopwords.<br>
    For example, 2 of the top 3 most common words here are 'U' and '2' - these would be great additions to the stopword list
    
<i>If unsure about adding a specific word to the stopwords list, ask if the word adds any context - if not, it would likely work well as a stopword.<br> Also, you can check the most common occurences of spam (shown below) and see if that word appears there as well. 

### Spam Word Frequencies

In [None]:
spam_word_count = Counter()

for each_word in spam_words:              #for each word in words
    spam_word_count.update(each_word)       #count frequency of each_word
    
print(spam_word_count.most_common(50))      #print 50 most common words

## Train Naive Base Classifier

In [None]:
X = sms_data.sms_clean    #define feature set
y = sms_data.Class_num    #define dependent variable

print(X.shape)       #print shape (Observations/ Rows, Features/ Columns)
print(y.shape)       #print shape (Observations/ Rows, Features/ Columns)

### Split Training and Testing Data

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state = 42)

In [None]:
# Verify Training/ Testing Data
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

#### Verification Analysis

* X_train: 4179 observations, 1 feature
* X_test: 1393 observations, 1 feature
* y_train: 4179 observations, 1 feature
* y_test: 1393 observations, 1 feature

Since Training and Testing observations match, and features are the expected value (missing means 1), the training and testing data was split correctly.

### Obtain Count Vectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer    #Convert a collection of text documents to a matrix of token counts.

# Fit Data
vect = CountVectorizer()
vect.fit(X_train)

#Transform Data
X_train_dtm = vect.transform(X_train)    #transform train data, dtm = data transformation
X_test_dtm = vect.transform(X_test)      #transform test data, dtm = data transformation



#### Error Check

In [None]:
#Verify Vectorizers Completed
print(X_train_dtm.toarray())             #print train vectorizer
print("\n\n")
print(X_train_dtm.toarray())             #print test vectorizer

In [None]:
X_test_dtm

In [None]:
X_train_dtm

#### Transformation Analysis

<i>I can verify the data transformation was successful, as both X_train and X_test<br>
produce the same output for columns [rows x colunmns]</i>
    
* X_test_dtm
    * 1393 x **8011**
* X_train_dtm
    * 4197 x **8011**

## Create Naive Base Model

In [None]:
from sklearn.naive_bayes import MultinomialNB 

nb = MultinomialNB()          #create instance
nb.fit(X_train_dtm, y_train)  #fit model

## Make Predictions

In [None]:
y_pred_class = nb.predict(X_test_dtm)     #make prediction for entire testing set

y_pred_class[:15]

### Prediction Analysis [1 of 2]

Reminder: 0 is Ham, 1 is Spam

According to this prediction, the fifteenth message (at array location 14) should be spam. We can check this by printing the fifteenth message:

In [None]:
X_test[14:15]

### Prediction Analysis [2 of 2]

Judging from this short snippet, it appears that the prediction was correct - this message is Spam

## Check Accuracy

In [None]:
from sklearn import metrics

metrics.accuracy_score(y_test, y_pred_class)

In [None]:
metrics.confusion_matrix(y_test, y_pred_class)

### Confusion Matrix Analysis

|                 | Predicted Value<br>[0] | Predicted Value<br>[1] |
| --------------- | ---------------------- | ---------------------- |
| True Value [0]  | Prediction Correct     | Prediction Incorrect   |
| True Value [1]  | Prediction Incorrect   | Prediction Correct     |

| P, T | 0           | 1           |
| ---- | ----------- | ----------- |
| 0    | <b>0, 0</b> | 1, 0        |
| 1    | 0, 1        | <b>1, 1</b> |

Therefore, the stats above state the following

| | |
| ---------------------------------------------------- | -------------------------------------------------- |
| 1200 Predicted HAM Correctly                         | 7 Predicted SPAM incorrectly,<br> was actually HAM |
| 14 Predicted HAM incorrectly,<br> was actually SPAM  | 172 Predicted SPAM Correctly                       |


### Verify Specific Predictions

In [None]:
X_test[y_pred_class > y_test]   #view all predictions of SPAM (1) where it was actually HAM (0)

In [None]:
X_test[y_pred_class < y_test]   #view all predictions of HAM (0) where it was actually SPAM (1)