# Naive Bayes Spam Classification

## Data Description

Dataset has been downloaded from the UCI machine learning repository that contains several Youtube comments from very popular music videos. Each comment in the data has been labeled as either spam or ham and this data will be used to train Naive Bayes algorithm for youtube comment spam classification. 

In [1]:
# Import modules
# For data manipulation
import pandas as pd
# For matrix operations
import numpy as np
# For regular expression (text cleaning)
import re

In [2]:
# Loading data set
data_comments = pd.read_csv('YoutubeCommentsSpam.csv')

# Create column labels
data_comments.columns = ["content","label"]
data_comments.head()

Unnamed: 0,content,label
0,+447935454150 lovely girl talk to me xxx,1
1,I always end up coming back to this song<br />,0
2,"my sister just received over 6,500 new <a rel=...",1
3,Cool,0
4,Hello I am from Palastine,1


In [3]:
# Show spam comments in data
print(data_comments["content"][data_comments["label"] == 1])

0                +447935454150 lovely girl talk to me xxx
2       my sister just received over 6,500 new <a rel=...
4                               Hello I am from Palastine
6       Go check out my rapping video called Four Whee...
8                           Aslamu Lykum... From Pakistan
10                            Help me get 50 subs please 
12      Alright ladies, if you like this song, then ch...
15      <a href="https://www.facebook.com/groups/10087...
16                  Take a look at this video on YouTube:
17                 Check out our Channel for nice Beats!!
19                    Check out this playlist on YouTube:
21                                            like please
24      I shared my first song &quot;I Want You&quot;,...
25      Come and check out my music!Im spamming on loa...
26                    Check out this playlist on YouTube:
27      HUH HYUCK HYUCK IM SPECIAL WHO S WATCHING THIS...
30      Check out this video on YouTube:<br /><br />Lo...
33            

In [4]:
# Add another column with corresponding comment length
data_comments['length'] = data_comments['content'].map(lambda text: len(text))

#Number of comments
print("Number of comments ", len(data_comments))

Number of comments  1959


 Randomly selecting $75\%$ of the data as training, and $25\%$ of the data for testing. 

In [5]:
# Set seed so we get same random allocation on each run of code
np.random.seed(0)

# Add column vector of randomly generated numbers form U[0,1]
data_comments["uniform"] = np.random.uniform(0,1,len(data_comments.index)) 

# About 75% of these numbers should be less than 0.75
data_comments_train = data_comments[data_comments["uniform"] < 0.75]

# About 25% of these numbers should be more than 0.75
data_comments_test = data_comments[data_comments["uniform"] > 0.75]

# Check that both training and test data have both spam and ham comments
data_comments_train["label"].describe()

count    1443.000000
mean        0.511435
std         0.500043
min         0.000000
25%         0.000000
50%         1.000000
75%         1.000000
max         1.000000
Name: label, dtype: float64

In [6]:
# Test data summary statistics
data_comments_test["label"].describe()

count    516.000000
mean       0.515504
std        0.500245
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: label, dtype: float64

In [7]:
# Join all the comments
training_list_words = "".join(data_comments_train.iloc[:,0].values)

# Split the list of comments into a list of unique words
train_unique_words = set(training_list_words.split(' '))

# Number of unique words in training 
vocab_size_train = len(train_unique_words)

# Description of comments in training data
print('Unique words in training data: %s' % vocab_size_train)
print('First 5 words in our unique set of words: \n % s' % list(train_unique_words)[1:6])

Unique words in training data: 5704
First 5 words in our unique set of words: 
 ['thundering', 'someone', 'Irish', 'your', 'much!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!check']


Currently "now!!" and "now!!!!", as well as "DOES","DoEs", and "does" are all considered to be unique words. For the purposes of spam classification, its probably better to process the data slightly to increase accuracy. 

In [8]:
# Only keep letters and numbers
train_unique_words = [re.sub(r'[^a-zA-Z0-9]','', words) for words in train_unique_words]

# Convert to lower case and get unique set of words
train_unique_words = set([words.lower() for words in train_unique_words])

# Number of unique words in training 
vocab_size_train = len(train_unique_words)

# Description of summarized comments in training data
print('Unique words in processed training data: %s' % vocab_size_train)
print('First 5 words in our processed unique set of words: \n % s' % list(train_unique_words)[1:6])

Unique words in processed training data: 4063
First 5 words in our processed unique set of words: 
 ['sooooooooooooong', 'thundering', 'see', 'wish', 'accidental']


## Naive Bayes for Spam Classification

In [9]:
#equation 1
#P(spam/vocab) = (P(vocab/spam)*P(spam))/P(vocab)
#P(ham/vocab) = (P(vocab/ham)*P(ham))/P(vocab)

#equation2
#P(vocab/spam) = P(word1/spam)*P(word2/spam)*P(word3/spam)...
#P(vocab/ham) = P(word1/ham)*P(word2/ham)*P(word3/ham)...

#equation3
#P(word1|spam)=(count of word1 belonging to category spam)/(total count of words belonging to category spam)
#P(word1|ham)=(count of word1 belonging to category ham)/(total count of words belonging to category ham)

#equation4
#P(word1|ham)=
#            (count of word1 belonging to category ham +1)/
#                          (total count of words belonging to ham + no of distinct words in training data sets)

#P(word1|spam)=
#            (count of word1 belonging to category spam +1)/
#                       (total count of words belonging to spam + no of distinct words in training data sets)

In [10]:
# Dictionary with comment words as "keys", and their label as "value"
trainPositive = dict()
trainNegative = dict()

# Intiailize classes
positiveTotal = 0
negativeTotal = 0

# Initialize Prob. of
pSpam = 0.0
pNotSpam = 0.0

# Laplace smoothing
alpha = 1

In [11]:
# Initialize dictionary of words and their labels   
for word in train_unique_words:
    
    # Classify all words for now as ham (legitimate)
    trainPositive[word] = 0
    trainNegative[word] = 0

In [12]:
# Count number of times word in comment appear in spam and ham comments
def processComment(comment,label):
    global positiveTotal
    global negativeTotal
    
    # Split comments into words
    comment = comment.split(' ')
    
    # Go over each word in comment
    for word in comment:
        
        # ham commments
        if(label == 0 and word != ' '):
            
            # Increment number of times word appears in ham comments
            trainNegative[word] = trainNegative.get(word,0)+1
            negativeTotal += 1
            
        # spam comments
        elif(label == 1 and word != ' '):
            
            # Increment number of times word appears in spam comments
            trainPositive[word] = trainPositive.get(word,0)+1
            positiveTotal += 1

In [13]:
# Prob(word|spam) and Prob(word|ham)
def conditionalWord(word,label):
    
    # Laplace smoothing parameter
    global alpha
    
    # word in ham comment
    if(label == 0):
        # Compute Prob(word|ham)
        return (trainNegative.get(word,0)+alpha)/(float)(negativeTotal+alpha*vocab_size_train)
    
    # word in spam comment
    else:
        
        # Compute Prob(word|ham)
        return (trainPositive.get(word,0)+alpha)/(float)(positiveTotal+alpha*vocab_size_train)

In [14]:
#Prob(spam|comment) or Prob(ham|comment)
def conditionalComment(comment,label):
    
    # Initialize conditional probability
    prob_label_comment = 1.0
    
    # Split comments into list of words
    comment = comment.split(' ')
    
    # Go through all words in comments
    for word in comment:
        
        # Compute value proportional to Prob(label|comment)
        prob_label_comment *= conditionalWord(word,label)
    
    return prob_label_comment

In [15]:
# Train naive bayes by computing several conditional probabilities in training data
def train():
    
    print('Starting training')
    global pSpam
    global pNotSpam

    # Initiailize 
    total = 0
    numNegative = 0
    
    # Go over each comment in training data
    for idx, comment in data_comments_train.iterrows():
        
        # Comment is ham 
        if comment.label == 0:
            
            # Increment ham comment counter
            numNegative += 1
        
        # Increment comment number
        total += 1
        
        # Update dictionary of ham and spam comments
        processComment(comment.content,comment.label)
    
    # Compute prior probabilities, P(spam), P(ham)
    pNotSpam = numNegative/float(total)
    pSpam = (total - numNegative)/float(total)
    
    print('Training is now finished')

In [16]:
# Run naive bayes
train()

Starting training
Training is now finished


In [17]:
# Classify comment are spam or ham
def classify(comment):
    
    global pSpam
    global pNotSpam
    
    # Compute value proportional to Pr(comment|ham)
    isNegative = pSpam * conditionalComment(comment,0)
    
    # Compute value proportional to Pr(comment|spam)
    isPositive = pNotSpam * conditionalComment(comment,1)
    
    # Output True = spam, False = ham
    return (isNegative < isPositive)

In [18]:
# Initialize spam prediction in test data
prediction_test = []

# Get prediction accuracy on test data
for comment in data_comments_test["content"]:

    # Classify comment 
    prediction_test.append(classify(comment))

# Check accuracy
test_accuracy = np.mean(np.equal(prediction_test, data_comments_test["label"]))

#print prediction_test
print("Proportion of comments classified correctly on test set:", test_accuracy)

Proportion of comments classified correctly on test set: 0.8565891472868217


 "True" is for spam comments, and "False" is for ham comments. 

In [19]:
# spam
classify("Guys check out my new chanell")

True

In [20]:
# spam
classify("I have solved P vs. NP, check my video https://www.youtube.com/watch?v=dQw4w9WgXcQ")

True

In [21]:
# ham
classify("I liked the video")

False

In [22]:
# ham
classify("Its great that this video has so many views")

False

In [23]:
pNotSpam

0.4885654885654886