# Name : Jeremy Liu
# CS 167 Project D

# Introduction to the program
The purpose of this program is to find the best algorithm for detecting spam comments from YouTube by looking for certain words.

## What is spam?
According to dictionary.com (http://www.dictionary.com/browse/spam), spams are unimportant and irrelevant messages that are sent to a massive amount of people on the Internet, or on YouTube, especially its comment section. Spams can mislead other Youtubers to click malicious links. Spams also do not make the comment section look good. Spams also take up a lot of unnecessary space on the YouTube servers. Thus, by detecting certain keywords, we could detect spam and remove it.

## Data
The data is obtained from: https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection.
Data was obtained using YouTube API. The comments were from videos of popular YouTube channels, i.e. Psy , KatyPerry, LMFAO, Eminem, and Shakira. They came in separated by 5 parts, so I combined the data myself.
There are 1957 rows and 5 columns of data.

### Predictors
There are 5 predictors in the dataset, namely:
1. COMMENT_ID: The ID of a certain YouTube comment.
2. AUTHOR: The person who wrote the YouTube comment.	
3. DATE: The date which the comment was written.
4. CONTENT: The comment that they wrote. This predictor is where all of the text is stored.	
5. CLASS: Indicator whether the comment is a spam or not. This is the target varible. If it is 0, it is not a spam. If it is 1, it is classified as spam.

However, we would not use all of these predictors because the only important predictors for this project are CONTENT and CLASS.

An example of a spam and non-spam YouTube comment would be:
<table>
  <tr>
    <th>Comment</th>
    <th>Spam?</th> 
  </tr>
  <tr>
    <td>"Behold the most viewed youtube video in the history of ever!"</td>
    <td>0 (No)</td> 
  </tr>
  <tr>
    <td>"guys please subscribe me to help my channel grow please guys"</td>
    <td>1 (Yes)</td> 
  </tr>
</table>

In [4]:
# Import important modules
import pandas
from bs4 import BeautifulSoup
import re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Steps to get data suitable for machine learning
1. Remove all html mark up using Beautiful Soup. Some words have html encoding in it, for example, one of the comments was "Super awesome video <'br/'>".
2. Remove all punctuation and leave all words only. This allows the algorithm to look at words only.
3. Remove all stop words (useless words that do not impact on the algorithm).
4. Join them into a sentence for use with CountVectorizer.


In [5]:
data = pandas.read_csv("YoutubeComments.csv")
#print out the first review
print(data["CONTENT"][0])
# #use the Beautiful Soup package to remove html mark up
rev_soup = BeautifulSoup(data["CONTENT"][0])
print(rev_soup.get_text())

Huh, anyway check out this you[tube] channel: kobyoshi02
Huh, anyway check out this you[tube] channel: kobyoshi02




 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [6]:
# Remove all punctuations using regex
letters_only = re.sub("[^a-zA-Z]"," ",rev_soup.get_text())
print(letters_only)


Huh  anyway check out this you tube  channel  kobyoshi  


In [7]:
# Split them to different words
lower_case = letters_only.lower()
words = lower_case.split()
print(words)


['huh', 'anyway', 'check', 'out', 'this', 'you', 'tube', 'channel', 'kobyoshi']


In [8]:
# Look for stopwords in the code
stop_words = stopwords.words("english")
            
# Add each element which is not a stopword into the list named word_list
word_list = []
for word in words:
    for stop_word in stop_words:
        if (word == stop_word):
            break 
    if(word!=stop_word):
        word_list.append(word)
        


In [9]:
# An example of a cleaned comment
print(word_list)

['huh', 'anyway', 'check', 'tube', 'channel', 'kobyoshi']


In [10]:
# Join them into a sentence again
clean_text = " ".join(word_list)
print(clean_text)

huh anyway check tube channel kobyoshi


## Now that we have a sample of how the data will be like, we could do this for all of the youtube comments:

In [6]:
# We create a function:
def clean_content(content):
    rev_soup = BeautifulSoup(content)
    letters_only = re.sub("[^a-zA-Z]"," ",rev_soup.get_text())
    lower_case = letters_only.lower()
    words = lower_case.split()
    stop_words = stopwords.words("english")

    word_list = []
    for word in words:
        for stop_word in stop_words:
            if (word == stop_word):
                break 
        if(word!=stop_word):
            word_list.append(word)
        
    clean_text = " ".join(word_list)
    return clean_text

allcontent = data["CONTENT"][0:]
cleaned_content = []

for content in allcontent:
    cleaned_content.append(clean_content(content))

print(cleaned_content)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that docu

  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup




In [7]:
# Train test split
(train_content, test_content, train_target, test_target) = \
train_test_split(cleaned_content, data["CLASS"][0:],\
test_size = 0.2)

# Bag of Words with 5000 most common words
vectorizer = CountVectorizer(analyzer='word', \
max_features = 5000)
# Find the right 5000 words
vectorizer.fit(train_content)
# Print out the words that it has found
print(vectorizer.get_feature_names())





In [11]:
train_word_columns = vectorizer.transform(train_content).toarray()
test_word_columns = vectorizer.transform(test_content).toarray()
#take a look at what one of these data sets looks like now
print("Train word data:")
print(train_word_columns)
print("Test word data:")
print(test_word_columns)

Train word data:
[[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]
Test word data:
[[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]


# Machine Learning Experiment
Now that we have converted the words into vectors, we can use Naive Bayes algorithm and Support Vector Machines algorithm. 

The first model that we are using is Gaussian Naive Bayes.

The second model that we will be using is Support Vector Machines.

Below shows the results of both algorithms:

In [13]:
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(train_word_columns,train_target)
preds = gnb.predict(test_word_columns)
print("Gaussian: ", accuracy_score(preds,test_target))

# SVM
svm = SVC()
svm.fit(train_word_columns,train_target)
preds = svm.predict(test_word_columns)
print("SVC: " , accuracy_score(preds,test_target))

Gaussian:  0.762755102041
SVC:  0.489795918367


The results show that Gaussian Naive Bayes is more accurate than Support Vector Machines. This is because Gaussian Naive Bayes is very suitable for classifying with discrete features (in this case, the bag of words is countable). Support Vector Machine is less accurate because it attempts to draw a hyperplane on a discrete data (as in the data is continuous).

## Principal Component Analysis
Now, we attempt to add Principal Component Analysis in hopes of increasing its accuracy.

In [14]:
from sklearn.decomposition import PCA
# Set n_components to 300
extractor = PCA(n_components=300, whiten=True)
# Fitting with PCA 
extractor.fit(train_word_columns)
# Transform train data
train_transformed = extractor.transform(train_word_columns)

# Transforming test data as well
test_transformed = extractor.transform(test_word_columns)

# PCA transformed GNB
gnbP = GaussianNB()
gnbP.fit(train_transformed,train_target)
preds = gnbP.predict(test_transformed)
print("PCA transformed Gaussian: ", accuracy_score(preds,test_target))

# PCA transformed SVM
svmP = SVC()
svmP.fit(train_transformed,train_target)
preds = svmP.predict(test_transformed)
print("PCA transformed SVM: ", accuracy_score(preds,test_target))


PCA transformed Gaussian:  0.686224489796
PCA transformed SVM:  0.938775510204


PCA transformed SVM worked as intended: The accuracy for SVM has dramatically increased to about 90%. This is pretty good.

Surprisingly, using Gaussian Naive Bayes after PCA transformation made the results worse. It might be because PCA transforms the data in such a way that it isn't a bag of words anymore.

## Using Multinomial Naive Bayes
I feel that using Multinomial Naive Bayes will provide the best accuracy out there because according to sklearn documentation (http://scikit-learn.org/stable/modules/naive_bayes.html), it says that this algorithm is widely used for Text Classification.

In [17]:
# Multinomial Naive Bayes
mnb = MultinomialNB()
mnb.fit(train_word_columns,train_target)
preds = mnb.predict(test_word_columns)
print("Gaussian: ", accuracy_score(preds,test_target))

Gaussian:  0.905612244898


It certainly did better than the original Gaussian Naive Bayes and SVM, but it is also not far behind from PCA transformed SVM.

## Conclusion
1. Gaussian Naive Bayes provides higher accuracy than Support Vector Machine.
2. However, PCA greatly increases Support Vector's Machine. 
3. PCA does not work well with Naive Bayes.
4. Multinomial Naive Bayes is the best algorithm for text classification.