# IST736 Text Mining
## Homework 6
### Martin Alonso
### 2019-02-23

### Objectives
For this assignment, we will revisit the week 4 homework and review customer classification using both Bernoulli and Multinomial Naive Bayes from the sklearn package to determine whether a customer is truthful or not and whether the review is good or bad. 

In [48]:
# Import the necessary packages
import os
import re
import arff
import numpy as np
import pandas as pd
from nltk.stem.snowball import SnowballStemmer
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [3]:
# We'll first start by reading the arff data and extracting the sentiment, veracity, and text from the data
data = arff.load(open('C:/Users/malon/Documents/Syracuse University/IST 736 Text Mining/IST736/IST736/Week4/deception_data_converted_arff.arff'))

# Extract the data into a list. 
data_texts = list(data.values())[3]

In [4]:
# Initiate empty DataFrame
corpus = pd.DataFrame()

# Process the data and convert to pandas DataFrame
for i in range(0, len(data_texts)):
    text = data_texts[i]
    text_processed = pd.DataFrame(np.array(text).reshape(-1,3))
    corpus = pd.concat((corpus, text_processed), axis=0, ignore_index=True)

# Keep column names from original data and print first 5 observations
corpus.columns = ['lie', 'sentiment', 'review']
print(corpus.head())

    lie sentiment                                             review
0  fake  negative  Mike's Pizza High Point, NY Service was very s...
1  fake  negative  i really like this buffet restaurant in Marsha...
2  fake  negative  After I went shopping with some of my friend, ...
3  fake  negative  Olive Oil Garden was very disappointing. I exp...
4  fake  negative  The Seven Heaven restaurant was never known fo...


Now that we have the data in a more readable format, we'll do some additional cleaning. We'll first remove any special characters from the review column. Then, we'll have all strings in each vector turned into lower case. Once this is completed, we'll start working with the Bernoulli and Multinomial Naive Bayes models.

In [5]:
# Remove special characters and lower case words 
for i in range(0, len(corpus)): 
    corpus.iloc[i, 2] = re.sub('[\W\_]', ' ', corpus.iloc[i, 2]).lower()

print(corpus.head())

    lie sentiment                                             review
0  fake  negative  mike s pizza high point  ny service was very s...
1  fake  negative  i really like this buffet restaurant in marsha...
2  fake  negative  after i went shopping with some of my friend  ...
3  fake  negative  olive oil garden was very disappointing  i exp...
4  fake  negative  the seven heaven restaurant was never known fo...


### Analysis and Modeling
#### Bernoulli Naive Bayes
Now that the data has been loaded, we'll build three Naive Bayes Bernoulli models with the following parameters: 
1. Stop words removed
2. Stop words removed and 1-2 ngrams
3. Stop words removed, stemming, and 1-2 ngrams  

But, before we do that, we must prepare the data by converting the vectors into booleans. 

In [65]:
# 1. We create a TF-IDF vectorizer that removes stop words
cv_1 = CountVectorizer(stop_words='english')

X_train = corpus.review
y_train = corpus.sentiment

X_train = cv_1.fit_transform(X_train)

# Initiate the BernoulliNB classifier, run over the X_test set, and print a confusion matrix using the predicted and actual variables 
bnb = BernoulliNB(binarize=1, alpha=1)
bnb.fit(X_train, y_train)

pred = bnb.predict(X_train)
confusion_matrix(y_train, pred, labels=['positive', 'negative'])

array([[46,  0],
       [20, 26]], dtype=int64)

A good first attempt. The model correctly identifies 46 positive reviews. Overall, the accuracy is 78.3 percent; however, the model has a low specificity when it comes to negative reviews, having a value of 56.5 percent. We'll try to improve upon this value. 

In [67]:
# 2. Include 1-2 n-grams into TFIDF Vectorizer
cv_2 = CountVectorizer(stop_words='english', ngram_range=[1,2], max_df=0.1)

X_train = corpus.review
y_train = corpus.sentiment

X_train = cv_2.fit_transform(X_train)

# Initiate the BernoulliNB classifier, run over the X_test set, and print a confusion matrix using the predicted and actual variables 
bnb.fit(X_train, y_train)

pred = bnb.predict(X_train)
confusion_matrix(y_train, pred, labels=['positive', 'negative'])

array([[46,  0],
       [28, 18]], dtype=int64)

No success. The model still accurately confirms the positive reviews, but has a worse specifity and accuracy after accounting for n-grams. We'll try to correct this by stemming the reviews in the last model, but removing n-grams altogether. 

In [50]:
# Initiate stemmer and run through reviews 
englishStemmer=SnowballStemmer("english", ignore_stopwords=True)

for i in range(0, len(corpus)): 
    corpus.loc[i, 'stemmed_review'] = englishStemmer.stem(corpus.iloc[i, 2])

    lie sentiment                                             review  \
0  fake  negative  mike s pizza high point  ny service was very s...   
1  fake  negative  i really like this buffet restaurant in marsha...   
2  fake  negative  after i went shopping with some of my friend  ...   
3  fake  negative  olive oil garden was very disappointing  i exp...   
4  fake  negative  the seven heaven restaurant was never known fo...   

                                      stemmed_review  
0  mike s pizza high point  ny service was very s...  
1  i really like this buffet restaurant in marsha...  
2  after i went shopping with some of my friend  ...  
3  olive oil garden was very disappointing  i exp...  
4  the seven heaven restaurant was never known fo...  


In [75]:
# 3. Classify over stemmed reviews
cv_3 = CountVectorizer(stop_words='english', max_df=0.1)

X_train = corpus.stemmed_review
y_train = corpus.sentiment

X_train = cv_3.fit_transform(X_train)

# Initiate the BernoulliNB classifier, run over the X_test set, and print a confusion matrix using the predicted and actual variables 
bnb.fit(X_train, y_train)

pred = bnb.predict(X_train)
confusion_matrix(y_train, pred, labels=['positive', 'negative'])

array([[46,  0],
       [24, 22]], dtype=int64)

The model improves slightly, but still not to the point where it is better than the first model. The model has trouble identifying which reviews are negative and why they are negative. Perhaps this can be corrected by using the Multinomial Naive Bayes algorithm. 

#### Multinomial Naive Bayes
We'll run the same three models but this time using the MultinomialNB function. 

In [69]:
# 1. We create a TF-IDF vectorizer that removes stop words
tfidf_1 = TfidfVectorizer(stop_words='english')

X_train = corpus.review
y_train = corpus.sentiment

X_train = tfidf_1.fit_transform(X_train)

# Initiate the BernoulliNB classifier, run over the X_test set, and print a confusion matrix using the predicted and actual variables 
mnb = MultinomialNB()
mnb.fit(X_train, y_train)

pred = mnb.predict(X_train)
confusion_matrix(y_train, pred, labels=['positive', 'negative'])

array([[44,  2],
       [ 0, 46]], dtype=int64)

Right off the bat the Multinomial Naive Bayes algorithm delivers a model accuracy of 97.8 percent, much better than any of the results we've seen from the Bernoulli algorithm. Let's try and improve this model by adding n-grams.

In [76]:
# 2. Include 1-2 n-grams into TFIDF Vectorizer
tfidf_2 = TfidfVectorizer(stop_words='english', ngram_range=[1,2], max_df=0.1)

X_train = corpus.review
y_train = corpus.sentiment

X_train = tfidf_2.fit_transform(X_train)

# Initiate the BernoulliNB classifier, run over the X_test set, and print a confusion matrix using the predicted and actual variables 
mnb.fit(X_train, y_train)

pred = mnb.predict(X_train)
confusion_matrix(y_train, pred, labels=['positive', 'negative'])

array([[44,  2],
       [ 0, 46]], dtype=int64)

The model shows no improvement when compared to the initial one. Let's try the third option after stemming the reviews. 

In [78]:
# 3. Classify over stemmed reviews
tfidf_3 = TfidfVectorizer(stop_words='english', max_df=0.1)

X_train = corpus.stemmed_review
y_train = corpus.sentiment

X_train = tfidf_3.fit_transform(X_train)

# Initiate the BernoulliNB classifier, run over the X_test set, and print a confusion matrix using the predicted and actual variables 
mnb.fit(X_train, y_train)

pred = mnb.predict(X_train)
confusion_matrix(y_train, pred, labels=['positive', 'negative'])

array([[44,  2],
       [ 0, 46]], dtype=int64)

### Conclusions

The Multinomial Naive Bayes model has been very difficult to improve after its initial attempt. Despite this, it still proved superior to the three models created by the Bernoulli Naive Bayes models. However, this model is not perfect either, as it is only being tested upon itself given that we are working with a very limited data set. It's imperative that, before we draw any conclusions as to which of the two models is the better one, we try using both models with different parameters on a larger training and testing set; this would allow us to get a better idea of which of the Bernoulli or Multinomial variations of Navie Bayes is better.