# IST736 Text Mining
## Homework 7
### Martin Alonso
### 2019-03-01

### Objectives
For this assignment, there are three tasks that will be completed using Sentiment Analysis data from [Kaggle](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data). 
1. Build a MultinomialNB and SVM model using unigram features. Compare the top ten words for both positive negative categories. Report the confusion matrix, recall, and precision for both models. Both models must have the same parameter features. 
2. Build a second MultinomialNB and SVM model using CountVectorizers. Split the training set 60/40, keeping the original parameters from task 1, except this time build a bigram model. Again build a confusion matrix and compare recall and precision of both models. 
3. Revise the model script and build an SVM using the full training set. Tune the model to gain maximum possible accuracy, reporting parameter tuning and cross validation accuracy. Use this model to predict both the test set of the Kaggle data and submit a prediction to the Kaggle competition. 

### Analysis
We'll start by first importing the necessary libraries, specifically `pandas` and `sklearn`. We'll then load the data and do some quick exploration before diving into the individual tasks.

In [1]:
# Load librares
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import confusion_matrix, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from nltk.stem.snowball import SnowballStemmer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
import pandas as pd 
import numpy as np
import re 

In [2]:
# Load data set and view first five rows
df = pd.read_csv('C:/Users/malon/Documents/Syracuse University/IST 736 Text Mining/IST736/IST736/Week6/kaggle-sentiment/kaggle-sentiment/train.tsv', delimiter='\t')
df.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2


In [3]:
# Show shape of data set and count number of items per sentiment
print(df.shape)

num_items = df.groupby('Sentiment')['SentenceId'].count()
print(num_items)

(156060, 4)
Sentiment
0     7072
1    27273
2    79582
3    32927
4     9206
Name: SentenceId, dtype: int64


We have 156,060 observations, split up among five different sentiment classifications, ranging from very negative (0) to very positive (4).  
We'll now identify the X and y columns which will be the Phrase and Sentiment columns; then we'll start with the tasks. 

#### Task 1
##### Multinomial Naive Bayes
We'll first start by taking all words in the data set, lowering, and stemming them with the Snowball Stemmer from the `nltk` package. After, we'll separate the data into training and test sets. Once this is done, a TF-IDF vectorizer will be built using unigram words, latin-encoding and removing stop words. 
With the model built and trained, we'll test the results on the remaining 40 percent of the data. 

In [4]:
# Create a new data frame in case the data needs to be reloaded
dat = df

# Remove special character, set the dataset to lower case
for i in range(0, len(dat)): 
    dat.loc[i, 'Phrase'] = re.sub('[\W\_]', ' ', dat.loc[i, 'Phrase']).lower()

In [5]:
# Initiate English stemmer for data set
englishStemmer=SnowballStemmer("english", ignore_stopwords=True)

for i in range(0, len(dat)): 
    dat.loc[i, 'stemmed_review'] = englishStemmer.stem(dat.loc[i, 'Phrase'])
    
# Review first five observations. 
print(dat[['Phrase', 'stemmed_review']].head())

                                              Phrase  \
0  a series of escapades demonstrating the adage ...   
1  a series of escapades demonstrating the adage ...   
2                                           a series   
3                                                  a   
4                                             series   

                                      stemmed_review  
0  a series of escapades demonstrating the adage ...  
1  a series of escapades demonstrating the adage ...  
2                                             a seri  
3                                                  a  
4                                               seri  


Now that the data has been transformed, we'll separate it into training and test sets and build the Multinomial Naive Bayes model using a TF-IDF Vectorizer.

In [6]:
# Identify X and y columns
X = dat['stemmed_review'].values
y = dat['Sentiment'].values

In [7]:
# Split the data into training and testing set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Build the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(encoding='latin-1', use_idf=True, min_df=2, stop_words='english')

In [8]:
# Initiate the Multinomial Naive Bayes
mnb = MultinomialNB()

# Fit transform the training set with the tf_idf vectorizer
X_train_mnb_task1 = tfidf_vectorizer.fit_transform(X_train)

# Fit the model on the data 
mnb.fit(X_train_mnb_task1, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [9]:
# Transform X_test using tfidf_vectorizer
X_test_mnb_task1 = tfidf_vectorizer.transform(X_test)

# Run the model against the test set
mnb_task1_pred = mnb.predict(X_test_mnb_task1)

# Compare the predictions with the actual outcomes; print a confusion matrix
mnb_task1_results = confusion_matrix(y_test, mnb_task1_pred)

mnb_task1_matrix = []
for i in range(0, 5):
        mnb_task1_matrix.append(mnb_task1_results[i,i])
 
print(mnb_task1_results)
print('The model has a {} percent accuracy score'.format(np.round(np.sum(mnb_task1_matrix)/len(X_test) * 100), 3))
print("The model has a precision score of {}".format(precision_score(y_test, mnb_task1_pred, average='weighted')))
print("The model has a recall score of {}".format(recall_score(y_test, mnb_task1_pred, average='weighted')))

[[   79  1138  1539    69     0]
 [   29  2616  7901   377     2]
 [    7  1078 28723  1774    15]
 [    0   166  8190  4898    65]
 [    0    21  1414  2179   144]]
The model has a 58.0 percent accuracy score
The model has a precision score of 0.5774402400505637
The model has a recall score of 0.5840702293989491


Similarly, the top ten most influencing words for both very positive and very negative scores are the following.

In [10]:
# Print top 10 very positive and very negative words.
mnb_task1_neg = sorted(zip(mnb.coef_[0], tfidf_vectorizer.get_feature_names()))
mnb_task1_very_negative_features = mnb_task1_neg[-10:]
print("The top ten most negative words: mnb_task1_neg")
print(mnb_task1_very_negative_features)

mnb_task1_pos = sorted(zip(mnb.coef_[4], tfidf_vectorizer.get_feature_names()))
very_positive_features = mnb_task1_pos[-10:]
print("\r\n The top ten most positive words: ")
print(very_positive_features)

The top ten most negative words: mnb_task1_neg
[(-6.9399772891086, 'plot'), (-6.920909330556597, 'does'), (-6.916481813146056, 'long'), (-6.614610163989215, 'movi'), (-6.469718645676306, 'just'), (-6.4612640564650885, 'worst'), (-6.36007690841749, 'like'), (-5.9943176509807214, 'film'), (-5.657573643770104, 'movie'), (-5.487203932578288, 'bad')]

 The top ten most positive words: 
[(-6.426636412219698, 'movi'), (-6.374529025683264, 'fun'), (-6.334336216735293, 'year'), (-6.290101220903835, 'perform'), (-6.257090422266336, 'great'), (-6.125897821477724, 'movie'), (-6.091946010331331, 'good'), (-6.05447637421946, 'funny'), (-5.667597087531533, 'best'), (-5.303799225452078, 'film')]


##### Support Vector Machine
Now let's compare the results from the Multinomial Naive Bayes to those of the SVM. Given that the data has already been prepared, we'll skip ahead to building the model and reporting the results. 

In [11]:
# Initiate the Support Vector Classifier
svm = LinearSVC()

# Fit transform the training set with the tf_idf vectorizer
X_train_svm_task1 = tfidf_vectorizer.fit_transform(X_train)

# Fit the model on the data 
svm.fit(X_train_svm_task1, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [12]:
# Transform X_test using tfidf_vectorizer
X_test_svm_task1 = tfidf_vectorizer.transform(X_test)

# Run the model against the test set
svm_task1_pred = svm.predict(X_test_svm_task1)

# Compare the predictions with the actual outcomes; print a confusion matrix
svm_task1_results = confusion_matrix(y_test, svm_task1_pred)

svm_task1_matrix = []
for i in range(0, 5):
        svm_task1_matrix.append(svm_task1_results[i,i])
 
print(svm_task1_results)
print('The model has a {} percent accuracy score'.format(np.round(np.sum(svm_task1_matrix)/len(X_test) * 100), 3))
print("The model has a precision score of {}".format(precision_score(y_test, svm_task1_pred, average='weighted')))
print("The model has a recall score of {}".format(recall_score(y_test, svm_task1_pred, average='weighted')))

[[  752  1424   545    98     6]
 [  624  4498  5182   582    39]
 [  176  2381 26305  2598   137]
 [   28   445  5674  6296   876]
 [    5    57   544  2013  1139]]
The model has a 62.0 percent accuracy score
The model has a precision score of 0.6061185913896154
The model has a recall score of 0.6245995130078175


Overall, this model does a better job identifying the sentiment correctly, especcially the very negative and very positive reviews. Let's compare the ten most positive and negative words with those of the Multinomial Naive Bayes. 

In [13]:
# Print top 10 very positive and very negative words.
svm_task1_neg = sorted(zip(svm.coef_[0], tfidf_vectorizer.get_feature_names()))
svm_task1_very_negative_features = svm_task1_neg[-10:]
print("The top ten most negative words: ")
print(svm_task1_very_negative_features)

svm_task1_pos = sorted(zip(svm.coef_[4], tfidf_vectorizer.get_feature_names()))
svm_task1_very_positive_features = svm_task1_pos[-10:]
print("\r\n The top ten most positive words: ")
print(svm_task1_very_positive_features)

The top ten most negative words: 
[(2.159152060488495, 'insignificance'), (2.1743451558033158, 'kills'), (2.1931265090210226, 'turd'), (2.2012723848818396, 'unwatch'), (2.2241261223823208, 'repulsive'), (2.2894162987291153, 'stinks'), (2.2922790258349983, 'disappointment'), (2.2944575677619246, 'puddle'), (2.4196833814226357, 'stinker'), (2.477082732539301, 'awfulness')]

 The top ten most positive words: 
[(2.1369176857495886, 'standout'), (2.1585644244037265, 'rivet'), (2.208045800356681, 'enthrall'), (2.229379219828903, 'tremendous'), (2.2317849951722732, 'majestic'), (2.2701573831943125, 'masterful'), (2.306380816767369, 'astoundingly'), (2.3104892741875984, 'terrific'), (2.509939336621655, 'refreshes'), (2.591685388063509, 'zings')]


When looking at the words selected by the Multionmial Naive Bayes model, they don't make much sense. Words like plot, movie, and film are associated negatively, along with bad and long. The SVM, however, identifies inignificance, dissapointment, stinker, and turd as negative words related to poor movie reviews.  
Regarding positive reviews, the MNB picks out words like simple, best, and good; while the SVM identifies more positive words as standout, enthrall, or majestic; which makes more sense and shows that the SVM indeed does a superior job when discriminating between positive and negative reviews. 

#### Task 2

Task 1 showed that, despite having similar accuracy scores (58 percent vs 62 percent), the SVM does a better job classifying sentiment because the list of words used to descriminate sentiment stand out much more than those chosen by the MNB model. 
Building on these models, we'll use a Count Vectorizer and both unigram and bigrams to improve both models' accuracy scores.

##### Multinomial Naive Bayes

We'll start by building the Count Vectorizer, then fitting both the test and training sets to the vectorizer, and creating the models. 

In [14]:
# Build the Count Vectorizer using both uni- and bigram.
cv_vectorizer = CountVectorizer(encoding='latin-1', ngram_range=(1, 2), min_df=2, stop_words='english')

# Initiate the Multinomial Naive Bayes
mnb = MultinomialNB()

# Fit transform the training set with the count vectorizer
X_train_mnb_task2 = cv_vectorizer.fit_transform(X_train)

# Fit the model on the data 
mnb.fit(X_train_mnb_task2, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [15]:
# Transform X_test using cv_vectorizer
X_test_mnb_task2 = cv_vectorizer.transform(X_test)

# Run the model against the test set
mnb_task2_pred = mnb.predict(X_test_mnb_task2)

# Compare the predictions with the actual outcomes; print a confusion matrix
mnb_task2_results = confusion_matrix(y_test, mnb_task2_pred)

mnb_task2_matrix = []
for i in range(0, 5):
        mnb_task2_matrix.append(mnb_task2_results[i,i])
 
print(mnb_task2_results)
print('The model has a {} percent accuracy score'.format(np.round(np.sum(mnb_task2_matrix)/len(X_test) * 100), 3))
print("The model has a precision score of {}".format(precision_score(y_test, mnb_task2_pred, average='weighted')))
print("The model has a recall score of {}".format(recall_score(y_test, mnb_task2_pred, average='weighted')))

[[  916  1315   530    54    10]
 [  868  4692  4825   494    46]
 [  309  2897 24715  3350   326]
 [   44   493  5154  6510  1118]
 [    4    52   592  1849  1261]]
The model has a 61.0 percent accuracy score
The model has a precision score of 0.5963672059380526
The model has a recall score of 0.6102460592079969


In [16]:
# Print top 10 very positive and very negative words.
mnb_task2_neg = sorted(zip(mnb.coef_[0], cv_vectorizer.get_feature_names()))
mnb_task2_very_negative_features = mnb_task2_neg[-10:]
print("The top ten most negative words: ")
print(mnb_task2_very_negative_features)

mnb_task2_pos = sorted(zip(mnb.coef_[4], cv_vectorizer.get_feature_names()))
mnb_task2_very_positive_features = mnb_task2_pos[-10:]
print("\r\n The top ten most positive words: ")
print(mnb_task2_very_positive_features)

The top ten most negative words: 
[(-7.151599930249553, 'plot'), (-7.151599930249553, 'worst'), (-7.140426629651428, 'does'), (-7.129376793464843, 'characters'), (-7.06552532147831, 'movi'), (-6.757434377395322, 'just'), (-6.351969269287157, 'like'), (-5.998329229043579, 'bad'), (-5.933126035232817, 'film'), (-5.684398930516862, 'movie')]

 The top ten most positive words: 
[(-6.992071807890141, 'comedy'), (-6.992071807890141, 'fun'), (-6.874288772233757, 'work'), (-6.866383592726644, 'great'), (-6.812730879234324, 'year'), (-6.5010845263438135, 'funny'), (-6.468823664125592, 'good'), (-6.234107127253724, 'best'), (-6.169307134026809, 'movie'), (-5.332144495533648, 'film')]


This new model has a better score, with an overall accuracy of 61 percent. Similarly, both the precision and recall improve from 57.7 and 58.4 percent to 59.6 and 61.0 percent. We can see that the model perfroms better, with most of the labels being distributed more accurately regarding the predicted labels. 
We'll now build a new SVM. 

##### Support Vector Machine

In [17]:
# Initiate the SVM
svm = LinearSVC()

# Fit transform the training set with the count vectorizer
X_train_svm_task2 = cv_vectorizer.fit_transform(X_train)

# Fit the model on the data 
svm.fit(X_train_svm_task2, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [18]:
# Transform X_test using count_vectorizer
X_test_svm_task2 = cv_vectorizer.transform(X_test)

# Run the model against the test set
svm_task2_pred = svm.predict(X_test_svm_task2)

# Compare the predictions with the actual outcomes; print a confusion matrix
svm_task2_results = confusion_matrix(y_test, svm_task2_pred)

svm_task2_matrix = []
for i in range(0, 5):
        svm_task2_matrix.append(svm_task2_results[i,i])
 
print(svm_task2_results)
print('The model has a {} percent accuracy score'.format(np.round(np.sum(svm_task2_matrix)/len(X_test) * 100), 3))
print("The model has a precision score of {}".format(precision_score(y_test, svm_task2_pred, average='weighted')))
print("The model has a recall score of {}".format(recall_score(y_test, svm_task2_pred, average='weighted')))

[[  981  1344   431    58    11]
 [  980  4846  4605   456    38]
 [  263  2716 25554  2894   170]
 [   33   366  5098  6407  1415]
 [    8    34   433  1775  1508]]
The model has a 63.0 percent accuracy score
The model has a precision score of 0.6156692667969537
The model has a recall score of 0.6295014737921313


This model also improves upon the previous SVM but not at the same rate that the MNB improved between tasks. The first SVM had an accuracy score of 62 percent, while this second model has an accuracy score of 63 percent. 
Similarly, both the precision and recall scores improve but again, this improvement is marginal at best. 
Let's check the top ten positive and negative words. 

In [19]:
# Print top 10 very positive and very negative words.
svm_task2_neg = sorted(zip(svm.coef_[0], cv_vectorizer.get_feature_names()))
svm_task2_very_negative_features = svm_task2_neg[-10:]
print("The top ten most negative words: ")
print(svm_task2_very_negative_features)

svm_task2_pos = sorted(zip(svm.coef_[4], cv_vectorizer.get_feature_names()))
svm_task2_very_positive_features = svm_task2_pos[-10:]
print("\r\n The top ten most positive words: ")
print(svm_task2_very_positive_features)

The top ten most negative words: 
[(1.6174420625040353, 'zzzzzzzzz'), (1.624393411570868, 'hated'), (1.649875858250587, 'lam'), (1.6597472124616677, 'stinks'), (1.6718817551537646, 'snoozer'), (1.7652745846712279, 'dread'), (1.7753912565100816, 'baaaaaaaaad'), (1.8425306482789476, 'unbear'), (1.894373281044505, 'grotesqu'), (1.9648968974993428, 'unwatch')]

 The top ten most positive words: 
[(1.6005508300412497, 'phenomenal'), (1.607459722815741, 'masterful'), (1.6210873328459603, 'splendid'), (1.626030518362409, 'superb'), (1.6483348401730091, 'masterpiec'), (1.7326633755420633, 'topnotch'), (1.7485043562142197, 'dreamy'), (1.7612453450823295, 'rivet'), (1.772829856378459, 'terrif'), (1.9672066529713204, 'seat tense')]


Once again, when looking at the words, it is obvious that the support vector machine does a better job at identifying words with both positive and negative connotations than the multinomial Naive Bayes model does. Words like 'zzzzzzz', 'hated', and 'grotesque' are more negative than words like 'bad', 'worst', and 'plot'. The same thing can be said about the positive words, thought the MNB does identify positive words more easily. 

#### Task 3 

Now that we've worked on both tasks, we'll train a SVM using the entire training set. We'll then load the real test data from the Kaggle competition, export those results, and submit them to Kaggle. However, we'll have to process the test data in the same way that the training set was porcessed. 

In [20]:
# Load the test data set
test = pd.read_csv('C:/Users/malon/Documents/Syracuse University/IST 736 Text Mining/IST736/IST736/Week6/kaggle-sentiment/kaggle-sentiment/test.tsv', delimiter='\t')
test.head()

Unnamed: 0,PhraseId,SentenceId,Phrase
0,156061,8545,An intermittently pleasing but mostly routine ...
1,156062,8545,An intermittently pleasing but mostly routine ...
2,156063,8545,An
3,156064,8545,intermittently pleasing but mostly routine effort
4,156065,8545,intermittently pleasing but mostly routine


In [21]:
# Process the test set, removing special characters, lowering the type case, and passing through the stemmer. 
dat_test = test

# Remove special character, set the dataset to lower case
for i in range(0, len(dat_test)): 
    dat_test.loc[i, 'Phrase'] = re.sub('[\W\_]', ' ', dat_test.loc[i, 'Phrase']).lower()

In [22]:
# Initiate English stemmer for testing set
englishStemmer=SnowballStemmer("english", ignore_stopwords=True)

for i in range(0, len(dat_test)): 
    dat_test.loc[i, 'stemmed_review'] = englishStemmer.stem(dat_test.loc[i, 'Phrase'])
    
# Review first five observations. 
print(dat_test[['Phrase', 'stemmed_review']].head())

                                              Phrase  \
0  an intermittently pleasing but mostly routine ...   
1  an intermittently pleasing but mostly routine ...   
2                                                 an   
3  intermittently pleasing but mostly routine effort   
4         intermittently pleasing but mostly routine   

                                      stemmed_review  
0  an intermittently pleasing but mostly routine ...  
1  an intermittently pleasing but mostly routine ...  
2                                                 an  
3  intermittently pleasing but mostly routine effort  
4          intermittently pleasing but mostly routin  


Now that the testing data set has been cleaned, we'll ste the training variable and label, along with the test variable. 

In [23]:
# Indicate the train and test sets
X_train = dat['stemmed_review'].values
y_train = dat['Sentiment'].values
X_test = dat_test['stemmed_review'].values

Because of it's superior accuracy rate, we'll keep the count vectorizer. However, this time we'll do some additional parameter tuning, setting the penalty rate to 0.5, initiating a random state of 42, and do 5-fold cross validation to get the best possible accuracy.

In [68]:
# Build the Count Vectorizer using both uni- and bigram.
cv_vectorizer = CountVectorizer(encoding='latin-1', ngram_range=(1, 2), min_df=2, stop_words='english')

# Initiate the model.
svm = LinearSVC(C=0.5, random_state=42)

# Fit transform X_train and X_test
task3_train = cv_vectorizer.fit_transform(X_train)
task3_test = cv_vectorizer.transform(X_test)

In [36]:
# Initiate the model 
model = cross_val_score(svm, task3_train, y_train, cv=5)
model

array([0.58000961, 0.55574422, 0.54698664, 0.55751362, 0.56706184])

In [69]:
# Fit the model 
svm.fit(task3_train, y_train)

# Check the model score
svm.score(task3_train, y_train)

0.8249327181853133

Despite the cross validate score having an average score of 56.1 percent, without cross validation, the model has a predicted score of 82.5 percent on the training set. The model will now be used to predict scores for the transformed test set; this will then be submitted to the Kaggle competition for a final score. 

In [70]:
# Predict sentiment scores and print first five values
pred_scores = svm.predict(task3_test)
print(pred_scores[:5])

# Transform the resulting array into a one-column data frame, merge with the phraseId column of the test set, and export it to a csv file
transformed_scores = pd.DataFrame(pred_scores.reshape(-1,1))
score_id = dat_test['PhraseId']

final_scores = pd.concat([score_id, transformed_scores], ignore_index=True, axis=1)
final_scores.columns = ['PhraseId', 'Sentiment']

final_scores.to_csv('Kaggle Submission.csv', index=False)

[3 3 2 3 3]


### Results and Conclusions

As we can see, there is additional parameter tuning that can be done to improve both the Multinomial Naive Bayes and Support Vector Machine models. Though they have given promising results, improving them requires more careful calibration of their parameters, as any simple parameter tuning might not increase model accuracy sufficiently. 

As to the Kaggle competition, the submitted model preduced a score of 0.60645, which is not bad for a first attempt but far from the top scores in the competition, which currently sit at around 0.70. If the model were to add lemmitization, as well as analysis of other stop word trends, there is no doubt that the model could be further improved upon. For the time being, this model produces an acceptable ground to build upon. 