# Sentiment Analysis

Name:

[Learning Objectives](#Learning-Objectives)

[Background](#Background)

[Rule-Based Sentiment](#Rule-Based-Sentiment-Prediction)

[Test Yourself](#Test-Yourself)

[Limitations](#Limitations)

## Learning Objectives
Experience the full data science workflow from data aquisition, pre-processing, to building a model and presenting the results. 

* Work with free-form text data.
* Learn and understand two approches to sentiment analysis.
* Get to know how to use an APIs to scrape Twitter data.

## Background

Today we want to look at ways to help buisness make the most out of their customers feedback, which oftentimes comes as textual reviews or comments. Sentiment Analysis, or categorizing attitudes towards something, is quite relevant today. Amazon, for example, sells products of all sorts; those who purchase these items are able to leave reviews and comments. Besides the ratings that are given, how would a company be able to tell which products are well-liked and which ones should be removed?

An easy way is through sentiment analysis, where the goal is to predict the sentiment or positivity/negativity of a product or service solely based on the text provided as comments and reviews. Throughout this lab, we will explore two different ways to predict and understand the sentiment of text data. First, we will work through a simple rule-based algorithm, looking at positive and negative words to determine the classification of reviews. Following this, we will work through a more sophisticated machine learning based approach, allowing us to scale our classification to much larger datasets.

## Rule-Based Sentiment Prediction

Rule-based sentiment prediction is the easier of the two algorithms to learn and implement. In short, we have a list of positive words and a list of negative words, both of which will be used to calculate a "sentiment score" for the review.

For example, let's say we have two sets of words, positive_words and negative_words:

In [125]:
positive_words = ['great', 'awesome', 'happy', 'good', 'exciting', 'love']
negative_words = ['bad', 'dislike', 'sad', 'boring', 'awful', 'poor']

We can also have a set of reviews or text that we want to analyze:

In [126]:
reviews = ['I thought the movie was great! I was very happy I could see it.',
           'I did not like the movie; boring acting, poor attitudes, bad lighting.',
           'The movie was pretty exciting overall, but the sound quality was bad.']

We then go through each review and add or subtract to the sentiment score based on the number of positive or negative words. If there is a positive word, then we add one to the score; a negative word subtracts one.

In [127]:
sentiment_scores = []
for review in reviews:
    sentiment_score = 0
    for word in review.split(' '):
        if word in positive_words:
            sentiment_score += 1
        if word in negative_words:
            sentiment_score -= 1
    sentiment_scores.append(sentiment_score)

We can print out these results to see the overall scores in order of the reviews.

In [128]:
print(sentiment_scores)

[1, -3, 1]


If we do this by hand, we see that the scores don't add up correctly. Why is this? The words of the reviews are split by spaces. Take the first review for example. If we split it by spaces and look at the words, we see that the word great still has the exclamation point with it!

In [129]:
first_review = 'I thought the show was great! I was very happy I could see it.'
first_review_words = first_review.split(' ')
print(first_review_words)

['I', 'thought', 'the', 'show', 'was', 'great!', 'I', 'was', 'very', 'happy', 'I', 'could', 'see', 'it.']


Having the words split only by spaces causes some words to include punctuation, which is something we don't want. We won't touch on this too much, but preprocessing data to make sure words or numbers are functioning correctly can increase performance and accuracy greatly. Making sure that punctuation is removed as well as standardizing to lowercase gives much more control over the text data at hand.

In [130]:
import string

# .lower() changes first_review to all lowercase
# .translate(str.maketrans(input, output, delete)) will replace characters from input with respective 
#      characters in output and deletes what's in delete. 
#      --> for example: translate(str.maketrans(“aeiou”, “12345", "!")) will replace vowels with their respective 
#          numbers and deletes all exclamation marks
# .split(' ') splits the words into an array based on ' ', or a space
# other functions include:
# .replace(target, new), which will replace all matches of the target string with the new string

new_first_words = first_review.lower().translate(str.maketrans("", "", string.punctuation)).split(" ")
print(new_first_words)

['i', 'thought', 'the', 'show', 'was', 'great', 'i', 'was', 'very', 'happy', 'i', 'could', 'see', 'it']


We can now re-run this code on the reviews to see the appropriate scores that should be allocated.

In [131]:
sentiment_scores = []
for review in reviews:
    sentiment_score = 0
    for word in review.lower().translate(str.maketrans("", "", string.punctuation)).split(" "):
        if word in positive_words:
            sentiment_score += 1
        if word in negative_words:
            sentiment_score -= 1
    sentiment_scores.append(sentiment_score)

print(sentiment_scores)

[2, -3, 0]


Great! We now have a working function to assign sentiment scores to reviews. The final step is simply to assign a sentiment to the reviews. There are several ways to approach this, depending on what the user is attempting to do. We could work this as a Binary Classification, where each review is either positive or negative, and cannot be anything else. For this, we would assign "Negative" to any review with a score less than zero, and "Positive" to every other review.

In [132]:
review_sentiments = []

for score in sentiment_scores:
    if score >= 0:
        review_sentiments.append("Positive")
    if score < 0:
        review_sentiments.append("Negative")
        
print(review_sentiments)

['Positive', 'Negative', 'Positive']


However, we could also use Multi-class classification, including a "Neutral" class for the reviews that have a score of zero.

In [133]:
review_sentiments = []

for score in sentiment_scores:
    if score > 0:
        review_sentiments.append("Positive")
    if score < 0:
        review_sentiments.append("Negative")
    if score == 0:
        review_sentiments.append("Neutral")
        
print(review_sentiments)

['Positive', 'Negative', 'Neutral']


With all of this in mind, there are no limits to the number of classes or splits that could be made for text data. We could adjust the range for neutral to be any reviews between -1 and 1, or perhaps add in more classes ("Slightly Positive", "Slightly Negative", "Very Positive", "Very Negative", etc...). As long as the data is preprocessed correctly and you have a good set of positive and negative words, you will be able to run sentiment analysis on the majority of text files.

### Caution!

Though Rule-Based Sentiment Analysis is quick and on the easier side to implement, there are several drawbacks that can render this method inefficient. This method does not take into account misspellings, nor does it take into account context. Take the two following reviews for example.

"The movie was not good, it was bad" and "The movie was not bad, it was good". 

Both of these reviews would end up with the same sentiment score, but are clearly different reviews. This is partly due to the nature of the method; we are only looking at one word at a time, and not pairs of words. We will not look at this specifically, but looking at pairs of words or groups of three word (called bi-grams or tri-grams or in general n-grams) can help alleviate mistakes in our analysis.

Rule-Based Sentiment Analysis also does not take into account the length of the review. If we have a very long review that uses a mix of positive and negative words, it may end up being classified as something it is not. Likewise, a short but very strongly opinionated review may not receive the same sentiment as a longer, equally opinionated review.

## Test Yourself

In the following code blocks, work through to analyze **real-life movie reviews**! Some of the code is written for you, some will have to be filled in yourself.


In [134]:
# Setup - This cell block is needed to set up everything for this testing section
# No need to edit this cell

import os
import string
import zipfile
import shutil

if not os.path.exists('utility/data/neg'):
    zip_ref = zipfile.ZipFile('utility/data/neg.zip', 'r')
    zip_ref.extractall('utility/data/')
    zip_ref.close()
    print('Unzipped Negative')


if not os.path.exists('utility/data/pos'):
    zip_ref = zipfile.ZipFile('utility/data/pos.zip', 'r')
    zip_ref.extractall('utility/data/')
    zip_ref.close()
    print('Unzipped Positive')
    
with open('utility/data/negative-words.txt') as f:
    negative_words = [word.strip() for word in f.readlines() if word[0] not in [';', '\n']]
    print('Created list of negtaive words: negative_words')

with open('utility/data/positive-words.txt') as f:
    positive_words = [word.strip() for word in f.readlines() if word[0] not in [';', '\n']]
    print('Created list of postive words: postitive_words')

Unzipped Negative
Unzipped Positive
Created list of negtaive words: negative_words
Created list of postive words: postitive_words


The bulk of the code will be executed in the following cell. Fill in what needs to be filled in to perform rule-based sentiment prediction on the **_postive reviews_**! Running this cell will take a little while as it needs to go through all the reviews and count the postivie and negative words in oder to get the sentiment score.  

In [135]:
# Create a blank sentiment_scores list
import csv


sentiment_scores = []

# Create a variable called file_path for the data folder
# Hint: Data is stored in the folder 'data', with two subfolders being 'neg' or 'pos'

file_path = "utility/data/pos"

for file in os.listdir(file_path):
    file_start = file_path + '/'
    
    # Create a sentiment_score variable for this review, and set it to zero
    sentiment_score = 0
        
    with open(file_start + file, encoding="utf8") as f:

        # Pull the words into a words array
        # The reviews include the string \"<br />\" quite a few times; the data looks cleaner if replaced
        # with a space!

        my_words = f.read()
        my_words = my_words.replace('<br />', ' ')

        # Hint: Remember to read, lower, replace, translate, and split!
        words = my_words.lower().translate(str.maketrans("", "", string.punctuation)).split(" ")
        
        # Loop through the words to generate the sentiment score
        
        for word in words:
                if word in positive_words:
                        sentiment_score += 1
                if word in negative_words:
                        sentiment_score -= 1
                
       
        # Append the sentiment_score to the sentiment_scores array!
        sentiment_scores.append(sentiment_score)

                                

print('Done Running')

Done Running



Analyzing the Results

Now, we can see how our apporach predicts the sentiment for those reviews. This phase is a crucial part in the data science workflow and is called model evaluation.

In [136]:
# Positive Reviews:
percent_pos = sum([1 for score in sentiment_scores if score >= 0]) / len(sentiment_scores)*100
print("%.2f%% positive reviews" % (percent_pos))
positive_percent_pos = percent_pos

# Negative Reviews:
percent_neg = sum([1 for score in sentiment_scores if score < 0]) / len(sentiment_scores)*100
print("%.2f%% negative reviews" % (percent_neg))
positive_percent_neg = percent_neg

84.40% positive reviews
15.60% negative reviews


What do these number mean? Explain whether our approach works well or not.

84.30% positive reviews means that in the pos folder where we have our positive examples, our classifier was able to detect 84.30% of them as positively correctly. 
15.70% negative reviews means that from the pos folder, where we have our positive examples, our classifier has detected 15.70% files as wrongly negative.

In my opinion, this approach is working decent since it is able to classify more than 80% of our positive files properly under the positive classification. However, some files are wrongly predicted as negative because of this approach's inability to take into account the context of the review and because it assigns sentiment score based on single words, it regards feedback like 'not bad' as a negative sentiment. Hence, it is a decent approach with a good scope of errors.

Repeat the above computations for the **_negative reviews_** and compare the results with the ones above. Is our approach better in predictiong postivie reviews correctly or negative ones?

When the negative analysis is done, we see that the classifier was able to classify 72.75% of the negative files properly under the negative classification ubmbrella. However, it went wrong with 27.35% of reviews and incorrectly classified them as positive. 
Since this % of correct negative classification is lesser when compared to the % of correct positive classification we got while running over the postive reviews, we can conclude that this approach worked better in positive review predictions as compared to the negative review predictions.

In [137]:
# Create a blank sentiment_scores list
import csv


sentiment_scores = []

# Create a variable called file_path for the data folder
# Hint: Data is stored in the folder 'data', with two subfolders being 'neg' or 'pos'

file_path = "utility/data/neg"

for file in os.listdir(file_path):
    file_start = file_path + '/'
    
    # Create a sentiment_score variable for this review, and set it to zero
    sentiment_score = 0
    
    with open(file_start + file, encoding="utf8") as f:

        # Pull the words into a words array
        my_words = f.read()
        my_words = my_words.replace('<br />', ' ')

        words = my_words.lower().translate(str.maketrans("", "", string.punctuation)).split(" ")
        
        # Loop through the words to generate the sentiment score
        for word in words:
                if word in positive_words:
                        sentiment_score += 1
                if word in negative_words:
                        sentiment_score -= 1
                
       
        # Append the sentiment_score to the sentiment_scores array!
        sentiment_scores.append(sentiment_score)                          

print('Done Running')

percent_pos = sum([1 for score in sentiment_scores if score >= 0]) / len(sentiment_scores)*100
print("%.2f%% positive reviews" % (percent_pos))
negative_percent_pos = percent_pos

# Negative Reviews:
percent_neg = sum([1 for score in sentiment_scores if score < 0]) / len(sentiment_scores)*100
print("%.2f%% negative reviews" % (percent_neg))
negative_percent_neg = percent_neg

Done Running
27.30% positive reviews
72.70% negative reviews


What is the allowver performance of our rule-based sentiment perdictor? Compute the percentage of correctly predicted reviews (this measure is allso called _accuracy_) and the percentage of incorrectly predicted reviews (this measure is also called _error rate_). Hint: both measures should add up to 100%.  

In [138]:
accuracy = (positive_percent_pos + negative_percent_neg)/2
print("Accuracy ",accuracy)
error = (positive_percent_neg + negative_percent_pos)/2
print("Error rate ",error)

Accuracy  78.55
Error rate  21.45


## Limitations and Introduction to Machine Learning
The rule-based sentiment predictor has many advantages including that is so simple to implement. With just a couple of extensions to our version (such as negation handling) we could actually make this production ready. However, the main drawback of this apporach is that we need **hand engineered** lists of positive and negative expressions, which are non-trivial to create and also static. That means they don't adapt automatically to the domain they are being used for. For example, in formal language expressions might have different meaning than in a colloquial context.

How can we overcome this problem? Can we maybe learn what expressions are used in a possitive versus a negative review? The answer is '_yes - we can!_'

### Quick Introduction to Sentiment Classification

Instead of working with lists of positive and negative expressions we will now look at reviews with known ratings and use thoese to learn what positive versus negative reviews are. With those known set of positive and negative reviews, we can build a model.


And then use it on new comments and reviews to determine a customer's attitudes. This approach is a **machine learning** approach commonly know as _classification_.

Okay, let's do it: 

In [139]:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

data_folder = "utility/data/"
dataset = load_files(data_folder, shuffle=False)
docs_raw = dataset.data

## Text preprocessing
docs_all = []
for doc in docs_raw:
    docs_all.append(doc.decode('utf-8', errors='replace')) # prevent UnicodeDecodeError
y_all = dataset.target

# Text tokenizing and filtering of stopwords
count_vect = CountVectorizer(min_df=5)  
X_all_counts = count_vect.fit_transform(docs_all)

# number of docs and number of words
print("Number of documents: " + str(X_all_counts.shape[0])) 
print("Number of words: " + str(X_all_counts.shape[1])) 
    # X_all_counts data representation (* = occurrence count):
    #    - - - - -
    #  |
    #  |  *        <- document
    #  |
    #  |
    #     ^
    #    word index

Number of documents: 4000
Number of words: 8870


After **preprocessing** the text documents, we **split** our data into two parts one for building the model (_training set_) 
and one for testing/evaluating it (_test/evaluation set_). Then we will **build the model** using the _training set_ and use the model to **predict** the sentiment of the documents in the _testing set_. 

In [140]:
# Split the data into two parts 
X_train, X_test, y_train, y_test = train_test_split(X_all_counts, y_all, train_size = .8, test_size = .2, random_state = 16)

print("Size of the training set: " + str(X_train.shape[0]))
print("Size of the test/evaluation set: " + str(X_test.shape[0]))

#Build the model using a linear classification model
model = LogisticRegression(max_iter = 10000).fit(X_train,y_train)

#Use the classification model for predictions
predicted_target = model.predict(X_test)

Size of the training set: 3200
Size of the test/evaluation set: 800


### Coding Task
Write a function that will go through all the test data and compare the predicted class and the actual class. If an entry is put into the wrong class by the model, this function will add one to the the respective variable: fneg_error_count if it was a _false negative_, fpos_error_count if it is a _false positive_. From these values you can compute the total _number of mistakes made_, the _error rate_ and _accuracy_ of the machine learning approach. 


Then, this function will print out how many total errors, how many false negatives, and how many false positives. The two parameters for this function are the predicted classification for each review generated by the model and the actual classification from the dataset.

In [141]:
def test_predictions(predictions, actual):
    
    fneg_error_count = 0
    fpos_error_count = 0
    mistakes = 0
    error_rate = 0
    accuracy = 0

    for idx, x in enumerate(predictions):
        if(x != actual[idx] and x == 0):
            fneg_error_count = fneg_error_count + 1
            mistakes = mistakes + 1
        elif (x!= actual[idx] and x == 1):
            fpos_error_count = fpos_error_count + 1
            mistakes = mistakes + 1
    
    error_rate = (mistakes /len(predictions)) * 100
    accuracy = 100 - error_rate
 
    print("There were a total of " + str(mistakes) + " errors out of " + str(len(predictions)) + " testpoints.")
    print("There were " + str(fneg_error_count) + " false negative errors")
    print("There were " + str(fpos_error_count) + " false positive errors")
    print("The algorithm was wrong in " + str(error_rate) + "% of the test cases." )
    print("The algorithm was correct in " + str(accuracy) + "% of the test cases." )

### Evaluate the Sentiment Classifier
Now, we can call this function using our predicted sentiments and the ground truth sentiments as input: 

In [142]:
test_predictions(predicted_target, y_test)

There were a total of 77 errors out of 800 testpoints.
There were 30 false negative errors
There were 47 false positive errors
The algorithm was wrong in 9.625% of the test cases.
The algorithm was correct in 90.375% of the test cases.


### Things to try
* Play with the train/test split sizes. we used a 80/20 split, but you can change this and see if it has an effect on the results. 
* Play with the random seed, to creat differnt train/test splits. How does this affect the results? 
* Use a differnt classifier, for example, NaiveBayes or a Support Vector Machine (SVM). Code examples are below - replace the model computation in the cell above with the respective lines to compute these differnt models. Do these models produce different results?

In [143]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB().fit(X_train, y_train)

from sklearn.svm import LinearSVC
model = LinearSVC(max_iter = 10000).fit(X_train,y_train)

In [144]:
# USING DIFFERENT TRAIN/TEST SPLIT SIZE

# Use case 1 : train/test split size is 60/40 
X_train, X_test, y_train, y_test = train_test_split(X_all_counts, y_all, train_size = .6, test_size = .4, random_state = 16)

print("Size of the training set: " + str(X_train.shape[0]))
print("Size of the test/evaluation set: " + str(X_test.shape[0]))

#Build the model using a linear classification model
model = LogisticRegression(max_iter = 10000).fit(X_train,y_train)

#Use the classification model for predictions
predicted_target = model.predict(X_test)

test_predictions(predicted_target, y_test)


# Use case 2 : train/test split size is 90/10
X_train, X_test, y_train, y_test = train_test_split(X_all_counts, y_all, train_size = .9, test_size = .1, random_state = 16)

print("Size of the training set: " + str(X_train.shape[0]))
print("Size of the test/evaluation set: " + str(X_test.shape[0]))

#Build the model using a linear classification model
model = LogisticRegression(max_iter = 10000).fit(X_train,y_train)

#Use the classification model for predictions
predicted_target = model.predict(X_test)

test_predictions(predicted_target, y_test)

Size of the training set: 2400
Size of the test/evaluation set: 1600
There were a total of 163 errors out of 1600 testpoints.
There were 68 false negative errors
There were 95 false positive errors
The algorithm was wrong in 10.1875% of the test cases.
The algorithm was correct in 89.8125% of the test cases.
Size of the training set: 3600
Size of the test/evaluation set: 400
There were a total of 39 errors out of 400 testpoints.
There were 14 false negative errors
There were 25 false positive errors
The algorithm was wrong in 9.75% of the test cases.
The algorithm was correct in 90.25% of the test cases.


As we can observe, the accuracy of predicitons have a direct correlation with the size of the training data set. When we have a larger training data split, we have more accurate predictions. We can observe above that when the split ratio of train/test is 60/40, the accuracy is worse than when train/test was 80/20. Similarly, at the same time, the accuracy of train/test at 90/10 is better than when it is run at 80/20.

In [145]:
# USING DIFFERENT RANDOM SEED

# Split the data into two parts 
X_train, X_test, y_train, y_test = train_test_split(X_all_counts, y_all, train_size = .8, test_size = .2, random_state = 4)

print("Size of the training set: " + str(X_train.shape[0]))
print("Size of the test/evaluation set: " + str(X_test.shape[0]))

#Build the model using a linear classification model
model = LogisticRegression(max_iter = 10000).fit(X_train,y_train)

#Use the classification model for predictions
predicted_target = model.predict(X_test)
test_predictions(predicted_target, y_test)

Size of the training set: 3200
Size of the test/evaluation set: 800
There were a total of 67 errors out of 800 testpoints.
There were 30 false negative errors
There were 37 false positive errors
The algorithm was wrong in 8.375% of the test cases.
The algorithm was correct in 91.625% of the test cases.


Random state (a model hyperparameter) is used to regulate the unpredictability present in machine learning models. When we set the random state as a different Integer value, the result comes out to be different because we get different train and test sets with different integer values for random_state.

In the above example, when random set was 16 we got the accuracy rate at 90.25% whereas when we selected the random set as 4, we got the accuracy rate as 91.625%.

In [146]:
# USING DIFFERENT CLASSIFIER

# Split the data into two parts 
X_train, X_test, y_train, y_test = train_test_split(X_all_counts, y_all, train_size = .8, test_size = .2, random_state = 16)

print("Size of the training set: " + str(X_train.shape[0]))
print("Size of the test/evaluation set: " + str(X_test.shape[0]))

#Build the model using a NaiveBayes model
model = MultinomialNB().fit(X_train, y_train)

#Use the classification model for predictions
predicted_target = model.predict(X_test)
test_predictions(predicted_target, y_test)


# Split the data into two parts 
X_train, X_test, y_train, y_test = train_test_split(X_all_counts, y_all, train_size = .8, test_size = .2, random_state = 16)

print("Size of the training set: " + str(X_train.shape[0]))
print("Size of the test/evaluation set: " + str(X_test.shape[0]))

#Build the model using Linear SVC
model = LinearSVC(max_iter = 10000).fit(X_train,y_train)

#Use the classification model for predictions
predicted_target = model.predict(X_test)
test_predictions(predicted_target, y_test)

Size of the training set: 3200
Size of the test/evaluation set: 800
There were a total of 75 errors out of 800 testpoints.
There were 50 false negative errors
There were 25 false positive errors
The algorithm was wrong in 9.375% of the test cases.
The algorithm was correct in 90.625% of the test cases.
Size of the training set: 3200
Size of the test/evaluation set: 800
There were a total of 79 errors out of 800 testpoints.
There were 29 false negative errors
There were 50 false positive errors
The algorithm was wrong in 9.875% of the test cases.
The algorithm was correct in 90.125% of the test cases.


As we can see,

MultinomialNB : algorithm was correct in 90.625% of the test cases

LinearSVC : algorithm was correct in 90.125% of the test cases

LogisticRegression : algorithm was correct in 91.625% of the test cases

All the three models generate different results.

So, it turns out that this performs quite well. Of course, we can do more fancy things with the text data, instead of only counting word occurrences. In practice, people use for instance the counts of _pairs of words_ (so-called _bi-grams_) or even _n-grams_ (counts of tuples of n words), or a feature called _TF-IDF_, which is very powerful in practice. This is a nice tutorial explaining how to compute those: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html. 

## Conclusion
Compare the two apporaches **rule-based sentiment prediction** versus **sentiment classification**. What are the main differences in terms of... 
* required data?

Rule based sentiment analysis will require a hand-engineered list of categorized words (for eg. a list of positive and negative words in our case) in order to predict the sentiment.
On the other hand, sentiment classification requires some already classified set of data for creating the training data sets. This training set is used in the machine learning model under the sentiment classification approach. In order to be properly trained and make accurate predictions, the machine learning model requires a lot of data.

* quality of the results?

The quality of results are generally better in sentiment classification as compared to rule-based sentiment prediction for several reasons. One of the main reasons is that the rule-based classification does not adapt automatically to the domain they are being used for. Which means that some terms in one formal language can have a different meaning in say a colloquial language. Another reason would be due to the static list of hand-engineered list of rules. Since the words referred to for classification are already pre-defined, any new word encountered which is not a part of any of the lists would be dis-regarded. On the other hand, sentiment classification adopts to the domain of the training data sets and is more efficient when making predicitons for the test data sets. Also, since it learns from the words that are a part of the training data sets, probability of accurate predictions increases.

* efficiency of the computation?

In the cases where the volume of the data is low and rules are relatively simple, rule based classification can be more efficient in terms of computation and time as compared to sentiment classification. But in other cases i.e when we have large volumes of data, definitely the efficiency of computation of sentiment classification will be better as compared to rule based classification.

* possibilities to extend the basic algorithms? 

The basic alogorithms can definitely be extended to be improved in some ways. One way would be to add conditions like use n-grams to understand the sentiment of a sentence better when it has a mix of both positive and negative connotation.  For eg. If a sentence is - 'I do not like coffee': this will have one positive word - 'like' and one negative word - 'not' which will lead the sentiment score to become 0 i.e neutral. Now clearly, we know that this sentence has a negative connotation but the classifier will classify it as neutral which is not correct. But if we use bi-grams then 'not like' can be correctly classified as negative instead of neutral. Another way the alogrithm can be extended it by including varying levels of sentiments like 'positive', 'highly positive' 'negative', 'highly negative', 'neutral'. This will allow the data to be classified more accurately and will be closer to the real sentiment.

Make a list of _pros_ and _cons_ for both approaches and also think of use cases/applications for either technique. 

Rule-based Prediction:

pros : 
    1) Very easy to implement
    2) Training data not required
    3) A good technique to gather data since one can set up the system with rules and let data accumulate spontaneously when users interact with it. 

cons :
    1) Does not take into account the account context/domain nor misspellings
    2) Does not take into account the length of the review. It can classify incorrectly (for eg. if there is a long review which is a mix of positive and negative words, it can go wrong) and also in the case of rule-based sentiment classification, short strong reviews do not hold the same sentiment as long strong reviews.
    3) When there is huge amounts of data, this approach can become generally in-efficient.

use case : In cold start scenarios i.e many a times when we have the cold start problem (No data to begin with) in Machine Learning, rule based approach make sense as it does not require any training data to begin with. This approach is also used when there is low volume of data and we need to make predictions.


Sentiment classification:

pros : 
    1) Does not require pre-defined hand-engineered list of rules
    2) Takes into account domain of text and is dynamic based on the training data.
    3) Useful for large amount of data

cons :
    1) Adaptability and speed of machine learning systems come at a cost
    2) Training data set is required
    3) In order for the models to improve on accuracy of predictions, the machine learning algorithm in sentiment classification requires a lot of data

use case: This approach is highly suitable when we have to handle complex and intensive issues with a relatively variable environment/domain. It is also particularly useful for high-volume use cases.

## Clean-up
Please run the following cell in order to clean up some of the files on your computer. While not mandatory, it will certainly save some space (over 4000 files are already unzipped, this will clear space).

In [147]:
# Run this to clean folders (unless you want to keep several thousand text files on your computer!)

if os.path.exists('utility/data/neg'):
    shutil.rmtree('utility/data/neg')
if os.path.exists('utility/data/pos'):
    shutil.rmtree('utility/data/pos')