# Naive Bayes Lab
## Text Classification with Naive Bayes



<h3>1. Reading files and storing their contents in 2 string arrays.</h3>

In [1]:
data_dir = "../../data_ml_2020/movies_reviews"

In [2]:
import os
cwd = os.getcwd()
os.chdir(cwd)
print(os.listdir(data_dir))

['.DS_Store', 'new_movies_reviews', 'neg', 'pos']


First, we put reviews from negative folder into neg_lst string array and put reviews from positive folder into pos_lst string array

In [3]:
neg_lst = []
neg_path = data_dir + "/neg"
for path in os.listdir(neg_path):
    if path.endswith('.txt'):
        with open(neg_path + '/' + path) as f:
            neg_lst.append(f.read())

print("Num of negative reviews: {}".format(len(neg_lst)))

pos_lst = []
pos_path = data_dir + "/pos"
for path in os.listdir(pos_path):
    if path.endswith('.txt'):
        with open(pos_path + '/' + path) as f:
            pos_lst.append(f.read())

print("Num of positive reviews: {}".format(len(pos_lst)))
   
        
    

Num of negative reviews: 1000
Num of positive reviews: 1005


<h3>2. Assigning class labels based on directory and combining small lists into one big list</h3>

We assign 1 for negative reviews and 0 for positive reviews. We also combine two lists into one list. At first, there were two lists: pos_lst and neg_lst. So we combine it into 1 big list X. Also, we combine the 1's and 0's together into one list Y. Y contains class label (ie. 1 for negative and 0 for positive)

In [4]:
import numpy as np

Y_neg = np.ones((len(neg_lst),)) # ones for negative
Y = np.concatenate((Y_neg, np.zeros((len(pos_lst),)))) # zeros for positive

X = np.concatenate((neg_lst,pos_lst))



<h3>3. Split Data</h3>

We then split the big list into training and testing set for both X and Y.

In [5]:
from sklearn.model_selection import train_test_split

#split dataset into train and test data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,random_state=20)



<h3>4. Transformation</h3>

We then apply transformations. After assigning unique numbers to each words in text, we count the occurence of each word. In other words, CountVectorizer transforms a given text into vector on the basis of count of each word in each document. So CountVectorizer creates a matrix where each row represents each document and where each column represents each unique word. The value at a particular cell tells us how many times does this word (represented by column) occurs in this document (represented by rows).

Right now, if we weigh these words equally, we would find words like 'they' which occurs almost in every texts as important. So we need to find a way to overcome this problem.

We can use TFidfTransformer.
We pass the array of counts (matrix) from CountVectorizer to Term Frequency Inverse Document Frequency (TfidfTransformer), which counts the number of times a word appears in a document and we give weights to the word so that word such as "the" will be less significant even though it occurs many times in many documents.



In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

count_vect = CountVectorizer()
X_train_count = count_vect.fit_transform(X_train)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_count)

<h3>5. Training the Model</h3>

We will then build a model using training dataset. I use ComplementNB as it works well with text classification and we see that the data is imbalanced because there are 1000 negative reviews and 1005 positive reviews. ComplementNB will give higher accuracy when the dataset is imbalanced compared to MultinominalNB and GaussianNB.

ComplementNB is an adaptated form of MultinominalNB. MultinominalNB does not work well on imbalanced datasets, meaning that the number of examples of a class is higher than the number of examples of another class. In this case, we have imbalanced datasets since number of positive reviews > number of negative reviews. Even though the number of positive reviews is not that much greater than the number of negative reviews, I would consider to use Complement NB. So basically, as the difference is not that great, we could still use MultinomialNB.

Actually I have tried out both the MultinomialNB and ComplementNB and finds that ComplementNB was able to produce a little bit better accuracy (results are shown below) so I plan to use ComplementNB.

In [7]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import ComplementNB


#clf = MultinomialNB().fit(X_train_tfidf, Y_train)
clf = ComplementNB().fit(X_train_tfidf, Y_train)


In [8]:
# to get accuracy for training set
acc_train = clf.score(X_train_tfidf, Y_train)
print("train accuracy: {} ".format(acc_train))


# testing with accuracy_score
predicted = clf.predict(X_train_tfidf)
from sklearn.metrics import accuracy_score
print("train accuracy", accuracy_score(Y_train, predicted))


train accuracy: 0.9706982543640897 
train accuracy 0.9706982543640897


In [9]:
# first transform
X_test_counts = count_vect.transform(X_test)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

# to get accuracy for testing set
acc_test = clf.score(X_test_tfidf, Y_test)
print("test accuracy: {} ".format(acc_test))


# testing with accuracy_score
predicted = clf.predict(X_test_tfidf)

print("test accuracy", accuracy_score(Y_test, predicted))


test accuracy: 0.8354114713216958 
test accuracy 0.8354114713216958


<h4>Note:</h4>
I have tried it with both multinomialNB and ComplementNB. As shown below, we see that ComplementNB gives a better accuracy for test set than multinomialNB.

with MultinomialNB ->

    train accuracy: 0.9675810473815462
    
    test accuracy: 0.8254364089775561 


with ComplementNB ->

    train accuracy: 0.9706982543640897 
    
    test accuracy: 0.8354114713216958 


<h3>6. Testing with New Reviews</h3>

<p>
    I have taken 5 new movie reviews from IMDB and the ones I have taken have ratings up to 10.
    So we use them to see how our classifier will perform on this new dataset.
    
The link for new reviews is 
<a href="https://drive.google.com/drive/folders/1XRjgANGakdyKKMXL9zZFz67mtwIt7lCr?usp=sharing">here
</a>:
    
</p>

In [10]:
new_data_dir = "../../data_ml_2020/movies_reviews/new_movies_reviews"

In [11]:
new_reviews_lst = []
for path in sorted(os.listdir(new_data_dir)):
    if path.endswith('.txt'):
        with open(new_data_dir + '/' + path) as f:
            new_reviews_lst.append(f.read())
        

In [12]:
X_new_counts = count_vect.transform(new_reviews_lst)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)

print(predicted)

# 0's for positive and 1's for negative

[0. 0. 0. 1. 0.]


<h4>Note:</h4> 
<p>
    I consider above half (ie. above 5/10) to be positive and the rest to be negative
</p>

<h4>Result</h4>

[0. 0. 0. 1. 0.]

<ul>
<li>
    1st file ->

    Rating: 7/10
    predicted: <b>positive</b> 
    actual: <b>positive</b>
</li>

<li>
    2nd file ->

    Rating: 10/10;
    predicted: <b>positive</b> 
    actual: <b>positive</b>
</li>

<li>
    3rd file ->

    Rating: 3/10;
    predicted: <b>positive</b> 
    actual: <b>negative</b>    ;
</li>

<li>
    4th file ->

    Rating: 3/10;
    predicted: <b>negative</b>  ;
    actual: <b>negative</b>
</li>

<li>
    5th file ->

    Rating: 8/10;
    predicted: <b>positive</b>. ;
    actual: <b>positive</b>
</li>    

</ul>

<p>Out of all 5 new reviews, the classifier was able to predict 4 correctly.</p>

<p>
    The classifier was not able to predict correctly for the 3rd text file. When I look at that review, I see things like "A 3-star is probably the best rating that I can give here." So I think the classifier associates 'best' as positive even though that person only gives 3 out of 10 stars. Then that person explains why 3 star was given. When explaining that, there are some positive words and maybe that's why the classifier thinks it's a positive review whereas it's actually a negative review. That person also states "IMDB won't allow me to give a ZERO star for the worst movie of all time".  
</p>




<h3>7. References</h3>

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.ComplementNB.html#sklearn.naive_bayes.ComplementNB

https://heartbeat.fritz.ai/understanding-naive-bayes-its-applications-in-text-classification-part-1-ec9caea4baae

https://medium.com/analytics-vidhya/tf-idf-term-frequency-technique-easiest-explanation-for-text-classification-in-nlp-with-code-8ca3912e58c3

https://www.geeksforgeeks.org/complement-naive-bayes-cnb-algorithm/

For new review 1: https://www.imdb.com/review/rw0893947/?ref_=ur_urv

For new review 2: https://www.imdb.com/review/rw2764760/?ref_=tt_urv

For new review 3: https://www.imdb.com/review/rw6094710/?ref_=ur_urv

For new review 4: https://www.imdb.com/review/rw3315212/?ref_=tt_urv

For new review 5: https://www.imdb.com/review/rw2770824/?ref_=tt_urv