# Analiza sentimentelor folosind algoritmul Naive Bayes

*Alexandru Fera*

## Introducere
Scopul proiectului este de a prezice sentimenul unei recenzii date, mai exact, dacă recenzia dată are un sentiment negativ sau pozitiv. Ca date de antrenare a fost folosit *dataset*-ul numit *Large Movie Review Dataset*, care se poate descărca de la următoare adresă de internet: http://ai.stanford.edu/~amaas/data/sentiment/.

Acest dataset conține 25 000 de recenzii deja clasificate ca fiind pozitive sau negative și 25 000 de recenzii pentru testare. În directorul rădăcină găsim două foldere train/ și test/, care corespund recenziilor destinate pentru a fi folosite la antrenat și recenziilor care pot fi folosite pentru testare. Fiecare folder conține două foldere: pos/ și neg/. În interiorul acestor foldere fiecare recenzie este stocată într-un fișier text, al cărui nume este dat de următoare convenție: [[id]_[rating].txt], unde id este un număr unic, rating este numărul de stele dat acelei recenzii.

## Bibliotecile folosite:
- scikit-learn (funcții pentru data mining, machine learning, conține implementare algoritmului Naive Bayes)
- pandas (funcții pentru reprezentarea datelor)
- nltk (funcții pentru procesarea limbajului natural, de exemplu: stemming)

In [1]:
import os
import nltk.stem
import re
import pandas as pd
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score

Definim o funcție care încarcă recenziile de pe disk în memorie.

Pentru demonstrație ne vom limita la doar 2 000 de recenzii din totalul de 25 000 de recenzii disponibile.

In [12]:
number_of_reviews = 2000

In [3]:
def load_reviews(directory_path):
    current_number_of_files = 0
    lines = []
    for root, dirs, files in os.walk(directory_path):
        for file in files:
            current_number_of_files += 1
            if current_number_of_files > number_of_reviews:
                break
            file_path = os.path.join(root, file)
            with open(file_path, "r", encoding="utf-8") as f:
                f_content = f.read()
                f_content = ''.join(i for i in f_content if not i.isdigit())
                lines.append(f_content)

    return lines

Definim o funcție care citește toate recenziile de pe disc și le încarcă în memorie într-o matrice cu două coloane:
- prima coloană denotă sentimentul: pozitive/negative (0 sau 1)
- a doua coloană reprezintă textul recenziei

In [4]:
def load_dataset(dataset_type):
    path_pos_reviews = "/home/alex/Documente/unibuc/master-2016/anul2/sem1/text-mining/aclImdb_v1/aclImdb/" + dataset_type +"/pos"
    path_neg_reviews = "/home/alex/Documente/unibuc/master-2016/anul2/sem1/text-mining/aclImdb_v1/aclImdb/" + dataset_type +"/neg"
    pos_reviews = load_reviews(path_pos_reviews)
    neg_reviews = load_reviews(path_neg_reviews)
    
    all_reviews = []
    for pos_review in pos_reviews:
        review = [pos_review, '1']
        all_reviews.append(review)

    for neg_review in neg_reviews:
        review = [neg_review, '0']
        all_reviews.append(review)

    return all_reviews

Împărțim recenziile în două matrice:
- o matrice numită *train_data* pentru a fi folosită la faza de antrenare
- o matrice numită *test_data* pentru a testa clasificatorul

Fiecare matrice va fi împărțită la rîndul ei în doi vectori:
- un vector *X_train* va conține recenziile
- un vector *y_train* va conține eticheta care ne spune că recenziile din vectorul *X_train* sînt fie pozitive sau negative

In [5]:
train_data = load_dataset("train")
test_data = load_dataset("test")

X_train = [review[0] for review in train_data]
y_train = [sentiment[1] for sentiment in train_data]

X_test = [review[0] for review in test_data]
y_test = [sentiment[1] for sentiment in test_data]

print("Datasetul pentru a fi folosit la antrenament este o matrice cu două coloane:")
df_train = pd.DataFrame(train_data,columns=["Recenzie","Sentiment"]).set_index("Recenzie") 
print(df_train.head())
print(df_train.tail())

print("")

print("Datasetul pentru testare are aceeași formă, este o matrice cu două coloane:")
df_test = pd.DataFrame(test_data,columns=["Recenzie","Sentiment"]).set_index("Recenzie") 
print(df_test.head())
print(df_test.tail())

Datasetul pentru a fi folosit la antrenament este o matrice cu două coloane:
                                                   Sentiment
Recenzie                                                    
Robin Williams and Kurt Russell play guys in th...         1
This anime seriously rocked my socks. When the ...         1
Please see also my comment on Die Nibelungen pa...         1
Best-selling horror novelist Cheryl (a solid an...         1
It is rare that one comes across a movie as fla...         1
                                                   Sentiment
Recenzie                                                    
I give this marriage  years and thats stretchin...         0
Yes, this movie is bad. What's worse is that it...         0
Truly terrible, pretentious, endless film. Dire...         0
The story of a woman (Ann) on her death bed, he...         0
My sincere advice to all: don't watch the movie...         0

Datasetul pentru testare are aceeași formă, este o matrice cu două c

# Transformarea datelor text în date numerice

Crearea matricei de frecvență a termenilor - *term frequency matrix*

Matricea o vom crea folosind clasa *CountVectorizer*, dar mai întîi vom extinde această clasă cu o nouă capabilitate și anume *stemming* sau reducerea cuvintelor la rădacina lor. Astfel, vom avea nou clasă numită *StemmedCountVectorizer*

In [6]:
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        english_stemmer = nltk.stem.SnowballStemmer("english")
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([english_stemmer.stem(w) for w in analyzer(doc)])

Parametrii folosiți pentru *StemmedCountVectorizer*:
- *min_df=X* include doar cuvintele care apar în mai mult de *X* documente

In [7]:
vectorizer = StemmedCountVectorizer(min_df=20, binary="true", analyzer="word", stop_words="english")
document_term_matrix = vectorizer.fit_transform(X_train)
print(document_term_matrix.shape)
print(vectorizer.get_feature_names()[:150])

(25000, 8455)
['aaron', 'abandon', 'abbot', 'abbott', 'abc', 'abduct', 'abe', 'abhorr', 'abid', 'abil', 'abl', 'abli', 'aboard', 'abomin', 'aborigin', 'abort', 'abound', 'abraham', 'abroad', 'abrupt', 'absenc', 'absent', 'absolut', 'absorb', 'abstract', 'absurd', 'abund', 'abus', 'abysm', 'academ', 'academi', 'accent', 'accentu', 'accept', 'access', 'accid', 'accident', 'acclaim', 'accolad', 'accommod', 'accompani', 'accomplic', 'accomplish', 'accord', 'account', 'accur', 'accuraci', 'accus', 'accustom', 'ace', 'ach', 'achiev', 'acid', 'acknowledg', 'acquaint', 'acquir', 'acquit', 'acrobat', 'act', 'action', 'activ', 'activist', 'actor', 'actress', 'actual', 'ad', 'adam', 'adapt', 'add', 'addict', 'addit', 'address', 'adept', 'adequ', 'adher', 'adjust', 'administr', 'admir', 'admiss', 'admit', 'adolesc', 'adolf', 'adopt', 'ador', 'adrenalin', 'adrian', 'adult', 'adulter', 'adulteri', 'adulthood', 'advanc', 'advantag', 'adventur', 'advers', 'adversari', 'advert', 'advertis', 'advic', 'a

Antrenăm clasificatorul folosind algoritmul *Naive Bayes* pe matricea de frecvență a termenilor

In [8]:
classifier = BernoulliNB().fit(document_term_matrix, y_train)

## Testarea unei fraze date

In [9]:
to_predict = vectorizer.transform(["I was looking forward to The Guardian, but when I walked into the theater I wasn't really in the mood for it at that particular time. It's kind of like the Olive Garden - I like it, but I have to be in the right mindset to thoroughly enjoy it.<br /><br />I'm not exactly sure what was dampening my spirit. The trailers looked good, but the water theme was giving me bad flashbacks to the last Kevin Costner movie that dealt with the subject - Waterworld. Plus, despite the promise Ashton Kutcher showed in The Butterfly Effect, I'm still not completely sold on him. Something about the guy just annoys me. Probably has to do with his simian features.<br /><br />It took approximately two minutes for my fears to subside and for my hesitancies to slip away. The movie immediately throws us into the midst of a tense rescue mission, and I was gripped tighter than Kenny Rogers' orange face lift. My concerns briefly bristled at Kutcher's initial appearance due to the fact that too much effort was made to paint him as ridiculously cool and rebellious. Sunglasses, a tough guy toothpick in his mouth, and sportin' a smirk that'd make George Clooney proud? Yeah, we get it. I was totally ready to hate him.<br /><br />But then he had to go and deliver a fairly strong performance and force me to soften my jabs. <br /><br />Darn you, ape man! Efficiently mixing tense, exciting rescue scenes, drama, humor, and solid acting, The Guardian is easily a film that I dare say the majority of audiences will enjoy. You can quibble about its clichés, predictability, and rare moments of overcooked sappiness, but none of that takes away from the entertainment value.<br /><br />I had a bad feeling that the pace would slow too much when Costner started training the young guys, but on the contrary, the training sessions just might be the most interesting aspect of the film. Coast Guard Rescue Swimmers are heroes whose stories have never really been portrayed on the big screen, so I feel the inside look at what they go through and how tough it is to make it is very informative and a great way to introduce audiences to this under-appreciated group.<br /><br />Do you have what it takes to be a rescue swimmer? Just think about it -you get to go on dangerous missions in cold, dark, rough water, and then you must fight disorientation, exhaustion, hypothermia, and a lack of oxygen all while trying to help stranded, panicked people who are depending on you for their survival. And if all that isn't bad enough, sometimes you can't save everybody so you have to make the tough decision of who lives and who dies.<br /><br />Man, who wants all that responsibility? Not me! I had no idea what it was really like for these guys, and who would have thought I'd have an Ashton Kutcher/Kevin Costner movie to thank for the education? <br /><br />Not only does The Guardian do a great job of paying tribute to this rare breed of hero, but lucky for us it also does a good job of entertaining its paying customers.<br /><br />THE GIST <br /><br />Moviegoers wanting an inside look at what it's like to embark on a daring rescue mission in the middle of the ocean might want to give The Guardian a chance. I saw it for free, but had I paid I would've felt I had gotten my money's worth."])
result = classifier.predict(to_predict)
if (result[0] == "1"):
    print("Recenzia introdusă esete pozitivă!")
else:
    print("Recenzia introdusă este negativă!")
    

Recenzia introdusă esete pozitivă!


## Calcularea ratei de success pe datele de test

In [10]:
pos_test_reviews = [row[0] for row in train_data if row[1] == "1"]
neg_test_reviews = [row[0] for row in train_data if row[1] == "0"]
number_pos = 0
number_neg = 0

for review in pos_test_reviews:
    to_predict = vectorizer.transform([review])
    pred = classifier.predict(to_predict)
    if pred[0] == "1":
        number_pos += 1
print("Rata de succes pentru recenziile pozitive: ", number_pos / number_of_reviews)

for review in neg_test_reviews:
    to_predict = vectorizer.transform([review])
    pred = classifier.predict(to_predict)
    if pred[0] == "0":
        number_neg += 1
print("Rata de succes pentru recenziile negative: ", number_neg / number_of_reviews)

Rata de succes pentru recenziile pozitive:  0.51945
Rata de succes pentru recenziile negative:  0.55075


## Testarea acurateții folosind *cross validation* pe datele de test

In [11]:
test_matrix = vectorizer.fit_transform(X_test)
cv_score = cross_val_score(classifier, test_matrix, y_test)
print(cv_score.mean())

0.8499601644926496
