# Introduction
Sentiment analysis is part of the Natural Language Processing (NLP) techniques that consists in extracting emotions related to some raw texts.<br><br>This is usually used on social media posts and customer reviews in order to automatically understand if some users are positive or negative and why.


### We will use here some hotel reviews data.
>+ The dataset combines reviews from hotels, books, movies, products and a few airlines.<br> 
>+ It hasthree classes (Mixed, Negative and Positive).<br>
>+ Most were mapped from reviewers' ratings with 3 being mixed, above 3 positive and below 3 negative.<br> 
>+ Each row has a label and text separated by a tab (tsv).<br>
>+ Text (reviews) were cleaned by removing Arabic diacritics and non-Arabic characters.<br>
>+ The dataset has no duplicate reviews.

## Import libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler

from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer


from nltk.stem import ISRIStemmer
from nltk.stem import PorterStemmer
import re
import Warnings
warnings.filterwarnings("ignore")


## load data

In [2]:
df=pd.read_csv("ar_reviews_100k.tsv",delimiter="\t")
df

Unnamed: 0,label,text
0,Positive,ممتاز نوعا ما . النظافة والموقع والتجهيز والشا...
1,Positive,أحد أسباب نجاح الإمارات أن كل شخص في هذه الدول...
2,Positive,هادفة .. وقوية. تنقلك من صخب شوارع القاهرة الى...
3,Positive,خلصنا .. مبدئيا اللي مستني ابهار زي الفيل الاز...
4,Positive,ياسات جلوريا جزء لا يتجزأ من دبي . فندق متكامل...
...,...,...
99994,Negative,معرفش ليه كنت عاوزة أكملها وهي مش عاجباني من ا...
99995,Negative,لا يستحق ان يكون في بوكنق لانه سيئ . لا شي. لا...
99996,Negative,كتاب ضعيف جدا ولم استمتع به. فى كل قصه سرد لحا...
99997,Negative,مملة جدا. محمد حسن علوان فنان بالكلمات، والوصف...


In [3]:
df["label"].value_counts()

Positive    33333
Negative    33333
Mixed       33333
Name: label, dtype: int64

In [4]:
df["label"]=df["label"].apply(lambda x:-1 if x=="Negative" else(0 if x=="Mixed" else 1))
df["label"].value_counts()

-1    33333
 1    33333
 0    33333
Name: label, dtype: int64

## cleaning data

To clean textual data,I define  'clean_text' function that performs several transformations:<br>


>+ remove useless words that contain numbers.
>+ remove useless stop words.


In [5]:
sm = PorterStemmer()

In [6]:
stopwords = open("stopwords_ar.txt",'r',encoding="utf-8").read().split('\n')
def cleaning(text):
    text  =  re.sub('[^ء-ي]',' ',text).replace("  ",' ')
    filtered_words = []
    words = text.split()
    for word in words:
        if word.lower() in stopwords:
            pass
        else:
            filtered_words.append(sm.stem(word))
    return ' '.join(filtered_words)

In [7]:
df["clean text"]=df["text"].apply(cleaning)

In [8]:
df["clean text"]

0        ممتاز نوعا النظافة والموقع والتجهيز والشاطيء ا...
1        أسباب نجاح الإمارات شخص الدولة يعشق ترابها نحب...
2        هادفة وقوية تنقلك صخب شوارع القاهرة هدوء جبال ...
3        خلصنا مبدئيا اللي مستني ابهار زي الفيل الازرق ...
4        ياسات جلوريا جزء يتجزأ دبي فندق متكامل الخدمات...
                               ...                        
99994    معرفش ليه كنت عاوزة أكملها مش عاجباني البداية ...
99995    يستحق بوكنق لانه سيئ شي يوجد خدمة افطار صباحي ...
99996    كتاب ضعيف استمتع قصه سرد لحاله مشهد بدون فكره ...
99997    مملة محمد حسن علوان فنان بالكلمات والوصف عندة ...
99998    ارجع إليه مرة قربه البحر المكان قديم توجد خدما...
Name: clean text, Length: 99999, dtype: object

In [9]:
df.drop(df[df["clean text"]==""].index,inplace=True)
df[df["clean text"]==""]


Unnamed: 0,label,text,clean text


In [10]:
df[df["clean text"]==""]


Unnamed: 0,label,text,clean text



### The TF-IDF metric :
##### TF-IDF (Term Frequency - Inverse Document Frequency) values for every word.

>+ TF computes the classic number of times the word appears in the text.
>+ IDF computes the relative importance of this word which depends on how many texts the word can be found.

In [11]:

tfidf = TfidfVectorizer()
x = tfidf.fit_transform(df["clean text"])
x.shape

(99979, 309626)

In [12]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,df.label,random_state=55,test_size = 0.2)

In [13]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=55)
clf.fit(x_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(random_state=55)

In [14]:
clf.score(x_train,y_train)


0.8360026505632447

In [15]:
y_pred = clf.predict(x_test)
accuracy_score(y_test, y_pred)

0.6627825565113022

In [16]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_train, y_train)

MultinomialNB()

In [17]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,df.label,random_state=55,test_size = 0.2)

In [18]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_train, y_train)

MultinomialNB()

In [19]:
model.score(x_train,y_train)


0.8367903179425628

In [20]:
y_pred = model.predict(x_test)
accuracy_score(y_test, y_pred)

0.635377075415083

In [21]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(x_train, y_train)

KNeighborsClassifier()

In [22]:
model.score(x_train,y_train)


0.6978607954190266

In [23]:
y_pred = model.predict(x_test)
accuracy_score(y_test, y_pred)

0.5363072614522905

In [45]:
x =df["clean text"]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,df.label,random_state=55,test_size = 0.2)

from sklearn.pipeline import Pipeline

pipelin= Pipeline([('tfidf',TfidfVectorizer()),
                     ('lr_classifier',LogisticRegression())])

In [46]:
pipelin.fit(x_train,y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Pipeline(steps=[('tfidf', TfidfVectorizer()),
                ('lr_classifier', LogisticRegression())])

In [47]:
y_pred = pipelin.predict(x_test)
accuracy_score(y_test, y_pred)

0.6609321864372875

In [48]:
import pickle
pickle.dump(pipelin,open('hotel_review.pkl', 'wb'))


In [49]:
model = pickle.load(open('hotel_review.pkl', 'rb'))

In [50]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score, recall_score, f1_score, fbeta_score

In [51]:
confusion_matrix(y_test, y_pred)

array([[4753, 1293,  600],
       [1403, 3654, 1613],
       [ 582, 1289, 4809]], dtype=int64)

In [52]:
accuracy_score(y_test, y_pred)

0.6609321864372875