# Sentiment analysis

##  Set up

In [5]:
#%pip install numpy pandas nltk sklearn

In [8]:
import numpy as np 
import pandas as pd
import nltk
from nltk.corpus import stopwords
import re

## Load dataset

In [2]:
df = pd.read_csv(r'../resources/IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [105]:
for i in range (len(df['review'])):
    sentence = df['review'][i]
    arr = sentence.split()
    for word in arr: 
        if word == "entertainedif":
            print(i, end = " : ")
            print(sentence)

In [106]:
df['review'][16123]

'I am writing this review having watched it several months ago....the trailer looked promising enough for me to buy this lame excuse for a movie. It is a complete joke....and literally a spit in the face of real classics of the early generation of horror like Texas Chainsaw Massacre (1974) which they even had the gall to compare itself to on the back of the cover art. The producer who played Brandon should go flip burgers and serve up greasy hamburgers....hell he might not even be good at that either! The lighting was bad bad bad and a big annoyance through out the film you couldn\'t even see the actor\'s faces sometimes. I don\'t even remember the rest of the cast members which is sad really, bad they never do anything to impress you to make them memorable. That\'s all the time I will waste on this review PLEASE stay as far away as you can from this pile of junk even if you get it for 25 cents don\'t do it buy s piece of gum at least IT would keep you entertained!<br /><br />If you wa

16123
12853

## Tiền xử lý dữ liệu

### Làm sạch dữ liệu 
- Loại bỏ các thẻ HTML
- Loại bỏ khoảng trắng thừa và dấu câu 
- Chuyển đổi chữ hoa thành chữ thường 

#### Loại bỏ thẻ HTML

In [3]:
df['review'] = df['review'].str.replace("<br />", "")
df['review'][2]

'I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.This was the most I\'d laughed at one of Woody\'s comedies in years (dare I say a decade?). While I\'ve never been impressed with Scarlet Johanson, in this she managed to tone down her "sexy" image and jumped right into a average, but spirited young woman.This may not be the crown jewel of his career, but it was wittier than "Devil Wears Prada" and more interesting than "Superman" a great comedy to go see with friends.'

#### Loại bỏ khoảng trắng thừa và dấu câu

In [9]:
# Hàm để chuyển các ký tự đặc biệt thành khoảng trắng và loại bỏ khoảng trắng thừa
def remove_punctuation(text):
    # Chuyển các ký tự đặc biệt thành khoảng trắng
    text = re.sub(r'[^\w\s]', ' ', text)
    # Loại bỏ các khoảng trắng thừa
    text = re.sub(r'\s+', ' ', text).strip()
    return text


In [10]:
df['review'] = df['review'].apply(remove_punctuation)

In [11]:
df['review'][16123]

'I am writing this review having watched it several months ago the trailer looked promising enough for me to buy this lame excuse for a movie It is a complete joke and literally a spit in the face of real classics of the early generation of horror like Texas Chainsaw Massacre 1974 which they even had the gall to compare itself to on the back of the cover art The producer who played Brandon should go flip burgers and serve up greasy hamburgers hell he might not even be good at that either The lighting was bad bad bad and a big annoyance through out the film you couldn t even see the actor s faces sometimes I don t even remember the rest of the cast members which is sad really bad they never do anything to impress you to make them memorable That s all the time I will waste on this review PLEASE stay as far away as you can from this pile of junk even if you get it for 25 cents don t do it buy s piece of gum at least IT would keep you entertained If you want good quality low budget fun far

#### Loại bỏ stop words

In [12]:
# Tải stop words từ nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PAVT\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [13]:
stop_words.remove('no')
stop_words.remove('not')

In [14]:
def remove_stop_words(text):
    words = text.split()
    filter_words = [word for word in words if word not in stop_words]
    return ' '.join(filter_words)

In [9]:
# df['review'] = df['review'].apply(remove_stop_words)
# df['review'][2]

#### Chuyển chữ hoa thành chữ thường 

In [15]:
df['review'] = df['review'].str.lower()
df['review'][16123]

'i am writing this review having watched it several months ago the trailer looked promising enough for me to buy this lame excuse for a movie it is a complete joke and literally a spit in the face of real classics of the early generation of horror like texas chainsaw massacre 1974 which they even had the gall to compare itself to on the back of the cover art the producer who played brandon should go flip burgers and serve up greasy hamburgers hell he might not even be good at that either the lighting was bad bad bad and a big annoyance through out the film you couldn t even see the actor s faces sometimes i don t even remember the rest of the cast members which is sad really bad they never do anything to impress you to make them memorable that s all the time i will waste on this review please stay as far away as you can from this pile of junk even if you get it for 25 cents don t do it buy s piece of gum at least it would keep you entertained if you want good quality low budget fun far

### Chia dữ liệu

In [16]:
data_removed = df['review'].apply(remove_stop_words)
# Chia dữ liệu thành tập huấn luyện và tập kiểm tra
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_removed, df['sentiment'], test_size=0.3, random_state=42)

## Word embeddings

In [17]:
from gensim.models import Word2Vec

In [18]:
sentences = []
for sentence in X_train:
    sentences.append(sentence.split())

In [20]:
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers = 80)

#### sentence embeddings

In [21]:
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

In [22]:
def sum_weights(vectors, w):
    res = np.zeros(100)
    for i in range(len(vectors)):
        res += w[i]*vectors[i]
    return res

Các tiêu chí đánh giá phim: 
- Kịch bản: 
    + coherent/ incoherent
    + unpredictable/ predictable
- Ý nghĩa phim: meaningful/ meaningless
- Hiệu ứng: impressive / unimpressive
- Cảnh quay: heartfelt / insincere

In [23]:
def leaky_ReLU_prob(x):
    if x < np.sqrt(3)/2:
        return 0.4*x
    else:
        return 2*x

In [24]:
def id_v(x):
    return x

In [25]:
list_positive_words = ['heartfelt', 'gripping', 'impressive', 'meaningful', 'coherent']
list_negative_words = ['insincere', 'predictable', 'soporific', 'illogical', 'uninteresting']
# incoherent == uninteresting
# meaningless == illogical
# unimpressive == soporific
# unpredictable == gripping


In [88]:
# Chưa cần dùng tới
def find_most_dissimilar_words(model, word, top_n=10):
    if word not in model:
        raise ValueError(f"Word '{word}' not in the model vocabulary.")
    
    # Lấy vector của từ cần tìm
    word_vector = model[word]
    
    # Tính khoảng cách cosine giữa từ cần tìm và tất cả các từ khác trong mô hình
    distances = []
    for other_word in model.index_to_key:
        if other_word != word:
            other_word_vector = model[other_word]
            cosine_distance = np.dot(word_vector, other_word_vector) / (np.linalg.norm(word_vector) * np.linalg.norm(other_word_vector))
            distances.append((other_word, cosine_distance))
    
    # Sắp xếp các từ theo khoảng cách cosine tăng dần
    distances.sort(key=lambda x: x[1])
    
    # Lấy top_n từ có khoảng cách cosine lớn nhất (nghĩa khác xa nhất)
    most_dissimilar_words = distances[:top_n]
    
    return most_dissimilar_words

In [44]:
# Ví dụ: Tìm những từ có nghĩa khác xa nhất
word = 'uninteresting'
most_dissimilar_words = find_most_dissimilar_words(model.wv, word, top_n=10)

print(f"Những từ có nghĩa khác xa nhất với '{word}':")
for other_word, distance in most_dissimilar_words:
    print(f"{other_word}: {distance}")

Những từ có nghĩa khác xa nhất với 'uninteresting':
appleseed: -0.4673237204551697
wolstencroft: -0.4455452561378479
saleslady: -0.4338872730731964
burlesks: -0.4295748770236969
photog: -0.4244859218597412
siddons: -0.41785940527915955
felson: -0.4168294072151184
kirron: -0.4103279411792755
lindsley: -0.4011898636817932
earls: -0.39931991696357727


- extraordinary/vivid >< dumbsh

In [34]:
list_positive = [model.wv[word] for word in list_positive_words ]
list_negative = [model.wv[word] for word in list_negative_words ]

In [45]:
for i in range(len(list_positive)):
    print(cosine_similarity(list_positive[i], list_negative[i]))

0.6893846
0.546904
0.56399274
0.682836
0.69089097


In [86]:
positive_vector = sum_weights(list_positive, [1, 2, 3, 1, 1])
negative_vector = sum_weights(list_negative, [1, 2, 3, 1, 1])

In [87]:
cosine_similarity(positive_vector, negative_vector)

0.5466110229191588

In [48]:
def sentence_to_vector(sentence, model, f = id_v):
    word_vectors = [model.wv[word] for word in sentence if word in model.wv] # vector 100 từ
    #dedicates= []
    pos_weights = []
    neg_weights = []
    res = np.zeros(model.vector_size)
    if len(word_vectors) == 0:
        return res
    for i in range(len(word_vectors)):
        word_vector = word_vectors[i]
        checkPositive = cosine_similarity(word_vector, positive_vector) # < 1
        checkNegative = cosine_similarity(word_vector, negative_vector) # < 1
        pos_weights.append(f(checkPositive))
        neg_weights.append(f(checkNegative))

    pos_res = sum_weights(word_vectors, pos_weights)
    neg_res = sum_weights(word_vectors, neg_weights)
    return np.concatenate([pos_res, neg_res])

    # if(len(dedicates) > 0):
    #     res = sum_weights(word_vectors,dedicates)
    # return res

#### Tạo tập train và test cho model

In [49]:
X_w2v_train = [ sentence_to_vector(sent.split(), model, id_v)  for sent in X_train]
X_w2v_test =  [ sentence_to_vector(sent.split(), model, id_v)  for sent in X_test]

In [64]:
X_train_removed = X_train.apply(remove_stop_words)
X_test_removed = X_test.apply(remove_stop_words)
X_w2v_train_removed = [ sentence_to_vector(sent.split(), model)  for sent in X_train_removed]
X_w2v_test_removed =  [ sentence_to_vector(sent.split(), model)  for sent in X_test_removed]

## Mô hình
chỉ chạy để kiểm thử model Word2Vec

### Hàm khảo sát

In [50]:
from sklearn.metrics import accuracy_score, classification_report
def train_and_valid(model, X_train, y_train, X_test, y_test):
    model.fit(X_train,y_train)
    y_pred_train = model.predict(X_train) 
    y_pred = model.predict(X_test)
    # Đánh giá mô hình
    accuracy_train = accuracy_score(y_train, y_pred_train)
    accuracy_test = accuracy_score(y_test, y_pred)
    print(f"training accuracy: {accuracy_train} \nvalidation accuracy: {accuracy_test}")
    

### Decision tree

In [52]:
train_and_valid(clf, X_w2v_train, y_train, X_w2v_test, y_test)

training accuracy: 0.9797142857142858 
validation accuracy: 0.7700666666666667


In [53]:
train_and_valid(clf, X_w2v_train_removed, y_train, X_w2v_test_removed, y_test)

NameError: name 'X_w2v_train_removed' is not defined

### Logistic regression


In [56]:
from sklearn.linear_model import LogisticRegression

# Huấn luyện mô hình Logistic Regression
log_regr = LogisticRegression(
    random_state=42,
    max_iter=400
)

In [57]:
train_and_valid(log_regr,X_w2v_train, y_train, X_w2v_test, y_test)

training accuracy: 0.8802857142857143 
validation accuracy: 0.8788


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [71]:
train_and_valid(log_regr,X_w2v_train_removed, y_train, X_w2v_test_removed, y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


training accuracy: 0.8616 
validation accuracy: 0.8570666666666666


### Random forest

In [58]:
# Import các thư viện cần thiết
from sklearn.ensemble import RandomForestClassifier

# Huấn luyện mô hình Random Forest
rf_clf = RandomForestClassifier(
    random_state=42,
)

In [59]:
train_and_valid(rf_clf, X_w2v_train, y_train, X_w2v_test, y_test)

training accuracy: 0.9999714285714286 
validation accuracy: 0.8488


In [52]:
train_and_valid(rf_clf, X_w2v_train_removed, y_train, X_w2v_test_removed, y_test)

training accuracy: 0.9999714285714286 
validation accuracy: 0.8292666666666667


### MLP

In [64]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

# Tạo mô hình MLP
mlp_classifier = MLPClassifier(
    random_state=42,
    hidden_layer_sizes=400,
    max_iter=300
)


In [65]:
train_and_valid(mlp_classifier, X_w2v_train ,y_train ,X_w2v_test, y_test)

training accuracy: 0.9736571428571429 
validation accuracy: 0.8483333333333334


### Ensemble model

[0.5,1.5,0.75,0.75,1.5] --> 0.855 \
weights --> 0.8466

## Tổng kết

| Mô hình Word Embeddings| Dữ liệu |Mô hình ML| Tham số | training accuracy | testing accuracy | Đánh giá | 
|---------|------|----------|------|--------|----|---|
|Word2Vec|Không loại stop words |Decision tree | default | 0.979 | 0.7249 | Overfitting |
|Word2Vec|Không loại stop words |Logistic Regression | max_iter = 400 | 0.852 | 0.850 | -- |
|Word2Vec|Không loại stop words |Random forest | default | 0.99997 | 0.818 | Ovefitting |
|Word2Vec|Không loại stop words  |XGBoost | default | 0.961 | 0.833 | Ovefitting |
|Word2Vec|Không loại stop words |MLP | max_iter = 250, learning_rate_init = 0.0005 | 0.868 | 0.850 | -- |
|Word2Vec|Loại stop words |Decision tree | default | 0.976 | 0.751 | Overfitting |
|Word2Vec|Loại stop words |Logistic Regression | default | 0.850 | 0.848 | -- |
|Word2Vec|Loại stop words |Random forest | default | 0.999 | 0.830 | Overfitting |
|Word2Vec|Loại stop words  |XGBoost | default | 0.964 | 0.834 | Ovefitting |
|Word2Vec|Loại stop words |MLP | default | 0.938 | 0.816 | Overfitting |
