<a href="https://colab.research.google.com/github/HowardHNguyen/Natural_Language_Processing-NLP/blob/main/English_Sentence_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We need to install some Python packages

In [None]:
!pip install nltk textblob googletrans==4.0.0-rc1

Collecting googletrans==4.0.0-rc1
  Downloading googletrans-4.0.0rc1.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: googletrans
  Building wheel for googletrans (setup.py) ... [?25l[?25hdone
  Created wheel for googletrans: filename=googletrans-4.0.0rc1-py3-none-any.whl size=17395 sha256=327b32ca3d48d5fcc34235c601d4fbda6e271dd0fdf3b8777d58c77a9bd1a484
  Stored in directory: /root/.cache/pip/wheels/c0/59/9f/7372f0cf70160fe61b528532e1a7c8498c4becd6bcffb022de
Successfully built googletrans
Installing collected packages: googletrans
  Attempting uninstall: googletrans
    Found existing installation: googletrans 3.0.0
    Uninstalling googletrans-3.0.0:
      Successfully uninstalled googletrans-3.0.0
Successfully installed googletrans-4.0.0rc1


# Basic Sentiment Analysis for English
For English, we can use TextBlob, which is straightforward for basic sentiment analysis:

In [None]:
from textblob import TextBlob

text = "Thanks a million."
blob = TextBlob(text)
sentiment = blob.sentiment

print(f"Sentiment of the text: Polarity={sentiment.polarity}, Subjectivity={sentiment.subjectivity}")

Sentiment of the text: Polarity=0.2, Subjectivity=0.2


# Sentiment Analysis for Vietnamese
For Vietnamese, you can use the googletrans library to translate the text to English and then perform sentiment analysis with TextBlob. Note that the translation step can introduce some noise into the analysis:

In [None]:
!pip install googletrans

In [None]:
from textblob import TextBlob
from googletrans import Translator

translator = Translator()

def translate_text(text, dest_language='en'):
    return translator.translate(text, dest=dest_language).text

vietnamese_text = "Con chào bố ạ."
translated_text = translate_text(vietnamese_text)

blob = TextBlob(translated_text)
sentiment = blob.sentiment

print(f"Sentiment of the translated text: Polarity={sentiment.polarity}, Subjectivity={sentiment.subjectivity}")

Sentiment of the translated text: Polarity=0.0, Subjectivity=0.0


In [None]:
translated_text

'Hello dad.'

# Advanced Options with Pre-trained Models
For more advanced sentiment analysis, especially for handling nuances in Vietnamese without relying on translation, consider using a model trained specifically for Vietnamese. Libraries like transformers by Hugging Face provide access to models that can handle multiple languages including Vietnamese. Here's a basic setup:

In [None]:
!pip install transformers



In [None]:
from transformers import pipeline

In [None]:
classifier = pipeline("sentiment-analysis", model='nlptown/bert-base-multilingual-uncased-sentiment')

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
# example texts
english_text = "Today is such a beautiful day, I want to take a walk in the park."
vietnamese_text = "Hôm nay trời đẹp quá, tôi muốn đi dạo ở công viên."

In [None]:
# perform analysis
english_result = classifier(english_text)
vietnamese_result = classifier(vietnamese_text)

In [None]:
print("English text sentiment:", english_result)
print("Vietnamese text sentiment:", vietnamese_result)

English text sentiment: [{'label': '5 stars', 'score': 0.7996292114257812}]
Vietnamese text sentiment: [{'label': '5 stars', 'score': 0.27345144748687744}]


0.27 is not a good score for Vietnamese language in this case.

# Pre-trained Models for Vietnamese Sentiment Analysis
Several pre-trained models can handle Vietnamese text directly. One of the popular platforms to find such models is Hugging Face's Model Hub. Here's how you can use a model directly trained on Vietnamese text using the transformers library:

In [None]:
from transformers import pipeline

# Load a model that is trained specifically for Vietnamese (Example model, make sure to check the actual model name on Hugging Face)
classifier = pipeline('sentiment-analysis', model='trituenhantaoio/bert-base-vietnamese-uncased')

# Vietnamese text example
vietnamese_text = "Hôm nay trời đẹp quá, tôi muốn đi dạo ở công viên."

# Perform sentiment analysis
result = classifier(vietnamese_text)

print("Sentiment Analysis Result:", result)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at trituenhantaoio/bert-base-vietnamese-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentiment Analysis Result: [{'label': 'LABEL_0', 'score': 0.5011802315711975}]


When using the model "bert-base-vietnamese-uncased" specific for Vietnamese, the results of score are much more better: 0.501

# Using VnCoreNLP for Vietnamese Natural Language Processing
VnCoreNLP is a Vietnamese natural language processing toolkit that supports tasks like POS tagging, named entity recognition, and dependency parsing. For sentiment analysis, you might need to train a classifier using features extracted with VnCoreNLP.

In [None]:
!pip install vncorenlp

Collecting vncorenlp
  Downloading vncorenlp-1.0.3.tar.gz (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: vncorenlp
  Building wheel for vncorenlp (setup.py) ... [?25l[?25hdone
  Created wheel for vncorenlp: filename=vncorenlp-1.0.3-py3-none-any.whl size=2645932 sha256=1e9adcc899c457f09b0dbca79d473def9d9fb69b36b557e543a1c16762cd7680
  Stored in directory: /root/.cache/pip/wheels/5d/d9/b3/41f6c6b1ab758561fd4aab55dc0480b9d7a131c6aaa573a3fa
Successfully built vncorenlp
Installing collected packages: vncorenlp
Successfully installed vncorenlp-1.0.3


VnCoreNLP is a Vietnamese natural language processing toolkit that supports tasks like POS tagging, named entity recognition, and dependency parsing. For sentiment analysis, you might need to train a classifier using features extracted with VnCoreNLP.

In [None]:
from vncorenlp import VnCoreNLP

# Setup the VnCoreNLP client
vncorenlp_file = '/content/drive/MyDrive/data/vncorenlp/VnCoreNLP-1.1.1.jar'
rdrsegmenter = VnCoreNLP(vncorenlp_file, annotators="wseg", max_heap_size='-Xmx500m')

text = "Hôm nay trời đẹp quá, tôi muốn đi dạo ở công viên."
word_segmented_text = rdrsegmenter.tokenize(text)
print("Word-segmented text:", word_segmented_text)

Word-segmented text: [['Hôm_nay', 'trời', 'đẹp', 'quá', ',', 'tôi', 'muốn', 'đi', 'dạo', 'ở', 'công_viên', '.']]


The VnCoreNLP toolkit provides tools for Vietnamese language processing like tokenization, part-of-speech tagging, and dependency parsing, but it does not inherently provide sentiment analysis capabilities. To get sentiment scores from Vietnamese texts using VnCoreNLP, you will need to train your own sentiment analysis model using features extracted with VnCoreNLP.

In [None]:
annotator = VnCoreNLP(vncorenlp_file, annotators="wseg,pos,ner,parse", max_heap_size='-Xmx2g')

annotations = annotator.annotate(text)
print(annotations)

{'sentences': [[{'index': 1, 'form': 'Hôm_nay', 'posTag': 'N', 'nerLabel': 'O', 'head': 3, 'depLabel': 'tmp'}, {'index': 2, 'form': 'trời', 'posTag': 'N', 'nerLabel': 'O', 'head': 3, 'depLabel': 'sub'}, {'index': 3, 'form': 'đẹp', 'posTag': 'A', 'nerLabel': 'O', 'head': 0, 'depLabel': 'root'}, {'index': 4, 'form': 'quá', 'posTag': 'R', 'nerLabel': 'O', 'head': 3, 'depLabel': 'amod'}, {'index': 5, 'form': ',', 'posTag': 'CH', 'nerLabel': 'O', 'head': 3, 'depLabel': 'punct'}, {'index': 6, 'form': 'tôi', 'posTag': 'P', 'nerLabel': 'O', 'head': 7, 'depLabel': 'sub'}, {'index': 7, 'form': 'muốn', 'posTag': 'V', 'nerLabel': 'O', 'head': 3, 'depLabel': 'dep'}, {'index': 8, 'form': 'đi', 'posTag': 'V', 'nerLabel': 'O', 'head': 7, 'depLabel': 'vmod'}, {'index': 9, 'form': 'dạo', 'posTag': 'V', 'nerLabel': 'O', 'head': 8, 'depLabel': 'vmod'}, {'index': 10, 'form': 'ở', 'posTag': 'E', 'nerLabel': 'O', 'head': 9, 'depLabel': 'loc'}, {'index': 11, 'form': 'công_viên', 'posTag': 'N', 'nerLabel': 'O'

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Example dataset
# Example dataset with actual text and labels
texts = [
    "Hôm nay tôi cảm thấy rất vui!",
    "Tôi thấy buồn.",
    "Cuộc sống này quá tuyệt!",
    "Tôi không hài lòng với dịch vụ này.",
    "Thức ăn rất ngon."
]
labels = [1, 0, 1, 0, 1]  # 1 for positive, 0 for negative

In [None]:
# Transform texts to features using TF-IDF
vectorizer = TfidfVectorizer()
features = vectorizer.fit_transform(texts)

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict and evaluate
predictions = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, predictions))

Accuracy: 0.0


In [None]:
# Example of predicting sentiment score for a new text
new_text = "Cuộc sống này quá tuyệt!"
new_features = vectorizer.transform([new_text])
prediction = model.predict_proba(new_features)
print("Sentiment Score:", prediction)

Sentiment Score: [[0.21801074 0.78198926]]


The output from the predict_proba method in your sentiment analysis model shows the probability estimates for each class for the new text "Cuộc sống này quá tuyệt!" (This life is wonderful!). In this output:

**Sentiment Score:** [[0.21801074 0.78198926]]

The first number in the array (0.21801074) represents the model's estimated probability that the sentiment of the text is negative (class 0). The second number (0.78198926) represents the probability that the sentiment is positive (class 1).

**Interpretation**
This means that the model predicts that the sentiment of the text "Cuộc sống này quá tuyệt!" is positive, with a confidence of about 78.2%. The text is assessed as being more likely to express a positive emotion based on the training data it was exposed to and the features learned during training.

In [None]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.classify import NaiveBayesClassifier
from textblob import TextBlob

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from textblob import TextBlob

text_ttt = "This is the best day of my life!"
blob = TextBlob(text_ttt)
polarity = blob.sentiment.polarity

print("Polarity Score:", polarity)

Polarity Score: 1.0


In [None]:
# Load data from CSV files
vietnamese_data = pd.read_csv('/content/drive/MyDrive/data/final_tale_of_kieu_vi.csv')
english_data = pd.read_csv('/content/drive/MyDrive/data/final_tale_of_kieu_en.csv')