<a href="https://colab.research.google.com/github/maulanaakbardj/ML-X/blob/main/Text%20Sentiment%20Analysis/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Sentiment Analysis

Text Sentiment Analysis is a text processing technique used to determine the sentiment or emotion conveyed in a piece of text, such as positive, negative, or neutral. The goal is to understand the viewpoint or emotions expressed in the text, especially in contexts like product reviews, social media, or online comments.

## Text Sentiment Analysis Dataset

The dataset used: https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset

## Import libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report


## Load the Dataset

In [None]:
train_data = pd.read_csv('train.csv', encoding='ISO-8859-1')
test_data = pd.read_csv('test.csv', encoding='ISO-8859-1')

In [None]:
train_data

Unnamed: 0,textID,text,selected_text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²)
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,morning,0-20,Afghanistan,38928346,652860.0,60
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,noon,21-30,Albania,2877797,27400.0,105
2,088c60f138,my boss is bullying me...,bullying me,negative,night,31-45,Algeria,43851044,2381740.0,18
3,9642c003ef,what interview! leave me alone,leave me alone,negative,morning,46-60,Andorra,77265,470.0,164
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,noon,60-70,Angola,32866272,1246700.0,26
...,...,...,...,...,...,...,...,...,...,...
27476,4eac33d1c0,wish we could come see u on Denver husband l...,d lost,negative,night,31-45,Ghana,31072940,227540.0,137
27477,4f4c4fc327,I`ve wondered about rake to. The client has ...,", don`t force",negative,morning,46-60,Greece,10423054,128900.0,81
27478,f67aae2310,Yay good for both of you. Enjoy the break - y...,Yay good for both of you.,positive,noon,60-70,Grenada,112523,340.0,331
27479,ed167662a5,But it was worth it ****.,But it was worth it ****.,positive,night,70-100,Guatemala,17915568,107160.0,167


In [None]:
test_data

Unnamed: 0,textID,text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²)
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral,morning,0-20,Afghanistan,38928346.0,652860.0,60.0
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive,noon,21-30,Albania,2877797.0,27400.0,105.0
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative,night,31-45,Algeria,43851044.0,2381740.0,18.0
3,01082688c6,happy bday!,positive,morning,46-60,Andorra,77265.0,470.0,164.0
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive,noon,60-70,Angola,32866272.0,1246700.0,26.0
...,...,...,...,...,...,...,...,...,...
4810,,,,,,,,,
4811,,,,,,,,,
4812,,,,,,,,,
4813,,,,,,,,,


## Text Preprocessing

The code starts by importing the nltk library, which is a powerful library for working with nlp.

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

`nltk.download('punkt') `to download the "Punkt" tokenizer models. The Punkt tokenizer is used for tokenizing text into individual words or sentences. It's a critical component for various text processing tasks, including text classification, sentiment analysis, and more.

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

`nltk.download('stopwords')` is used to download a list of common stopwords in various languages. Stopwords are words that are commonly used in a language but are often removed from text data because they don't typically carry significant meaning (e.g., "and," "the," "is"). Removing stopwords can help reduce the dimensionality of text data for text analysis tasks.

In [None]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    if isinstance(text, str):
        text = re.sub(r'[^\w\s]', '', text)
        text = text.lower()
        tokens = word_tokenize(text)
        tokens = [word for word in tokens if word not in stop_words]
        return ' '.join(tokens)
    else:
        return text

train_data['text'] = train_data['text'].apply(preprocess_text)
test_data['text'] = test_data['text'].apply(preprocess_text)

The `preprocess_text` function is defined to preprocess text. It takes a text input as an argument and performs the following operations:


*   Check if the input is a string.
*   Remove punctuation from the text using the `re.sub()` function.
*   Convert the text to lowercase to ensure consistency.
*   Tokenize the text into words using `word_tokenize` from NLTK.
*   Remove stopwords by filtering out words in the `stop_words` set.
*   Join the filtered tokens back into a string.







### Drop rows with missing values

In [None]:
train_data = train_data.dropna(subset=['text'])
test_data = test_data.dropna(subset=['text'])
train_data['text'].fillna('', inplace=True)
test_data['text'].fillna('', inplace=True)

In [None]:
train_data

Unnamed: 0,textID,text,selected_text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²)
0,cb774db0d1,id responded going,"I`d have responded, if I were going",neutral,morning,0-20,Afghanistan,38928346,652860.0,60
1,549e992a42,sooo sad miss san diego,Sooo SAD,negative,noon,21-30,Albania,2877797,27400.0,105
2,088c60f138,boss bullying,bullying me,negative,night,31-45,Algeria,43851044,2381740.0,18
3,9642c003ef,interview leave alone,leave me alone,negative,morning,46-60,Andorra,77265,470.0,164
4,358bd9e861,sons couldnt put releases already bought,"Sons of ****,",negative,noon,60-70,Angola,32866272,1246700.0,26
...,...,...,...,...,...,...,...,...,...,...
27476,4eac33d1c0,wish could come see u denver husband lost job ...,d lost,negative,night,31-45,Ghana,31072940,227540.0,137
27477,4f4c4fc327,ive wondered rake client made clear net dont f...,", don`t force",negative,morning,46-60,Greece,10423054,128900.0,81
27478,f67aae2310,yay good enjoy break probably need hectic week...,Yay good for both of you.,positive,noon,60-70,Grenada,112523,340.0,331
27479,ed167662a5,worth,But it was worth it ****.,positive,night,70-100,Guatemala,17915568,107160.0,167


In [None]:
test_data

Unnamed: 0,textID,text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²)
0,f87dea47db,last session day httptwitpiccom67ezh,neutral,morning,0-20,Afghanistan,38928346.0,652860.0,60.0
1,96d74cb729,shanghai also really exciting precisely skyscr...,positive,noon,21-30,Albania,2877797.0,27400.0,105.0
2,eee518ae67,recession hit veronique branquinho quit compan...,negative,night,31-45,Algeria,43851044.0,2381740.0,18.0
3,01082688c6,happy bday,positive,morning,46-60,Andorra,77265.0,470.0,164.0
4,33987a8ee5,httptwitpiccom4w75p like,positive,noon,60-70,Angola,32866272.0,1246700.0,26.0
...,...,...,...,...,...,...,...,...,...
3529,e5f0e6ef4b,3 im tired cant sleep try,negative,noon,21-30,Nicaragua,6624554.0,120340.0,55.0
3530,416863ce47,alone old house thanks net keeps alive kicking...,positive,night,31-45,Niger,24206644.0,1266700.0,19.0
3531,6332da480c,know mean little dog sinking depression wants ...,negative,morning,46-60,Nigeria,206139589.0,910770.0,226.0
3532,df1baec676,_sutra next youtube video gon na love videos,positive,noon,60-70,North Korea,25778816.0,120410.0,214.0


## Model Building
splitting the data into training and validation sets.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(train_data['text'], train_data['sentiment'], test_size=0.2, random_state=42)

In [None]:
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_val_vectorized = vectorizer.transform(X_val)
X_test_vectorized = vectorizer.transform(test_data['text'])


`CountVectorizer` from scikit-learn to convert text data into a numerical format that can be used for machine learning models. This process is known as text vectorization.

## Train the Model through Multinomial Naive Bayes

 I am using the Multinomial Naive Bayes classifier to train a text classification model. Text classification is the task of categorizing text documents into predefined categories or labels.

In [None]:
classifier = MultinomialNB()
classifier.fit(X_train_vectorized, y_train)


Multinomial Naive Bayes classifier is a probabilistic machine learning algorithm commonly used for text classification and natural language processing tasks. It's an extension of the Naive Bayes algorithm, which is based on Bayes' theorem and the assumption of independence among features. The "Multinomial" part of the name refers to the distribution it uses for modeling the data, which is the multinomial distribution.

## Model Evaluation Results with Classification Report

evaluating the performance of the Multinomial Naive Bayes text classification model using the `classification_report`. The report provides detailed information about the model's precision, recall, F1-score, and other metrics for each class.

In [None]:
y_pred = classifier.predict(X_val_vectorized)

print(classification_report(y_val, y_pred))


              precision    recall  f1-score   support

    negative       0.67      0.54      0.60      1572
     neutral       0.58      0.66      0.62      2236
    positive       0.69      0.68      0.69      1688

    accuracy                           0.63      5496
   macro avg       0.65      0.63      0.63      5496
weighted avg       0.64      0.63      0.63      5496



F1-score is the harmonic mean of precision and recall:

*   For the "negative" class, the F1-score is 0.60.
*   For the "neutral" class, the F1-score is 0.62.
*   For the "positive" class, the F1-score is 0.69.

accuracy of the model is 0.63, meaning that the model correctly predicted 63% of the instances in the dataset.

## Predicted Sentiment to Test Data

I am using the trained Multinomial Naive Bayes classifier to make predictions on the test data

In [None]:
test_predictions = classifier.predict(X_test_vectorized)
test_data['predicted_sentiment'] = test_predictions


In [None]:
test_data

Unnamed: 0,textID,text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²),predicted_sentiment
0,f87dea47db,last session day httptwitpiccom67ezh,neutral,morning,0-20,Afghanistan,38928346.0,652860.0,60.0,positive
1,96d74cb729,shanghai also really exciting precisely skyscr...,positive,noon,21-30,Albania,2877797.0,27400.0,105.0,positive
2,eee518ae67,recession hit veronique branquinho quit compan...,negative,night,31-45,Algeria,43851044.0,2381740.0,18.0,negative
3,01082688c6,happy bday,positive,morning,46-60,Andorra,77265.0,470.0,164.0,positive
4,33987a8ee5,httptwitpiccom4w75p like,positive,noon,60-70,Angola,32866272.0,1246700.0,26.0,neutral
...,...,...,...,...,...,...,...,...,...,...
3529,e5f0e6ef4b,3 im tired cant sleep try,negative,noon,21-30,Nicaragua,6624554.0,120340.0,55.0,negative
3530,416863ce47,alone old house thanks net keeps alive kicking...,positive,night,31-45,Niger,24206644.0,1266700.0,19.0,neutral
3531,6332da480c,know mean little dog sinking depression wants ...,negative,morning,46-60,Nigeria,206139589.0,910770.0,226.0,negative
3532,df1baec676,_sutra next youtube video gon na love videos,positive,noon,60-70,North Korea,25778816.0,120410.0,214.0,neutral


test_data with a new column called `predicted_sentiment` which contains the predicted sentiment labels generated by the Multinomial Naive Bayes classifier for each entry in the test data.