### Sentiment Analysis Python

Natural Language Processing Python Project creating a Sentiment Analysis classifier using 3 different techniques:

        K-nn (K-nearest neighbors)
        VADER (Valence Aware Dictionary and sEntiment Reasoner)
        Logistic Regression model


### Step 0. Read in Data and analyses

In [41]:
import pandas as pd                     # pip install pandas
import re                               # for regex
import nltk                             # pip install nltk

from nltk.corpus import stopwords                                       # pip install nltk
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.sentiment.vader import SentimentIntensityAnalyzer             # Vader Sentiment Analysis (eng)
from sklearn.feature_extraction.text import CountVectorizer             # pip install scikit-learn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [42]:
# Read in data
df = pd.read_csv('Reviews.csv')

In [None]:
# Attribute information
df.info()

In [None]:
# Show the first 5 data of the file
df.head()

#### Data analysis

In [None]:
# Overview of reviews
print(df['Score'].value_counts().sort_index())

# Data visualization
graph = df['Score'].value_counts().sort_index().plot(kind='bar', title='Quantidade de avaliações em cada nota', figsize=(10, 5), color='red')
graph.set_xlabel('Notas')
graph.set_ylabel('Quantidade de avaliações')
graph.plot()

#### Text labeling

The classification of each review will be given according to the "Score".

Neutral reviews will be ignored

Score > 3: Positive

Score < 3: Negative

Score = 3: Neutral

In [None]:
# Remove the neutral reviews
df = df[df['Score'] != 3]

# Cria nova coluna "Sentimento" e indica se os dados em positivos, negativos e neutros
df['Sentiment'] = df['Score'].apply(lambda score: 'Positive' if score > 3 else ('Negative' if score < 3 else 'Neutral'))

df.head()

### Step 1. Data cleaning

Stopwords, HTML Tags, special characters and punctuation will be removed. The texts will be transformed into lowercase.

Stopwords are words that don't bring meaning to our sentence, like "I, are, all, mine, yours, ours, theirs, was...".

Normalization of similar words using "stemming". Stemming is the technique for removing suffixes and prefixes.

        Ex: Watch, Watched, Watching

##### Remove HTML Tags

In [59]:
def remove_tags(text):
    TAG_RE = re.compile(r'<[^>]+>')
    return TAG_RE.sub('', text)

df.Text = df.Text.apply(remove_tags)
df.Text[0]

'bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better'

##### Remove special characters (punctuation)

In [60]:
def remove_special_characters(text):
    pattern = r'[^a-zA-z0-9\s]'
    text = re.sub(pattern, '', text)
    return text

df.Text = df.Text.apply(remove_special_characters)
df.Text[0]

'bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better'

##### Reviews to lowercase

In [61]:
def lower_case(text):
    return text.lower()

df.Text = df.Text.apply(lower_case)
df.Text[0]

'bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better'

##### Remove stopwords

In [62]:
nltk.download('stopwords')
nltk.download('punkt')

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
   
    # Add specific words to the stop_words list if necessary
    stop_words.update(['rt'])
    
    words = word_tokenize(text)
    filtered_text = [word for word in words if word not in stop_words]
    
    return filtered_text

df.Text = df.Text.apply(remove_stopwords)

# transform list to string
df.Text = df.Text.apply(lambda x: ' '.join(x))
df.Text[0]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Andre_Rodrigues\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Andre_Rodrigues\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


'bought sever vital dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better'

##### Data normalization with 'stemming' technique

In [63]:
def stemming_words(text):
    stemmer = SnowballStemmer('english')
    words = word_tokenize(text)
    stemmed_words = [stemmer.stem(word) for word in words]
    return stemmed_words

df.Text = df.Text.apply(stemming_words)

# transform list to string
df.Text = df.Text.apply(lambda x: ' '.join(x))
df.Text[0]

'bought sever vital dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better'

##### Check frequent words in the dataset

It can help us to insert new words into Stopwords list to be excluded from our analysis

In [64]:
def get_top_n_words(corpus, n=None):
    vec = CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [65]:
# Verificar as 20 palavras mais frequentes no dataset
top_words = get_top_n_words(df.Text, n=20)

In [None]:
# Printar as 20 palavras mais frequentes
top_df = pd.DataFrame(top_words)
top_df.columns=["Palavra", "Frequência"]
top_df

### Step 2. Bag of words

Create the Bag of Words, which consists of a set of words that exist in our Dataset. Each word will only appear 1x in the BOW.

We will perform the tokenization of our dataset. Tokenization consists of transforming a string into a list of words called tokens. Tokens can consist of words, emoticons, hashtags, links, or even individual characters. A basic way to tokenize words is to split text based on whitespace and punctuation.

The NLTK library provides a default tokenizer with the word_tokenize(text) method

With the words tokenized, we will put a representation of each in our "bag of words".

In [66]:
# Simple BOW with 1 representation of each word
def generate_bow(sentences):
    words = CountVectorizer().fit(sentences)
    # remove duplicates
    words = list(set(words.get_feature_names()))
    return words

# Words and it's frequency BOW
def generate_bow_freq(sentences):
    vec = CountVectorizer().fit(sentences)
    bag_of_words = vec.transform(sentences)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[0:]

# Matrix BOW
def generate_bow_matrix(sentences):
    vec = CountVectorizer().fit(sentences)
    bag_of_words = vec.transform(sentences)
    return bag_of_words
    

### Step 3. Balanced Dataset

The dataset that will be used to train the model must be balanced with the same amount of samples for pos and neg groups

In [67]:
# verify how many positive and negative reviews we have
df.Sentiment.value_counts()

Positive    3846
Negative     759
Name: Sentiment, dtype: int64

In [69]:
# 759 neg and 3846 pos
# We take 759 samples of each feeling type to form the balanced dataset
df_balanced = df.groupby('Sentiment').head(759)
df_balanced.Sentiment.value_counts()

Positive    759
Negative    759
Name: Sentiment, dtype: int64

In [None]:
df_balanced.head()

### Step 4. Training and testing datasets

In [71]:
# Separate training and testing datasets
df_train, df_test, label_column_train, label_column_test = train_test_split(df_balanced, df_balanced['Sentiment'], test_size=0.2, random_state=9)


# must be the same size
print(f"Train shapes: x = {df_train.shape}, y = {label_column_train.shape}")
print(f"Test shapes: x = {df_test.shape}, y = {label_column_test.shape}")

Train shapes: x = (1214, 11), y = (1214,)
Test shapes: x = (304, 11), y = (304,)


### Step 5. Classification models

We will use the models for classification below

#### K-NN model

KNN (K Nearest Neighbors) is a classification algorithm. KNN tries to predict the correct class for the test data by calculating the distance between the test data and all the training points. Then select the K number of points which is closet to the test data. The KNN algorithm calculates the probability of the test data belonging to the classes of ‘K’ training data and class holds the highest probability will be selected.

In [72]:

# K-nn sentiment analysis model
def generate_knn_model(df, txt_vectorized_arr, k):
    knn = KNeighborsClassifier(n_neighbors=k)

    # Parameters: Texts vectorized in 2D array, Array of Feelings
    knn.fit(txt_vectorized_arr, df.Sentiment.values)

    return knn

# K-nn sentiment analysis of a single dataframe
def knn_sentiment_analysis(df, txt_vectorized_arr, model):
    df_copy = df.copy()

    # predict
    # Parameters: Texts vectorized in 2D array, Array of Feelings
    df_copy['Knn_Predict'] = model.predict(txt_vectorized_arr)

    # check accuracy
    # Parameters: Array of Feelings, Array of Predicted Feelings
    df_copy['Accuracy'] = accuracy_score(df_copy.Sentiment.values, df_copy.Knn_Predict.values)
    return df_copy

In [73]:
cv = CountVectorizer()

# Create the knn model
# Parameters: Training dataframe, 2D array of vectorized DF texts, k-neighbors
# cv.transform(df_train.Text).toarray() generates a 2D Array containing Vectorized of the texts from our DF so that it can be used to generate the model
knn_model = generate_knn_model(df_train, cv.fit_transform(df_train.Text).toarray(), 5)


In [74]:
# Prediction using created model
df_train_Knn = knn_sentiment_analysis(df_train, cv.transform(df_train.Text).toarray(), knn_model)
df_test_Knn = knn_sentiment_analysis(df_test, cv.transform(df_test.Text).toarray(), knn_model)


##### Training Result

Comparison of the quantities obtained with the training x test dataset

In [76]:
print("\nQuantity of positive and negative values in the dataset:")
print(df_train_Knn.Sentiment.value_counts())

print("\nQuantity of positive and negative values found by the model:")
print(df_train_Knn.Knn_Predict.value_counts())


Quantity of positive and negative values in the dataset:
Positive    619
Negative    595
Name: Sentiment, dtype: int64

Quantity of positive and negative values found by the model:
Positive    698
Negative    516
Name: Knn_Predict, dtype: int64


Training Dataframe

In [79]:
df_train_Knn.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Sentiment,Knn_Predict,Accuracy
199,200,B0028C44Z0,A1WTXY4MW3YDF2,Gordon,0,0,5,1344729600,These mints are awesome!,huge suppli im still work plenti spare much ef...,Positive,Negative,0.81631
559,560,B000G6RYNE,A10RJEQN64ATXU,"Paul Rodney Williams ""Higher Lifestyle""",3,3,5,1188777600,delicious,love kettl brand sea salt vinegar chip sinc fi...,Positive,Positive,0.81631
46,47,B001EO5QW8,AQLL2R1PPR46X,grumpyrainbow,0,0,5,1192752000,good,good oatmeal like appl cinnamon best though wo...,Positive,Positive,0.81631
551,552,B000G6RYNE,A2B5OI74EHGVH1,"Jane ""jdeaton2""",3,8,1,1273017600,dripping in oil,purcha low salt ind low salt howev mani mani c...,Negative,Negative,0.81631
2466,2467,B002JX7GVM,AN9E7KAWWF95Q,ChristineThomson,0,0,2,1348963200,Terrible flavor,almost gag first tast use coconut base product...,Negative,Negative,0.81631


Tessting Dataframe

In [80]:
df_test_Knn.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Sentiment,Knn_Predict,Accuracy
133,134,B003OB0IB8,AOTEC8KEH8JGN,Seth S Moyers,0,0,5,1334880000,Great value and convenient ramen,got sale rough 25 cent per cup half price loca...,Positive,Negative,0.6875
277,278,B001D07IPG,A3QN14A5DGUA0U,J. Kraayenbrink,0,0,5,1343692800,Excellent for G/F,plea ignor onestar comment check bag main ingr...,Positive,Positive,0.6875
1520,1521,B001LQNX8S,A2QA5W84LGMG8L,Dee,0,0,1,1342742400,These are absolutely revolting,flavor pod horrid order senseo stock long need...,Negative,Positive,0.6875
2951,2952,B000H280KS,AJITVA02GIXPO,Gretchen,0,0,2,1173312000,"Looks better than it tasted, I'm told",sent gift found later mani item stale love tho...,Negative,Positive,0.6875
758,759,B0035YE9CS,AGXFRN1RZKI4J,Earl Hudson,0,1,5,1321228800,Outstanding Product!,saw made interest processdecid give tri realli...,Positive,Positive,0.6875


#### Vader model

Valence Aware Dictionary and sEntiment Reasoner (VADER) is an open source Python library built for use in sentiment analysis tasks.

Vader works in a very simple way: it has a collection of words, where each word already has a note (or feeling) assigned, and when passed a document (phrase) it returns the following values ​​in percentage:

        pos: how positive is that sentence/document;

        neu: how neutral is the sentence/document;

        neg: how much is negative;

compound: is the metric used to indicate the final conclusion regarding the sentence as a whole. It is calculated by adding the valence scores of each word in the lexicon, which generates a number between -1 (very negative) and +1 (very positive).

        If compound >= 0.05 -> Positive

        If compound <= -0.05 -> Negative

        If compounding between -0.04 and 0.04 -> Neutral

The compound is the most important metric when you just want to know if that sentence is positive or negative, because its value can be converted into those respective categories and that's exactly what we're going to learn to do here!

In [81]:
nltk.download('vader_lexicon')

# Full analysis with pos, neg, neu and compound sentiment values
# Returns a dictionary with the sentiment values neg, pos, neu and compound of the passed sentence
def vader_complete_sentiment_analyse(text):
    sentiment = SentimentIntensityAnalyzer().polarity_scores(text)
    return sentiment

# Simple analysis that returns pos, neg or neu according to the compound value
def vader_get_sentiment(text):
    sentiment = vader_complete_sentiment_analyse(text)
    if sentiment['compound'] >= 0.05:
        return 'Positive'
    elif sentiment['compound'] <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

# Sentiment analysis of the dataset with the VADER model
def vader_df_sentiment_analyse(df):
    df_copy = df.copy()
    df_copy['Vader_Predict'] = df_copy.Text.apply(vader_get_sentiment)
    df_copy['Accuracy'] = accuracy_score(df_copy.Sentiment.values, df_copy.Vader_Predict.values)

    return df_copy
    

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Andre_Rodrigues\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [82]:
df_train_vader = vader_df_sentiment_analyse(df_train)
df_test_vader = vader_df_sentiment_analyse(df_test)

##### Training Result

Comparison of the quantities obtained with the training x test dataset

In [83]:
print("\nQuantity of actual positive and negative values in the dataset:")
print(df_train_vader.Sentiment.value_counts())

print("\nQuantity of positive and negative values found by the model:")
print(df_train_vader.Vader_Predict.value_counts())


Quantity of actual positive and negative values in the dataset:
Positive    619
Negative    595
Name: Sentiment, dtype: int64

Quantity of positive and negative values found by the model:
Positive    945
Negative    191
Neutral      78
Name: Vader_Predict, dtype: int64


In [84]:
df_train_vader.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Sentiment,Vader_Predict,Accuracy
199,200,B0028C44Z0,A1WTXY4MW3YDF2,Gordon,0,0,5,1344729600,These mints are awesome!,huge suppli im still work plenti spare much ef...,Positive,Positive,0.608731
559,560,B000G6RYNE,A10RJEQN64ATXU,"Paul Rodney Williams ""Higher Lifestyle""",3,3,5,1188777600,delicious,love kettl brand sea salt vinegar chip sinc fi...,Positive,Positive,0.608731
46,47,B001EO5QW8,AQLL2R1PPR46X,grumpyrainbow,0,0,5,1192752000,good,good oatmeal like appl cinnamon best though wo...,Positive,Positive,0.608731
551,552,B000G6RYNE,A2B5OI74EHGVH1,"Jane ""jdeaton2""",3,8,1,1273017600,dripping in oil,purcha low salt ind low salt howev mani mani c...,Negative,Negative,0.608731
2466,2467,B002JX7GVM,AN9E7KAWWF95Q,ChristineThomson,0,0,2,1348963200,Terrible flavor,almost gag first tast use coconut base product...,Negative,Negative,0.608731


##### Test Result

In [85]:
print("\nQuantity of actual positive and negative values in the dataset:")
print(df_train_vader.Sentiment.value_counts())

print("\nQuantity of positive and negative values found by the model:")
print(df_train_vader.Vader_Predict.value_counts())


Quantity of actual positive and negative values in the dataset:
Positive    619
Negative    595
Name: Sentiment, dtype: int64

Quantity of positive and negative values found by the model:
Positive    945
Negative    191
Neutral      78
Name: Vader_Predict, dtype: int64


In [86]:
df_test_vader.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Sentiment,Vader_Predict,Accuracy
133,134,B003OB0IB8,AOTEC8KEH8JGN,Seth S Moyers,0,0,5,1334880000,Great value and convenient ramen,got sale rough 25 cent per cup half price loca...,Positive,Positive,0.542763
277,278,B001D07IPG,A3QN14A5DGUA0U,J. Kraayenbrink,0,0,5,1343692800,Excellent for G/F,plea ignor onestar comment check bag main ingr...,Positive,Positive,0.542763
1520,1521,B001LQNX8S,A2QA5W84LGMG8L,Dee,0,0,1,1342742400,These are absolutely revolting,flavor pod horrid order senseo stock long need...,Negative,Negative,0.542763
2951,2952,B000H280KS,AJITVA02GIXPO,Gretchen,0,0,2,1173312000,"Looks better than it tasted, I'm told",sent gift found later mani item stale love tho...,Negative,Positive,0.542763
758,759,B0035YE9CS,AGXFRN1RZKI4J,Earl Hudson,0,1,5,1321228800,Outstanding Product!,saw made interest processdecid give tri realli...,Positive,Positive,0.542763


#### Logistic Regression model

It is a probabilistic prediction model that seeks to predict a binary value of 0 or 1, which is more malleable and can be used in complex classification cases. In the formula of this method, outliers are not considered, since they are only considered data close to a line that divides the data according to the specified attributes. Like the Perceptron algorithm, it is an algorithm that requires a lot of training and can only find a linear separator for the data if the data are linearly separable.

In [87]:

def logistic_regression_sentiment_analysis(df, iteractions=1000):
    # generate BOW matrix
    bow_matrix = generate_bow_matrix(df.Text)
    # generate BOW frequency
    bow_freq = generate_bow_freq(df.Text)
    # generate BOW frequency dataframe
    bow_freq_df = pd.DataFrame(bow_freq)
    bow_freq_df.columns=["Palavra", "Frequência"]
    
    # Logistic Regression
    lr= LogisticRegression(max_iter=iteractions)
    lr.fit(bow_matrix, df.Sentiment)

    df_copy = df.copy()
    # predict
    df_copy['Logistic_Predict'] = lr.predict(bow_matrix)
    # check accuracy
    df_copy['Accuracy'] = accuracy_score(df_copy.Sentiment.values, df_copy.Logistic_Predict.values)

    return df_copy

In [88]:
df_train_lr = logistic_regression_sentiment_analysis(df_train)
df_test_lr = logistic_regression_sentiment_analysis(df_test)

##### Training Result

Comparison of the quantities obtained with the training x test dataset

In [89]:
print("\nQuantidade de valores positivos e negativos reais no dataset:")
print(df_train_lr.Sentiment.value_counts())

print("\nQuantidade de valores positivos e negativos encontrados pelo modelo:")
print(df_train_lr.Logistic_Predict.value_counts())


Quantidade de valores positivos e negativos reais no dataset:
Positive    619
Negative    595
Name: Sentiment, dtype: int64

Quantidade de valores positivos e negativos encontrados pelo modelo:
Positive    623
Negative    591
Name: Logistic_Predict, dtype: int64


In [90]:
df_train_vader.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Sentiment,Vader_Predict,Accuracy
199,200,B0028C44Z0,A1WTXY4MW3YDF2,Gordon,0,0,5,1344729600,These mints are awesome!,huge suppli im still work plenti spare much ef...,Positive,Positive,0.608731
559,560,B000G6RYNE,A10RJEQN64ATXU,"Paul Rodney Williams ""Higher Lifestyle""",3,3,5,1188777600,delicious,love kettl brand sea salt vinegar chip sinc fi...,Positive,Positive,0.608731
46,47,B001EO5QW8,AQLL2R1PPR46X,grumpyrainbow,0,0,5,1192752000,good,good oatmeal like appl cinnamon best though wo...,Positive,Positive,0.608731
551,552,B000G6RYNE,A2B5OI74EHGVH1,"Jane ""jdeaton2""",3,8,1,1273017600,dripping in oil,purcha low salt ind low salt howev mani mani c...,Negative,Negative,0.608731
2466,2467,B002JX7GVM,AN9E7KAWWF95Q,ChristineThomson,0,0,2,1348963200,Terrible flavor,almost gag first tast use coconut base product...,Negative,Negative,0.608731


##### Test Result

In [91]:
df_test_lr.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Sentiment,Logistic_Predict,Accuracy
133,134,B003OB0IB8,AOTEC8KEH8JGN,Seth S Moyers,0,0,5,1334880000,Great value and convenient ramen,got sale rough 25 cent per cup half price loca...,Positive,Positive,1.0
277,278,B001D07IPG,A3QN14A5DGUA0U,J. Kraayenbrink,0,0,5,1343692800,Excellent for G/F,plea ignor onestar comment check bag main ingr...,Positive,Positive,1.0
1520,1521,B001LQNX8S,A2QA5W84LGMG8L,Dee,0,0,1,1342742400,These are absolutely revolting,flavor pod horrid order senseo stock long need...,Negative,Negative,1.0
2951,2952,B000H280KS,AJITVA02GIXPO,Gretchen,0,0,2,1173312000,"Looks better than it tasted, I'm told",sent gift found later mani item stale love tho...,Negative,Negative,1.0
758,759,B0035YE9CS,AGXFRN1RZKI4J,Earl Hudson,0,1,5,1321228800,Outstanding Product!,saw made interest processdecid give tri realli...,Positive,Positive,1.0
