**Sentiment Analysis**

---

**datacamp's website** has been used for this demonstration. The notebook features a summarised version of the blog post and this notebook is for practice purposes. Here's the link for the full version:

https://www.datacamp.com/tutorial/text-analytics-beginners-nltk

---


**Text analysis:** analsysing & extracting meaningful insights from unstructured text data.

**Sentiment analysis:** establishing the particular tone of the text provided.



---



**But what is Sentiment Anlaysis?**

Method by which we determine the overall emotional tone/sentiment expressed in a text.
The main challenge it faces is that the human language is quite complex for it features sarcasm, irony etc.


**3 Methodologies for Sentiment Analysis**

(a) Lexicon based analysis \
(b) Machine Learning \
(c) Pre-trained transformer based deep learning

(a) For **Lexicon-based analysis**, a set of **predefined rules & heuristics to determine the sentiment** of a piece of text are used.
Rules are based on **lexical & syntactic features of text** i.e. presence of positive/negative words.
**Simple to implement & interpret but not as accurate**.

---

(b) For **Machine learning**, a model is trained so as to **identify a sentiment of a piece of text based on a set of labeled training data.** (a wide range of algorithms can be used)
Tend to be **more accurate but are computationally expensive** & need a large amount of training data.

---

(c) Lastly, for **Pre-trained transformer-based deep learning**, involves the use of **pre-trained models trained on massive amounts of text data.** The models use **complex neural networks to encode the context & meaning of the text**. \
This allows for **high-levels of accuracy** to be achieved. But these models **need significant computational resources** & aren't practical for all use cases.

---

In [1]:
#install necessary library
!pip install nltk



**Text Preprocessing** \
- involves cleaning & normalizing text data making it easier to analyze.
- involves steps that help transform raw text data into a form you can use for analysis.
- common techniques: \
a) tokenization \
b) stop word removal \
c) stemming \
d) lemmatization \

---

- data cleaning involves: \
a) identifying noise \
b) noise removal \
c) character normalization \
d) data masking \

- linguistic processing involves: \
a) tokenization \
b) POS tagging \
c) lemmatization \
d) named entity recognition \

---

**Tokenization**
- a text preprocessing step
- involves breaking down the text into individual words/tokens
-essential (seperates individual words from raw text so as to make analysis easier)
- done using "word_tokenize" function that splits text into individual words.

**Stop words**
- involves removing common & irrelevant words unlikely to convey sentiment.
- words such as "*and, the, of, it* "
-they cause noise and can skew results.
- by removing we improve accuracy

**Stemming & lemmatization**
- techniques used to reduce words to their root forms
- stemming involves removing suffixes from words
- lemmatization involves reducing words to their base form (based on their part of speech)

---



**Bag of Words (BoW) Model**
- technique that represents text data as a set of numerical features
- each piece of text is seen as a "bag" of words with each word in the text represented by a seperate dimension.
-value of each feature is determined by how many times it appears.
- model lets us analyze text data using ML algorithms which usually need numerical input.
- by having text data represented as numerical features we can train ML models to classify text/analyze sentiments.

---



**Sentiment-analysis pipeline**

tokenization => stop word removal => stemming/lemmatization => pass to Vader sentiment analyzer

In [7]:
#importing necessary libraries
import pandas as pd
import nltk

In [4]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [5]:
import nltk
nltk.download('all') #additional data is installed i.e. pre-trained models, corpora etc.

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    |   Package bcp47 is already up-to-dat

True

In [6]:
#we'll use a dataset from amazon customer reviews
df = pd.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/amazon.csv')

df

Unnamed: 0,reviewText,Positive
0,This is a one of the best apps acording to a b...,1
1,This is a pretty good version of the game for ...,1
2,this is a really cool game. there are a bunch ...,1
3,"This is a silly game and can be frustrating, b...",1
4,This is a terrific game on any pad. Hrs of fun...,1
...,...,...
19995,this app is fricken stupid.it froze on the kin...,0
19996,Please add me!!!!! I need neighbors! Ginger101...,1
19997,love it! this game. is awesome. wish it had m...,1
19998,I love love love this app on my side of fashio...,1


In [12]:
#function to handle pre-processing
def preprocess_text(text):
  #tokenize text
    tokens = word_tokenize(text.lower())

   #remove stop words
    filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]

    #lemmatize tokens
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

   #combine the tokens back into a string
    processed_text = ' '.join(lemmatized_tokens)

    return processed_text

In [13]:
#apply the function to the 'reviewText' column
df['reviewText'] = df['reviewText'].apply(preprocess_text)
df

Unnamed: 0,reviewText,Positive
0,one best apps acording bunch people agree bomb...,1
1,pretty good version game free . lot different ...,1
2,really cool game . bunch level find golden egg...,1
3,"silly game frustrating , lot fun definitely re...",1
4,terrific game pad . hr fun . grandkids love . ...,1
...,...,...
19995,app fricken stupid.it froze kindle wont allow ...,0
19996,please add ! ! ! ! ! need neighbor ! ginger101...,1
19997,love ! game . awesome . wish free stuff house ...,1
19998,love love love app side fashion story fight wo...,1


**Sentiment Analyzer**

In [16]:
analyzer = SentimentIntensityAnalyzer()

#making the function that takes text string as input
def get_sentiment(text):
  scores = analyzer.polarity_scores(text)
  sentiment = 1 if scores ['pos'] > 0 else 0
  return sentiment

In [17]:
df['sentiment'] = df['reviewText'].apply(get_sentiment)
df

Unnamed: 0,reviewText,Positive,sentiment
0,one best apps acording bunch people agree bomb...,1,1
1,pretty good version game free . lot different ...,1,1
2,really cool game . bunch level find golden egg...,1,1
3,"silly game frustrating , lot fun definitely re...",1,1
4,terrific game pad . hr fun . grandkids love . ...,1,1
...,...,...,...
19995,app fricken stupid.it froze kindle wont allow ...,0,0
19996,please add ! ! ! ! ! need neighbor ! ginger101...,1,1
19997,love ! game . awesome . wish free stuff house ...,1,1
19998,love love love app side fashion story fight wo...,1,1


**Classification report**

In [18]:
from sklearn.metrics import classification_report

print(classification_report(df['Positive'], df['sentiment']))

              precision    recall  f1-score   support

           0       0.66      0.24      0.35      4767
           1       0.80      0.96      0.87     15233

    accuracy                           0.79     20000
   macro avg       0.73      0.60      0.61     20000
weighted avg       0.77      0.79      0.75     20000

