## What is Sentiment Analysis? 
- Sentiment Analysis is an Classification use-case that employes NLP techniques to classify a given text into various sentiments, such as positive or negative, Happy, Sad or Neutral, etc. 


- The key aim of Sentiment Analysis is to categorize opinions expressed in a piece of text, determining the writer's attitude towards a particular subject (topic, product, etc.). 


- Sentiment analysis helps with public opinion analysis on social media posts, customer reviews, or news articles. For example, analyzing Twitter data to determine the overall sentiment towards a particular product or tracking customer sentiment in online reviews.


- **There exists different forms of Sentiment Analysis:**
    1. **Baseline Sentiment Analysis:** Classify a text into one of the three sentiments: Positive, Negative, or Neutral. 
    2. **Fine-grained Sentiment Analysis:** Classify a text into one of the sentiments on a five-degree scale: Very Positive, Positive, Neutral, Negative, Very Negative. 
    3. **Intent Analysis:** 
    4. **Opinion Mining:** 
    5. **Emotion/Mood Analysis:** Happy, Sad, Fear, Anger, Disguist, 
    

## Ways to Perform Sentiment Analysis using Python 

- Sentiment analysis is one of the hardest tasks in natural language processing because even humans struggle to analyze sentiments accurately. 
- **Different methods to technically automate Sentiment Analysis include:** 
    1. Rule-based Systems: <no code> 
    2. Machine Learning: <1 code> 
    3. Deep Learning: <1 code> 
    4. Using pre-trained NLP libraries <3 code> 
    5. Using online tools (free/purchase) - usually come as a SaaS bundle with attached Tableau/PowerBI dashboards for user-friendly visualizations. <no code> 
    Using Text Blob
    Using Vader
    Using Bag of Words Vectorization-based Models
    Using LSTM-based Models
    Using Transformer-based Models

_Note: The methods in this notebook uses Python 3.12.0_    

## A. Sentiment Analysis using Textblob
- Textblob is a Python Library for NLP tasks 
- It takes text as an input and can return polarity and subjectivity as outputs. 
    - Polarity determines the sentiment of the text. Its values lie in [-1,1] where -1 denotes a highly negative sentiment and 1 denotes a highly positive sentiment.

    - Subjectivity determines whether a text input is factual information or a personal opinion. Its value lies between [0,1] where a value closer to 0 denotes a piece of factual information and a value closer to 1 denotes a personal opinion.

In [2]:
# install textblob 
!pip install textblob==0.17.1

Collecting textblob==0.17.1


[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip



  Downloading textblob-0.17.1-py2.py3-none-any.whl (636 kB)
     ------------------------------------ 636.8/636.8 kB 355.0 kB/s eta 0:00:00
Installing collected packages: textblob
Successfully installed textblob-0.17.1


In [10]:
# import textblob
from textblob import TextBlob

text_1 = "The movie was so awesome."
text_2 = "The food here tastes terrible."
text_3 = "The talk seems decent"

#Determining the Polarity 
p_1 = TextBlob(text_1).sentiment.polarity
p_2 = TextBlob(text_2).sentiment.polarity
p_3 = TextBlob(text_3).sentiment.polarity

#Determining the Subjectivity
s_1 = TextBlob(text_1).sentiment.subjectivity
s_2 = TextBlob(text_2).sentiment.subjectivity
s_3 = TextBlob(text_3).sentiment.subjectivity

print("Polarity of Text 1 is", p_1)
print("Polarity of Text 2 is", p_2)
print("Polarity of Text 3 is", p_3)
print("\n")
print("Subjectivity of Text 1 is", s_1)
print("Subjectivity of Text 2 is", s_2)
print("Subjectivity of Text 3 is", s_3)

Polarity of Text 1 is 1.0
Polarity of Text 2 is -1.0
Polarity of Text 3 is 0.16666666666666666


Subjectivity of Text 1 is 1.0
Subjectivity of Text 2 is 1.0
Subjectivity of Text 3 is 0.6666666666666666


## B. Sentiment Analysis using VADER 
- VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment analyzer that has been trained on social media text. 

In [27]:
# install nltk to access vaderSentiment
!pip install nltk==3.8.1
# import nltk
# nltk.download()




[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [31]:
# import vaderSentiment
# from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sentiment = SentimentIntensityAnalyzer()
text_1 = "The book was a perfect balance between wrtiting style and plot."
text_2 =  "The pizza tastes terrible."
sent_1 = sentiment.polarity_scores(text_1)
sent_2 = sentiment.polarity_scores(text_2)
print("Sentiment of text 1:", sent_1)
print("Sentiment of text 2:", sent_2)

Sentiment of text 1: {'neg': 0.0, 'neu': 0.709, 'pos': 0.291, 'compound': 0.5719}
Sentiment of text 2: {'neg': 0.508, 'neu': 0.492, 'pos': 0.0, 'compound': -0.4767}


## C. Sentiment Analysis using a custom Machine Learning Model 
- In the two approaches i.e. Text Blob and Vader, ready-to-use Python libraries are employed to perform sentiment analysis. But, in this approach an ML model will be trained from scratch on custom Data. 
- The steps involved in performing sentiment analysis using the Bag of Words Vectorization method are as follows:

    - import custom dataset. The current dataset being used is a Finance Dataset from Kaggle. 
    https://www.kaggle.com/datasets/sbhatti/financial-sentiment-analysis
    - Pre-Process the text of training data (Text pre-processing involves Normalization, Tokenization, Stopwords Removal, and Stemming/Lemmatization). 
    - Create a Bag of Words for the pre-processed text data using the Count Vectorization or TF-IDF Vectorization approach.
    - Train a suitable classification model on the processed data for sentiment classification.


In [35]:
#Loading the Dataset
import pandas as pd
data = pd.read_csv('Finance_data.csv')
data

Unnamed: 0,Sentence,Sentiment
0,The GeoSolutions technology will leverage Bene...,positive
1,"$ESI on lows, down $1.50 to $2.50 BK a real po...",negative
2,"For the last quarter of 2010 , Componenta 's n...",positive
3,According to the Finnish-Russian Chamber of Co...,neutral
4,The Swedish buyout firm has sold its remaining...,neutral
...,...,...
5837,RISING costs have forced packaging producer Hu...,negative
5838,Nordic Walking was first used as a summer trai...,neutral
5839,"According shipping company Viking Line , the E...",neutral
5840,"In the building and home improvement trade , s...",neutral


In [37]:
#Pre-Prcoessing and Bag of Word Vectorization using Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = token.tokenize)
text_counts = cv.fit_transform(data['Sentence'])

#Splitting the data into trainig and testing
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, data['Sentiment'], test_size=0.25, random_state=5)

#Training the model
from sklearn.naive_bayes import MultinomialNB
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)

#Caluclating the accuracy score of the model
from sklearn import metrics
predicted = MNB.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
print("Accuracuy Score: ",accuracy_score)

Accuracuy Score:  0.6851471594798083


- The accuracy can be further improved by picking a better ML Model like XGBoost and fine-tuning it on the datset. 

## D. Sentiment Analysis using a custom Deep Learning Model 
- When the size of the training datset grows large, advanced Deep Learning models can be employed in place of Machine Learning Models for improved precision in predictions. 
- LSTM (Long Short Term Memory) model is a Deep Learning Model that can be implemented using TensorFlow with Keras
- The steps to perform sentiment analysis using LSTM-based models are as follows:

    - import custom dataset. The current dataset being used is a Finance Dataset from Kaggle. 
    https://www.kaggle.com/datasets/sbhatti/financial-sentiment-analysis
    - Pre-Process the text of training data (Text pre-processing involves Normalization, Tokenization, Stopwords Removal, and Stemming/Lemmatization). 
    - Import Tokenizer from Keras.preprocessing.text and create its object. Fit the tokenizer on the entire training text (so that the Tokenizer gets trained on the training data vocabulary). Generate text embeddings using the texts_to_sequence() method of the Tokenizer and store them after padding them to an equal length. (Embeddings are numerical/vectorized representations of text, so that an ML model can understand the word semantics/associations). 
    - Train a suitable DL classification model on the processed data for sentiment classification. Currently, LSTM is being used. Add dropouts and tune the hyperparameters to get a decent accuracy score. Generally, we ReLU or LeakyReLU activation functions in the inner layers of LSTM models are used to avoid the vanishing gradient problem; and a Softmax or Sigmoid activation function at the output layer. 
    

In [46]:
#Importing necessary libraries
import nltk
import pandas as pd
from textblob import Word
from nltk.corpus import stopwords
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split 
from keras.layers import LeakyReLU
#Loading the dataset
data = pd.read_csv('Finance_data.csv')
data

Unnamed: 0,Sentence,Sentiment
0,The GeoSolutions technology will leverage Bene...,positive
1,"$ESI on lows, down $1.50 to $2.50 BK a real po...",negative
2,"For the last quarter of 2010 , Componenta 's n...",positive
3,According to the Finnish-Russian Chamber of Co...,neutral
4,The Swedish buyout firm has sold its remaining...,neutral
...,...,...
5837,RISING costs have forced packaging producer Hu...,negative
5838,Nordic Walking was first used as a summer trai...,neutral
5839,"According shipping company Viking Line , the E...",neutral
5840,"In the building and home improvement trade , s...",neutral


In [50]:
#Pre-Processing the text 
def cleaning(df, stop_words):
    df['Sentence'] = df['Sentence'].apply(lambda x: ' '.join(x.lower() for x in x.split()))
    # Replacing the digits/numbers
    df['Sentence'] = df['Sentence'].str.replace('d', '')
    # Removing stop words
    df['Sentence'] = df['Sentence'].apply(lambda x: ' '.join(x for x in x.split() if x not in stop_words))
    # Lemmatization
    df['Sentence'] = df['Sentence'].apply(lambda x: ' '.join([Word(x).lemmatize() for x in x.split()]))
    return df
stop_words = stopwords.words('english')
data_cleaned = cleaning(data, stop_words)

#Generating Embeddings using tokenizer
tokenizer = Tokenizer(num_words=500, split=' ') 
tokenizer.fit_on_texts(data_cleaned['Sentiment'].values)
X = tokenizer.texts_to_sequences(data_cleaned['Sentiment'].values)
X = pad_sequences(X)

#Splitting the data into trainig and testing
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, data['Sentiment'], test_size=0.25, random_state=5)

#Model Building
model = Sequential()
model.add(Embedding(500, 120, input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(704, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(352, activation=LeakyReLU(alpha=0.3)))
model.add(Dense(3, activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])
print(model.summary())

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 1, 120)            60000     
_________________________________________________________________
spatial_dropout1d_4 (Spatial (None, 1, 120)            0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 704)               2323200   
_________________________________________________________________
dense_6 (Dense)              (None, 352)               248160    
_________________________________________________________________
dense_7 (Dense)              (None, 3)                 1059      
Total params: 2,632,419
Trainable params: 2,632,419
Non-trainable params: 0
_________________________________________________________________
None


In [53]:
#Model Training
model.fit(X_train, Y_train, epochs = 20, batch_size=32, verbose =1)
#Model Testing
model.evaluate(X_test,Y_test)

Epoch 1/20


ValueError: in user code:

    D:\Installation_deck\AnaConDa\lib\site-packages\tensorflow\python\keras\engine\training.py:805 train_function  *
        return step_function(self, iterator)
    D:\Installation_deck\AnaConDa\lib\site-packages\tensorflow\python\keras\engine\training.py:795 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    D:\Installation_deck\AnaConDa\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:1259 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    D:\Installation_deck\AnaConDa\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2730 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    D:\Installation_deck\AnaConDa\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:3417 _call_for_each_replica
        return fn(*args, **kwargs)
    D:\Installation_deck\AnaConDa\lib\site-packages\tensorflow\python\keras\engine\training.py:788 run_step  **
        outputs = model.train_step(data)
    D:\Installation_deck\AnaConDa\lib\site-packages\tensorflow\python\keras\engine\training.py:755 train_step
        loss = self.compiled_loss(
    D:\Installation_deck\AnaConDa\lib\site-packages\tensorflow\python\keras\engine\compile_utils.py:203 __call__
        loss_value = loss_obj(y_t, y_p, sample_weight=sw)
    D:\Installation_deck\AnaConDa\lib\site-packages\tensorflow\python\keras\losses.py:152 __call__
        losses = call_fn(y_true, y_pred)
    D:\Installation_deck\AnaConDa\lib\site-packages\tensorflow\python\keras\losses.py:256 call  **
        return ag_fn(y_true, y_pred, **self._fn_kwargs)
    D:\Installation_deck\AnaConDa\lib\site-packages\tensorflow\python\util\dispatch.py:201 wrapper
        return target(*args, **kwargs)
    D:\Installation_deck\AnaConDa\lib\site-packages\tensorflow\python\keras\losses.py:1537 categorical_crossentropy
        return K.categorical_crossentropy(y_true, y_pred, from_logits=from_logits)
    D:\Installation_deck\AnaConDa\lib\site-packages\tensorflow\python\util\dispatch.py:201 wrapper
        return target(*args, **kwargs)
    D:\Installation_deck\AnaConDa\lib\site-packages\tensorflow\python\keras\backend.py:4833 categorical_crossentropy
        target.shape.assert_is_compatible_with(output.shape)
    D:\Installation_deck\AnaConDa\lib\site-packages\tensorflow\python\framework\tensor_shape.py:1134 assert_is_compatible_with
        raise ValueError("Shapes %s and %s are incompatible" % (self, other))

    ValueError: Shapes (None, 1) and (None, 3) are incompatible


## E. Sentiment Analysis using a custom advanced Deep Learning Model - Transformer Model 
- Transformer-based models are one of the most advanced Natural Language Processing Techniques. They follow an Encoder-Decoder-based architecture and employ the concepts of self-attention to yield impressive results. 
- Building and Training Transformers from Scratch is technically and computationally expensive. Hence, a pre-trained transformer from Hugging Face is being used. 

In [56]:
# install transformers
!pip install transformers==4.36.2

Collecting transformers==4.36.2
  Obtaining dependency information for transformers==4.36.2 from https://files.pythonhosted.org/packages/20/0a/739426a81f7635b422fbe6cb8d1d99d1235579a6ac8024c13d743efa6847/transformers-4.36.2-py3-none-any.whl.metadata
  Using cached transformers-4.36.2-py3-none-any.whl.metadata (126 kB)
Collecting filelock (from transformers==4.36.2)
  Obtaining dependency information for filelock from https://files.pythonhosted.org/packages/81/54/84d42a0bee35edba99dee7b59a8d4970eccdd44b99fe728ed912106fc781/filelock-3.13.1-py3-none-any.whl.metadata
  Using cached filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB)
Collecting huggingface-hub<1.0,>=0.19.3 (from transformers==4.36.2)
  Obtaining dependency information for huggingface-hub<1.0,>=0.19.3 from https://files.pythonhosted.org/packages/a0/0a/02ac0ae1047d97769003ff4fb8e6717024f3f174a5d13257415aa09e13d9/huggingface_hub-0.20.1-py3-none-any.whl.metadata
  Using cached huggingface_hub-0.20.1-py3-none-any.whl.metadata (12

  error: subprocess-exited-with-error
  
  Preparing metadata (pyproject.toml) did not run successfully.
  exit code: 1
  
  [6 lines of output]
  
  Cargo, the Rust package manager, is not installed or is not on PATH.
  This package requires Rust and Cargo to compile extensions. Install it through
  the system's package manager or via https://rustup.rs/
  
  Checking for Rust toolchain....
  [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Encountered error while generating package metadata.

See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting tokenizers<0.19,>=0.14 (from transformers==4.36.2)
  Using cached tokenizers-0.15.0.tar.gz (318 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'error'


In [57]:
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["It was the best of times.", "t was the worst of times."]
sentiment_pipeline(data)

ModuleNotFoundError: No module named 'transformers'

# References: 
- https://www.analyticsvidhya.com/blog/2022/07/sentiment-analysis-using-python/