# Sentiment analysis
**Sentiment analysis**, sometimes called **opinion mining** or **polarity detection**, refers to the **set of algorithms** and **techniques** that are used to **extract the polarity of a given document**; that is, it determines whether the **sentiment** of a **document** is 

* **Positive**, 
* **Negative**,
* **Neutral**. 

**Sentiment analysis** is **gaining popularity** in the **industry** as it **allows organizations**
to **mine opinions** of a **large group of users** or **potential customers** in a **cost-efficient
way**. **Sentiment analysis** is now **used extensively**, among others in 

* **Advertisement campaigns**, 
* **Political campaigns**, 
* **Stock analysis**. 

# Financial Phrase Bank dataset
**[FinancialPhraseBank](https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news) dataset** contains the **sentiments** for **financial news headlines** from the **perspective of a retail investor**. It contains **two columns**:
* **Sentiment**
* **News Headline**. 

In [None]:
import pickle
import numpy as np
import pandas as pd
import spacy
from spacy.tokens.doc import Doc
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords, wordnet
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from typing import Generator

In [None]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
news_csv = '/content/drive/MyDrive/Financial News dataset/financial_news.csv'

In [None]:
news_df = pd.read_csv(
    filepath_or_buffer=news_csv, 
    sep=';', 
    encoding='unicode_escape', 
    engine='python', 
    header=None,
    usecols=[0, 1], 
    names=['Sentiment', 'News']
)

In [None]:
news_df.head()

Unnamed: 0,Sentiment,News
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...


In [None]:
news_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4846 entries, 0 to 4845
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Sentiment  4846 non-null   object
 1   News       4846 non-null   object
dtypes: object(2)
memory usage: 75.8+ KB


In [None]:
X = news_df['News']
y = news_df['Sentiment']

In [None]:
y.replace(
    to_replace={'negative': -1, 'neutral': 0, 'positive': 1}, 
    inplace=True
)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=17)

# Text preprocessing pipeline
There are **three transformations** included in the **pipeline**:
* **CountVectorizer** is used to **tokenize sentences** to **lower-case words**, **stopwords are removed**, and output is **vectorized**.
* **Chi2Score** selects **top k features** related to the **target** based on **ChiSquare test statistics**
* **TfidfTransformer** transforms the **vector of selected k-features** to **TF-IDF representation**.

In [None]:
class SpacyPreprocessor(BaseEstimator, TransformerMixin):
  def __init__(self):
    self.nlp = spacy.load(
      name='en_core_web_sm', 
      disable=["parser", "ner"]
    )
  
  def fit(self, X, y=None):
    return self
  
  def transform(self, X, y=None) -> list:
      return [
        ' '.join(SpacyPreprocessor.process_document(doc)) 
        for doc in self.nlp.pipe(X)
      ]

  @staticmethod
  def process_document(doc: Doc) -> Generator[str, None, None]:
    # Tokenize document
    for token in doc:
      # Remove non-alphanumeric tokens
      if not token.is_alpha:
        continue
        
      # Stopword removal
      if token.is_stop:
        continue
        
      # Lemmatization
      token = token.lemma_
        
      # Case folding
      token = token.casefold()

      # Yield token
      yield token

In [None]:
class NLTKPreprocessor(BaseEstimator, TransformerMixin):
  def __init__(self):
    self.tokenizer = TweetTokenizer()
    self.lemmatizer = WordNetLemmatizer()
    self.stop_words = stopwords.words('english')
  
  def fit(self, X, y=None):
    return self
  
  def transform(self, X, y=None) -> list:
    return [' '.join(self.process_document(doc)) for doc in X]
  
  def process_document(self, doc: str) -> list:
    for token in self.tokenizer.tokenize(doc):
      word = token.casefold()
      if word not in self.stop_words:
        tag = NLTKPreprocessor.wordnet_pos_tag(word)
        lemma = self.lemmatizer.lemmatize(word, pos=tag)
        yield lemma
  
  # This is a common method which is widely used across the NLP community
  @staticmethod
  def wordnet_pos_tag(token: str) -> Generator[str, None, None]:
    """
    Maps POS tags to the first character lemmatize() accepts.
    WordNet groups [N]ouns, [V]erbs, [A]djectives, and Adve[R]bs into synsets.
    """
    tag_dict = {
        'J': wordnet.ADJ,
        'N': wordnet.NOUN,
        'V': wordnet.VERB,
        'R': wordnet.ADV
    }
    tag = nltk.pos_tag([token])[0][1][0].upper()
    return tag_dict.get(tag, wordnet.NOUN)

In [None]:
def preprocess_text(
    lib: str = "", 
    vec_norm: str = 'l2',
    n_grams: tuple = (1, 1), 
    k_best: int = 1000
  ) -> Pipeline:

  # Use SpaCy to preprocess text
  if lib == 'spacy':
    count_vectorizer = CountVectorizer(
        ngram_range=n_grams,
        lowercase=False
    )
    steps = [
      ('spacy_preprocessor', SpacyPreprocessor()),
      ('count_vectorizer', count_vectorizer),
    ]

  
  # Use NLTK to preprocess text
  elif lib == 'nltk':
    count_vectorizer = CountVectorizer(
        ngram_range=n_grams,
        lowercase=False
    )
    steps = [
      ('nltk_preprocessor', NLTKPreprocessor()),
      ('count_vectorizer', count_vectorizer),
    ]

  # Use CountVectorizer's in-built preprocessing
  else:
    count_vectorizer = CountVectorizer(
        ngram_range=n_grams,
        stop_words=stopwords.words('english'), 
        lowercase=True
    )
    steps = [('count_vectorizer', count_vectorizer)]

  # Customize pipeline
  pipe = Pipeline([
    *steps,
    ('chi2score', SelectKBest(chi2, k=k_best)),
    ('tfidf_transformer', TfidfTransformer(norm=vec_norm, use_idf=True)),
  ])

  # Return custom pipeline
  return pipe

# Multinomial Naive Bayes

In [None]:
pipe_nb = Pipeline([
  *preprocess_text(lib='spacy').steps, 
  ('naive_bayes', MultinomialNB())
])

In [None]:
param_grid_nb = {
    'count_vectorizer__ngram_range': [(1, 2), (1, 3)],
    'chi2score__k': [500, 1000],
    'tfidf_transformer__norm': ['l1', 'l2'],
    'naive_bayes__alpha': [0.25, 0.5, 1],
}
grid_search_nb = GridSearchCV(pipe_nb, param_grid_nb)

In [None]:
grid_search_nb.fit(X_train, y_train)
print(
  'Naive Bayes score: {:.2f}\nPipeline parameters: {}'
  .format(grid_search_nb.best_score_, grid_search_nb.best_params_)
)

Naive Bayes score: 0.73
Pipeline parameters: {'chi2score__k': 500, 'count_vectorizer__ngram_range': (1, 2), 'naive_bayes__alpha': 0.5, 'tfidf_transformer__norm': 'l2'}


In [None]:
naive_bayes_pipe = grid_search_nb.best_estimator_

# Support Vector Classification (SVC)

In [None]:
pipe_svc = Pipeline([
  *preprocess_text(lib='spacy').steps, 
  ('svc', SVC())
])

In [None]:
param_grid_svc = {
    'count_vectorizer__ngram_range': [(1, 2), (1, 3)],
    'chi2score__k': [500, 1000],
    # The strength of the regularization is inversely proportional to C. 
    'svc__kernel': ['linear', 'poly'],
    # The penalty is a squared l2 penalty.
    'tfidf_transformer__norm': ['l2'], 
}
grid_search_svc = GridSearchCV(pipe_svc, param_grid_svc)

In [None]:
grid_search_svc.fit(X_train, y_train)
print(
  'Naive Bayes score: {:.2f}\nPipeline parameters: {}'
  .format(grid_search_svc.best_score_, grid_search_svc.best_params_)
)

Naive Bayes score: 0.75
Pipeline parameters: {'chi2score__k': 1000, 'count_vectorizer__ngram_range': (1, 2), 'svc__kernel': 'linear', 'tfidf_transformer__norm': 'l2'}


In [None]:
svc_pipe = grid_search_svc.best_estimator_

# Productionizing a trained sentiment analyzer


In [None]:
def save_pipe(pipe: Pipeline, path: str):
  with open(path, 'wb') as f_out:
    pickle.dump(pipe, f_out)

def load_pipe(path) -> Pipeline:
  with open(path, 'rb') as f_in:
    return pickle.load(f_in)

In [None]:
LOAD_MODEL = False
SAVE_MODEL = False
MODEL_PATH = r'/content/drive/MyDrive/sentiment_classifier.pickle'

In [None]:
if LOAD_MODEL:
  pipe = load_pipe(MODEL_PATH)

else:
  # Build the model
  pipe = Pipeline([
    *preprocess_text(lib='spacy', k_best=1000, n_grams=(1, 2), vec_norm='l2').steps, 
    ('svc', SVC(kernel='linear'))
  ])

  # Fit the model
  pipe.fit(X_train, y_train)

  if SAVE_MODEL:
    save_pipe(pipe, MODEL_PATH)

# Custom sentiment classifier in action

In [None]:
def predict_sentiment(pipe: Pipeline, doc: str):
  y_pred = pipe.predict([doc])[0]
  sentiments = {
      -1: "Negative",
      0: "Neutral",
      1: "Positive"
  }
  return sentiments[y_pred]

In [None]:
emojis = {
    "Negative": "🙁",
    "Neutral": "😐",
    "Positive": "🙂",
}
for i in np.random.randint(low=0, high=X_test.shape[0], size=(5,)):
  doc = X_test.iloc[i]
  sentiment = predict_sentiment(pipe, doc)
  print(f'[{emojis[sentiment]}] {doc}')

[🙁] Earnings per share EPS in 2005 decreased to EUR0 .66 from EUR1 .15 in 2004 .
[😐] The percentages of shares and voting rights have been calculated in proportion to the total number of shares registered with the Trade Register and the total number of voting rights related to them .
[😐] The company had earlier said that it was considering different strategic options for the struggling low-cost mobile operator , including a divestment of its holding .
[😐] Nokia is requesting that the companies stop making and selling the mobile phones and pay monetary damages and costs .
[😐] The total value of the deliveries is some EUR65m .


# Summary
There are some **important things** to consider while creating and deploying the **sentiment analyzer**:
* The **training data** should be **consistent with the objective** of the **sentiment analyzer**. 
> **Don't train the model** using **movie reviews** if the **objective** is to predict
the **sentiment** of **financial news articles**.

* **Accurately labeling** the **training data** is **critical** for the model to perform well. 
> If you are creating a **real-world application**, you will have to **spend time labeling the training documents**, unless you use **pre-labeled dataset** Typically, **labeling** should be done by someone with a **good understanding of industry jargon**.

* **Sourcing training data** is a **difficult task**. You can use tools such as **web scraping**
or **social media scraping**, subject to permissions
> Effort should be spent on **sourcing data** from **multiple platforms** and you **shouldn't rely too much on a particular source**.

* **Evaluate the performance of your model regularly** and **retrain the model if
required**.