# NLP / Sentiment Analysis Introduction 
##### What is NLP / Sentiment Analsis?
---

Natural Language Processing (NLP) is a subfield of AI concerned with enabling machines 
with the ability to understand, analyze, and generate natural human language. Sentiment Analysis is 
the subsect of NLP concerned with classifying the polarity or emotion of a block of input text. 
Applications include Customer Feedback Analysis, Brand Reputation, Competitor Analysis, Marketing 
effectivness, and so much more! 

Types of Sentiment Analysis include :
1. Emotion Detection 
2. Aspect-Based Analysis
3. Multi-lingual Analysis
4. Fine-Grained 
5. Rule/Sentiment Based 

## Initial Approaches at Sentiment Analysis
##### Basic Lexicon Sentiment Analysis
---
In the early approaches, Sentiment Analysis models relied on statically defined *Lexicons* (essentially a list of keywords) that identifies words of interest. Each keyword in the lexicon is mapped to a *polarity* (positive (1) / negative (-1) ). From there, we can easily scan a block of text, keeping track of how many positive and negative words we encounter.  

Straight-Forward Implementation
1. Define static lexicon (or vocabulary) of positive and negative words
2. Iterate through input stream
3. Add to score when positive, penalize from score when negative


In [183]:
# --------- Static Lexicon---------
#   Positive sentiment = +1 
#   Negative sentiment = -1
lexicon = {"good": 1, "great": 1, "excellent": 1, "wonderful": 1, "amazing": 1, "fantastic": 1,
           "bad": -1, "terrible": -1, "awful": -1, "horrible": -1, "stupid": -1}

def simple_sentiment(text: str) -> str:
  '''
  Initial Sentiment Analysis Approach, count positive and negative words
  :param text: Block of text to be analyzed
  :return: string result "POS"/"NEG"/"NEUTRAL"
  '''
  tokens = text.split()
  score = 0
  for token in tokens:
    score += lexicon.get(token.lower(), 0)
  return "POS" if score > 0 else "NEG" if score < 0 else "neutral"



In [211]:
#positive test
print("Expected: POS\t=>\t<" + simple_sentiment("This is a good movie!") + ">")
print("Expected: POS\t=>\t<" + simple_sentiment("This is a very good movie!") + ">", end='\n\n')
#negative test
print("Expected: NEG\t=>\t<" + simple_sentiment("This is a bad movie!") + ">")
print("Expected: NEG\t=>\t<" + simple_sentiment("That guy is not a good person") + ">", end='\n\n')
#confusing it on purpose
print("\nExpected: NEG\t=>\t<" + simple_sentiment("I don't feel very good today.") + ">", end='')
print("\nExpected: NEG\t=>\t<" + simple_sentiment("I do not like The Amazing Spiderman. The visuals were good, but overall the storyline was awful and predictable.") + ">", end='')

Expected: POS	=>	<POS>
Expected: POS	=>	<POS>

Expected: NEG	=>	<NEG>
Expected: NEG	=>	<POS>


Expected: NEG	=>	<POS>
Expected: NEG	=>	<neutral>

#### Strengths and Limitations 
##### Lexicon Based Methods 
- - -
Even though this is simplistic implementation of a lexicon-based model, the pitfalls are clearly visible. That is the tradeoff we accept, as lexicon-based models are simple to implement and interpret. We can clearly determine *WHY* a given decision was made. Additionally, no training overhead is required. No need for gathering up a labeled dataset that fits your use-case. 

That being said, even with a well-designed lexicon model, the pitfalls are signifcant. There is a reason why better models were quickly produced after all.  Just to name a few of the problems here: 
1. Fails to consider context
2. Does not handle negation (That guy is NOT a good person) 
3. Does not handle intensity or sarcasm
4. Predefined lexicons do not adapt with language or respond well to unknown words


## Improving Initial Approach 
##### Rule Handling and Lexicon Improvements 
---
To improve the limited lexicon-based approach, we must handle the following:
1. Negation handling
2. Intensity handling
3. Dynmaic / Domain-specific lexicon
4. Tokenization
5. Multi-Word Phrase Recognition

People who are much better at programming than I came up with something called VADER. VADER stands for Valence Aware Dictionary and sEntiment Reasonser, and it is a lexicon + rule based model. It includes a lexicon which includes the typical sentiment words, but also accounts for slang (such as "meh"). Each word is mapped to a valence score in addition to the set of defined rules to handle context. 


These improvements to the simplistic lexicon-based model allow for VADER to be extremely effective for classifying blocks of text found in social media postings and product reviews (think Twitter, Instragram, etc). That being said, it is still a Lexicon-Rule Based approach, still reaching its limitations when confronted with sarcasm or more complex tasks. 

Implementing all that from scratch can get rather expansive, so here is a pretty simple implementation to serve as a basic view into how one might implement that

In [234]:
#1. Negation Handling
#2. Intensity Handling

#------- Statically Defined Vocabularies -----

lexicon_plus = {"good": 1, "great": 2, "excellent": 3, "wonderful": 2, "amazing": 2, "fantastic": 3,
                "bad": -1, "terrible": -2, "awful": -2, "horrible": -3, "stupid": -2}
negation_words  = ["not", "don't", "dont", "never", "no"]
amplifier_words = {"extremely": 2.0, "very": 1.5, "so": 1.2, "really": 1.3}

def simple_sentiment_plus(text: str) -> str:
  '''
  slight improvement on simple_sentiment
  Adds rules for handling negation and intensity
  :param text: Block of text to be analyzed
  :return: string result "POS"/"NEG"/"NEUTRAL"
  '''
  tokens = text.split()
  negate_window = 0
  multiplier    = 1.0
  score = 0
  
  #define some list of negators and amplifiers

  for token in tokens:
    word = token.lower().strip(".,?!()_\"\'!@#$%^&*+=:;")
    value = 0
    if word in negation_words:
      negate_window = 3
      continue

    if word in amplifier_words:
      multiplier *= amplifier_words[word]
      continue

    if word in lexicon_plus:
      value = lexicon_plus[word]

      if negate_window > 0:
        value = -value
        negate_window -= 1

    score += value * multiplier
    multiplier = 1.0

  return "POS" if score > 0 else "NEG" if score < 0 else "neutral"

In [235]:
#positive test
print("Expected: POS\t=>\t<" + simple_sentiment_plus("This is a good movie!") + ">")
print("Expected: POS\t=>\t<" + simple_sentiment_plus("This is a very good movie!") + ">", end='\n\n')
#negative test
print("Expected: NEG\t=>\t<" + simple_sentiment_plus("This is a bad movie!") + ">")
print("Expected: NEG\t=>\t<" + simple_sentiment_plus("That guy is not a good person") + ">", end='\n\n')
#confusing it on purpose
print("Expected: NEG\t=>\t<" + simple_sentiment_plus("I don't feel very good today.") + ">", end='\n')
print("Expected: NEG\t=>\t<" + simple_sentiment_plus("I do not like The Amazing Spiderman. The visuals were good, but overall the storyline was awful and predictable. not good at all") + ">", end='')

Expected: POS	=>	<POS>
Expected: POS	=>	<POS>

Expected: NEG	=>	<NEG>
Expected: NEG	=>	<NEG>

Expected: NEG	=>	<NEG>
Expected: NEG	=>	<NEG>

In [256]:
#nltk VADER example 
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

samples = [
    "I love this product! It works great and is very affordable.",
    "This product is okay. It gets the job done, but could be better.",
    "I hate this product. It doesn't work at all and is a waste of money."
]

for text in samples:
    scores = analyzer.polarity_scores(text)
    print(text)
    print(scores, end='\n\n')

I love this product! It works great and is very affordable.
{'neg': 0.0, 'neu': 0.482, 'pos': 0.518, 'compound': 0.8622}

This product is okay. It gets the job done, but could be better.
{'neg': 0.0, 'neu': 0.675, 'pos': 0.325, 'compound': 0.6486}

I hate this product. It doesn't work at all and is a waste of money.
{'neg': 0.371, 'neu': 0.629, 'pos': 0.0, 'compound': -0.7579}


While combining Lexicon and Rule based methods improves performance significantly, maintaining and scaling such systems is cumbersome and just not really worth it. They can easily break, and will never properly encapsulate all the ways people convey sentiment. 

# Next Generation Techniques 
##### Overcoming Lexicon + Rule Based Limitations
- - -
Eventually some smart person finally got tired of having to hand-write rules  and update their lexicons constantly and asked themselves - *"Why do I hate my life, and how can I get the computer to do this for me?"* - and all of a sudden shit got real

In the following section, I aim to cover some of the approaches to 
1. Dynamic Lexicon Creation,
2. Tokenization,
3. Mutli-Phrase detection,
4. N-Grams, 
5. and possibly more...


## Dynamic Lexicon Creation
##### How statistics took us a step further
---
The idea here is simple. How can we overcome the need for manually created lexicons by learning sentiment from data? 

The answer is, well, there were many techniques developed. 
1. Corpus-Based Lexicon Expansion
2. Semantic Orientation
3. Sentiment Classification via Clustering
4. many more...
For simplicity and time sake, I will focus on number 1 *Corpus-Based Lexicon Expasion

In [263]:
#Corpus Based Lexicon Expansion
from nltk.corpus import wordnet
nltk.download('wordnet')

def get_synoynms(word: str) -> list[str]:
  synonyms = set()
  for w in wordnet.synsets(word):
    for lem in w.lemmas(): 
        if lem not in synonyms: 
          synonyms.add(lem)
  return synonyms

print(f'{get_synoynms("terrible")}')
   

[nltk_data] Error loading wordnet: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


LookupError: 
**********************************************************************
  Resource [93mwordnet[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('wordnet')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/wordnet[0m

  Searched in:
    - 'C:\\Users\\Trevor/nltk_data'
    - 'C:\\Users\\Trevor\\PycharmProjects\\Practice\\.venv\\nltk_data'
    - 'C:\\Users\\Trevor\\PycharmProjects\\Practice\\.venv\\share\\nltk_data'
    - 'C:\\Users\\Trevor\\PycharmProjects\\Practice\\.venv\\lib\\nltk_data'
    - 'C:\\Users\\Trevor\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


## Tokenization 
##### What is it and Why am I talking about it?
---

Tokenization is defined as breaking text into sequences of *Tokens*. There are many different ways in which we can *Tokenize* our input text
1. Character Tokenization
    - Split text by each character
2. Word Tokenization
    - Split text by each word 
3. Subword Tokenization
    - Breaks words down into smaller units

Why bother with Tokenization? By breaking the text block into smaller tokens, we enable the computer with the ability to identify meaningful features and patterns within language. We can standardize our input (i.e. stripping whitespaces, stop words, ...), we can generate vocabularies, and later on convert our *tokens* into numerical representations. 

In [244]:
#3. Tokenization
import re

def character_tokenization(text: str) -> list:
  '''
  tokenizez text at the character level
  :param text: block of text to be tokenzied
  :return: list of tokens 
  '''
  tokens = []
  for char in text:
    tokens.append(char)
  return tokens

#splits input by words (spaces)
def word_tokenization(text: str) -> list:
  '''
  tokenize text at the word level (split by spaces)
  :param text: block of text to be tokenzied
  :return: list of tokens
  ''' 
  #text.split() may be sufficient in some cases
  pattern = r"[A-Za-z]+(?:'[A-Za-z]+)?|[.,!?;]"
  return re.findall(pattern, text.lower())

#stem extraction
''' i would implement myself, but I wanted to focus on the bigger parts
i started it and gave up its so if else if else if else if else if else if else if else if else if else if else if else if else if'''
from nltk import PorterStemmer
def stem_tokenization(text: str):
  stemmer = PorterStemmer()
  words = word_tokenization(text)
  return  [stemmer.stem(word) for word in words]

#lemmatization
''' again i would implement myself, but i wanna focus on getting somewhere in the project first
i can always come back to this spot and implement from scratch should i choose or have the time for it'''
from nltk import WordNetLemmatizer
def lemmatize_tokenization(text: str):
  lemmatizer = WordNetLemmatizer()
  words = word_tokenization(text)
  print(words)
  return [lemmatizer.lemmatize(word) for word in words]

In [259]:
text = "The quick brown fox jumps over the lazy dog that "

characters = character_tokenization(text)
words = word_tokenization(text)
stems = stem_tokenization(text) 
print(f"Char Tokenization: {characters[:51]}", end='\n\n')
print(f"Word Tokenization: {words}", end='\n\n')
print(f"Stemming : {stems}", end='\n\n')
print(f"Lemmatization : {stems}", end='\n\n')


Char Tokenization: ['T', 'h', 'e', ' ', 'q', 'u', 'i', 'c', 'k', ' ', 'b', 'r', 'o', 'w', 'n', ' ', 'f', 'o', 'x', ' ', 'j', 'u', 'm', 'p', 's', ' ', 'o', 'v', 'e', 'r', ' ', 't', 'h', 'e', ' ', 'l', 'a', 'z', 'y', ' ', 'd', 'o', 'g', ' ', 't', 'h', 'a', 't', ' ']

Word Tokenization: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'that']

Stemming : ['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog', 'that']

Lemmatization : ['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog', 'that']


In [246]:
#4. Dynamic Lexicon

stop = ['.', '!', '?', ':', ';', 'it', 'he', 'had', 'i', 'if', 'a', 'an', 'as', 'to', 'the', 'was', \
        'of', 'these', 'in', 'on', 'at', 'with', 'is', 'am', 'are','we', 'be', 'been', 'im', 'so', 'that',\
        'to', 'was', ]
def generate_lexicon(text: str, window_size: int = 3):
  '''
  Modify our original lexicon and append new words
  Filters out predefined stopwords and punctuation markers
  Assigns new scores using a sliding window + (pos_words - neg_words) 
  :param text: block of text to be lexicicalized 
  :return: something
  '''
  static_lexicon = lexicon.copy() 
  #some predefined stop words

  #alternatively, stop_words = set(stopwords.words('english'))
  
  #we need lists of positive + negative words in statically defined lexicon
  pos_words = [word[0] for word in static_lexicon.items() if int(word[1]) > 0]
  neg_words = [word[0] for word in static_lexicon.items() if  int(word[1]) < 0]
 
  #no lets iterate through our tokenized text 
  dynamic_lexicon = {} 
  tokens =  stem_tokenization(text)
  
  for index, token in enumerate(tokens):
    #print(f"EVALUATING SCORE FOR <{token}>")
    #if its in our predefined list of stop words (which includes puncutation) or if its like a number or special character, just skip it
    if token in stop or not token.isalpha(): 
      #print(f"Token <{token}> is a stop word")
      continue
    #if we already have this token accounted for, skip it 
    if token in static_lexicon or token in amplifier_words:
      #print(f"Token <{token}> is already defined")
      continue
    positives = 0
    negatives = 0    
    #establish the bounds of our window
    start = max(0, index - window_size) #if curr_index - window size is negative, we will get out of bounds. So start at 0 if thats the case
    end   = min(len(tokens), index + window_size + 1) #if the window size escapes the size of our tokens, just end at the final token

    #print(f"start index => <{start}\tend index => {end}")
    for j in range(start, end):
      #print(f"\twindow token at j <{j} => {tokens[j]}", end = ' ')
      #skip the word we are evaluating at the moment
      if j == index or tokens[j] in stop:
        #print("CONTINUING")
        continue

      candidate = tokens[j]

      #if tokens[j] in static_lexicon:
        #print(f"\t{static_lexicon[tokens[j]]}", end=' ')
      #print("")
      if candidate in pos_words:
        positives+=1
      elif candidate in neg_words:
        negatives+=1
      
      score = positives - negatives
      dynamic_lexicon[token] = score
    #print(f"Final entry for token <{token}> => {dynamic_lexicon[token]}", end='\n\n')
    
  return dynamic_lexicon

In [196]:
text = "I was very good to hear about your wonderful vacation! Im so sorry about that stupid hassle at the airport, glad it was fantastic Superwoman !"
print(generate_lexicon(text, 3))

{'wa': 0, 'veri': 1, 'hear': 1, 'about': -1, 'your': 0, 'wonder': 0, 'sorri': -1, 'hassl': -1, 'glad': 0, 'fantast': 0, 'superwoman': 0}


#### *Notice how this behavior isn't exactly what we we're shooting for.*
1. <'wonderful'> was converted to <'wonder'> via tokenization, went unrecognized as a positive seed
2. <'very'> was converted to <'veri'> via tokenization, went unrecognized as a amplifier word
3. 'superwoman' preceded by fantastic, yet neutral score was calculated
4. I guess my statically defined lexicon does not define 'glad' as a positive word

This is clearly problematic. To fix this, I hypothesised that tokenizing my lexicons (positive, negative, stop, amplifiers...) will yield better, more expected results.
Logically, what is applied to one should be applied to all, for consistency. Words with similar stems 