# Sentiment Analysis
> [Main Table of Contents](../../../README.md)

## In This Notebook
- Sentiment Analysis
- Preprocess Text
	- Tokenization
	- Stemming
	- Lemmatization
	- Stop Words
	- Convert text to numerical representations
		- one-hot encoding
		- Vectorization
			- Type of vectorization - Bag of Words
- NER. Named Entity Recognition
- n-gram | q-gram | ngram | ngrams

## Sentiment Analysis

Libraries | Method(s)/Attribute(s) | Description
--- | --- | ---
nltk.vader.SentimentIntensityAnalyzer()| .polarity_score(text) |  Trained on social media text<br>Returns -1 <= score <= 1
textblob.TextBlob(text) | textblob.sentiment | Returns named tuple<br>Sentiment(polarity, subjectivity)<br>polarity score is a float within the range [-1.0, 1.0]<br>subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective
spacy | Doc.sentiment<br>Span.sentiment<br> Token.sentiment | Scalar value
rasa | interpreter.parse returns dict(intent, entities, text) which can be further used  TODO: COME BBACK TO THIS!!!!!
sklearn | logistic regression<br>Train text on sentiment classification labels (e.g. positive, negative, etc)

## Preprocess Text
- Leads to smaller vocabularies which improves performance

In [2]:
# Apply a preprocess Step to a pandas column of transcripts
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_sm") 
stopwords = spacy.lang.en.stop_words.STOP_WORDS
def preprocess(text):
  	# Create Doc object
    doc = nlp(text, disable=['ner', 'parser'])
    # Generate lemmas
    lemmas = [token.lemma_ for token in doc]
    # Remove stopwords and non-alphabetic characters
    # NOTE: caution, isalpha() would remove abbreviations with periods and proper nouns that utilize non-alpha chars
    # NOTE: if niche context, advisable to create custom stopwords
    a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]
    
    return ' '.join(a_lemmas)
  
# Apply preprocess to ted['transcript']
ted = pd.DataFrame({'transcript': ["We're going to talk — my — a new lecture, just", "This is a representation of your brain, and yo."], 'url': ["https://www.ted.com/talks/al_seckel_says_our_b", "https://www.ted.com/talks/aaron_o_connell_maki"]})
ted['transcript'] = ted['transcript'].apply(preprocess)
print(ted['transcript'])

ModuleNotFoundError: No module named 'spacy'

### Tokenization
- Split text into structures called tokens

	Library | Method
	--- | ---
	nltk | nltk.download('punkt')
	spacy | nlp(text) produces a doc that's been tokenized

In [None]:
# Tokenization Example
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("I won't focus on negative thoughts")
print(*doc)

: 

### Stop Words
- Common, uninformative words (e.g. prepositions) to typically filter out
- Often advisable to build own list of stop words if niche context

### Stemming
- Truncate a word to its root (root may not be a valid word)
	- e.g. house, housing, houses -> hous

### Lemmatization
- Truncat a word to its root (root will be a valid word)
	- e.g. house, housing, houses -> house

In [3]:
# Lemmatization Example
# Turned tokens: wo -> will and n't -> not
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("I won't focus on negative thoughts")
print(' '.join(token.lemma_ for token in doc))

ModuleNotFoundError: No module named 'spacy'

### Convert Text to numerical representations
- For machine learning purposes text must be numerical not text
- Integer encoding
- One-hot encoding process
- Vectorization

#### Integer Encoding
- Convert text to numerical representations

#### One-Hot Encoding / one-hot vector
- One-hot encoding is the process of converting categorical data (text or numerical) to a binary representation
	- binary variables are called 'dummy' variables in statistics
		Ways to get dummy variables |
		--- |
		`pandas.get_dummies()`|
		`tensorflow.keras.utils.to_categorical()`|
		`pyspark.ml.StringIndexer` + `pyspark.ml.OneHotEncoder` in pyspark pipeline |

In [None]:
# One-Hot Encoding Example
import pandas as pd
pre = pd.DataFrame({'age': [10, 8, 12, 18], 'type': ['beagle', 'lab', 'pood', 'beagle'], 'sex': ['M', 'F', 'F', 'M']})
print(f'Pre DF\n{pre.to_string()}')
# convert type and sex columns into numerical versions
post = pd.get_dummies(pre, columns=['sex', 'type']) 
print(f'Post DF\n{post.to_string()}')

Pre DF
   age    type sex
0   10  beagle   M
1    8     lab   F
2   12    pood   F
3   18  beagle   M
Post DF
   age  sex_F  sex_M  type_beagle  type_lab  type_pood
0   10      0      1            1         0          0
1    8      1      0            0         1          0
2   12      1      0            0         0          1
3   18      0      1            1         0          0


#### Vectorization
- Numerical representation of text using algos like spacy's `tok2vec` component

##### Type of vectorization - Bag of Words
- Count/frequency of tokens/words/sentences
- Use sklearn CountVectorizer, TfidfVectorizer to preprocess text
	- Further use spacy to preprocess text for lemmatization, etc
- Model doesn't know how to analyze new words in test set
- BOW shortcomings:  Doesn't take in context. Can use ngram_range to resolve this. Though the increase in ngrams may only do marginally better and the added inefficiency (time and dimensionality) may not be worth it.
	- e.g. "The movie was good and not boring" and "The movie was not good and boring" has same BOW, but opposite sentiment. 

## NER. Named Entity Recognition
- Categorizing words into entities e.g. Person, Place
- NER is useful on new words.  Gather meaning or categorizes based on context.

## n-gram | q-gram | ngram | ngrams
- Contiguous sequence of *n* items from a given text
- Incorporate context to words by also analyzing surrounding words
- Use phrases (bigram, trigram, etc) instead of a word (unigram)
- ngram applications include
	- sentence completion
	- spelling correction
	- machine translation correction
	- take in more context in BOW vectorizations
- High ngrams are essentially useless. Curse of dimensionality
	- Keep ngrams small for efficiency and usefulness