# Sentiment Analysis

- 又稱為 opinion mining, polarity detection  
- definition: The set of algorithms and techniques used to extract the polarity of a given document  
- 屬於 ML，也屬於 DL  
- 有兩種 appraoches: Machine Learning Approach 與 Lexicon-based (辭典) Approach

Relevant Python Libraries:

- ```NLTK``` (Python Natural Language Toolkit): Can perform Tokenization, Stemming, Lemmatization, Punctuation, Character count, word count   
- ```VADER``` (Valence Aware Dictionary and sEntiment Reasoner): Better for social media sentiment analysis    
- ```Scikit Learn```: Use ML models (Naive Bayes, Decision trees, SVM) to perform sentiment analysis  
- ```TextBlob```

## 0. Text Vectorization (向量化):

- Computers take in vectors and matrices. 
- A vectorization cotains the following 5 steps:  
    - ```Tokenization``` (斷詞): splitting text into relevant units (characters, words, phrases,...)。這些 units 稱為 tokens  
    - ```Lemmatization``` (詞幹提取): removing the inflectional forms of the words (把詞形變化去掉，比如將 playing, played, performed 的 -ing, -ed 去掉，往往不簡單，因為要確認 play, perform 這些字是真的英文)  
    - ```Stemming``` (詞幹提取): keeping only the root word and rejecting other forms (跟 ```Lemmatization``` 很像，但這次指定好詞形變化是哪些)  
    - ```Stop words```: removing non sentiment showing words such as connector words (the,a...), subject verbs etc.  
    - ```Normalization```: cleaning data such as removing emoticons (表情符號), extra punctuation marks etc. 
- 做完上述步驟後的結果是 a vector of words


### python code:  

安裝 ```NLTK```: ```conda install nltk```

In [1]:
text = "I am not a sentimental person but I believe in the utility of sentiment analysis."

In [2]:
# Tokenization
from nltk.tokenize import word_tokenize

In [3]:
tokens = word_tokenize(text)

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\User/nltk_data'
    - 'C:\\Users\\User\\Anaconda3\\envs\\quantTrading\\nltk_data'
    - 'C:\\Users\\User\\Anaconda3\\envs\\quantTrading\\share\\nltk_data'
    - 'C:\\Users\\User\\Anaconda3\\envs\\quantTrading\\lib\\nltk_data'
    - 'C:\\Users\\User\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************


 出現 ```LookupError```，原因是要下載字典:

In [4]:
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [13]:
tokens = word_tokenize(text)
tokens

['I',
 'am',
 'not',
 'a',
 'sentimental',
 'person',
 'but',
 'I',
 'believe',
 'in',
 'the',
 'utility',
 'of',
 'sentiment',
 'analysis',
 '.']

In [7]:
# Lemmatization
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() # create a lemmatizer object

tokens = [lemmatizer.lemmatize(word) for word in tokens]

LookupError: 
**********************************************************************
  Resource [93mwordnet[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('wordnet')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/wordnet[0m

  Searched in:
    - 'C:\\Users\\User/nltk_data'
    - 'C:\\Users\\User\\Anaconda3\\envs\\quantTrading\\nltk_data'
    - 'C:\\Users\\User\\Anaconda3\\envs\\quantTrading\\share\\nltk_data'
    - 'C:\\Users\\User\\Anaconda3\\envs\\quantTrading\\lib\\nltk_data'
    - 'C:\\Users\\User\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


一樣要下載字典:

In [8]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [19]:
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() # create a lemmatizer object

tokens = [lemmatizer.lemmatize(word) for word in tokens]

In [20]:
tokens

['i',
 'am',
 'not',
 'a',
 'sentiment',
 'person',
 'but',
 'i',
 'believ',
 'in',
 'the',
 'util',
 'of',
 'sentiment',
 'analysi',
 '.']

In [16]:
# Stemming 
from nltk.stem import PorterStemmer
tokens = word_tokenize(text.lower())
ps = PorterStemmer()

tokens = [ps.stem(word) for word in tokens]

In [17]:
tokens

['i',
 'am',
 'not',
 'a',
 'sentiment',
 'person',
 'but',
 'i',
 'believ',
 'in',
 'the',
 'util',
 'of',
 'sentiment',
 'analysi',
 '.']

可以發現，經過 lemmatize 或 stemmize 後， a sentimental person 變成 a sentiment person

In [23]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [24]:
stopwords = nltk.corpus.stopwords.words("english")
stopwords # 這些是被 nltk package 認定為 stopwords 的英文字

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [27]:
stopwords_ARB = nltk.corpus.stopwords.words("arabic")

In [29]:
stopwords_ARB[:5]

['إذ', 'إذا', 'إذما', 'إذن', 'أف']

stopwords 的檔案存放在: ```C:\Users\User\AppData\Roaming\nltk_data\corpora\stopwords```  
注意: 沒有中文字的 stopwords

在上述的 vectorizaton 步驟中，lemmatize 或 stemmize 其實不太合理，因為可能造成誤會，比如 promised (承諾)跟 promising (有未來的)意思不同，但在 lemmatize 或 stemmize 下都會變成 promise。

## 1. Lexicon Based Approach