# Sentiment Analysis

- 又稱為 opinion mining, polarity detection  
- definition: The set of algorithms and techniques used to extract the polarity of a given document  
- 屬於 ML，也屬於 DL  
- 有兩種 appraoches: Machine Learning Approach 與 Lexicon-based (辭典) Approach

Relevant Python Libraries:

- ```NLTK``` (Python Natural Language Toolkit): Can perform Tokenization, Stemming, Lemmatization, Punctuation, Character count, word count   
- ```VADER``` (Valence Aware Dictionary and sEntiment Reasoner): Better for social media sentiment analysis    
- ```Scikit Learn```: Use ML models (Naive Bayes, Decision trees, SVM) to perform sentiment analysis  
- ```TextBlob```

## 0. Text Vectorization (向量化):

- Computers take in vectors and matrices. 
- A vectorization cotains the following 5 steps:  
    - ```Tokenization``` (斷詞): splitting text into relevant units (characters, words, phrases,...)。這些 units 稱為 tokens  
    - ```Lemmatization``` (詞幹提取): removing the inflectional forms of the words (把詞形變化去掉，比如將 playing, played, performed 的 -ing, -ed 去掉，往往不簡單，因為要確認 play, perform 這些字是真的英文)  
    - ```Stemming``` (詞幹提取): keeping only the root word and rejecting other forms (跟 ```Lemmatization``` 很像，但這次指定好詞形變化是哪些)  
    - ```Stop words```: removing non sentiment showing words such as connector words (the,a...), subject verbs etc.  
    - ```Normalization```: cleaning data such as removing emoticons (表情符號), extra punctuation marks etc. 
- 做完上述步驟後的結果是 a vector of words


### python code:  

安裝 ```NLTK```: ```conda install nltk```

In [1]:
text = "I am not a sentimental person but I believe in the utility of sentiment analysis."

In [2]:
# Tokenization
from nltk.tokenize import word_tokenize

In [3]:
tokens = word_tokenize(text)

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\User/nltk_data'
    - 'C:\\Users\\User\\Anaconda3\\envs\\quantTrading\\nltk_data'
    - 'C:\\Users\\User\\Anaconda3\\envs\\quantTrading\\share\\nltk_data'
    - 'C:\\Users\\User\\Anaconda3\\envs\\quantTrading\\lib\\nltk_data'
    - 'C:\\Users\\User\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************


 出現 ```LookupError```，原因是要下載字典:

In [4]:
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [13]:
tokens = word_tokenize(text)
tokens

['I',
 'am',
 'not',
 'a',
 'sentimental',
 'person',
 'but',
 'I',
 'believe',
 'in',
 'the',
 'utility',
 'of',
 'sentiment',
 'analysis',
 '.']

In [7]:
# Lemmatization
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() # create a lemmatizer object

tokens = [lemmatizer.lemmatize(word) for word in tokens]

LookupError: 
**********************************************************************
  Resource [93mwordnet[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('wordnet')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/wordnet[0m

  Searched in:
    - 'C:\\Users\\User/nltk_data'
    - 'C:\\Users\\User\\Anaconda3\\envs\\quantTrading\\nltk_data'
    - 'C:\\Users\\User\\Anaconda3\\envs\\quantTrading\\share\\nltk_data'
    - 'C:\\Users\\User\\Anaconda3\\envs\\quantTrading\\lib\\nltk_data'
    - 'C:\\Users\\User\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


一樣要下載字典:

In [8]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [19]:
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() # create a lemmatizer object

tokens = [lemmatizer.lemmatize(word) for word in tokens]

In [20]:
tokens

['i',
 'am',
 'not',
 'a',
 'sentiment',
 'person',
 'but',
 'i',
 'believ',
 'in',
 'the',
 'util',
 'of',
 'sentiment',
 'analysi',
 '.']

In [16]:
# Stemming 
from nltk.stem import PorterStemmer
tokens = word_tokenize(text.lower())
ps = PorterStemmer()

tokens = [ps.stem(word) for word in tokens]

In [17]:
tokens

['i',
 'am',
 'not',
 'a',
 'sentiment',
 'person',
 'but',
 'i',
 'believ',
 'in',
 'the',
 'util',
 'of',
 'sentiment',
 'analysi',
 '.']

可以發現，經過 lemmatize 或 stemmize 後， a sentimental person 變成 a sentiment person

In [23]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [24]:
stopwords = nltk.corpus.stopwords.words("english")
stopwords # 這些是被 nltk package 認定為 stopwords 的英文字

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [27]:
stopwords_ARB = nltk.corpus.stopwords.words("arabic")

In [29]:
stopwords_ARB[:5]

['إذ', 'إذا', 'إذما', 'إذن', 'أف']

stopwords 的檔案存放在: ```C:\Users\User\AppData\Roaming\nltk_data\corpora\stopwords```  
注意: 沒有中文字的 stopwords

在上述的 vectorizaton 步驟中，lemmatize 或 stemmize 其實不太合理，因為可能造成誤會，比如 promised (承諾)跟 promising (有未來的)意思不同，但在 lemmatize 或 stemmize 下都會變成 promise。

## 1. Lexicon Based Approach

- Sentiment lexicon 就是一個辭典，裡面存放很多字，而根據這些字的正面或負面，分別給 -1 到 +1 的值，數值越大，代表這個詞越 positive  
Some pre-existing lexicons:   
    - LIWC  
    - ANEW  
    - SentiWordNEt  
    - SenticNet   
    - VADER  
- Drawbacks:  
    - 無法為 acronyms (縮寫，例如: laser)、initialism (縮寫，例如 PDF。[initialism 跟 acronyms 的差別在 initialism 的縮寫只能一個一個字母唸](https://www.editage.com.tw/resources/tutorial/difference-between-abbreviation.html))、emoticons、slangs etc. 因此 lexicon based approach 較不適合拿來分析 social media。  
    - 無法分辨 sentiment intensity (強度)。例如: Food is exceptional 與 Food is good 哪一個比較好。  
    - 無法分辨 sarcasm (諷刺)  
- 改善:   
    VADER 改善前述問題，其特色為:  
    - 納入 slangs (eg: LOL, OMG, ROFL, Nah, Meh etc) 與 emoji  
    - wider scale: -4 (extremely negative) to +4 (extremely positive)

### VADER ([github 網站](https://github.com/cjhutto/vaderSentiment)):

安裝: ```pip install vaderSentiment```

- 根據 [```vaderSentiment.py``` 裡的 ```normalize``` method:](https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vaderSentiment.py) 作者將每個字標準化，使數值介於 ```0``` 到 ```1```。  
- 同時，作者建議:  
    - positive sentiment: ```compound``` score >= 0.05
    - neutral sentiment: (```compound``` score > -0.05) and (```compound``` score < 0.05)
    - negative sentiment: ```compound``` score <= -0.05

In [1]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [7]:
analyser = SentimentIntensityAnalyzer() # create an object

In [8]:
analyser.polarity_scores("This is a good course.")

{'neg': 0.0, 'neu': 0.58, 'pos': 0.42, 'compound': 0.4404}

回傳值是一個字典，裡面有 ```This is a good course.``` 這段話裡:  
- 負面的字 (```neg```) 的程度  
- 中性字 (```neu```) 的程度  
- 正面字的程度 (```pos```)  
- 所有字合在一起 (也就是整句話) 的程度 (```compound```)

In [10]:
analyser.polarity_scores("This is an awesome course.") # better than previous

{'neg': 0.0, 'neu': 0.494, 'pos': 0.506, 'compound': 0.6249}

In [11]:
analyser.polarity_scores("This is an exceptional course.") # cant tell

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

增強語意:

In [12]:
analyser.polarity_scores("The instructor is so cool.")

{'neg': 0.0, 'neu': 0.572, 'pos': 0.428, 'compound': 0.4572}

In [13]:
analyser.polarity_scores("The instructor is so cool!")

{'neg': 0.0, 'neu': 0.549, 'pos': 0.451, 'compound': 0.5079}

In [14]:
analyser.polarity_scores("The instructor is so COOL!")

{'neg': 0.0, 'neu': 0.488, 'pos': 0.512, 'compound': 0.6369}

表情符號:

```pip install emoji```

[完整 emoji lists 及其 utf-8 與 CLDR code](https://unicode.org/emoji/charts/full-emoji-list.html)

In [18]:
import emoji

In [29]:
print(emoji.emojize(":pleading_face:"))

🥺


In [27]:
print("\U0001F97A")

🥺


In [30]:
plead_face = "\U0001F97A"

In [38]:
print("I want a girlfriend" + plead_face)
analyser.polarity_scores("I want a girlfriend" + plead_face)

I want a girlfriend🥺


{'neg': 0.0, 'neu': 0.794, 'pos': 0.206, 'compound': 0.0772}

In [35]:
crying_face = emoji.emojize(":crying_face:")

In [37]:
print("I want a girlfriend" + crying_face)
analyser.polarity_scores("I want a girlfriend" + crying_face)

I want a girlfriend😢


{'neg': 0.369, 'neu': 0.476, 'pos': 0.155, 'compound': -0.4215}

Slangs:

In [34]:
analyser.polarity_scores("The movie SUX")

{'neg': 0.618, 'neu': 0.382, 'pos': 0.0, 'compound': -0.4995}

### Textblob ([github 網站](https://github.com/sloria/TextBlob))

安裝: ```pip install textblob```

In [39]:
from textblob import TextBlob

In [40]:
TextBlob("His").sentiment

Sentiment(polarity=0.0, subjectivity=0.0)

In [41]:
txt = "His remarkable work ethic impressed me"

In [45]:
for i in txt.split(" "):
    print("Showing the sentiment of {}: {}".format(i, TextBlob(i).sentiment))

Showing the sentiment of His: Sentiment(polarity=0.0, subjectivity=0.0)
Showing the sentiment of remarkable: Sentiment(polarity=0.75, subjectivity=0.75)
Showing the sentiment of work: Sentiment(polarity=0.0, subjectivity=0.0)
Showing the sentiment of ethic: Sentiment(polarity=0.0, subjectivity=0.0)
Showing the sentiment of impressed: Sentiment(polarity=1.0, subjectivity=1.0)
Showing the sentiment of me: Sentiment(polarity=0.0, subjectivity=0.0)


In [46]:
TextBlob(txt).sentiment

Sentiment(polarity=0.875, subjectivity=0.875)

## 目標: 爬取 [oilprice.com 關於 crude oil 的新聞](https://oilprice.com/Energy/Crude-Oil/)，並解析 sentiment:

In [47]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [48]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [49]:
# create lists to store relevant datas:
url_list = []
date_time = []
headlines = []
news_text = []

In [55]:
# scrap page 1 to page 3 of the website:
for i in range(1, 3):
    url = "http://oilprice.com/Energy/Crude-Oil/Page-" + str(i) + ".html"
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html")
    
    for div in soup.find_all("div", {"class":"categoryArticle__content"}):
        for a in div.find_all("a"):
            if a.get('href') not in url_list:
                url_list.append(a.get('href'))


In [57]:
url_list[0]

'https://oilprice.com/Energy/Crude-Oil/Will-US-Shale-Finally-Reward-Shareholders.html'

查看如何找出目標資訊 (使用一個特定網頁檢查):

Find the headline:

In [58]:
# 新聞標題就在 url 裡:
url_list[0].split("/")

['https:',
 '',
 'oilprice.com',
 'Energy',
 'Crude-Oil',
 'Will-US-Shale-Finally-Reward-Shareholders.html']

In [61]:
url_list[0].split("/")[-1].replace("-", " ")

'Will US Shale Finally Reward Shareholders.html'

In [132]:
egg = url_list[0].split("/")[-1].replace("-", " ").split(" ")

In [133]:
egg

['Will', 'US', 'Shale', 'Finally', 'Reward', 'Shareholders.html']

Find the date:

In [79]:
# 找出 datetime:
url = "https://oilprice.com/Energy/Crude-Oil/Will-US-Shale-Finally-Reward-Shareholders.html"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html")

In [93]:
dates = soup.find("span", {"class":"article_byline"}) # find the dates

In [92]:
dates

<span class="article_byline">By <a href="https://oilprice.com/contributors/Tsvetana-Paraskova">Tsvetana Paraskova</a> - Jan 31, 2021, 6:00 PM CST</span>

In [97]:
type(dates)

bs4.element.Tag

In [98]:
dates.text

'By Tsvetana Paraskova - Jan 31, 2021, 6:00 PM CST'

In [99]:
type(dates.text)

str

In [104]:
# slice the relevant info:
dates.text.split("-")[-1]

' Jan 31, 2021, 6:00 PM CST'

Find the main text of the news:

In [106]:
par = soup.find_all("p")

In [108]:
type(par)

bs4.element.ResultSet

In [114]:
par

[<p><a href="https://oilprice.com/oil-price-charts#prices">Click Here for 150+ Global Oil Prices <img alt="Link" src="https://d1o9e4un86hhpc.cloudfront.net/a/img/common/header/link.png"/></a></p>,
 <p><a href="https://oilprice.com/oil-price-charts#prices">Click Here for 150+ Global Oil Prices <img alt="Link" src="https://d1o9e4un86hhpc.cloudfront.net/a/img/common/header/link.png"/></a></p>,
 <p><a href="https://oilprice.com/oil-price-charts#prices">Click Here for 150+ Global Oil Prices <img alt="Link" src="https://d1o9e4un86hhpc.cloudfront.net/a/img/common/header/link.png"/></a></p>,
 <p><a href="https://oilprice.com/oil-price-charts#prices">Click Here for 150+ Global Oil Prices <img alt="Link" src="https://d1o9e4un86hhpc.cloudfront.net/a/img/common/header/link.png"/></a></p>,
 <p><a href="https://oilprice.com/oil-price-charts#prices">Click Here for 150+ Global Oil Prices <img alt="Link" src="https://d1o9e4un86hhpc.cloudfront.net/a/img/common/header/link.png"/></a></p>,
 <p><a href="ht

```html``` code 的文章內容都是放在 ```<p>``` tag 裡，故把網頁的所有 paragraph tags 找出來。  
但要注意的是，不是所有 ```<p>``` 的 tag 都是內文，  
觀察 ```par```，發現文章的正文是在 ```More Info``` 之後，止於  ```By Tsvetana Paraskova for Oilprice.com```。 (這裡就是要觀察)

抓取主文內容，關鍵 attribute: ```.text```。

In [111]:
# slice the relevant text of the website:
temp = []
for news in soup.find_all("p"):
    temp.append(news.text)

In [112]:
temp

['Click Here for 150+ Global Oil Prices ',
 'Click Here for 150+ Global Oil Prices ',
 'Click Here for 150+ Global Oil Prices ',
 'Click Here for 150+ Global Oil Prices ',
 'Click Here for 150+ Global Oil Prices ',
 'Click Here for 150+ Global Oil Prices ',
 'Click Here for 150+ Global Oil Prices ',
 'Click Here for 150+ Global Oil Prices ',
 "U.S. Shale: Biden's Drilling Ban Actually Undermines Emission Targets",
 '\nFind us on:\n\n\n\n\n\n\n\n ',
 'Iraq will pump less oil…',
 'The declining upstream investment, if…',
 'Saudi Arabia narrowly beat Russia…',
 'Tsvetana Paraskova',
 'Tsvetana is a writer for Oilprice.com with over a decade of experience writing for news outlets such as iNVEZZ and SeeNews.\xa0',
 'More Info',
 'U.S. oil prices above $50 a barrel are helping the shale patch to generate more cash flow, especially after the massive capital spending cuts last year.\xa0 But higher prices are also reviving the good old dilemma of U.S. shale producers—raise production or raise p

In [116]:
temp[-1]

'Merchant of Record: A Media Solutions trading as Oilprice.com'

規則: 如果 ```temp[-1]``` 的開頭是 ```By``` 且結尾是 ```Oilprice.com```，則 ```break``` 掉迴圈

### 取得文章的開頭:

In [120]:
temp.index("More Info") # 文章開頭的 index 是 15

15

In [122]:
temp[temp.index("More Info")+1] # 因此下一個就是正文的開頭

'U.S. oil prices above $50 a barrel are helping the shale patch to generate more cash flow, especially after the massive capital spending cuts last year.\xa0 But higher prices are also reviving the good old dilemma of U.S. shale producers—raise production or raise payouts to shareholders, who have grown increasingly frustrated in recent years with the lack of meaningful returns while drillers were sinking cash flows, and even spending beyond cash flow generation into breaking production records.\xa0\xa0\xa0\xa0'

### 全部整合:

In [123]:
url_list = []
date_time = []
headlines = []
news_text = []

In [124]:
# scrap page 1 to page 3 of the website:
for i in range(1, 3):
    url = "http://oilprice.com/Energy/Crude-Oil/Page-" + str(i) + ".html"
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html")
    
    for div in soup.find_all("div", {"class":"categoryArticle__content"}):
        for a in div.find_all("a"):
            if a.get('href') not in url_list:
                url_list.append(a.get('href')) # use .get() method

In [140]:
for www in url_list:
    page = requests.get(www)
    soup = BeautifulSoup(page.content, 'html')

    headlines.append(www.split("/")[-1].replace("-", " ")) # update the headline

    # update the date
    dates = soup.find("span", {"class":"article_byline"})
    date_time.append(dates.text.split("-")[-1])

    # update the main text
    temp = []
    for news in soup.find_all("p"):
        temp.append(news.text) # contains a lot of unrelevant information

    for last_sentence in reversed(temp): # 從尾到頭 loop，找出結尾的那句話
        if last_sentence.split(" ")[0] == "By" and last_sentence.split(" ")[-1] == "Oilprice.com":
            break 
        elif last_sentence.split(" ")[0] == "By":
            break 
        elif last_sentence.split(" ")[-1] == "Oilprice.com":
            break
    
    join_text = ' '.join( temp[(temp.index("More Info")+1):temp.index(last_sentence)]  )
    news_text.append(join_text) # add the main text to global variable


In [145]:
news_df = pd.DataFrame({'Date':date_time,
                        'Headline':headlines, 
                        'News':news_text
                        })

In [146]:
news_df

Unnamed: 0,Date,Headline,News
0,"Jan 31, 2021, 6:00 PM CST",Will US Shale Finally Reward Shareholders.html,U.S. oil prices above $50 a barrel are helping...
1,"Jan 29, 2021, 10:00 AM CST",The Enemy Of My Enemy Big Oil Befriends Big Co...,"It’s the oddest of odd couples, yet understand..."
2,"Jan 26, 2021, 7:00 PM CST",Big Oils Exploration Cuts Exacerbate Supply De...,The idea that crude oil could ever again hit t...
3,"Jan 26, 2021, 6:00 PM CST",Underinvestment In Oil And Gas May Cause Major...,"Following an unpredictable and shocking 2020, ..."
4,"Jan 26, 2021, 12:00 PM CST",Game changing Iranian Pipeline Set To Launch I...,The geopolitically game-changing Goreh-Jask pi...
5,"Jan 25, 2021, 3:00 PM CST",An Oil Boom Hidden Within One Of Americas Bigg...,When you think of giant oil fields you probabl...
6,"Jan 25, 2021, 2:00 PM CST",US Shale Industry Accelerates As Oil Prices Ri...,The number of drilled but uncompleted wells (D...
7,"Jan 25, 2021, 10:00 AM CST",Oil Major BP Significantly Downsizes Oil Explo...,As part of its ambition to reduce oil and gas ...
8,"Jan 25, 2021, 9:00 AM CST",Iraq Slashes Oil Output To Compensate For Over...,Iraq will pump less oil this month and next to...
9,"Jan 22, 2021, 11:00 AM CST",Iran Begins Boosting Oil Production.html,Iran has started ramping up its crude oil prod...


In [147]:
# sentiment analysis
analyser = SentimentIntensityAnalyzer()

In [152]:
def Comp_score(text):
    return analyser.polarity_scores(text)["compound"]

用 ```.apply()``` 將某一 function 套用在 ```data frame```:

In [153]:
news_df["Sentiment"] = news_df["News"].apply(Comp_score)

In [154]:
news_df["Sentiment"]

0     0.9960
1     0.6660
2     0.9907
3     0.9078
4    -0.8914
5     0.7397
6     0.9971
7     0.2518
8    -0.9702
9    -0.6691
10    0.9930
11   -0.9493
12    0.8395
13    0.9932
14    0.4001
15   -0.9450
16    0.9978
17    0.9742
18   -0.5086
19   -0.9912
20    0.2670
21   -0.5634
22   -0.5853
23   -0.9969
24   -0.6576
25   -0.8805
26   -0.9961
27    0.9502
28    0.8515
29   -0.9661
30    0.9691
31   -0.9993
32   -0.8268
33    0.9510
34    0.9790
35    0.9223
36   -0.9957
37    0.9694
38   -0.9930
39    0.8392
Name: Sentiment, dtype: float64