###  ---------------------------------------------------------------------------------------------------------------------------------------
##### Copyright (c) Rajdeep Biswas
##### Licensed under the MIT license.
###### File: stock_news_sentiment.ipynb
###### Date: 10/24/2021
###  ---------------------------------------------------------------------------------------------------------------------------------------

### Table of Contents

* [Initial Configurations](#IC)
    * [Import Libraries](#IL)
    * [Autheticate the AML Workspace](#AML)
* [Get Data](#GD)
    * [Setup Directory Structure](#SD)
    * [Read Gold File](#RG) 
* [Apply NLP](#AN)   
    * [VADER](#AV) 
    * [textBlob](#ATB) 
    * [Dictionary-Based Sentiment Analysis](#ADB)     
* [Evaluation](#EV)      

### Initial Configurations <a class="anchor" id="IC"></a>
#### Import Libraries <a class="anchor" id="IL"></a>

In [1]:
#Import required Libraries
import os

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib.image import imread
import cv2
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

#pip install pandas_datareader
import pandas_datareader.data as web
import pandas as pd
import datetime as dt

import azureml.core
import azureml.automl
from azureml.core import Workspace, Dataset, Datastore



#viz
#pip install seaborn
import matplotlib.pyplot as plt
import seaborn as sns 
import matplotlib.gridspec as gridspec 
import matplotlib.gridspec as gridspec 

#settings
color = sns.color_palette()
sns.set_style("dark")

#### Autheticate the AML Workspace <a class="anchor" id="AML"></a>

In [3]:
import azureml.core
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.name))

Ready to use Azure ML 1.33.0 to work with houston-techsummit-workspace


### Get Data <a class="anchor" id="GD"></a>
#### Setup Directory Structure <a class="anchor" id="SD"></a>

In [4]:
data_folder = os.path.join(os.getcwd(), 'data')

#Create the data directory
os.makedirs(data_folder, exist_ok=True)

gold_data_folder = data_folder +"/gold"
os.makedirs(gold_data_folder, exist_ok=True)

#Create sub folder for stock news data in gold
news_data_gold = gold_data_folder +"/snp500_news"
os.makedirs(news_data_gold, exist_ok=True)

#### Read Gold File <a class="anchor" id="RG"></a>

In [5]:
output_file_name = news_data_gold + '/snp500_all_news.csv'
df_snp500_news_gold = pd.read_csv(output_file_name, index_col=None, header=0) 

In [20]:
# Reviews for Apple
rev_apple = df_snp500_news_gold[df_snp500_news_gold['Ticker']=='AAPL']

In [21]:
type(rev_apple)

pandas.core.frame.DataFrame

### Apply NLP <a class="anchor" id="AN"></a>
#### Vader <a class="anchor" id="AV"></a>

Automated sentiment analyzer
VADER:
VADER stands for 'Valence Aware Dictionary and sEntiment Reasoner'.
(Note: in the spelling ‘sEntiment’, first letter ‘s’ is a small letter
 and second letter ‘E’ is capital and it is correct). 
VADER is a lexicon and rule-based sentiment analysis tool.
It is used to analyze the sentiment of a text. Lexicon is a list of
lexical features (words) that are labeled with positive or negative based
on the semantic meaning. Even an unlabelled text data can be labeled
with VADER sentiment analyzer.
https://pypi.org/project/vaderSentiment/

In [8]:
nltk.download('vader_lexicon') #Uncomment if running for the first time

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/azureuser/nltk_data...


True

In [9]:
import nltk
#import the VADER Sentiment Analysis from nltk. 
#Then create an instance for the imported library.
from nltk.sentiment.vader import SentimentIntensityAnalyzer
vds = SentimentIntensityAnalyzer()

Now let us check how VADER sentiment analyzer works with a few examples.
I am analyzing the text 'I went to the movie, yesterday. It was amazing!
Everyone acted well.'
The 'polarity_scores(text)' is used for the analysis of the text data.
Below is the standard scoring metric followed by most of the analyzers.
1. Positive sentiment: compound score >= 0.05
2. Neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
3. Negative sentiment: compound score <= -0.05

In [10]:
def sentiment_analyzer_vader (lines):
    sentiment_score=0
    sentiment = 'N/A' #not yet analyzed
    
    #apply polarity_scores from vader to analyze data
    vader_hash = vds.polarity_scores(lines)
    sentiment_score = vader_hash['compound']
    #Applying rule
    if sentiment_score > 0.05:
        sentiment = 'Positive'
    elif sentiment_score < -0.05:
        sentiment = 'Negative'
    else:
        sentiment = 'Neutral'
    
    return sentiment

In [22]:
rev_apple['sentiment_vader']=rev_apple['CleanedText'].apply(sentiment_analyzer_vader)
rev_apple

Unnamed: 0,Ticker,NewsNum,CleanedText,sentiment_vader
120,AAPL,0,santa clara calif september pfizer executive d...,Neutral
121,AAPL,1,santa clara calif september agilent announces ...,Neutral
122,AAPL,2,hedge fund manager william ackman bet universa...,Positive
123,AAPL,3,smart beta etf report ftc,Positive
124,AAPL,4,stock cut loss amid mixed economic data,Negative
...,...,...,...,...
195,AAPL,15,global shopper face possible shortage smartpho...,Negative
196,AAPL,16,bloomberg china embattled tech tycoon linedup ...,Positive
197,AAPL,17,bloomberg iphone assembly operation china begi...,Positive
198,AAPL,18,deal actually closed long time ago back near d...,Negative


In [24]:
#Create sub folder for stock news data in gold
news_data_gold_sentiment = news_data_gold +"/news_with_sentiment"
os.makedirs(news_data_gold_sentiment, exist_ok=True)
output_file_name = news_data_gold_sentiment + '/apple_sentiment.csv'
apple_sent_gold = pd.read_csv(output_file_name, index_col=None, header=0)
apple_sent_gold

Unnamed: 0,Ticker,NewsNum,CleanedText,Sentiment
0,AAPL,0,santa clara calif september pfizer executive d...,neutral
1,AAPL,1,santa clara calif september agilent announces ...,neutral
2,AAPL,2,hedge fund manager william ackman bet universa...,neutral
3,AAPL,3,smart beta etf report ftc,neutral
4,AAPL,4,stock cut loss amid mixed economic data,neutral
...,...,...,...,...
75,AAPL,15,global shopper face possible shortage smartpho...,neutral
76,AAPL,16,bloomberg china embattled tech tycoon linedup ...,neutral
77,AAPL,17,bloomberg iphone assembly operation china begi...,neutral
78,AAPL,18,deal actually closed long time ago back near d...,negative


In [25]:
apple_sent_gold['sentiment_vader']=apple_sent_gold['CleanedText'].apply(sentiment_analyzer_vader)
apple_sent_gold

Unnamed: 0,Ticker,NewsNum,CleanedText,Sentiment,sentiment_vader
0,AAPL,0,santa clara calif september pfizer executive d...,neutral,Neutral
1,AAPL,1,santa clara calif september agilent announces ...,neutral,Neutral
2,AAPL,2,hedge fund manager william ackman bet universa...,neutral,Positive
3,AAPL,3,smart beta etf report ftc,neutral,Positive
4,AAPL,4,stock cut loss amid mixed economic data,neutral,Negative
...,...,...,...,...,...
75,AAPL,15,global shopper face possible shortage smartpho...,neutral,Negative
76,AAPL,16,bloomberg china embattled tech tycoon linedup ...,neutral,Positive
77,AAPL,17,bloomberg iphone assembly operation china begi...,neutral,Positive
78,AAPL,18,deal actually closed long time ago back near d...,negative,Negative


#### TextBlob <a class="anchor" id="ATB"></a>
##### UnSupervised Sentiment Analysis using TextBlob

In [29]:
#pip install textblob

In [34]:
from textblob import TextBlob
# Get the polarity score using below function
def get_textBlob_score(sent):
    # This polarity score is between -1 to 1
    polarity = TextBlob(sent).sentiment.polarity
    return polarity

In [31]:
def sentiment_analyzer_textBlob (lines):
    sentiment_score=0
    sentiment = 'N/A' #not yet analyzed
    
    #apply polarity_scores from vader to analyze data
    #vader_hash = vds.polarity_scores(lines)
    #sentiment_score = vader_hash['compound']
    sentiment_score = TextBlob(lines).sentiment.polarity
    #Applying rule
    if sentiment_score > 0.05:
        sentiment = 'Positive'
    elif sentiment_score < -0.05:
        sentiment = 'Negative'
    else:
        sentiment = 'Neutral'
    
    return sentiment

In [32]:
apple_sent_gold['sentiment_textBlob']=apple_sent_gold['CleanedText'].apply(sentiment_analyzer_textBlob)
apple_sent_gold

Unnamed: 0,Ticker,NewsNum,CleanedText,Sentiment,sentiment_vader,sentiment_textBlob
0,AAPL,0,santa clara calif september pfizer executive d...,neutral,Neutral,Neutral
1,AAPL,1,santa clara calif september agilent announces ...,neutral,Neutral,Neutral
2,AAPL,2,hedge fund manager william ackman bet universa...,neutral,Positive,Neutral
3,AAPL,3,smart beta etf report ftc,neutral,Positive,Positive
4,AAPL,4,stock cut loss amid mixed economic data,neutral,Negative,Positive
...,...,...,...,...,...,...
75,AAPL,15,global shopper face possible shortage smartpho...,neutral,Negative,Neutral
76,AAPL,16,bloomberg china embattled tech tycoon linedup ...,neutral,Positive,Neutral
77,AAPL,17,bloomberg iphone assembly operation china begi...,neutral,Positive,Neutral
78,AAPL,18,deal actually closed long time ago back near d...,negative,Negative,Neutral


#### The punctuation changes the degree. Which raises the question that do we need to clean the text or pass it as is.

In [36]:
print("! ",get_textBlob_score("The phone is super cool!"))
print("!! ",get_textBlob_score("The phone is super cool!!"))
print("!!! ",get_textBlob_score("The phone is super cool!!!"))

!  0.38541666666666663
!!  0.44010416666666663
!!!  0.5084635416666666


#### Dictionary-Based Sentiment Analysis <a class="anchor" id="ADB"></a>

In [37]:
import nltk
nltk.download('opinion_lexicon') #Uncomment if using for the first time
from nltk.corpus import opinion_lexicon

[nltk_data] Downloading package opinion_lexicon to
[nltk_data]     /home/azureuser/nltk_data...
[nltk_data]   Unzipping corpora/opinion_lexicon.zip.


In [41]:
nltk.download('stopwords')  #Uncomment if using for the first time
nltk.download('punkt')   #Uncomment if using for the first time
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/azureuser/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/azureuser/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [38]:
#Generate a list of positive words and a list of negative words from
#the dictionary downloaded above:
pos_list=set(opinion_lexicon.positive())
neg_list=set(opinion_lexicon.negative())

In [45]:
# Load stop words
stop_words = stopwords.words('english')

In [46]:
#Define a sentiment analyzer python function
def sentiment_analyzer (lines):
    sentiment_score=0
    sentiment = 'N/A' #not yet analyzed
    #Tokenize words
    tokenized_words=word_tokenize(lines)
    #Remove Stopwords
    stopwords_removed_words = [word for word in tokenized_words if word not in stop_words]
    #return tokenized_words
    #return stopwords_removed_words
    for word in stopwords_removed_words:
        if word in pos_list:
            sentiment_score+=1
        elif word in neg_list:
            sentiment_score-=1
    #return sentiment_score

    #Rule
    #1 sentiment_score > 0 -> Positive :)
    #2 sentiment_score < 0 -> Negative :(
    #3 sentiment_score = 0 -> Neutral -()- 
    
    if sentiment_score > 0:
        sentiment = 'Positive'
    elif sentiment_score < 0:
        sentiment = 'Negative'
    else:
        sentiment = 'Neutral'
    
    return sentiment

In [47]:
apple_sent_gold['sentiment_dictionary']=apple_sent_gold['CleanedText'].apply(sentiment_analyzer)
apple_sent_gold

Unnamed: 0,Ticker,NewsNum,CleanedText,Sentiment,sentiment_vader,sentiment_textBlob,sentiment_dictionary
0,AAPL,0,santa clara calif september pfizer executive d...,neutral,Neutral,Neutral,Neutral
1,AAPL,1,santa clara calif september agilent announces ...,neutral,Neutral,Neutral,Neutral
2,AAPL,2,hedge fund manager william ackman bet universa...,neutral,Positive,Neutral,Positive
3,AAPL,3,smart beta etf report ftc,neutral,Positive,Positive,Positive
4,AAPL,4,stock cut loss amid mixed economic data,neutral,Negative,Positive,Negative
...,...,...,...,...,...,...,...
75,AAPL,15,global shopper face possible shortage smartpho...,neutral,Negative,Neutral,Negative
76,AAPL,16,bloomberg china embattled tech tycoon linedup ...,neutral,Positive,Neutral,Negative
77,AAPL,17,bloomberg iphone assembly operation china begi...,neutral,Positive,Neutral,Negative
78,AAPL,18,deal actually closed long time ago back near d...,negative,Negative,Neutral,Positive


### Evaluation <a class="anchor" id="EV"></a>
So how to evaluate which technique is working better?  
Manual for now

In [48]:
#Create sub folder for stock news data in gold
news_data_gold_sentiment = news_data_gold +"/news_with_sentiment"
os.makedirs(news_data_gold_sentiment, exist_ok=True)

In [49]:
output_file_name = news_data_gold_sentiment + '/apple_sentiment_all_analyze.csv'
apple_sent_gold.to_csv(output_file_name, index=False)  

In [50]:
test_df = pd.read_csv(output_file_name, index_col=None, header=0)
test_df

Unnamed: 0,Ticker,NewsNum,CleanedText,Sentiment,sentiment_vader,sentiment_textBlob,sentiment_dictionary
0,AAPL,0,santa clara calif september pfizer executive d...,neutral,Neutral,Neutral,Neutral
1,AAPL,1,santa clara calif september agilent announces ...,neutral,Neutral,Neutral,Neutral
2,AAPL,2,hedge fund manager william ackman bet universa...,neutral,Positive,Neutral,Positive
3,AAPL,3,smart beta etf report ftc,neutral,Positive,Positive,Positive
4,AAPL,4,stock cut loss amid mixed economic data,neutral,Negative,Positive,Negative
...,...,...,...,...,...,...,...
75,AAPL,15,global shopper face possible shortage smartpho...,neutral,Negative,Neutral,Negative
76,AAPL,16,bloomberg china embattled tech tycoon linedup ...,neutral,Positive,Neutral,Negative
77,AAPL,17,bloomberg iphone assembly operation china begi...,neutral,Positive,Neutral,Negative
78,AAPL,18,deal actually closed long time ago back near d...,negative,Negative,Neutral,Positive
