### StandfordCoreNLP
#### Use this API to get sentiment score (via PyCorenlp wrapper)

See https://stackoverflow.com/questions/32879532/stanford-nlp-for-python

Since StandfordCoreNLP is written in Java, we need to use pycorenlp (a python wrapper for StandfordCoreNLP) to connect to Java and use it here. To run the following codes, you need to install the following:

1. Install latest StandfordCoreNLP version (a zip file) from http://nlp.stanford.edu/software/stanford-corenlp-latest.zip 
    - Note: In MacOS, you can directly download by using wget or curl

2. Unzip the downloaded file, put it into a directory, say named `standford-corenlp-4.0.0`.

3. Start the server using terminal
    - Open the terminal, use `cd` to go to the directory you just created, type `cd Users\xxx\stanstandford-corenlp-4.0.0`
    - Then type in `java -mx5g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -timeout 10000` 
    - Now you should see something like `[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000`, means that your server is started, waiting for data from port in 9000
   
4. Then download the `pycorenlp` package using `pip install pycorenlp` 
5. Now you should be good to run the following sentiment anaylsis code using StandfordCoreNLP (note the port below is also using 9000 as you just set. If port name not consistent, your code would fail)


In [1]:
import pandas as pd
import json
import nltk
import numpy as np
import string
import spacy
import en_core_web_sm

import pprint

import seaborn
import matplotlib.pyplot as plt

from Data_Processor import clean, Data_Processor

from pycorenlp import StanfordCoreNLP

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hwk97\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hwk97\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hwk97\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
DP=Data_Processor(start_month='2020-01',end_month='2020-05',
                  template=["../../Data/FidelityInvestments", 
                            "../../Data/eTrade",
                            "../../Data/CharlesSchwab",
                           "../../Data/TDAmeritrade"])

DP.readdata()
print(DP.datanums())
DP.specifylang()
DP.removenoise() 
DP.clean()

#after removing noises (delete promotional/advertisement data)
print(DP.datanums())

([9394, 11212, 15014, 12599, 12810], 61029)
([4499, 5478, 9431, 7038, 7388], 33834)


In [3]:
DP.tokenizetext()

In [4]:
months = list(DP.textdata().keys())
print(months)

['2020-01', '2020-02', '2020-03', '2020-04', '2020-05']


In [5]:
token_data= {}
string_data = {}
#remove token word with length<3
for m in months:
    token_data[m] = [[word for word in sent.split() if len(word)>=3] for sent in DP.textdata()[m]]
    string_data[m] = [" ".join([word for word in token]) for token in token_data[m]]

In [6]:
#print(token_data['2020-05'][:3])
print(string_data['2020-05'][:3])

['think woman biggest influencer abigail johnson fidelity investment', 'sure fidelity investment super supportive employee time', 'untrue laughable robinhood founded disrupted discount brokerage space providing commissionfree trading forcing incumbent retail broker charles schwab corp ameritrade holding corp fidelity investment follow suit']


In [7]:
all_sentences = "\n".join(string_data[m])  #join by '\n'
all_sentences[-1000:]

'rket buy roku didnt get filled morning next thing know position blank etradepro totally unresponsive never got fill roku run close back cash hour mattered\nwhaaat dont know story\nnever forget time etrade actually digitally mark account fucked roku fun phone call ever\nthats saying told could trade paper money said wtf think usd mark\ncomputer shouldnt problem maybe switch etrade\ntrying explain tdameritrade would like digitally mark account balance ive shown jay powell interview dont get frustrating\nliked webull account mess take month get response use ameritrade customer service good anywhere else account messed many time webull worth headache call\ntdameritrade wonder used debit card buy stock\nthats blowup think swim chart think swim free account ameritrade\nmany financial institution use like fidelity tdameritrade talk broker option use app like acorn robinhood\nfelt wrath robin hood traded month switched ameritrade loving highly recommend\nameritrade use like havent used anythi

In [12]:
LIMIT = 100000 #standfordnlp can only process 100000 characters a time, need divide them into serveral parts
n_divid = len(all_sentences)//LIMIT*2
len_divid = len(string_data[m])//n_divid
print(len(string_data[m]), n_divid, len_divid)

nlp = StanfordCoreNLP('http://localhost:9000') #the port here should be the same as above (openned server)
#nlp.annotate will return a dictionary (key is 'sentences')

#here nlp.annotate process each sentence in the paragraph above using the annotators specified below

res = {}
sentiment = []

for i in range(n_divid):
    print(f"processing {i*len_divid} to {(i+1)*len_divid}")
    sentence = ". ".join(string_data[m][i*len_divid:(i+1)*len_divid])
    
    res[i] = nlp.annotate(sentence,
                   properties={
                       'annotators': 'tokenize,ssplit,pos,parse,sentiment',
                       'outputFormat': 'json',
                       'timeout': 100000,
                   })
    if isinstance(res[i],str):
        raise Exception("Sentence is too long to parse")
    
    #add sentiment result
    else:
        for s in res[i]['sentences']:
            sentiment.append([" ".join([t["word"] for t in s["tokens"]])[:-2], s["sentimentValue"], s["sentiment"]])

7388 12 615
processing 0 to 615
processing 615 to 1230
processing 1230 to 1845
processing 1845 to 2460
processing 2460 to 3075
processing 3075 to 3690
processing 3690 to 4305
processing 4305 to 4920
processing 4920 to 5535
processing 5535 to 6150
processing 6150 to 6765
processing 6765 to 7380


In [13]:
sentiment_df = pd.DataFrame(sentiment, columns=['text', 'score', 'sentiment'])
sentiment_df['score'] = sentiment_df['score'].astype(int)
positive_df = sentiment_df[sentiment_df['score']>=3]
negative_df = sentiment_df[sentiment_df['score']<=1]

In [14]:
negative_df['text']

2       untrue laughable robinhood founded disrupted d...
9       chinmay thank interest fidelity investment off...
10      dan nathan liberal political hack writing far ...
13      rep pat toomey rpa accepted fidelity investmen...
19      weird wildly date tweet promoted timeline fide...
                              ...                        
7341    course never actually held stock disagreement ...
7344        ridiculous happened trying close tdameritrade
7345    get phone middle day they re clearly freaking ...
7346    quick october market buy roku did nt get fille...
7348    never forget time etrade actually digitally ma...
Name: text, Length: 2180, dtype: object

In [15]:
def LDA(texts, topics=10, num_words=15, dictionary = None):
    if not dictionary:
        dictionary = corpora.Dictionary(texts) # texts: list of list of words
    corpus = [dictionary.doc2bow(text) for text in texts]
    num_topics = topics #The number of topics that should be generated
    passes = 30
    lda = LdaModel(corpus,
              id2word=dictionary,
              alpha = 'auto',
              num_topics=num_topics,
              passes=passes)
    
    return lda

In [56]:
negative_token = [sent.split(" ") for sent in negative_df['text']]
positive_token = [sent.split(" ") for sent in positive_df['text']]

In [58]:
lda_model = {}
k = 10
num_words = 15
print("lda modeling", m)
#lda_model[m] = {}
lda_pos = LDA(positive_token, topics=k, num_words=num_words)  #must use token_data [[word1,word2],[...]]
lda_neg = LDA(negative_token, topics=k, num_words=num_words)  

lda modeling 2020-05


In [59]:
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(lda_pos.print_topics(num_words=15))

[   (   0,
        '0.018*"high" + 0.018*"good" + 0.013*"well" + 0.013*"time" + '
        '0.012*"thanks" + 0.011*"got" + 0.011*"tdameritrade" + 0.011*"etrade" '
        '+ 0.010*"schwab" + 0.010*"ameritrade" + 0.009*"charles" + 0.009*"day" '
        '+ 0.007*"thing" + 0.007*"always" + 0.006*"know"'),
    (   1,
        '0.022*"tdameritrade" + 0.016*"update" + 0.014*"price" + '
        '0.010*"signal" + 0.008*"premium" + 0.008*"thinkorswim" + '
        '0.007*"right" + 0.007*"thanks" + 0.007*"street" + 0.007*"definitely" '
        '+ 0.007*"half" + 0.006*"analyst" + 0.006*"wall" + 0.006*"excellent" + '
        '0.006*"think"'),
    (   2,
        '0.011*"tdameritrade" + 0.010*"share" + 0.008*"job" + 0.008*"join" + '
        '0.008*"case" + 0.007*"military" + 0.007*"real" + 0.007*"family" + '
        '0.007*"part" + 0.006*"estate" + 0.006*"network" + 0.006*"year" + '
        '0.006*"may" + 0.006*"great" + 0.006*"week"'),
    (   3,
        '0.022*"account" + 0.021*"ameritrade" + 0.016*"

In [60]:
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(lda_neg.print_topics(num_words=15))

[   (   0,
        '0.016*"stock" + 0.015*"day" + 0.012*"trade" + 0.011*"schwab" + '
        '0.008*"platform" + 0.007*"last" + 0.007*"ameritrade" + 0.006*"like" + '
        '0.006*"think" + 0.006*"etrade" + 0.006*"year" + 0.006*"money" + '
        '0.006*"charles" + 0.006*"every" + 0.005*"do"'),
    (   1,
        '0.025*"ameritrade" + 0.018*"stock" + 0.016*"account" + 0.012*"new" + '
        '0.011*"tdameritrade" + 0.009*"first" + 0.008*"market" + '
        '0.008*"online" + 0.008*"schwab" + 0.008*"robinhood" + 0.007*"saw" + '
        '0.007*"etrade" + 0.007*"charles" + 0.007*"like" + 0.007*"use"'),
    (   2,
        '0.023*"stock" + 0.021*"schwab" + 0.016*"charles" + '
        '0.016*"tdameritrade" + 0.011*"ameritrade" + 0.009*"would" + '
        '0.008*"account" + 0.007*"bought" + 0.007*"market" + 0.007*"investor" '
        '+ 0.007*"say" + 0.007*"investment" + 0.006*"charlesschwab" + '
        '0.006*"oil" + 0.005*"open"'),
    (   3,
        '0.025*"etrade" + 0.024*"ameritrade" 