### StandfordCoreNLP
#### Use this API to get sentiment score (via PyCorenlp wrapper)

See https://stackoverflow.com/questions/32879532/stanford-nlp-for-python

Since StandfordCoreNLP is written in Java, we need to use pycorenlp (a python wrapper for StandfordCoreNLP) to connect to Java and use it here. To run the following codes, you need to install the following:

1. Install latest StandfordCoreNLP version (a zip file) from http://nlp.stanford.edu/software/stanford-corenlp-latest.zip 
    - Note: In MacOS, you can directly download by using wget or curl

2. Unzip the downloaded file, put it into a directory, say named `standford-corenlp-4.0.0`.

3. Start the server using terminal
    - Open the terminal, use `cd` to go to the directory you just created, type `cd Users\xxx\stanstandford-corenlp-4.0.0`
    - Then type in `java -mx5g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -timeout 10000` 
    - Now you should see something like `[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000`, means that your server is started, waiting for data from port in 9000
   
4. Then download the `pycorenlp` package using `pip install pycorenlp` 
5. Now you should be good to run the following sentiment anaylsis code using StandfordCoreNLP (note the port below is also using 9000 as you just set. If port name not consistent, your code would fail)


In [2]:
import pandas as pd
import json
import nltk
import numpy as np
import string
import spacy
import en_core_web_sm

import pprint

import seaborn
import matplotlib.pyplot as plt

import gensim
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from gensim.models import Word2Vec, CoherenceModel

from Data_Processor import clean, Data_Processor

from pycorenlp import StanfordCoreNLP

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

In [3]:
key = 'Morgan_Stanley'
DP=Data_Processor(start_month='2017-06',end_month='2020-05',
                  template=["../../Data/"+key])
                  #"../../Data/UBS","../../Data/Goldman_Sachs"

DP.readdata()
print(DP.datanums())
DP.specifylang()
DP.removenoise() 
DP.clean()

#after removing noises (delete promotional/advertisement data)
print(DP.datanums())

([9513, 9768, 9885, 9534, 9371, 9661, 10183, 9527, 9292, 9290, 9059, 9399, 9927, 9314, 9829, 9700, 9283, 9411, 9914, 9075, 8763, 9295, 8949, 9234, 8468, 8648, 9049, 8523, 9401, 8371, 7937, 8180, 8683, 9569, 8916, 9644], 332565)
([2129, 2492, 2770, 2498, 2423, 2481, 2387, 2563, 2221, 2036, 2126, 2878, 2602, 2066, 2146, 2745, 2096, 1885, 2462, 1846, 2046, 2658, 2298, 2876, 2647, 1632, 2031, 2538, 2128, 2122, 2057, 2389, 2242, 2557, 2490, 2711], 84274)


In [4]:
## do not tokenize for sentiment data
#DP.tokenizetext()

In [4]:
months = list(DP.textdata().keys())
print(months)

['2017-06', '2017-07', '2017-08', '2017-09', '2017-10', '2017-11', '2017-12', '2018-01', '2018-02', '2018-03', '2018-04', '2018-05', '2018-06', '2018-07', '2018-08', '2018-09', '2018-10', '2018-11', '2018-12', '2019-01', '2019-02', '2019-03', '2019-04', '2019-05', '2019-06', '2019-07', '2019-08', '2019-09', '2019-10', '2019-11', '2019-12', '2020-01', '2020-02', '2020-03', '2020-04', '2020-05']


In [8]:
LIMIT = 100000 #standfordnlp can only process 100000 characters a time, need divide them into serveral parts
nlp = StanfordCoreNLP('http://localhost:9000') #the port here should be the same as above (openned server)

In [6]:
def sentiment_analysis(data, month):
    all_sentences = " . ".join(data)
    n_divid = len(all_sentences)//LIMIT*3
    len_divid = len(data)//n_divid
    print(f"Processing {month} - length {len(data)}, {n_divid} parts")
    
    #nlp.annotate will return a dictionary (key is 'sentences')
    #here nlp.annotate process each sentence in the paragraph above using the annotators specified below
    #res = {}
    all_sentence = []
    sentiment = []  #save sentiment ("Positive", "Netrual", "Negative")
    score = []  #save sentiment score

    for i in range(n_divid+1): #one extra divid for remaining part
        #print(i*len_divid, (i+1)*len_divid)
        if i==n_divid and i*len_divid>=len(data):
            break
        
        sentence = " . ".join(data[i*len_divid:min((i+1)*len_divid,len(data))]) #take min to include the last remaining part
        
        res = nlp.annotate(sentence,
                           properties={
                           'annotators': 'tokenize,ssplit,pos,parse,sentiment',
                           'outputFormat': 'json',
                           'timeout': 100000,
                           })
        if isinstance(res,str):
            raise Exception("Sentence is too long to parse")

        #add sentiment result
        for s in res['sentences']:
            #all_sentence.append(" ".join([t["word"] for t in s["tokens"]]))
            score.append(s["sentimentValue"])
            sentiment.append(s["sentiment"])

    
    return score, sentiment

#### Save to json file

In [None]:
sentiment = {}
for i,m in enumerate(months):
    score, sentiment = sentiment_analysis(DP.textdata()[m], m)  #get two lists of data that month
    for j in range(len(DP.data[i])):
        DP.data[i][j]['sentiment_score'] = score[j]
        DP.data[i][j]['sentiment'] = sentiment[j]
    
    with open(key+m+'.json',"w") as file:
        json.dump(DP.data[i], file)

Processing 2017-06 - length 2129, 3 parts
Processing 2017-07 - length 2492, 6 parts
