# Sentiment Analysis on 10k Fillings using CoreNLP

## 1. Setting up

### Install Stanford CoreNLP & Python API

Follow [this article](https://towardsdatascience.com/natural-language-processing-using-stanfords-corenlp-d9e64c1e1024) to install Stanford CoreNLP and install Python API

```shell
cd stanford-corenlp-full-2018-10-05

java -mx6g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -timeout 5000
```

### Requirement
- pycorenlp
- nltk
- pandas

Environment: Python 3.8.2

## 2. Data Pre-processing

The thesis "" presents 5 steps to preprocessing the 10k reports:

1. Remove punctuation
2. Remove stopwords
3. Stemming

### Load dataset

In [1]:
import os
import textract

data_folder_path = '/Users/aringuyen/Desktop/CAP/10k-Reports/'
file_path = 'Starbuck-10k.pdf'
full_file_path = os.path.join(data_folder_path, file_path)

# extract text from pdf
text = textract.process(full_file_path, method='pdfminer').decode('utf-8')

# lower case
text = text.lower()

In [2]:
text = text[:100000]
text

'table of contents\n\nunited states securities and exchange commission\n\n☒ annual report pursuant to section 13 or 15(d) of the securities exchange act of 1934\n\n☐ transition report pursuant to section 13 or 15(d) of the securities exchange act of 1934\n\nwashington, dc 20549\n\nform 10-k\n\nfor the fiscal year ended september 27, 2020\n\nor\n\nfor the transition period from            to            .\n\ncommission file number: 0-20322\n\nstarbucks corporation\n\n(exact name of registrant as specified in its charter)\n\nwashington\n\n(state of incorporation)\n\n91-1325671\n\n(irs employer id)\n\n2401 utah avenue south, seattle, washington 98134\n\n(206) 447-1575\n\n(address of principal executive office, zip code, telephone number)\n\nsecurities registered pursuant to section 12(b) of the act:\n\ntitle of each class\n\ncommon stock, $0.001 par value per share\n\ntrading symbol\n\nsbux\n\nname of each exchange on which registered\n\nnasdaq global select market\n\nsecurities registered

### Data Preprocessing

In [3]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.stem.porter import PorterStemmer

In [4]:
# Remove punctuation
# tokenizer = RegexpTokenizer(r"\s+")    
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
text_tokens = tokenizer.tokenize(text)

In [5]:
# remove stop words
stopwords = set(stopwords.words('english'))
text_tokens_no_stopwords = [word for word in text_tokens if word not in stopwords]
text_tokens_no_stopwords

['table',
 'contents',
 'united',
 'states',
 'securities',
 'exchange',
 'commission',
 '☒',
 'annual',
 'report',
 'pursuant',
 'section',
 '13',
 '15',
 '(d)',
 'securities',
 'exchange',
 'act',
 '1934',
 '☐',
 'transition',
 'report',
 'pursuant',
 'section',
 '13',
 '15',
 '(d)',
 'securities',
 'exchange',
 'act',
 '1934',
 'washington',
 ',',
 'dc',
 '20549',
 'form',
 '10',
 '-k',
 'fiscal',
 'year',
 'ended',
 'september',
 '27',
 ',',
 '2020',
 'transition',
 'period',
 '.',
 'commission',
 'file',
 'number',
 ':',
 '0',
 '-20322',
 'starbucks',
 'corporation',
 '(exact',
 'name',
 'registrant',
 'specified',
 'charter',
 ')',
 'washington',
 '(state',
 'incorporation',
 ')',
 '91',
 '-1325671',
 '(irs',
 'employer',
 'id',
 ')',
 '2401',
 'utah',
 'avenue',
 'south',
 ',',
 'seattle',
 ',',
 'washington',
 '98134',
 '(206)',
 '447',
 '-1575',
 '(address',
 'principal',
 'executive',
 'office',
 ',',
 'zip',
 'code',
 ',',
 'telephone',
 'number',
 ')',
 'securities',
 'regi

In [6]:
# stemming
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in text_tokens_no_stopwords]

In [7]:
clean_text = ' '.join(text_tokens_no_stopwords)

## 3. Sentiment Analysis

In [8]:
from pycorenlp import StanfordCoreNLP

nlp = StanfordCoreNLP('http://localhost:9000')

In [9]:
result = nlp.annotate(text,
                   properties={
                       'annotators': 'sentiment, ner, pos',
                       'outputFormat': 'json',
                       'timeout': 600000,
                   })

In [10]:
import pandas as pd

df = pd.DataFrame(columns=['sentence', 'sentiment', 'sentiment value'])
for s in result["sentences"]:
    new_row = {
        'sentence': " ".join([t["word"] for t in s["tokens"]]),
        'sentiment': s["sentiment"],
        'sentiment value': s["sentimentValue"]
    }
    df = df.append(new_row, ignore_index=True)

### With Preprocessed data

In [11]:
result2 = nlp.annotate(clean_text,
                   properties={
                       'annotators': 'sentiment, ner, pos',
                       'outputFormat': 'json',
                       'timeout': 600000,
                   })

In [12]:
df_clean = pd.DataFrame(columns=['sentence', 'sentiment', 'sentiment value'])
for s in result2["sentences"]:
    new_row = {
        'sentence': " ".join([t["word"] for t in s["tokens"]]),
        'sentiment': s["sentiment"],
        'sentiment value': s["sentimentValue"]
    }
    df_clean = df_clean.append(new_row, ignore_index=True)

### Save to csv file

In [13]:
df.to_csv('starbucks-10k-sentiment-analysis_no_preprocess.csv') 
df_clean.to_csv('starbucks-10k-sentiment-analysis.csv')

## 4. Analysis

### No preprocessing step

In [14]:
df.groupby('sentiment').count()

Unnamed: 0_level_0,sentence,sentiment value
sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1
Negative,294,294
Neutral,29,29
Positive,64,64
Verynegative,28,28
Verypositive,3,3


### With preprocessing step

In [15]:
df_clean.groupby('sentiment').count()

Unnamed: 0_level_0,sentence,sentiment value
sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1
Negative,313,313
Neutral,79,79
Positive,70,70
Verynegative,44,44
Verypositive,3,3


In [None]:
- Cannot Fine tune to our dataset