#### PGGM Bootcamp Text Analytics 2020
*Notebook by [Pedro V Hernandez Serrano](https://github.com/pedrohserrano)*

---
![](images/3_1.png)

# 3.1 Financial Sentiment Analysis
* [3.1.1. Sentiment correlation](#3.1.1)
* [3.1.2. Sentiment through the corpus](#3.1.2)

---

Sentiment analysis is a subfield within: 
- Textual analysis
- Natural language processing
- Content analysis
- Computational linguistics

Increased interest attributable to:
- Bigger, faster computers → faster processing of data (HFTs)
- Availability of large quantities of text → better interpretation of information
- New technologies derived from search engines → improve quality of information

Some early financial sentiment analysis by [Paul C. Tetlock](https://www.jstor.org/stable/4622297?seq=1#metadata_info_tab_contents)
![](images/paultetlock.png)

Applications  
- Stock prediction is a concurrent application of text mining to give scores (financial sentiments) and trade
- Automate news analysis: Contents and Tone, Measurement of qualitative and quantitative attributes
- The dictionaries or lexicon is domain specific.
- You have also events (predicates) we can say that is organization events, on an unsupervised approach, *See: [The Stock Sonar](https://pdfs.semanticscholar.org/a9b7/bf3e34c1a0235d07d423eb9d0c3b46c630e5.pdf)*
- We have more applications and better interfaces, *See happines of the world [hedenometer](http://hedonometer.org/index.html)

---
### 3.1.1. Sentiment correlation
<a id="3.1.1">

1. **TextBlob package:** Linguistic researchers have labeled the sentiment of words based on their domain expertise. Sentiment of words can vary based on where it is in a sentence. The TextBlob package allows us to take advantage of these labels.
2. **Sentiment Labels:** Each word in a corpus is labeled in terms of polarity and subjectivity, for the following example and we can define a corpus' sentiment ias the average of these.
   * **Polarity**: How positive or negative a word is. -1 is very negative. +1 is very positive.
   * **Subjectivity**: How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion.

For more info on how `TextBlob` coded up its [sentiment function](https://planspace.org/20150607-textblob_sentiment/).

In [None]:
import pandas as pd

data = pd.read_pickle('pickle/AnnualReports_corpus.pkl')
data.head()

In [None]:
# Terminal / Anaconda Navigator: conda install -c conda-forge textblob
from textblob import TextBlob

pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity

data['polarity'] = data['report'].apply(pol)
data['subjectivity'] = data['report'].apply(sub)

In [None]:
data.head()

In [None]:
# Let's plot the results
import matplotlib.pyplot as plt
%matplotlib inline

plt.rcParams['figure.figsize'] = [16, 12]

for index, company in enumerate(data.index):
    x = data.polarity.loc[company]
    y = - data.subjectivity.loc[company]
    plt.scatter(x, y, color='#008080', s=120, alpha=.8)
    plt.text(x+.001, y+.001, data['company_name'][index], fontsize=8)
    #plt.xlim(-.01, .12) 
    
plt.title('Sentiment Analysis', fontsize=20)
plt.xlabel('<-- Negative -------- Positive -->', fontsize=20)
plt.ylabel('<-- Opinions -------- Facts -->', fontsize=20)

plt.show()

---
### 3.1.2. Sentiment through the corpus
<a id="3.1.2">

In [None]:
# Split each documents into 20 parts
import numpy as np
import math

def split_text(text, n=20):
    '''Takes in a string of text and splits into n equal parts, with a default of 10 equal parts.'''

    # Calculate length of text, the size of each chunk of text and the starting points of each chunk of text
    length = len(text)
    size = math.floor(length / n)
    start = np.arange(0, length, size)
    
    # Pull out equally sized pieces of text and put it into a list
    split_list = []
    for piece in range(n):
        split_list.append(text[start[piece]:start[piece]+size])
    return split_list

In [None]:
data = data.head(18)

In [None]:
# Let's create a list to hold all of the pieces of text
list_pieces = []
for t in data.report:
    split = split_text(t)
    list_pieces.append(split)   

In [None]:
# The list has 10 elements, one for each text
len(list_pieces)

In [None]:
# Calculate the polarity for each piece of text
polarity_transcript = []
for lp in list_pieces:
    polarity_piece = []
    for p in lp:
        polarity_piece.append(TextBlob(p).sentiment.polarity)
    polarity_transcript.append(polarity_piece)   

In [None]:
# Show the plot for all companies
plt.rcParams['figure.figsize'] = [16, 20]

for index, company in enumerate(data.index):    
    plt.subplot(6, 3, index+1)
    plt.plot(polarity_transcript[index])
    plt.plot(np.arange(0,20), np.zeros(20))
    plt.title(data['company_name'][index])
    
plt.show()

---
#### *Learn more about applications on this [Youtube video](https://www.youtube.com/watch?v=QXlCAFPtmbg): Techniques and Applications for Sentiment Analysis at 5th Annual Wolfram Data Summit 2014*