#### PGGM Bootcamp Text Analytics 2020
*Notebook by [Pedro V Hernandez Serrano](https://github.com/pedrohserrano)*

---
![](images/1_4.png)

# 1.4 Data Analysis
* [1.4.1. Most common words](#1.4.1)
* [1.4.2. Wordclouds](#1.4.2)
* [1.4.3. Relevant words analysis](#1.4.3)

---

## Exploratory data analysis of annual reports

After the data cleaning step where we put our data into a few standard formats, the next step is to take a look at the data and see if what we're looking at makes sense. Before applying any algorithm, let's explore the data first.

When working with numerical data, some of the exploratory data analysis (EDA) techniques we can use include finding the average of the data set, the distribution of the data, the most common values, etc. The idea is the same when working with text data. We are going to find some more obvious patterns with EDA before identifying the hidden patterns with machines learning (ML) techniques.

1. **Most common words** - find these and create word clouds
2. **Size of vocabulary** - look number of unique words and total
3. **Trade-off Risk** - most common terms

---
### 1.4.1. Most Common Words
<a id="1.4.1">

We will read the document-term matrix and explore the words per document

In [None]:
import pandas as pd

In [None]:
data_matrix = pd.read_pickle('pickle/AnnualReports_matrix.pkl')
data_matrix.head()

In [None]:
data = data_matrix.transpose()

<br>
Find the top 30 words per report where each report is on every column

In [None]:
top_dict = {}
for c in data.columns:
    top = data[c].sort_values(ascending=False).head(30)
    top_dict[c]= list(zip(top.index, top.values))

In [None]:
top_words = pd.DataFrame.from_dict(top_dict)
top_words

<br>
We can try to find the number of unique words that each company uses

In [None]:
# list of companies we're working on
companies = list(data_matrix.index)

In [None]:
# Identify the non-zero items in the document-term matrix, meaning that the word occurs at least once
unique_list = []
for company in data.columns:
    uniques = data[company].to_numpy().nonzero()[0].size
    unique_list.append(uniques)

In [None]:
# Create a new dataframe that contains this unique word count
data_words = pd.DataFrame(list(zip(companies, unique_list)), columns=['company', 'unique_words'])

In [None]:
data_words.head()

In [None]:
# Find the total number of words per document
total_list = []
for company in data.columns:
    totals = sum(data[company])
    total_list.append(totals)
# Let's add some columns to our dataframe
data_words['total_words'] = total_list

Having the total words we can make the calculation on relative numbers, since is expected that the lenght of unique words will increase with the lenght of the document so we need to normalize it

In [None]:
# Calculate the ratio
data_words['unique_relative'] = data_words['unique_words']/data_words['total_words']

In [None]:
# Sort it out
data_sort = data_words.sort_values(by='unique_relative', ascending=False)
data_sort.head(10)

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
y_pos = np.arange(len(data_words))

plt.rcParams['figure.figsize'] = [12, 18]
plt.barh(y_pos, data_sort.unique_relative, align='center')
plt.yticks(y_pos, data_sort.company)
plt.title('Unique Words %', fontsize=20)

plt.tight_layout()
plt.show()

---
### 1.4.2. Wordclouds
<a id="1.4.2">

A Word Cloud is a good option to help visually interpret text at first gaze and gain insight into the most prominent items in a given text, by visualizing the word frequency in the text as a weighted list.

In [None]:
# Wordcloud python library
from wordcloud import WordCloud

In [None]:
# Dfining wordcloud object
wc = WordCloud(background_color="white", colormap="Dark2", max_font_size=150, random_state=42)

To start to plot we would first need our corpus

In [None]:
# Read in the corpus in dataframe
data_clean = pd.read_pickle('pickle/AnnualReports_corpus.pkl')

In [None]:
index = 'Pfizer_(2018).pdf' #'ABN_AMRO_Group_(2018).pdf' 
wc.generate(data_clean.report[index])
plt.figure(figsize=(12,12))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.title(index)
plt.show()

---
#### *Learn more about Wordcloud library on its official documentation [amueller.github.io/word_cloud](https://amueller.github.io/word_cloud/index.html)*

---
### 1.4.3. Relevant words analysis (Risk-loss trade-off)
<a id="1.4.3">

Among our corpus we might find relevant certain terms related to something in particular, like negative or positive sentiment

In [None]:
top_words

In [None]:
# Let's isolate just these risk words
data_risk_words = data_matrix[['risk', 'loss']]
data_risk_words

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
# We include the total count to this new dataframe
data_risk_words['total_words'] = list(data_sort.sort_values('company').total_words)

In [None]:
# We calculate the relatives
data_risk_words['relative_risk'] = data_risk_words['risk']/data_risk_words['total_words']*1000
data_risk_words['relative_loss'] = data_risk_words['loss']/data_risk_words['total_words']*1000

In [None]:
data_risk_words

<br> 
We simply scatterplot the word counts

In [None]:
plt.rcParams['figure.figsize'] = [16, 16]

for i, company in enumerate(data_risk_words.index):
    x = round(data_risk_words.relative_risk.loc[company],2)
    y = round(data_risk_words.relative_loss.loc[company],2)
    plt.scatter(x, y, color='blue')
    plt.text(x+0.1, y+0.1, data_risk_words.index[i], fontsize=8)
    
plt.title('Number of Risk-Loss Terms Used in Document per Thousand Words', fontsize=20)
plt.xlabel('Risk mentions', fontsize=15)
plt.ylabel('Loss mentions', fontsize=15)

plt.show()

---
#### *More tips and tricks of Matplotlib python library at [realpython.com](https://realpython.com/python-matplotlib-guide/)*