# Text mining jaarverslagen
See the README.

## 1. Read from `file_list.csv`

In [1]:
from os.path import join, isfile
import pandas as pd

input_folder = '../jaarverslagen'
files = pd.read_csv(join(input_folder, 'file_list.csv'))
files

Unnamed: 0,filename,filename_no_extension,output_folder,language
0,ABNAMRO_2017.pdf,ABNAMRO_2017,../output/ABNAMRO_2017,english
1,AEGON_2017.pdf,AEGON_2017,../output/AEGON_2017,english
2,Akzonobel_2017.pdf,Akzonobel_2017,../output/Akzonobel_2017,english
3,Heineken_2017.pdf,Heineken_2017,../output/Heineken_2017,english
4,ING_Groep_2017.pdf,ING_Groep_2017,../output/ING_Groep_2017,english
5,KPN_2017.pdf,KPN_2017,../output/KPN_2017,english
6,Philips_2017.pdf,Philips_2017,../output/Philips_2017,english
7,Unilever_2017.pdf,Unilever_2017,../output/Unilever_2017,english


In [2]:
# Develop: to speed things up let's stick to 3 for now
files = files.iloc[:3, :]

## 2. Run through all jaarverslagen
Using `papermill`, see https://papermill.readthedocs.io/en/latest/usage.html.

How does this work with my virtual environment `tmj`?
* Jupyter Notebook is served from `base` environment which has `widgetnbextension` enabled
* Next `ipywidgets` is installed in this kernel's environment `tmj`
* See https://ipywidgets.readthedocs.io/en/stable/user_install.html#installing-with-multiple-environments for more info about this

In [3]:
import papermill as pm

### 2.1 PDF to text
First we'll convert all PDFs to plain text. This takes some time and only has to be done once if the script hasn't changed. So consider skipping this step if already done.

We'll execute the notebook `pdf_to_text.ipynb` and have the same output file as the input file. This means that the parameters will be inserted (`injected-paramaters` cell) into the original notebook.

In [11]:
# Iter over the rows as named tuples
for row in files.itertuples():
    
    #
    print('Running:', row.filename)
    
    # Security check
    if not isfile(join(input_folder, row.filename)):
        print('File not found:', row.filename)
        continue
    
    # Execute the notebook
    pm.execute_notebook(
       'pdf_to_text.ipynb',
       'pdf_to_text.ipynb',
       parameters = dict(
           filename = row.filename,
           folder = input_folder,
           filename_no_extension = row.filename_no_extension,
           output_folder = row.output_folder
       )
    )

Running: ABNAMRO_2017.pdf


HBox(children=(IntProgress(value=0, max=7), HTML(value='')))


Running: AEGON_2017.pdf


HBox(children=(IntProgress(value=0, max=7), HTML(value='')))


Running: Akzonobel_2017.pdf


HBox(children=(IntProgress(value=0, max=7), HTML(value='')))




### 2.2 Mine all the text
Run the text mining script for all previously extracted plain text files.

This will take a while. So if you already did this, and the script is not changed, consider skipping this part.

In [12]:
# The text files are in the output folder
for row in files.itertuples():
    
    #
    print('Running:', row.filename_no_extension)
    
    # Execute the notebook
    pm.execute_notebook(
       'mine_text.ipynb',
       'mine_text.ipynb',
       parameters = dict(
           filename = row.filename_no_extension + '.txt',
           folder = row.output_folder,
           language = row.language,
           filename_no_extension = row.filename_no_extension,
           output_folder = row.output_folder
       )
    )

Running: ABNAMRO_2017


HBox(children=(IntProgress(value=0, max=17), HTML(value='')))


Running: AEGON_2017


HBox(children=(IntProgress(value=0, max=17), HTML(value='')))


Running: Akzonobel_2017


HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




Now we've mined each document. Only some calculations are left to be done. Things will go a lot faster from now on.

## 3. Extract the data from the output folders
And combine it of course.

Each output folder corresponds to 1 jaarverslag and has multiple CSV. Each CSV has a *tag* in its filename which describes the current stage in the text mining process, e.g. `bag_of_words`, `basic_processing`, `lemmatized`.

Read them all into a large data frame `df_tf`.

And calculate the term frequency `tf` which is the count normalized by the total number of words, per document.

In [13]:
# Set the tag for which we want to fetch the data
tag = '_basic_processing'

# Initiate the all results data frame
df_tf = pd.DataFrame()

# Loop over our output files
for row in files.itertuples():
    # Construct the CSV filename
    filename = row.filename_no_extension + tag + '.csv'
    print('Reading:', join(row.output_folder, filename))
    
    # Pandas read
    df = pd.read_csv(join(row.output_folder, filename),
                     nrows=2000 # This parameter might be interesting for TF-IDF later
                    )
    print('Shape:', df.shape)

    # Add the report name (filename without tag) as a column
    df['filename'] = row.filename_no_extension
    
    # Append to rest of results
    df_tf = df_tf.append(df)

# Show a heads-up
df_tf.head()

Reading: ../output/ABNAMRO_2017/ABNAMRO_2017_basic_processing.csv
Shape: (2000, 3)
Reading: ../output/AEGON_2017/AEGON_2017_basic_processing.csv
Shape: (2000, 3)
Reading: ../output/Akzonobel_2017/Akzonobel_2017_basic_processing.csv
Shape: (2000, 3)


Unnamed: 0,word,count,tf,filename
0,risk,1551,0.018978,ABNAMRO_2017
1,amro,1352,0.016543,ABNAMRO_2017
2,abn,1352,0.016543,ABNAMRO_2017
3,financial,1269,0.015527,ABNAMRO_2017
4,annual,1014,0.012407,ABNAMRO_2017


## 4. Inverse document frequency

### 4.1 Calculate inverse document frequency for every word

IDF is the inverse function of the amount of documents the term occurs in. We define it as:

$$\text{idf}(t,D)=\log{ \frac{N}{|\{d \in D:t \in d\}|} }$$

where $t$ is a term or word, $D$ is the collection of all documents or corpus, $N$ is the total amount of documents and $d$ is one document. $|\{d \in D:t \in d\}|$ means the sum of all documents $d$ in $D$ which have term $t$ in their body. Note that $\log$ is the natural logarithm.

See https://en.wikipedia.org/wiki/tf-idf for more info.

*Note:* Maybe this will clear to many words? Maybe take into account *the counts* of other documents? For example: the word 'risk' is in every document. But perhaps in 1 document way more than in others.

In [14]:
# Extract and count unique filenames in the data frame
N = len(set(df_tf['filename']))

# Group by word and count the amount of filenames
idf = (df_tf.groupby(by='word', as_index=False)['filename'].count()
       .rename(columns={'filename': 'document_count'})
      )

# Calculate IDF from the document count
import numpy as np
idf['idf'] = np.log(N/idf['document_count'])
idf.head()

Unnamed: 0,word,document_count,idf
0,aa,1,1.098612
1,aaa,1,1.098612
2,aag,1,1.098612
3,aandelenlease,1,1.098612
4,ab,1,1.098612


### 4.2 Join each jaarverslag's TF list with the great IDF list
The IDF data frame is a collection of all words in the corpus, information about with words belongs to which document is lost. The TF data frame contains the words (and counts) per document.

So join TF with IDF and calculate TF-IDF for every word in every document.

In [16]:
# Join TF with IDF
df_tfidf = df_tf.merge(idf, how='left', on='word')

# Calculate TF-IDF
df_tfidf['tf-idf'] = df_tfidf['tf'] * df_tfidf['idf']

# Sort
df_tfidf.sort_values(by=['filename', 'tf-idf'], ascending=[True, False], inplace=True)
df_tfidf.head(10)

Unnamed: 0,word,count,tf,filename,document_count,idf,tf-idf
1,amro,1352,0.016543,ABNAMRO_2017,1,1.098612,0.018174
2,abn,1352,0.016543,ABNAMRO_2017,1,1.098612,0.018174
6,eur,892,0.010914,ABNAMRO_2017,2,0.405465,0.004425
50,banks,209,0.002557,ABNAMRO_2017,1,1.098612,0.002809
14,bank,497,0.006081,ABNAMRO_2017,2,0.405465,0.002466
18,banking,469,0.005739,ABNAMRO_2017,2,0.405465,0.002327
29,leadership,373,0.004564,ABNAMRO_2017,2,0.405465,0.001851
86,recognised,135,0.001652,ABNAMRO_2017,1,1.098612,0.001815
35,clients,286,0.003499,ABNAMRO_2017,2,0.405465,0.001419
168,depositary,79,0.000967,ABNAMRO_2017,1,1.098612,0.001062


In [19]:
# Save as CSV
output_folder = '../output'
df_tfidf.to_csv(
    join(output_folder, 'Alle_jaarverslagen' + tag + '.csv'),
    index=False
)

TODO: generate word clouds for each `filename` in `df_tfidf`.