# Text mining jaarverslagen
1. Read info from the `file_list.csv`.
2. Uses `analyse_jaarverslag.ipynb` to analyse each jaarverslag, one at a time.

## 1. Read from `file_list.csv`

In [1]:
from os.path import join, isfile
import pandas as pd

folder = '../jaarverslagen'
files = pd.read_csv(join(folder, 'file_list.csv'))
files

Unnamed: 0,filename,language
0,ABNAMRO_2017.pdf,english
1,AEGON_2017.pdf,english
2,Akzonobel_2017.pdf,english
3,Heineken_2017.pdf,english
4,ING_Groep_2017.pdf,english
5,KPN_2017.pdf,english
6,Philips_2017.pdf,english
7,Unilever_2017.pdf,english


## 2. Run through all jaarverslagen
Using `papermill`, see https://papermill.readthedocs.io/en/latest/usage.html.

How does this work with my virtual environment `tmj`?
* Jupyter Notebook is served from `base` environment which has `widgetnbextension` enabled
* Next `ipywidgets` is installed in this kernel's environment `tmj`
* See https://ipywidgets.readthedocs.io/en/stable/user_install.html#installing-with-multiple-environments for more info about this

In [2]:
import papermill as pm

We'll execute the notebook `analyse_jaarverslag` and have the same output file as the input file. This means that the parameters will be inserted (`injected-paramaters` cell) into the original notebook.

In [23]:
# Loop over all files
for row in files.itertuples():
    
    #
    print('Running:', row.filename)
    
    # First check if it exists
    if not isfile(join(folder, row.filename)):
        print('File not found:', row.filename)
        continue
    
    # Execute the notebook
    pm.execute_notebook(
       'analyse_jaarverslag.ipynb',
       'analyse_jaarverslag.ipynb',
       parameters = dict(filename = row.filename,
                         folder = folder,
                         language = row.language)
    )

Running: ABNAMRO_2017.pdf


HBox(children=(IntProgress(value=0, max=18), HTML(value='')))


Running: AEGON_2017.pdf


HBox(children=(IntProgress(value=0, max=18), HTML(value='')))


Running: Akzonobel_2017.pdf


HBox(children=(IntProgress(value=0, max=18), HTML(value='')))


Running: Heineken_2017.pdf


HBox(children=(IntProgress(value=0, max=18), HTML(value='')))


Running: ING_Groep_2017.pdf


HBox(children=(IntProgress(value=0, max=18), HTML(value='')))


Running: KPN_2017.pdf


HBox(children=(IntProgress(value=0, max=18), HTML(value='')))


Running: Philips_2017.pdf


HBox(children=(IntProgress(value=0, max=18), HTML(value='')))


Running: Unilever_2017.pdf


HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




## 3. Extract the data from the output folders
And combine it of course.

Each output folder corresponds to 1 jaarverslag and includes multiple CSV-PNG couples. Each CSV (table with unique words and their frequencies) and PNG (accompanying word cloud) have a *tag* which describes the current stage in the text mining process.

In [110]:
# Let's stick to one tag for now
tag = '_basic_processing'

# Our output folder
output_folder = '../output/'

# Maximum words (most common) to extract from the files
max_rows = 1000

In [111]:
# Create a list (of tuples) with the path and filename of our CSVs
csvs = []
for path, _, files in walk(output_folder):
    for filename in files:
        # Only store the CSVs we're interested in
        if re.match('.+' + tag + '\.csv', filename):
            csvs.append((path, filename))

csvs

[('../output/ABNAMRO_2017', 'ABNAMRO_2017_basic_processing.csv'),
 ('../output/AEGON_2017', 'AEGON_2017_basic_processing.csv'),
 ('../output/Akzonobel_2017', 'Akzonobel_2017_basic_processing.csv'),
 ('../output/Heineken_2017', 'Heineken_2017_basic_processing.csv'),
 ('../output/ING_Groep_2017', 'ING_Groep_2017_basic_processing.csv'),
 ('../output/KPN_2017', 'KPN_2017_basic_processing.csv'),
 ('../output/Philips_2017', 'Philips_2017_basic_processing.csv'),
 ('../output/Unilever_2017', 'Unilever_2017_basic_processing.csv')]

Now we have the files we need. Read them into a large data frame.

In [112]:
import pandas as pd

# Initiate all results data frame
df_large = pd.DataFrame()

# Loop over our files
for (path, filename) in csvs:
    # Pandas read
    df = pd.read_csv(join(path, filename), header=None, names=['word', 'count'],
                     #nrows=max_rows # For debugging only
                    )

    # Sort (again, but just to make sure)
    df.sort_values(by='count', ascending=False, inplace=True)

    # Truncate to maximum rows
    df = df[:max_rows]

    # Add the report name (filename without tag) as a column
    df['filename'] = re.match('(.+)' + tag + '\.csv', filename).group(1)

    # Append to rest of results
    df_large = df_large.append(df)
    print(df_large.shape)

df_large.head()

(1000, 3)
(2000, 3)
(3000, 3)
(4000, 3)
(5000, 3)
(6000, 3)
(7000, 3)
(8000, 3)


Unnamed: 0,word,count,filename
0,risk,1544,ABNAMRO_2017
1,amro,1352,ABNAMRO_2017
2,abn,1352,ABNAMRO_2017
3,2017,1294,ABNAMRO_2017
4,financial,1269,ABNAMRO_2017


In [113]:
# Save as CSV
df_large.to_csv(join(output_folder, 'Alle_jaarverslagen' + tag + '.csv'),
                index=False
               )

## 4. Inverse document frequency