In [1]:
# Load

import os
import pandas as pd
import json
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import itertools
from itertools import compress
import numpy as np
from collections import Counter

# Gensim
import gensim
import gensim.corpora as corpora

data_path = os.path.join('..', 'data')
pdf_path = os.path.join(data_path, 'pdf')
out_path = os.path.join('..', 'output')
datafile = 'drr_scrape2021-07-08.json'
datafile_tokenized = 'drr_scrape2021-07-08_tokenized.json'
datafile_es = 'drr_scrape2021-07-08_es.json'

path = os.path.join(data_path, datafile_tokenized)

with open(path, 'r') as file:
    data = json.load(file)
    
# Dummy functions for using existing tokens in sklearn vectorizer
def return_tokens(tokens):
    return tokens

# Function for summarizing keywords with tf-idf
def tfidf_summarize(token_list, n_words = 50):
    vectorizer = TfidfVectorizer(
        tokenizer=return_tokens,
        preprocessor=return_tokens,
        token_pattern=None,
        norm = False)

    # Fitting vectorizer
    transformed_documents = vectorizer.fit_transform(token_list)
    transformed_documents_as_array = transformed_documents.toarray()
    df = pd.DataFrame(transformed_documents_as_array, columns = vectorizer.get_feature_names())

    # Word count
    word_tfidfsum = df.sum().sort_values(ascending = False)
    word_tfidfsum_select = word_tfidfsum[0:n_words]
    
    return(word_tfidfsum_select)

# DRR: Initial data exploration

This report briefly the initial analysis process in a project at AAU on disaster risk reduction.

The report contains the following:
- Data sources and data collection (web scraping)
- Data description
- Initial keywords
- Example topic model

The report is meant to give an initial idea of the possibilities with the techniques used.

## Data sources and data collection

Two main data sources was initial identified for gaining insights into how international organizations work with disaster risk reduction:
- The United Nations Inter-Agency Working Group (IAWG) on Disarmament, Demobilization and Reintegration (DDR) (UNDDR - https://www.unddr.org/)
- The European Commission Disaster Risk Management Knowledge Centre (DRMKC - https://drmkc.jrc.ec.europa.eu/)

The data collection consisted of two main steps:
1. Scraping the textual content and links of all webpages on the two websites
2. Identify and download the initial identified pdf documents on the websites

The webpages and pdf documents was combined into a combined dataset containing all the texts. 

All the texts are made available on an Elasticsearch instance (link and login details sent via e-mail).

**NOTE ON PDFs!:** Not all pdf documents have been downloaded. Both websites contain several search portal for searching through pdf documents on the site. Downloading all available pdf on these portals would require setting up a different scraper which was not done for this iteration of the project.

## Data description and initial keywords

In [2]:
data_un = [entry for entry in data if entry['org'] == 'unddr']
data_drmkc = [entry for entry in data if entry['org'] == 'drmkc']

print(f"""The data consists of {len(data)} texts in total. \n
{len(data_un)} texts are from UN DDR (https://www.unddr.org). {dict(Counter([entry.get('type') for entry in data_un]))['webpage']} from webpages and {dict(Counter([entry.get('type') for entry in data_un]))['pdf']} from pdfs \n
{len(data_drmkc)} texts are from DRMKC EU (https://drmkc.jrc.ec.europa.eu). {dict(Counter([entry.get('type') for entry in data_drmkc]))['webpage']} from webpages and {dict(Counter([entry.get('type') for entry in data_drmkc]))['pdf']} from pdfs""")

The data consists of 433 texts in total. 

196 texts are from UN DDR (https://www.unddr.org). 133 from webpages and 63 from pdfs 

237 texts are from DRMKC EU (https://drmkc.jrc.ec.europa.eu). 151 from webpages and 86 from pdfs


In order to identify keywords (here understood as most occurring words), the texts have been 'tokenized'. 'Tokenization' is the process of converting raw text into standardized word values, so that words can be compared and counted regardless of casing or lemma. It also involves filtering out "stop words", meaning words that are semantically rather irrelevant. 

A pre-developed English language model from spaCy was used in the tokenization (https://spacy.io/).

The tokenization involved the following steps:
- Converting the text to lower-case
- Splitting the text into individual words
- Removing numbers and punctuation
- Filtering out common stop words (as defined in the language model)
- Keeping only nouns, proper nouns, adjectives and verbs
- Filtering out words shorter than five characters
- Converting the word to its lemma

With the texts tokenized, the most occurring tokens can be identified.

### Keywords based on counts

**Top 50 most common terms based on counts across all data**

In [3]:
# Keywords based on counts

drr_tokens = [entry['tokens'] for entry in data]
drr_tokens_flat = list(itertools.chain(*drr_tokens))

un_tokens = [entry['tokens'] for entry in data_un]
un_tokens_flat = list(itertools.chain(*un_tokens))

drmkc_tokens = [entry['tokens'] for entry in data_drmkc]
drmkc_tokens_flat = list(itertools.chain(*drmkc_tokens))

tokens_counted = Counter(drr_tokens_flat)
tokens_counted_un = Counter(un_tokens_flat)
tokens_counted_drmkc = Counter(drmkc_tokens_flat)
tokens_counted.most_common()[0:50]

[('should', 6140),
 ('their', 4689),
 ('national', 4093),
 ('support', 3832),
 ('reintegration', 3675),
 ('international', 3664),
 ('management', 3532),
 ('disaster', 3504),
 ('security', 3323),
 ('information', 3290),
 ('groups', 3288),
 ('programmes', 3161),
 ('programme', 3071),
 ('european', 2836),
 ('processes', 2479),
 ('system', 2450),
 ('these', 2422),
 ('force', 2364),
 ('service', 2200),
 ('including', 2156),
 ('development', 2063),
 ('commission', 2016),
 ('crisis', 2007),
 ('community', 2002),
 ('demobilization', 1976),
 ('planning', 1976),
 ('activities', 1960),
 ('peace', 1905),
 ('measures', 1884),
 ('people', 1883),
 ('women', 1882),
 ('process', 1875),
 ('weapons', 1825),
 ('united', 1798),
 ('between', 1784),
 ('natural', 1756),
 ('assessment', 1716),
 ('excombatants', 1693),
 ('during', 1692),
 ('transitional', 1630),
 ('provide', 1612),
 ('health', 1594),
 ('inform', 1592),
 ('research', 1579),
 ('communities', 1573),
 ('different', 1560),
 ('disarmament', 1522),
 (

**Top 50 most common terms based on counts across data from UNDDR**

In [4]:
tokens_counted_un.most_common()[0:50]

[('should', 5109),
 ('reintegration', 3675),
 ('their', 3373),
 ('support', 3177),
 ('programmes', 3039),
 ('national', 3038),
 ('programme', 2892),
 ('security', 2864),
 ('groups', 2819),
 ('international', 2321),
 ('force', 2298),
 ('processes', 2126),
 ('demobilization', 1976),
 ('peace', 1895),
 ('information', 1872),
 ('women', 1872),
 ('weapons', 1822),
 ('excombatants', 1693),
 ('including', 1673),
 ('transitional', 1626),
 ('process', 1556),
 ('activities', 1543),
 ('disarmament', 1522),
 ('community', 1519),
 ('these', 1519),
 ('planning', 1468),
 ('mission', 1429),
 ('rights', 1337),
 ('ensure', 1264),
 ('ammunition', 1237),
 ('development', 1234),
 ('integrated', 1196),
 ('provide', 1192),
 ('communities', 1186),
 ('measures', 1144),
 ('combatants', 1135),
 ('nations', 1122),
 ('include', 1105),
 ('during', 1080),
 ('management', 1067),
 ('where', 1053),
 ('training', 1048),
 ('united', 1036),
 ('political', 1031),
 ('service', 996),
 ('between', 939),
 ('gender', 939),
 ('v

**Top 50 most common terms based on counts across data from DRMKC**

In [5]:
tokens_counted_drmkc.most_common()[0:50]

[('disaster', 3495),
 ('european', 2799),
 ('management', 2465),
 ('crisis', 1936),
 ('system', 1769),
 ('inform', 1495),
 ('commission', 1473),
 ('impacts', 1444),
 ('research', 1438),
 ('information', 1418),
 ('international', 1343),
 ('their', 1316),
 ('people', 1302),
 ('service', 1204),
 ('assessment', 1182),
 ('disasters', 1181),
 ('change', 1172),
 ('global', 1143),
 ('natural', 1127),
 ('damage', 1118),
 ('hazards', 1066),
 ('national', 1055),
 ('medium', 1035),
 ('should', 1031),
 ('index', 1010),
 ('different', 1003),
 ('science', 997),
 ('knowledge', 962),
 ('impact', 962),
 ('communication', 945),
 ('events', 942),
 ('centre', 920),
 ('model', 915),
 ('these', 903),
 ('reduction', 881),
 ('vulnerability', 881),
 ('complex', 860),
 ('severity', 853),
 ('between', 845),
 ('infrastructure', 844),
 ('africa', 842),
 ('development', 829),
 ('resilience', 826),
 ('population', 808),
 ('health', 804),
 ('report', 795),
 ('university', 790),
 ('framework', 781),
 ('protection', 775

### Keywords based on TF-IDF

A common way of weighing words is by using the metrics "TF-IDF": Term-frequency, inverse document frequency (https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Briefly put, the metric "punishes" a word the more documents it occurs in. This on the assumption that words that occur in almost all documents are less relevant than words that occur in only some documents. Words that occur in all documents are often common words in general and may not be of interest for a specific topic.

**Top 50 terms based on TF-IDF across all data**

In [6]:
tfidf_summarize(drr_tokens)

should            14188.745480
their              9765.367421
reintegration      9666.695295
disaster           8864.243130
national           8441.441019
support            7802.304282
groups             7771.235126
programmes         7471.068806
international      7320.198314
programme          7258.352516
management         7237.647295
security           7154.630370
european           6807.011724
information        6593.621692
force              6274.538082
processes          5544.934553
these              5436.737990
system             5257.187580
demobilization     5244.706958
women              5136.428009
weapons            5103.390608
service            4973.892393
peace              4944.823206
excombatants       4933.675336
crisis             4817.232909
including          4771.728472
natural            4703.613486
commission         4641.557578
community          4626.362940
transitional       4558.096817
activities         4546.130512
planning           4532.787977
measures

**Top 50 terms based on TF-IDF across data from UNDDR**

In [7]:
tfidf_summarize(un_tokens)

should            10395.325658
reintegration      6764.030336
their              6585.682620
support            5923.081635
national           5813.990352
programmes         5702.636339
groups             5617.546147
programme          5571.423257
security           5480.996830
force              4810.975571
international      4657.173792
processes          4015.492224
women              3756.238405
weapons            3733.455142
demobilization     3683.981527
information        3679.819632
peace              3602.730593
excombatants       3596.474853
transitional       3331.832086
including          3288.642224
process            3188.395281
activities         3161.757017
these              3090.722191
community          3090.722191
darfur             3034.839366
mission            2887.328584
planning           2885.670523
ammunition         2884.634533
disarmament        2837.560670
rights             2819.487605
ensure             2665.544004
measures           2504.054886
communit

**Top 50 terms based on TF-IDF across data from DRMKC**

In [8]:
tfidf_summarize(drmkc_tokens)

disaster          7052.642204
european          5552.192417
management        5152.451556
crisis            4171.659036
system            3933.858739
impacts           3482.679146
inform            3390.029191
commission        3193.766820
research          3060.733032
change            3000.610267
people            2991.844192
their             2889.415594
information       2878.000334
natural           2819.730063
service           2766.651618
should            2681.702822
international     2664.020870
assessment        2662.765829
damage            2658.521393
global            2644.202016
disasters         2643.271884
hazards           2552.785623
communication     2540.236583
science           2532.823706
medium            2532.561147
vulnerability     2329.037393
national          2316.362805
journal           2315.031774
model             2306.739664
complex           2292.423038
index             2275.290598
different         2274.380788
severity          2273.763781
events    

## Topic modelling (example)

A topic model is a model that tries to extract a set number of topics in a collection of documents. The model is built on the assumption that a text is compiled of this set number of topics. Simply put a topic is a probability distribution over the words in all the texts. The different topics thus indicate what words often occur together in a text.

A somewhat easy to understand description of how topic models work can be found here: http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf

To make proper use of a topic model, it makes sense to try and identify what the topic is actually capturing: what does the words in the topic have in common and what texts are associated with different topics? 

From that key texts within each topic can be identified and it is possible to compare and contrast the prevalence of different topics across organizations.

Creating a proper topic model involves fine-tuning the different parameters in the model as well as trying out different pre-processing techniques. Texts - especially from websites - can be tricky and involve a lot of "noise" that is specific to the data. 

In this example, a simple model have been put together simply to illustrate some of the outputs it can produce. Using this technique would involve more work than what has been put in for this report.

### Example topic model

A topic model has been created using the texts from the two websites. The model was set to identify 10 topics (finding the optimal number of topics both involves trying out different parameters - the fine-tuning refered to above - as well as evaulating the outputs qualitatively). 

In [9]:
## Dictionary and filter extremes
id2token = corpora.Dictionary([entry.get('tokens') for entry in data])
id2token.filter_extremes(no_below=0.05, no_above=0.95)

## Gensim doc2bow corpus
for entry in data:
    entry['doc2bow'] = id2token.doc2bow(entry.get('tokens'))    
    
tokens_bow = [entry.get('doc2bow') for entry in data]

lda_model = gensim.models.LdaMulticore.load(os.path.join(out_path, 'lda_model'))

The most likely words within each of the 10 topics can be seen below:

In [10]:
from pprint import pprint 

# Show Topics
pprint(lda_model.show_topics(formatted=False, num_topics=10))

[(0,
  [('disaster', 0.01122476),
   ('management', 0.007977575),
   ('european', 0.0079092765),
   ('system', 0.005714025),
   ('impacts', 0.0048752055),
   ('their', 0.004089153),
   ('change', 0.003980041),
   ('disasters', 0.0038803057),
   ('information', 0.003842459),
   ('research', 0.0038103017)]),
 (1,
  [('function', 0.018646052),
   ('newcategorysectionsettingscategorysection', 0.015501083),
   ('european', 0.011650168),
   ('return', 0.011440888),
   ('commission', 0.010161978),
   ('found', 0.009384863),
   ('university', 0.009039401),
   ('research', 0.008704226),
   ('centre', 0.008032969),
   ('joint', 0.0071320636)]),
 (2,
  [('inform', 0.028230362),
   ('index', 0.017968168),
   ('severity', 0.015301224),
   ('crisis', 0.011634377),
   ('people', 0.008789542),
   ('dimension', 0.0065143546),
   ('population', 0.0063117063),
   ('indicators', 0.006056953),
   ('model', 0.0054247826),
   ('vulnerability', 0.0050228564)]),
 (3,
  [('reintegration', 0.008804386),
   ('nat

The model can be used to identify the topics in the texts. 

In [52]:
lda_model[tokens_bow[21]]

[(5, 0.2899878), (7, 0.30732593), (8, 0.40228155)]

The output above shows that text 21 (corresponding to this text: https://www.unddr.org/the-ddr-bulletin-map-test/) consists of 29 % topic 5, 31% topic 7 and 40% topic 8.

A mean topic probability is calculated across all the texts below:

In [53]:
# Topic prob for all texts

for entry in data:
    entry['topic_prob'] = lda_model[entry['doc2bow']]
    
df = pd.DataFrame.from_records(data)
df = df.loc[:, ['id', 'org', 'topic_prob']]
df = df.explode('topic_prob').reset_index(drop = True)

df['topic'] = 0
df['prob'] = 0

for row in range(df.shape[0]):
    df.loc[row, 'topic'] = df.loc[row, 'topic_prob'][0]
    df.loc[row, 'prob'] = df.loc[row, 'topic_prob'][1]

df.groupby('topic')['prob'].mean()

topic
0    0.267625
1    0.157956
2    0.169609
3    0.117099
4    0.104254
5    0.231530
6    0.102894
7    0.183316
8    0.166712
9    0.103568
Name: prob, dtype: float64

This indicates that topic 0 and 5 are the most prevalent topics in the data.

Below the mean probabilities are calculated for each organization:

In [56]:
df.groupby(['org', 'topic'])['prob'].mean()

org    topic
drmkc  0        0.379858
       1        0.203076
       2        0.218982
       3        0.125504
       4        0.099662
       5        0.101839
       6        0.100968
       7        0.101872
       8        0.222663
       9        0.099669
unddr  0        0.098370
       1        0.099214
       2        0.097106
       3        0.108001
       4        0.108762
       5        0.323969
       6        0.104720
       7        0.238617
       8        0.112833
       9        0.107720
Name: prob, dtype: float64

This indicates that topic 0 is the most prevalent in the DRMKC texts and topic 5 is the most prevalent in the UNDDR texts.