## Baseline search engine for EviDENce

Search strategy

1. Collect corpus to perform search on
2. Index documents in corpus
3. Collect Keywords
4. Perform search
6. Analyze results

In [1]:
# Imports from python libraries
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import string
import sys

# Imports from own script
from baseline_search import create_searchable_data
from baseline_search import create_searchable_data2
from baseline_search import create_lemma
from baseline_search import eng_to_dutch
from baseline_search import search_corpus
from baseline_search import quote_phrase

# disable SettingWithCopyWarning warning (default='warn')
pd.options.mode.chained_assignment = None 

In [2]:
# Define paths:
# to extracted folder with lemma fragments
root = os.path.join(os.sep,"media","sf_MartinedeVos")
manual_set = os.path.join(root,"TargetSize150","Manual_annotation_sets-20190110T093859Z-001")
search_dir = os.path.join(manual_set,"Auto_annotation_sets")
# to reports on baseline_search results
report = os.path.join(manual_set,"Reports")
# to ground truth
gt_file = os.path.join(manual_set,"Manual Annotation Jeroen en Susan","Manual annotation TOTAAL door Jeroen en Susan.csv")


### 1. Collect corpus to perform search on

Our corpus consists of oral history accounts.
These are broken up in text fragments -of 150 lemmas each- and can be found in a zip folder on surfdrive:

../Data/NR-teksts/EviDENce_NR_output/TargetSize150/Lemma_preserve_paragraph.zip

*Both the file names and path names are long. Make sure to extract the zip folder on high-level location on your computer a to avoid "path-too-long" error*


### 2. Index all documents (i.e., lemma fragments) in the directory**

* Create Schema
* Add documents
* Perform indexing

_NB: this step only has to be run once, or when data is added or changed_

In [3]:
# The creation of an index is only needed once; after that, opending the existing index is sufficient
# in that case, the following line should be commented out

#create_searchable_data(search_dir)

### 3. Collect list of keywords from an existing vocabulary

Preprocessing entails:
* manually select keywords related to violence
* express keywords as lemmas to ensure more effective matching 

#### a.  Keywords from CEO-ECB mappings ####

Keywords based on mappings from classes of the Circumstantial Event Ontology (CEO) on the ECB+ corpus
* translate selected keywords to Dutch

In [4]:
ceo_file ="../data/MdV_selectedCEOECB.csv"
ceo_df = pd.read_csv(ceo_file,sep=';',encoding = "ISO-8859-1")

ceo_df['wordnet_lemma']=ceo_df.apply(lambda x:create_lemma(x['Mention']), axis=1)

NB: As automatic translation is not stable or does not provide sufficient results, please use following workaround

In [5]:
transl_file = "../data/Translated_lemmas.csv"
transl_df = pd.read_csv(transl_file,sep=',',encoding = "ISO-8859-1") 

ceo_df['Dutch']= transl_df['Dutch']
# Add quotes in case of multiple words to enable 'phrase queries'.
ceo_df['Dutch']= ceo_df.apply(lambda x:quote_phrase(x['Dutch']), axis=1)
ceo_list = list(ceo_df['Dutch'])

Create dictionary with CEO class per unique keyword

In [6]:
ceo_class_df = pd.DataFrame(data =ceo_df[['CEO class','Dutch']]).drop_duplicates(subset= 'Dutch')
ceo_class_df = ceo_class_df.drop(columns=['Dutch']).set_index(ceo_class_df['Dutch'])
ceo_class_dict = ceo_class_df.transpose().to_dict(orient = 'records')[0]

#### b.  Keywords from WW2 Thesaurus ####

The WW2 Thesaurus describes keywords related to events, locations, concepts and objects from the second world war.

In [7]:
thes_file ="../data/Geweldslexicon obv WOIIThesaurus SH en JW.csv"
thes_df = pd.read_csv(thes_file,sep=',',encoding = "ISO-8859-1")

Manual selection of relevant violence keywords is done by two annotators. 

Either take union or intersection of two manual selections

In [8]:
susan = thes_df['Susan'].dropna().str.lower().tolist()
jeroen = thes_df['Jeroen'].dropna().str.lower().tolist()

thes_intersect = [x for x in susan if x in jeroen]
thes_union = set(susan+jeroen)

### 4. Perform search ###

Using whoosh library:
* Define query parser: which schema, which search fields, AND/OR search
* Define searcher: which scoring approach

Store info from results object in pandas dataframe, which contain
* all keywords, also those that are not present in documents
* all documents, also those that have no keywords

In [34]:
indexdir = os.path.join(os.sep,search_dir,"indexdir")

# Search results for selected keyword list
merged_df = search_corpus(indexdir,ceo_list)

### Compare with groundtruth ###

1. From ground truth: determine which fragments are relevant, i.e., violent fragments 

In [35]:
# Prepare ground truth dataframe
gt_df = pd.read_csv(gt_file,sep=',',encoding = "ISO-8859-1") 

# Random set for manual annotation contains duplicate fragments; remove these
gt_all_df = gt_df.drop_duplicates(subset = ['Titel'])
# Select all violent fragments
gt_vio_df = gt_all_df[gt_all_df['JA/NEE'].isin(['ja','Ja'])]

# Add columns with clean titles to enable comparison
titles = gt_vio_df['Titel'].tolist()
new_titles = [title.split('_text')[0] for title in titles]
se = pd.Series(new_titles)
gt_vio_df['clean title'] = se.values

2. From baseline: determine which fragments are correctly marked as violent

In [36]:
merged_df['hits']=merged_df.sum(axis=1)
base_vio_df = merged_df[merged_df['hits'] > 0]
base_vio_df.shape

(371, 173)

In [37]:
merged_df['hits']=merged_df.sum(axis=1)

base_vio_df = merged_df[merged_df['hits'] > 0]

# Add columns with clean titles to enable comparison
titles = base_vio_df.index.values.tolist()
new_titles = [title.split('_lemma')[0] for title in titles]
se = pd.Series(new_titles)
base_vio_df['clean title'] = se.values

# Select correctly annotated violent fragments
base_vio_df['correct']= base_vio_df['clean title'].isin(gt_vio_df['clean title'])

3. Determine recall and precision

In [38]:
relevant = len(gt_vio_df['clean title'])
retrieved = len(base_vio_df['clean title'])
correct_retrieved = len(base_vio_df[base_vio_df['correct'] == True])

recall = correct_retrieved/relevant
precision = correct_retrieved/retrieved

print('Recall: {}'.format(recall))
print('Precision: {}'.format(precision))


Recall: 0.5381944444444444
Precision: 0.41778975741239893


In [39]:
compare_report = os.path.join(report, "base_vs_gt.csv")

base_vio_df.to_csv(compare_report)

# Which keywords were succesful?
correct_df = base_vio_df[base_vio_df['correct'] == True]
clean_correct_df = correct_df.loc[:, (correct_df != 0).any(axis=0)]
clean_correct_df.columns

Index(['aanval', 'arm', 'arresteren', 'bezetting', 'bom', 'bombardement',
       'brand', 'breuk', 'conflict', 'doden', 'dood', 'doodslag',
       'gebombardeerd', 'geslagen', 'gesloten', 'gevangen', 'gevangenis',
       'geweer', 'geweld', 'gewond', 'hangen', 'herrie', 'inval', 'kwaad',
       'missen', 'moord', 'onderzoek', 'oorlog', 'raken', 'roken', 'rook',
       'schade', 'schieten', 'schot', 'staking', 'steken', 'stempel',
       'sterven', 'straf', 'straffen', 'strijd', 'vermoorden', 'vernietigen',
       'veroordeling', 'verraden', 'verwoesting', 'vluchten', 'wond',
       'zelfmoord', 'hits', 'clean title', 'correct'],
      dtype='object')

In [None]:
# Create Series for keywords
succesful_keywords = pd.Series(clean_correct_df.sum().iloc[:-1])

sum_keywords = sum_keywords.sort_values(ascending=True)
p6 = sum_keywords.iloc[-35:-1].plot(kind='barh', figsize = (12,10))

p6.set_title('Figure 6: Total nr of hits for succesful keywords')

p6.get_figure().savefig(os.path.join(report,"Succesful_keywords.png"))

**6. Process and describe results** 

* Describe general characteristics of baseline search
    * Original corpus size
    * Number of keywords
    * csv with raw data

In [None]:
all_docs = ix.searcher().documents() 
summed_docs = sum(1 for x in all_docs)

summed_results =len(results_df.index)
percent_hits = (summed_results/summed_docs)*100

percent_keywords = (len(results_df.columns)/len(nl_mention_list))*100

In [None]:
report_summary = os.path.join(report,"Baseline_summary.txt")

with open(report_summary, 'w') as file_handler:
    # Add path to corpus
    file_handler.write("Original corpus size: %s \n"%summed_docs)
    file_handler.write("Number of snippets with keyword(s) present: %s \n"%summed_results)
    file_handler.write("Percentage snippets with keywords(s) in corpus: %s \n"%percent_hits)
    # Add path to keywords
    file_handler.write("Total number of unique keywords found: %s \n"%len(results_df.columns))
    file_handler.write("Percentage keywords found wrt to set used in query: %s \n"%percent_keywords)
    file_handler.close()

In [None]:
# Store raw data
report_raw_data = os.path.join(report,"Baseline_results.csv")
merged_df.to_csv(report_raw_data)

* Analyze results
    * nr keywords found per document
    * nr keywords found per category
    * nr hits found per keyword 
    * missed keywords


In [None]:
# Create Series for documents
sum_docs = pd.Series(merged_df.sum(axis=1)).value_counts(sort=True)
sum_docs = sum_docs.sort_index()

p1 = sum_docs.plot(kind='bar')

p1.set_title('Figure 1: Distribution of keywords (total nr) in the searched corpus')

p1.get_figure().savefig(os.path.join(report,"Hits_per_document.png"))

In [None]:
# Create Series for keywords
sum_keywords = pd.Series(merged_df.sum().iloc[:-1])

sum_keywords = sum_keywords.sort_values(ascending=True)
p2 = sum_keywords.iloc[-35:-1].plot(kind='barh', figsize = (12,10))

p2.set_title('Figure 2: Total nr of hits for top 35 most frequent keywords')

p2.get_figure().savefig(os.path.join(report,"Freq_keywords.png"))


In [None]:
# Store list of keywords that were not present in analyzed corpus
report_missed_keywords = os.path.join(report,"Keywords_not_found_in_corpus.csv")

missed_keywords = sum_keywords[sum_keywords == 0]
missed_keywords.to_csv(report_missed_keywords)

In [None]:
count_ceo = pd.Series(ceo_mention_df2['CEO class']).value_counts(sort=True,ascending=True).drop(labels=['/B','/M','/P','\P','/O'])

p3 = count_ceo.plot(kind='barh', figsize = (12,10))

p3.set_title('Figure 3: Distribution of classes of keywords in the query')

p3.get_figure().savefig(os.path.join(report,"Keyword_classes.png"))

In [None]:
ceo_sum_keywords = [ceo_mention_dict[i] for i in sum_keywords.index]

serie = pd.Series(sum_keywords.values, index=ceo_sum_keywords)

res = serie.groupby(serie.index).sum()
res = res.drop(labels=['\P','/M','/B']).sort_values()

p4 = res.iloc[-35:-1].plot(kind='barh', figsize = (12,10))

p4.set_title('Figure 4: Distribution of classes of keywords in search results')

p4.get_figure().savefig(os.path.join(report,"Freq_keywords_classes.png"))

