## Baseline search engine for EviDENce

Search strategy

1. Collect corpus to perform search on
2. Index documents in corpus
3. Collect Keywords
4. Construct query
5. Perform search
6. Analyze results

In [None]:
import numpy as np
import os
import pandas as pd
import string
import sys

**1. Collect corpus to perform search on**

Our corpus consists of oral history accounts.
These are broken up in text fragments of 100 lemmas and can be found in a zip folder on surfdrive:

../Data/NR-teksts/EviDENce_NR_output/TargetSize100/Lemma_preserve_paragraph.zip

*Make sure to extract the zip folder on high-level location on your computer a to avoid "path-too-long" error*


In [None]:
# Provide path to extracted folder with lemma fragments
root = os.path.join(os.sep,"media","sf_MartinedeVos")
search_dir = os.path.join(os.sep,root,"lem_par_150","lemma_preserve_paragraph")

In [None]:
# Path to alternative folder with lemma fragments
surf = os.path.join(os.sep,root,"surfdrive","Projects", "EviDENce","Data","NR-Teksts","EviDENce_NR_output")
alt_search_dir = os.path.join(os.sep,surf,"Size200","fragmented_lemmas")

**2. Index all documents (i.e., lemma fragments) in the directory**

* Create Schema
* Add documents
* Perform indexing

_NB: this step only has to be run once, or when data is added or changed_

In [None]:
from baseline_search import create_searchable_data

# The creation of an index is only needed once; after that, opending the existing index is sufficient
# in that case, the following line should be commented out

#create_searchable_data(search_dir)


**3. Collect list of keywords from CEO-ECB mappings**

Keywords are based on mappings from classes of the Circumstantial Event Ontology (CEO) on the ECB+ corpus

Preprocessing entails:
* express keywords as lemmas to ensure more effective matching 
* manually select keywords related to violence
* translate selected keywords to Dutch

In [None]:
# The following functions use the google translate API 
# As this API has stability issues, there is a workaround in the next cell
from baseline_search import create_lemma_list
from baseline_search import eng_to_dutch_list

mention_file ="../data/MdV_selectedCEOECB.csv"

#en_mentions = create_lemma_list(mention_file)
#nl_mentions = eng_to_dutch_list(en_mentions)

In [None]:
import pandas as pd

mention_df = pd.read_csv(mention_file,sep=';',encoding = "ISO-8859-1")
mention_df['Mention','CEO class']

In [None]:
import pandas as pd

prefab_file = "../data/nl_mentions.csv"
prefab_mentions = pd.read_csv(prefab_file,sep=';',encoding = "ISO-8859-1")
nl_mentions = list(prefab_mentions["Mention"])

**4. Construct query**

* Sort keywords
* Add double quotes to phrase queries
* Concatenate all keywords into one query string

In [None]:
from baseline_search import quote_phrases

nl_mention_list = list(set(nl_mentions))
quoted_nl_mention_list = quote_phrases (nl_mention_list)
nl_mention_query = ",".join(quoted_nl_mention_list)

**5. Perform search**

Using whoosh library:

* Define query parser: which schema, which search fields, AND/OR search
* Define searcher: which scoring approach
* Store info from results object in pandas dataframes

In [None]:
from collections import defaultdict
import pandas as pd
from whoosh import scoring
from whoosh import qparser
from whoosh.index import open_dir

indexdir = os.path.join(os.sep,search_dir,"indexdir")
ix = open_dir(indexdir)

parser = qparser.QueryParser("content", schema=ix.schema,group=qparser.OrGroup)
my_query = parser.parse(nl_mention_query)

cols_list = []
titles_list = []

with ix.searcher(weighting=scoring.TF_IDF()) as searcher:
    results = searcher.search(my_query,limit=None, terms = True)
    for res in results:
        titles_list.append(res["title"])
        #row_dict = {}
        col_dict = defaultdict(int)
        hits = [term.decode('utf8')  for where,term in res.matched_terms()]
        for hit in hits:
            col_dict[hit]+= 1
            #row_dict[hit] = row_dict.get(hit, 0) + 1  
        cols_list.append(col_dict)

#Create a dataframe for results of this search, i.e. with a limited set of keywords 
results_df = pd.DataFrame(cols_list)
results_df.set_index([titles_list], inplace=True)

In [51]:
with ix.searcher() as searcher:
    index_dic = {doc['title']:[doc['textdata']] for doc in searcher.all_stored_fields()}   

index_df = pd.DataFrame.from_dict(index_dic, orient='index')    
index_df

Unnamed: 0,0
GV_SiteFilms_Java_05_conversation_clipped_150_paragraph_805-813_lemma,dus dat zijn echt een huiselijk tafereel wat j...
GV_SiteFilms_Java_05_conversation_clipped_150_paragraph_814-822_lemma,je hebben kans dat er misschien wel tweehonder...
GV_SiteFilms_Java_05_conversation_clipped_150_paragraph_823-832_lemma,en als hij dan dus de slang hebben localiseren...
GV_SiteFilms_Java_05_conversation_clipped_150_paragraph_833-839_lemma,en vanuit die kamp worden ze dus ook weer naar...
GV_SiteFilms_Java_05_conversation_clipped_150_paragraph_840-844_lemma,en dan hebben je daar weer een ellende en dan ...
GV_SiteFilms_Java_05_conversation_clipped_150_paragraph_845-851_lemma,"en dan komen de militair politie er weer bij ,..."
GV_SiteFilms_Java_05_conversation_clipped_150_paragraph_852-863_lemma,de stemming veranderen je worden dus eh ... ja...
GV_SiteFilms_Java_05_conversation_clipped_150_paragraph_864-873_lemma,en dan zijn dus de ding gebeuren die dus moeil...
GV_SiteFilms_Java_05_conversation_clipped_150_paragraph_874-881_lemma,en dat staan ook in het boek van ' 60 jaar ext...
GV_SiteFilms_Java_05_conversation_clipped_150_paragraph_882-891_lemma,en dat zijn eigenlijk nog niet hoeven gebeuren...


Search results dataframe contains:
* only those keywords that are found in documents
* only those documents that have one or more keywords

Combined dataframe contains: 
* all keywords, also those that are not present in documents
* all documents, also those that have no keywords

In [None]:
#Create a dataframe for all docs and keywords with empty values
keywords_dic = {term:0 for term in quoted_nl_mention_list}
list_docs = [doc['title'] for doc in ix.searcher().documents()] 

all_df = pd.DataFrame(keywords_dic, index = list_docs)

# Create a dataframe for all docs and keywords with search results
merged_df = results_df.combine_first(all_df)

In [None]:
# Apparently phrase queries are still broken up in separate search terms
# this is shown by the surplus in columns in merged_df
surplus = [col for col in merged_df if col not in all_df]
# Remove these for now as a workaround; phrase queries should be fixed
merged_df = merged_df.drop(columns = surplus)

**6. Process and describe results** 

* Describe general characteristics of baseline search
    * Original corpus size
    * Number of keywords
    * csv with raw data

In [None]:
all_docs = ix.searcher().documents() 
summed_docs = sum(1 for x in all_docs)

summed_results =len(results_df.index)
percent_hits = (summed_results/summed_docs)*100

percent_keywords = (len(results_df.columns)/len(quoted_nl_mention_list))*100

In [None]:
report_summary = os.path.join(root,"surfdrive","Projects", "EviDENce","Baseline summary.txt")

with open(report_summary, 'w') as file_handler:
    # Add path to corpus
    file_handler.write("Original corpus size: %s \n"%summed_docs)
    file_handler.write("Number of snippets with keyword(s) present: %s \n"%summed_results)
    file_handler.write("Percentage snippets with keywords(s) in corpus: %s \n"%percent_hits)
    # Add path to keywords
    file_handler.write("Total number of unique keywords found: %s \n"%len(results_df.columns))
    file_handler.write("Percentage keywords found wrt to set used in query: %s \n"%percent_keywords)
    file_handler.close()

In [None]:
# Store raw data
report_raw_data = os.path.join(root,"surfdrive","Projects", "EviDENce","Baseline results.csv")
merged_df.to_csv(report_raw_data)

* Analyze results
    * nr keywords found per document
    * nr keywords found per category
    * nr hits found per keyword 
    * missed keywords


In [None]:
# Create Series for documents
sum_docs = pd.Series(merged_df.sum(axis=1)).value_counts(sort=True)
sum_docs = sum_docs.sort_index()

p1 = sum_docs.plot(kind='bar')

p1.set_title('Figure 1: Distribution of keywords (total nr) in the searched corpus')

p1.get_figure().savefig(os.path.join(root,"surfdrive","Projects", "EviDENce","Hits_per_document.png"))

In [None]:
# Create Series for keywords
sum_keywords = pd.Series(merged_df.sum().iloc[:-1])

sum_keywords = sum_keywords.sort_values(ascending=True)
p2 = sum_keywords.iloc[-35:-1].plot(kind='barh', figsize = (12,10))

p2.set_title('Figure 2: Total nr of hits for top 35 most frequent keywords')

p2.get_figure().savefig(os.path.join(root,"surfdrive","Projects", "EviDENce","Freq_keywords.png"))


In [None]:
# Store list of keywords that were not present in analyzed corpus
report_missed_keywords = os.path.join(root,"surfdrive","Projects", "EviDENce","Keywords not found in corpus.csv")

missed_keywords = sum_keywords[sum_keywords == 0]
missed_keywords.to_csv(report_missed_keywords)