**Introduction**
- ConcordanceLoader is a class that loads text data catering for CSV, Text Files, and existing DataFrames types. Once a class is created, a key word can be searched in the data and its concordance (i.e. existance) within the text is shown.

- The main advantages of this class over other concordance tools are:
    1. The ability to work with multiple data inputs (files (csv,text) and dataframes.
    
    2. Most concordance tools only show context of a key work limited to the line the key word is in. The context this ConcordanceLoader can work with spans more than the existing line and is limited to how the data is grouped into chunks (see below).
    
    3. When loading structured data that not only has text but other descriptive dimensions (for instance a csv that has a text column and other columns descibing the text), this tool can not only search for the context by keyword, but make visible the other descriptive columns associated with the matching text.
    
    4. Natural Language processing tools drive the keyword search. Thus the ConcordanceLoader has the potential in the future to be used in more versatile ways (for instance using languages other than english).
    
**How it works**

- Lines of text are grouped into chunks and each row is tagged with its row number. The chunk variable is an integer reflecting the number of lines you intend to group within each chunk (i.e. the size in lines of one chunk). The context the keyword appears in is bounded by the chunk it resides in. A larger number of chunk groups the data more coarsely offering greater context (at the expense of loading times in some cases). 

- Text files are a special mention, where symbols can be assigned which are used search and split the text into key- value pairs. The ConcordanceLoader filters the text for these key-value pairs and converts matches it into a two columned dataframe object.


**Limitations:**
- If the word you are matching begins at the start of a group, the left context is limites by the start of the chunk group. A larger chunk integer is suggested.
- Lines are tagged with a --[line_number] symbol in the text (which can be removed from the widget display). However, if the raw data has this pattern within the text it could cause confusion with line tagging method

In [4]:
import pandas as pd
import spacy
#from atap_widgets.concordance import ConcordanceTable, ConcordanceWidget
from atap_widgets.concordance import ConcordanceTable, ConcordanceWidget, ConcordanceLoader
from atap_widgets.concordance import prepare_text_df
import dask.bag as db
import re


In [9]:
#Make and refer to some example data


def basic_spacy_nlp():
    return spacy.lang.en.English()

Question_Answer_Dialogue = '../tests/data/D.QandA_Dummy.txt'

MarkScottSpeach = "../tests/data/MarkScottNationalPressClub.txt"


## ConcordanceLoader Demo 1

Run the below code and explore:

* Keyword searches and other options to toggle case sensitivity,regular expression and whole word matching.
* Increasing "Window Size(characters) " to bring in context around the keyword.
* Show More Multiselect dropdown can bring in more than one column by "command + click" when choosing.

In [10]:

CHUNK = 2

#loads ConcordanceLoader class ingesting data either in csv
DataCSV = ConcordanceLoader(type = "csv", path = "../tests/data/sherlock_for_testing.csv",chunk = CHUNK) #By Text / Csv file

# or in dataframe format
#DataCSV = ConcordanceLoader(type = "dataframe",df_input = sherlock_df,chunk = CHUNK)  # Or exisitng dataframe

#to display widget show the class instance
DataCSV.show() #For instance, search for "she" in sherlock holmes data


VBox(children=(Text(value='', continuous_update=False, description='Keyword(s):'), HBox(children=(Checkbox(val…

<atap_widgets.concordance.ConcordanceLoaderWidget at 0x28eed76d0>

In [11]:
#Can explore how underlying data was grouped into chunks with line tags used for internal purposes

DataCSV.get_grouped_data()

Unnamed: 0,text,speaker,chunk,row
0,0--To Sherlock Holmes she is always the woman.,A,0,0
1,1--I have seldom heard him\n mention her un...,B,0,1
2,2--In his eyes she eclipses and predominates t...,A,1,2
3,3--It was not that he felt any emotion akin to...,B,1,3
4,"4--All emotions, and that one particularly, we...",A,2,4


## ConcordanceLoader Demo 2: Larger mutliple dimensioned (i.e. columns) text data

In [12]:
# More complex and larger debate data 
CHUNK = 10 #increase chunks to expand context region. i.e. "time" search for instance
data = pd.read_excel("../tests/data/A.debate_clean.xlsx") #already has text_id
DataDF = ConcordanceLoader(type = "dataframe",df_input = data,chunk = CHUNK)
DataDF.show() #search "economy" or "environment" and bring in speaker from ShowMore dropdown to find out who said what


VBox(children=(Text(value='', continuous_update=False, description='Keyword(s):'), HBox(children=(Checkbox(val…

<atap_widgets.concordance.ConcordanceLoaderWidget at 0x28fd69c90>

## ConcordanceLoader Demo 3: Structured Text

In [6]:
# This is what data looks like. Notice the key:value structure within the text.
! head -15 $Question_Answer_Dialogue

Question: What is your favourite animal in Australia?
Name 6: Kangaroos and koalas.

Question: What is your favourite animal in Australia?
Name 1: Wombats are my favourite.

Question: What is your favourite animal in Australia?
Name 10: I don’t know, but I know I don’t like any of the poisonous spiders and dangerous snakes!

Question: What is your favourite food in Australia?
Name 10: Tomatos for sure!

Question: What is your favourite food in Australia?
Name 6: I decline to answer that.



In [14]:
# With text types, you can define a symbol to split lines assuming all relevant info is in the structure key [SYMBOL] value.
# The keyword is searched in the value field, and the additional key column (whateve was before the SYMBOL) can be selected
symbol = r':' 

CHUNK = 4

DataDF = ConcordanceLoader(type = "txt",path = Question_Answer_Dialogue,re_symbol_txt = symbol,chunk = CHUNK)

DataDF.show() #search tomatos, pick "key" in "Show More" to bring in key associated with text.


AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'

In [8]:
# As a side note, one can work with underlying dataframe for further analysis.
DataDF.get_grouped_data().sort_values('key').key.unique()


array(['Name 1', 'Name 10', 'Name 6', 'Name11', 'Question'], dtype=object)

## ConcordanceLoader Demo 4: Plain text

In [9]:
#the underlying data communicates the University of Sydney 2022 Strategy
! head -n 10 $MarkScottSpeach



I acknowledge that we meet today on the ancestral lands of the Ngunnawal people, the traditional custodians of this land. I pay my respects to elders past and present, and those who have cared for and continue to care for country.

It’s great to be with you.

The University of Sydney is Australia’s oldest university. We took in our first students in 1852 and just yesterday released our aspirations for the decade through to 2032, by which time we’ll be closing in on the end of the University’s second century.

In considering our future, we humbly acknowledge that for hundreds of centuries before the University of Sydney opened its doors, generations of First Nations peoples have been exchanging knowledge on the ancestral lands on which the University’s campuses and facilities now stand. And as we create a university for the future, we aim to extend and build upon this prior knowledge.



In [10]:
# Loads text without any key value structure 
CHUNK = 4
DataDF = ConcordanceLoader(type = "txt",path = MarkScottSpeach)
DataDF.show() #search for "pandemic" for instance


VBox(children=(Text(value='', description='Keyword(s):'), HBox(children=(Checkbox(value=False, description='En…

<atap_widgets.concordance.ConcordanceLoaderWidget at 0x29e21c190>

### Simpler functionality is still present that that reflects older DataWidget and Concordance Table development

In [13]:

original_data = DataCSV.get_original_data()

original_data.head() #chuch and row columns added to original data.

data = pd.read_csv("../tests/data/sherlock_for_testing.csv")                  
data =  prepare_text_df(data)

table = ConcordanceTable(df = data,keyword = "she")
table

search_results_df = table.to_dataframe() #extract results into dataframe
search_results_df.head()

oldWidget = ConcordanceWidget(data) #run simplier widget (no chunks or context)
oldWidget.show()


  ].apply(pd.Series)


VBox(children=(Text(value='', continuous_update=False, description='Keyword(s):'), HBox(children=(Checkbox(val…