# ATAP Concordancer

## Explanation

**Introduction**
- ConcordanceLoader is a class that loads text data catering for CSV, Text Files, and existing DataFrames types. Once a class is created, a key word can be searched in the data and its concordance (i.e. existance) within the text is shown.

- The main advantages of this class over other concordance tools are:
    1. The ability to work with multiple data inputs (files (csv,text) and dataframes.
    
    2. Most concordance tools only show context of a key work limited to the line the key word is in. The context this ConcordanceLoader can work with spans more than the existing line and is limited to how the data is grouped into chunks (see below).
    
    3. When loading structured data that not only has text but other descriptive dimensions (for instance a csv that has a text column and other columns descibing the text), this tool can not only search for the context by keyword, but make visible the other descriptive columns associated with the matching text.
    
    4. Natural Language processing tools drive the keyword search. Thus the ConcordanceLoader has the potential in the future to be used in more versatile ways (for instance using languages other than english).
    
**How it works**

- Lines of text are grouped into chunks and each row is tagged with its row number. The chunk variable is an integer reflecting the number of lines you intend to group within each chunk (i.e. the size in lines of one chunk). The context the keyword appears in is bounded by the chunk it resides in. A larger number of chunk groups the data more coarsely offering greater context (at the expense of loading times in some cases). 

- Text files are a special mention, where symbols can be assigned which are used search and split the text into key- value pairs. The ConcordanceLoader filters the text for these key-value pairs and converts matches it into a two columned dataframe object.


**Limitations:**
- If the word you are matching begins at the start of a group, the left context is limites by the start of the chunk group. A larger chunk integer is suggested.
- Lines are tagged with a --[line_number] symbol in the text (which can be removed from the widget display). However, if the raw data has this pattern within the text it could cause confusion with line tagging method

## Tool

Upload either a CSV or text file here:

In [None]:
from ipywidgets import FileUpload
from src.atap_widgets.concordance import ConcordanceLoader
uploader = FileUpload(accept=".csv,.txt")
display(uploader)
symbol = None

Provide a key/value splitting symbol or leave empty for no splitting.

In [None]:
symbol = input("Enter a symbol to split the text into key/value pairs (leave empty for none): ")
if symbol == "":
    symbol = None

Ensure you have uploaded a file and then run the block below to show the widget

In [None]:
uploaded = len(uploader.value) > 0
if uploaded:
    uploaded_file = uploader.value[0]
    file_name = uploaded_file.name
    with open(file_name, "wb") as fp:
        fp.write(uploaded_file.content)
    
    file_type = uploaded_file.name[-3:]
    
    concordance_loader = ConcordanceLoader(path=file_name, type=file_type, re_symbol_txt=symbol)
    concordance_loader.show()
else:
    print("Ensure you upload a file!")