# ATAP Concordancer

## Introduction

This notebook is a Concordancer tool which allows users to upload text data (eg. .csv or .txt file) and to search the text for each instance of a search term, presenting it in the form of a concordance. The Concordancer retrieves all relevant instances of the search term and displays them for users in the tool as well as making them available for download as a CSV file for additional analysis. It has specifically been designed to allow users (i) to undertake ‘dialogic’ analysis (when the input consists of related text pairs, such as question-answer or social media post-response) and/or (ii) to make visible the meta-data that are associated with the occurrence of the search term (when available in the input; for example, speaker identity, political affiliation, company, etc).

To do so, the data that is loaded into the notebook must contain ‘structured’ data, where one column consists of ‘text’ (eg the question or social media post) and the other columns consist of either the associated text of the dialogic pair (eg the relevant answer or the relevant reply/comment) or of metadata (describing aspects of the text). This is explained further below. In addition to this analysis of structured data, the notebook can create its own structured data based on symbols present in the uploaded text(s), automatically splitting the data preceding and following the relevant symbol (e.g. a colon or a question mark). This is also explained and illustrated further below.

In sum, this notebook is not meant to feature all types of analyses offered by current off-the-shelf Concordancers and should be considered as complementary to such existing tools. You may want to use this tool if you are interested in using a Concordancer for dialogic analysis or exploring the relationship between search term and meta-data.

## File upload

Upload a single txt or CSV file. Only one file can be uploaded and analysed at a time. Note that there is no progress indicator, but you will get a message if you run the next cell prior to the uploading process having completed.

In [3]:
from ipywidgets import FileUpload
from src.atap_widgets.concordance import ConcordanceLoader
uploader = FileUpload(accept=".csv,.txt")
display(uploader)

FileUpload(value=(), accept='.csv,.txt', description='Upload')

## How to use

### Preparation
1. Upload a file by clicking the above 'Upload' button
2. Run the code block below and wait for the concordancer tool to display

Note: if you want to analyse dialogic structures (question-answer; post-response) or if you want to analyse metadata associated with your search term (such as the identity of the speaker, the date, etc.), you should upload your text data as a ‘structured’ .csv file. Note that you can save an .xlsx spreadsheet as .csv file within Excel (‘Save as’). Make sure that the text you want to analyse is included in the column titled ‘text’. A mocked-up example is provided below.

![Structured text](./concordance_standalone_imgs/structured_eg.png)

### Search
1. Enter a search term into the search field and press enter on your keyboard to perform a search
2. Toggle the checkboxes below the search field to enable/disable regular expression matching, case sensitivity, and whole word matching

Note that without regular expressions the search field uses exact matching and considers punctuation (for example, a search for “oh my god” will NOT retrieve instances of “oh, my god”).

Regular expressions can be used for advanced searches. Here are some examples:
- `\b(find\w*|.*ness)\b` - matches strings that start with 'find' or end in 'ness'
- `oh,? my god` - matches "oh my god" or "oh, my god"
- `\bhope\w{2}\b` - matches strings that start with "hope" followed by 2 characters
- `\bwomen(?:\s\w+)?\smen\b` - matches strings that start with "women" and end in "men" with 0 or 1 words between them, e.g. "women and men", "women or men" will both match
- `\bthe\s\w+\sof\b` - matches strings that start with "the" and end in "of" with exactly 1 word between them, e.g. "the minister of" will match but "the of" will not
- `\b(his|him|himself)\b` - matches any of the following words: "him", "his", "himself"
- `\b\d+\s[A-Za-z]+\b` - matches strings that begin with a number and end with a word, e.g. "10 dogs" will match

### Display
1. Use the 'Sort by' dropdown to sort by text_id, left context, or right context
    - The text_id field corresponds to the line number of the match in the text (where text_id is 0 for the first line). Sorting by text_id will display results in the order which they appear in the text.
    - If sorting by left or right context, sorting is done in alphabetical order.
2. If your data contains metadata columns, use the 'Show More' field to select a metadata column to display.
    - to select multiple metadata columns, hold the control/command key and click multiple
3. Export the file to an Excel spreadsheet by providing an appropriate file name and clicking the button labelled "Export to Excel".
   This sheet will appear in the Jupyter file window on the left and can be downloaded by right-clicking the file and clicking "Download"

- If the context windows don't display all the text you would like to display, change the window size using the "Window size" field
- If there are many results from the search, navigate through the pages of results using the 'Page' navigator field


## Concordancer
Ensure you have uploaded a file and then run the code cell below to show the Concordancer

In [4]:
uploaded = len(uploader.value) > 0
if uploaded:
    uploaded_file = uploader.value[0]
    file_name = uploaded_file.name
    with open(file_name, "wb") as fp:
        fp.write(uploaded_file.content)
    
    file_type = uploaded_file.name[-3:]
    
    concordance_loader = ConcordanceLoader(path=file_name, type=file_type)
    concordance_loader.show()
else:
    print("Ensure you upload a file!")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 7279: invalid continuation byte

## Concordancer - Unstructured data

The dialogic feature is a more advanced feature of the concordancer and allows you to analyse discourse structures in unstructured text (text that does not contain different columns with aligned text pairs or aligned text-metadata pairs). It requires that your text contains a symbol that is consistently used to identify a structure. For instance, your text might use the colon symbol (:) ONLY after speakers and before their respective dialogue, as in the example below. 

![Dialogue example 1](./concordance_standalone_imgs/dialogue_eg1.png)

Or your text might use the question mark symbol (?) only after the interviewer’s question, as in the example below:

![Dialogue example 2](./concordance_standalone_imgs/dialogue_eg2.png)

If your text uses such symbols consistently, this would allow you to use this tool to structure your text, for example into speaker-text pairs or question-answer pairs, and analyse it accordingly (similar to structured data).

To do so, you specify a character across which the text data will be split. For example, if your text is of the format speaker: spoken words, you can specify the "splitter" to be :, which will create a column called "key" for the speaker and a column for the spoken words. You can then see which speaker spoke the words in a given concordance line. Note that information on the left of the chosen symbol (here speaker) will be a metadata column, while information on the right of the chosen symbol (here the words spoken by the speaker) will be treated as the text that is searched for with the Concordancer.

In the code cell below, replace the : between the quotation marks to specify a different splitter character (for example a question mark), depending on what symbols are present in the uploaded text. 

In [None]:
splitter = ":"

uploaded = len(uploader.value) > 0
if uploaded:
    uploaded_file = uploader.value[0]
    file_name = uploaded_file.name
    with open(file_name, "wb") as fp:
        fp.write(uploaded_file.content)
    
    file_type = uploaded_file.name[-3:]
    
    concordance_loader = ConcordanceLoader(path=file_name, type=file_type, re_symbol_txt=splitter)
    concordance_loader.show()
else:
    print("Ensure you upload a file!")