## Generating Concordances

This notebook imports contents of a text file, tokenizes the text and uses [regular expression](https://docs.python.org/3/library/re.html) to create concordances.

In [1]:
with open("Emma_Austen.txt") as file:
    txt = file.read()
print(txt)


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings of
existence; and had lived nearly twenty-one years in the world with very
little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister’s marriage, been
mistress of his house from a very early period. Her mother had died
too long ago for her to have more than an indistinct remembrance of
her caresses; and her place had been supplied by an excellent woman as
governess, who had fallen little short of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse’s family, less as a
governess than a friend, very fond of both daughters, but particularly
of Emma.
Miss Taylor had ceased to hold the nominal office of governess, the
mildness of her temper had hardly allowed her to impose any restraint;
and the shadow of authority being now long passed away,

### Tokenization
Now we tokenize the text producing a list called "list_of_tokens" and check the first words. This eliminates punctuations and converts each word to lowercase.

In [2]:
import re
list_of_tokens = re.findall(r'\b\w[\w-]*\b', txt.lower())

# print the first 10 words
print(list_of_tokens[:10])

['emma', 'woodhouse', 'handsome', 'clever', 'and', 'rich', 'with', 'a', 'comfortable', 'home']


### Main function

This is the function to create concordances and return a list with results

In [3]:
def make_conc(word_to_conc, list_to_find_in, context_to_use, conc_list):
    """Creates a concordance as a list given a word / phrase
    word_to_conc - The word to find in the text
    list_to_find_in - The tokenized text to find the word
    context_to_use - Number of words to show before and after the keyword.
    conc_list - The container of concordances
    """
    end = len(list_to_find_in) # get length of the text
    for location in range(end):
        if list_to_find_in[location] == word_to_conc:
            # Here we check whether we are at the very beginning or end
            if (location - context_to_use) < 0:
                begin_conc = 0 # cannot get any word appearing before
            else:
                begin_conc = location - context_to_use
                
            if (location + context_to_use) > end:
                end_conc = end # cannot get any words after keyword
            else:
                end_conc = location + context_to_use + 1
            
            # extract from the text from the given start and end points
            # of the tokens
            the_context = (list_to_find_in[begin_conc:end_conc])
            conc_line = ' '.join(the_context)
            conc_list.append("{}: {}".format(location, conc_line))

### Call the main function

We define and provide arguments for the function 

In [4]:
# set a word to find
word_to_find = 'emma'

# set the context of words on either side to grab
context = 5

# create a list to store the concordance
conc_list = []

make_conc(word_to_find, list_of_tokens, context, conc_list)
conc_list[-5:] # get the last 5 items of the concordance list

['0: emma woodhouse handsome clever and rich',
 '140: both daughters but particularly of emma miss taylor had ceased to',
 '188: friend very mutually attached and emma doing just what she liked']

The format of the above results starts with a number which shows the location of the keyword. In this case, the first result **"emma"** was actually found the very beginning, position zero.

The word "emma" appears only 3 times in the text, so we got three lines of concordance.  

**Reference:**  
https://github.com/sgsinclair/alta/blob/2eb10ab6787d032e317ce883fb0bc3427406333d/ipynb/utilities/Concordances.ipynb