# Generating Concordances

This notebook shows how you can generate a concordance using lists of tokens.

First we see what text files we have. 

In [1]:
ls *.txt

FullText.txt                performanceConcordance.txt
Hume Enquiry.txt            theWritingStory.txt
StoryOfWriting.txt          truthConcordance.txt
bigdata.txt


We are going to use the "Hume Enquiry.txt" from the Gutenberg Project. You can use whatever text you want. We print the first 50 characters to check.

In [3]:
theText2Use = "Hume Enquiry.txt"
with open(theText2Use, "r") as fileToRead:
    fileRead = fileToRead.read()
    
print("This string has", len(fileRead), "characters.")
print(fileRead[:50])

This string has 366798 characters.
The Project Gutenberg EBook of An Enquiry Concerni


### Cleaning the Text

Coming from Gutenberg I need to remove their material from the front and end.

First we get the stuff at the front.

In [4]:
text2find = "Distributed Proofreaders"
lengthOfT2F = len(text2find)
location = fileRead.find(text2find)
up2location = location + lengthOfT2F
# print(fileRead[location - 20:location + 20 + lengthOfT2F]) # Use to check.
theFullText1 = fileRead[up2location:]
print(theFullText1[:50])













AN ENQUIRY CONCERNING HUMAN UNDERSTAND


Now we get the stuff at the end.

In [5]:
text2find = "End of the Project Gutenberg EBook"
location = theFullText1.find(text2find)
# print(fileRead[location:location + 20 + lengthOfT2F]) # Use to check.
theFullText = theFullText1[:location]
print(theFullText[-50:])

on, 57.

  Freedom of (v. _Necessity_).













## Tokenization

Now we tokenize the text producing a list called "listOfTokens" and check the first words. This eliminate punctuation and lowercases the words.

In [6]:
import re
listOfTokens = re.findall(r'\b\w[\w-]*\b', theFullText.lower())
print(listOfTokens[:10])

['an', 'enquiry', 'concerning', 'human', 'understanding', 'by', 'david', 'hume', 'extracted', 'from']


## Main function

Here is the main function that does the work populating a new list with the lines of concordance. We check the first 5 concordance lines.

In [7]:
def makeConc(word2conc,list2FindIn,context2Use,concList):
    # Lets get 
    end = len(list2FindIn)
    for location in range(end):
        if list2FindIn[location] == word2conc:
            # Here we check whether we are at the very beginning or end
            if (location - context2Use) < 0:
                beginCon = 0
            else:
                beginCon = location - context2Use
                
            if (location + context2Use) > end:
                endCon = end
            else:
                endCon = location + context2Use + 1
                
            theContext = (list2FindIn[beginCon:endCon])
            concordanceLine = ' '.join(theContext)
            # print(str(location) + ": " + concordanceLine)
            concList.append(str(location) + ": " + concordanceLine)

## Input and Run

Now we have code to run that asks for the word to concord and context. This generates the concordance.

In [10]:
# Ask for the word to search for
word2find = input("What word do you want concordances for? ").lower() 

# This asks for the context of words on either side to grab
context = input("How much context do you want? ") 

theConc = []
makeConc(word2find,listOfTokens,int(context),theConc)
theConc[:5]
len(theConc)

What word do you want concordances for? truth
How much context do you want? 20


21

In [11]:
theConc[:5]

['468: should not yet have fixed beyond controversy the foundation of morals reasoning and criticism and should for ever talk of truth and falsehood vice and virtue beauty and deformity without being able to determine the source of these distinctions while they',
 '2919: that what is really distinct to the immediate perception may be distinguished by reflexion and consequently that there is a truth and falsehood in all propositions on this subject and a truth and falsehood which lie not beyond the compass of',
 '2930: distinguished by reflexion and consequently that there is a truth and falsehood in all propositions on this subject and a truth and falsehood which lie not beyond the compass of human understanding there are many obvious distinctions of this kind such',
 '3686: happy if we can unite the boundaries of the different species of philosophy by reconciling profound enquiry with clearness and truth with novelty and still more happy if reasoning in this easy manner we can undermi

## Output

Any concordance we like we can output to a text file.

In [6]:
nameOfResults = word2find.capitalize() + ".Concordance.txt"

with open(nameOfResults, "w") as fileToWrite:
    for line in theConc:
        fileToWrite.write(line + "\n")
    
print("Done")

Done


Here we check that the file was created.

In [7]:
ls

Basic CSV Handling.ipynb             Truth.Concordance.txt
Concordances.ipynb                   Truths.Concordance.txt
ExampleTable.csv                     Untitled.ipynb
Exploring a text with NLTK.ipynb     Untitled1.ipynb
Hume Enquiry.txt                     Untitled2.ipynb
Python language notes.ipynb          theText.txt
Teaching IPython to Humanists.ipynb


---
[CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0/) From [The Art of Literary Text Analysis](ArtOfLiteraryTextAnalysis.ipynb) by [Stéfan Sinclair](http://stefansinclair.name) &amp; [Geoffrey Rockwell](http://geoffreyrockwell.com)<br >Created October, 2016 (Jupyter 4.2.1)