# Checking and retrieving character indexes from quotations


What you will need to run this notebook:

+ The Project Gutenberg fulltext of your source text (text A). In this case, the Project Gutenberg version of *Middlemarch*: `middlemarch.txt`. This should already be in the Github repository.
+ The JSON file with the output of `text-matcher`. In this case, this is `jstor-middlemarch-articles.json`. You can download it from here: https://drive.google.com/drive/folders/1N1IXEy5CGEKplru0R6KNzj5kDLwxMVDC?usp=share_link

Both of these files must be in the same directory as this notebook for the filepaths below to run correctly. You should move the JSON file into this directory, and remember to delete it and empty your Trash when you finish working on this task.

In addition, you should open [the `middlemarch-ground-truth-indexes` Google Sheet](https://docs.google.com/spreadsheets/d/1I1xEVKGQIf9eGvfLs_l0rCRmSz5GYVmHmRbWzUBWIRc/edit?usp=share_link)


### A preliminary note about  character indexes:

A match in text matcher takes the form of a pair, or a list of pairs, of character indexes. These character indexes store the position of a match and can be used to retrieve the corresponding text.

Let's say you were looking at an output :  [[173657, 173756], [292143, 292406]]. 

In each pair, the first number corresponds to the **starting character index**, and the second number corresponds to the **ending character index** of a quotation. 

So in this example, for match [173657, 173756].
+ the **starting charcter** is 173657
+ the **ending character** is 173756

### Import libraries
Run the cell below to import libraries

In [12]:
from text_matcher.matcher import Text, Matcher
import json
import pandas as pd
from IPython.display import clear_output
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = [16, 6]
#pd.set_option('display.max_colwidth', None)

### Load in our data files:

In [13]:
# Load Middlemarch .txt file 
# (Note: must have 'middlemarch.txt' in this directory)
with open('middlemarch.txt') as f: 
    rawMM = f.read()

mm = Text(rawMM, 'Middlemarch')

# Load in the JSON file with our JSTOR articles and data from TextMatcher
# (Note: must have the file 'default.json' in the same directory as this notebook)
df = pd.read_json('jstor-middlemarch-articles.json')

In [14]:
# Let's peek inside our DataFrame
df.head(3)

Unnamed: 0,creator,datePublished,docSubType,docType,fullText,id,identifier,isPartOf,issueNumber,language,...,title,url,volumeNumber,wordCount,numMatches,Locations in A,Locations in B,abstract,keyphrase,subTitle
0,[Rainer Emig],2006-01-01,book-review,article,"[Monika Mueller, George Eliot U.S.: Transat- l...",http://www.jstor.org/stable/41158244,"[{'name': 'issn', 'value': '03402827'}, {'name...",Amerikastudien / American Studies,3,[eng],...,Review Article,http://www.jstor.org/stable/41158244,51,1109,0,[],[],,,
1,[Martin Green],1970-01-01,book-review,article,[Reviews I57 Thackeray's Critics: An Annotated...,http://www.jstor.org/stable/3722819,"[{'name': 'issn', 'value': '00267937'}, {'name...",The Modern Language Review,1,[eng],...,Review Article,http://www.jstor.org/stable/3722819,65,1342,0,[],[],,,
2,[Richard Exner],1982-01-01,book-review,article,[Essays Mary McCarthy. Ideas and the Novel. Ne...,http://www.jstor.org/stable/40137021,"[{'name': 'issn', 'value': '01963570'}, {'name...",World Literature Today,1,[eng],...,Review Article,http://www.jstor.org/stable/40137021,56,493,0,[],[],,,


# Check quotation matches for particular articles


## Set the `article_id` ‼️

In the cell below, change the variable `article_id` to the id of the article you wish to examine.

**Where can I find the article id?**

+ This can be found in the `id` column of the [Google Sheet](https://docs.google.com/spreadsheets/d/1I1xEVKGQIf9eGvfLs_l0rCRmSz5GYVmHmRbWzUBWIRc/edit?usp=share_link)

*Note: JSTOR outputs the fulltext of articles text as a list of strings, so we have to concatenate them using text-matcher;s `Text()` function.*


In [17]:
# ‼️ 🛑 Make sure to change the variable below to the correct article id 🛑  ‼️
article_id  = 'http://www.jstor.org/stable/25088885' # CHANGE THIS to article id

# Use article_id to get the index of the article in our DataFrame
article_index = df[df['id'] == article_id].index[0]
article_text = df['fullText'].loc[article_index]
article_title = df['title'].loc[article_index]

# Assign the full text of this article to a variable called `cleaned_article_text`, with text-matcher's Text function
cleaned_article_text = Text(article_text, article_title)

# Print out the title and ID of the article we selected as confirmation
print(f"""
Article selected:
ID: {article_id}
Title: {article_title}
""")



Article selected:
ID: http://www.jstor.org/stable/25088885
Title: Self-Suppression & Attachment: Mid-Victorian Emotional Life



## Find the index positions of a given quotation

To establish all of the "ground truth" quotations (and their character indexes), we'll want to get the index characters not just for quotations that text-matcher successfully matched, but for *all* quotations in that article.

To retrieve the index characters for all quotations in an article legilbe to human eyes, follow the following steps.


### Step 1: Select the quotation in the ground truth spreadsheet

Work through the quotations in your selected article one by one.

### Step 2:  Locate the text of that quotation as it appears in the JSON file in the ""fullText" field
(🛑 Make sure you've entered the `article_id` for the article in the section "Set the `article_id`", first!!)  
Run the cell below, and then use "CTRL+F" in your browser to find the quotation as it appears in the "Quotation from PDF" column on the spreadsheet.

Look at the text below and see whether the "Quotation from PDF" text is actually a quotation from *Middlemarch*. Often it will include too much or too little, sometimes it won't be the quotation at all.

In [28]:
print(cleaned_article_text.text)

SELF-SUPPRESSION & ATTACHMENT MID-VICTORIAN EMOTIONAL LIFE Judith M. Hughes T tow can an historian hope to recreate the emotional life of the - - past, to discern the major features of an inner landscape that has disappeared from view? More specifically, how can I recall to life the psychic rhythms of the mid-Victorians? Where should I expect to hear the voices of the dead reverberate with sufficient clarity so that I could begin to interpret the sounds? Cultural artifacts, in particular the novel, provide a starting point. I can begin by trying to pinpoint prominent emotional configurations in a number of novels. Such artistic constructions condense inner experiences and render them immediate, thereby promising access to those deeper layers of the mind which have long since been buried. It is, of course, a promise that cannot be entirely fulfilled. Neverthe less, it would be a timid scholar indeed who would hesitate to exploit 541 
 The Massachusetts Review a rich store of material si

### Step 3: Copy the actual quoted text from *Middlemarch* as it appears exactly in the article text above

### Step 4: Paste the text of the quotation in the `quotation` field below
Make sure that you enclose the quotation in quotation marks.

If there are are quotation marks in the text of the quote, either place an escape character `\` in front of them, or change the quotation marks that you use. (Eg, if there are single quotes (`'`) in the text, use double quotes (`"`) to surround the text.

Run the cell below.

In [27]:
# PASTE the quotation below in the field, replacing the text below ‼️
# Make sure to include quotation marks around the string
quotation = "We are all of us born in moral stupidity, taking the world as an udder to feed our supreme selves" #pas

index = cleaned_article_text.text.rindex(quotation)
print(f"Article id: {article_id}")
print('Starting index:', index) 
print('Ending index:', index + len(quotation))
print(f'Quotation text and indexes to paste into spreadsheet:\n{quotation}\t[{index}, {index + len(quotation)}]')
print("\n\nSanity check (does this match the text above?):")
cleaned_article_text.text[index:index + len(quotation)]



Article id: http://www.jstor.org/stable/25088885
Starting index: 4329
Ending index: 4426
Quotation text and indexes to paste into spreadsheet:
We are all of us born in moral stupidity, taking the world as an udder to feed our supreme selves	[4329, 4426]


Sanity check (does this match the text above?):


'We are all of us born in moral stupidity, taking the world as an udder to feed our supreme selves'

### Step 5: Record the character indexes and article id in spreadsheet
Add the character indexes and article ID as a new row in a spreadsheet