# Checking and retrieving character indexes from quotations


What you will need to run this notebook:

+ The Project Gutenberg fulltext of your source text (text A). In this case, the Project Gutenberg version of *Middlemarch*: `middlemarch.txt`. This should already be in the Github repository.
+ The JSON file with the output of `text-matcher`. In this case, this is `jstor-middlemarch-articles.json`. You can download it from here: https://drive.google.com/drive/folders/1N1IXEy5CGEKplru0R6KNzj5kDLwxMVDC?usp=share_link

Both of these files must be in the same directory as this notebook for the filepaths below to run correctly. You should move the JSON file into this directory, and remember to delete it and empty your Trash when you finish working on this task.

In addition, you should open [the `middlemarch-ground-truth-indexes` Google Sheet](https://docs.google.com/spreadsheets/d/1I1xEVKGQIf9eGvfLs_l0rCRmSz5GYVmHmRbWzUBWIRc/edit?usp=share_link)


### A preliminary note about  character indexes:

A match in text matcher takes the form of a pair, or a list of pairs, of character indexes. These character indexes store the position of a match and can be used to retrieve the corresponding text.

Let's say you were looking at an output :  [[173657, 173756], [292143, 292406]]. 

In each pair, the first number corresponds to the **starting character index**, and the second number corresponds to the **ending character index** of a quotation. 

So in this example, for match [173657, 173756].
+ the **starting charcter** is 173657
+ the **ending character** is 173756

### Import libraries
Run the cell below to import libraries

In [1]:
from text_matcher.matcher import Text, Matcher
import json
import pandas as pd
from IPython.display import clear_output
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = [16, 6]
#pd.set_option('display.max_colwidth', None)

### Load in our data files:

In [2]:
# Load Middlemarch .txt file 
# (Note: must have 'middlemarch.txt' in this directory)
with open('middlemarch.txt') as f: 
    rawMM = f.read()

mm = Text(rawMM, 'Middlemarch')

# Load in the JSON file with our JSTOR articles and data from TextMatcher
# (Note: must have the file 'default.json' in the same directory as this notebook)
df = pd.read_json('jstor-middlemarch-articles.json')

In [3]:
# Let's peek inside our DataFrame
df.head(3)

Unnamed: 0,creator,datePublished,docSubType,docType,fullText,id,identifier,isPartOf,issueNumber,language,...,title,url,volumeNumber,wordCount,numMatches,Locations in A,Locations in B,abstract,keyphrase,subTitle
0,[Rainer Emig],2006-01-01,book-review,article,"[Monika Mueller, George Eliot U.S.: Transat- l...",http://www.jstor.org/stable/41158244,"[{'name': 'issn', 'value': '03402827'}, {'name...",Amerikastudien / American Studies,3,[eng],...,Review Article,http://www.jstor.org/stable/41158244,51,1109,0,[],[],,,
1,[Martin Green],1970-01-01,book-review,article,[Reviews I57 Thackeray's Critics: An Annotated...,http://www.jstor.org/stable/3722819,"[{'name': 'issn', 'value': '00267937'}, {'name...",The Modern Language Review,1,[eng],...,Review Article,http://www.jstor.org/stable/3722819,65,1342,0,[],[],,,
2,[Richard Exner],1982-01-01,book-review,article,[Essays Mary McCarthy. Ideas and the Novel. Ne...,http://www.jstor.org/stable/40137021,"[{'name': 'issn', 'value': '01963570'}, {'name...",World Literature Today,1,[eng],...,Review Article,http://www.jstor.org/stable/40137021,56,493,0,[],[],,,


# Check quotation matches for particular articles


## Set the `article_id` ‚ÄºÔ∏è

In the cell below, change the variable `article_id` to the id of the article you wish to examine.

**Where can I find the article id?**

+ This can be found in the `id` column of the [Google Sheet](https://docs.google.com/spreadsheets/d/1I1xEVKGQIf9eGvfLs_l0rCRmSz5GYVmHmRbWzUBWIRc/edit?usp=share_link)

*Note: JSTOR outputs the fulltext of articles text as a list of strings, so we have to concatenate them using text-matcher;s `Text()` function.*


In [5]:
# ‚ÄºÔ∏è üõë Make sure to change the variable below to the correct article id üõë  ‚ÄºÔ∏è
article_id  = 'http://www.jstor.org/stable/xxxxxxxxx' # CHANGE THIS to article id

# Use article_id to get the index of the article in our DataFrame
article_index = df[df['id'] == article_id].index[0]
article_text = df['fullText'].loc[article_index]
article_title = df['title'].loc[article_index]

# Assign the full text of this article to a variable called `cleaned_article_text`, with text-matcher's Text function
cleaned_article_text = Text(article_text, article_title)

# Print out the title and ID of the article we selected as confirmation
print(f"""
Article selected:
ID: {article_id}
Title: {article_title}
""")



Article selected:
ID: http://www.jstor.org/stable/25088885
Title: Self-Suppression & Attachment: Mid-Victorian Emotional Life



## Part 1: Get quotes (& their character indexes) from `text-matcher` output


### What are the index positions of matches in our source text (Text "A")?
Retrieve the character indexes in for the source text (Text A):

In [7]:
# What are the locations in A?
print("Middlemarch character indexes:")
df.loc[df['id'] == article_id, 'Locations in A'].item()

Middlemarch character indexes:


[[173657, 173756], [292143, 292406]]

### What's the text of one of those matches?

Let's check the corresponding text in Middlemarch for one of the matches output above.  
Change the start and end character indexes to one of the index ranges in the cell above. 

In [8]:
#‚ÄºÔ∏è üõë IMPORTANT: Change the start and end character indexes to one of the ouputs above

mm_start = 173657 # üõë REPLACE the number with one of the starting character indexes
mm_end = 173756 # üõë REPLACE the number with one of the ending character indexes

# Output the text in "A" for the start and end characters selected above
print("Middlemarch character indexes:", f"[{mm_start}, {mm_end}]")
mm.text[mm_start:mm_end]

Middlemarch character indexes: [173657, 173756]


'all of\nus, grave or light, get our thoughts entangled in metaphors, and act\nfatally on the strength'

### What are the indexes positions of matches in our target text (Text "B")?
Retrieve the indexes in the B text (that is, the article index: 

In [9]:
# What are the locations in B?
print(f"Character index locations for {article_id}:")
df.loc[df['id'] == article_id, 'Locations in B'].item()

Character index locations for http://www.jstor.org/stable/30030019:


[[14718, 14816], [64553, 64816]]

### What's the text of one of those matches in Text "B" (the article)?
Change the start and end character indexes to one of the index ranges in the cell above.

In [10]:
#‚ÄºÔ∏è üõë IMPORTANT: Change the start and end character indexes to one of the ouputs above

textB_start = 14718 # üõë REPLACE the number to the left with one of the starting character indexes
textB_end = 14816 # üõë REPLACE the number to the left with one of the ending character indexes

# Output the text in "B" for the start and end characters selected above 
print(f"Character index locations for {article_id}:", f"[{textB_start}, {textB_end}]")
cleaned_article_text.text[textB_start:textB_end]

Character index locations for http://www.jstor.org/stable/30030019: [14718, 14816]


'All of us, grave or light, get our thoughts entangled in metaphors and act fatally on the strength'

---

## Find the index positions of a given quotation

To establish all of the "ground truth" quotations (and their character indexes), we'll want to get the index characters not just for quotations that text-matcher successfully matched, but for *all* quotations in that article.

To retrieve the index characters for all quotations in an article legilbe to human eyes, follow the following steps.


### Step 1: Locate the quotation in the PDF of the article.

### Step 2:  Locate the text of that quotation as it appears in the JSON file in the ""fullText" field
(üõë Make sure you've entered the `article_id` for the article in the section "Set the `article_id`", first!!)  
Run the cell below, and then use "CTRL+F" in your browser to find the quotation as it appears in the article text.

In [None]:
print(cleaned_article_text.text)

### Step 3: Copy that text of the quotation as it appears exactly in the article text above.

### Step 4: Paste the text of the quotation in the `quotation` field below
Make sure that you enclose the quotation in quotation marks.

If there are are quotation marks in the text of the quote, either place an escape character `\` in front of them, or change the quotation marks that you use. (Eg, if there are single quotes (`'`) in the text, use double quotes (`"`) to surround the text.

Run the cell below.

In [12]:
# PASTE the quotation below in the field, replacing the text below ‚ÄºÔ∏è
# Make sure to include quotation marks around the string
quotation = "All of us, grave or light, get our thoughts entangled in metaphors and act fatally on the strength" #pas

index = cleaned_article_text.text.rindex(quotation)
print(f"Article id: {article_id}")
print('Starting index:', index) 
print('Ending index:', index + len(quotation))
print(f'Character indexes for match: [{index}, {index + len(quotation)}]')
print("\n Corresponding text:")
cleaned_article_text.text[index:index + len(quotation)]



Article id: http://www.jstor.org/stable/30030019
Starting index: 14718
Ending index: 14816
Character indexes for match: [14718, 14816]

 Corresponding text:


'All of us, grave or light, get our thoughts entangled in metaphors and act fatally on the strength'

### Step 5: Record the character indexes and article id in spreadsheet
Add the character indexes and article ID as a new row in a spreadsheet