# Sample Notebook: Exploring a Tabled Image Dataset

The script *extract_info_from_dataset.py* extracts information from a dataset of letter images and saves the extracted information in a csv file. This notebook demonstrates how the information can be used to explore the dataset and retrieve letter images satistfying given criteria such as the date it was written, the recipient of the letter, or its signator. Along with the *letter_image* class, the information can also be used to find letters similar to a letter image not in the dataset.

In [1]:
import pandas as pd
import letter_image as li
import os
import nltk.tag.stanford as st
from ast import literal_eval

dataset_dir = "/media/datadr/datasets/letters" 


## Get the Data Table from the CSV File

In [3]:
# Read the data table fromt he csv file
data_file_name= "infoTable.csv"
data = pd.read_csv(data_file_name)

### The data table: a closer look

Each row in the table represents a semantic segment or part of the image whose name is **image_name**.
The table consists of:
    -  an **image_name** column 
    -  a **dir** column giving the path to teh image
    -  a **part** column which contains the name of the current part. Currently this is one of: 
        -  heading/sender 
        -  recipient
        -  greeting
        -  body
        -  signature
        -  date
        -  enclosures/notes
    -  a **contour** column which is a list of contours each representing a block of the current part. 
    -  a **text** column which is a list representing the OCR recognized text within the contours in **contour**.

In [4]:
data.head()

Unnamed: 0,image_name,dir,contour,text,part
0,letter_1135,/media/datadr/datasets/letters,"[array([[ 395. , 1417.66887443],\n ...",['HILL\n\nW\n\nlﬂcﬂlPCIATID 1.7!\n\nO\nQ\nC\nO...,heading
1,letter_1135,/media/datadr/datasets/letters,"[array([[928. , 578.84663755],\n ...",['2013'],date
2,letter_1135,/media/datadr/datasets/letters,"[array([[938. , 438.44451491],\n ...","['April', 'M.Sc.', 'Elliott Bay', 'Mr.', 'Kids...",recipient
3,letter_1135,/media/datadr/datasets/letters,"[array([[1389. , 483.4266401 ],\n ...",['Dear'],greeting
4,letter_1135,/media/datadr/datasets/letters,"[array([[1500. , 1560.20216494],\n ...","['Hill', 'Chamber', 'Richmond', 'you on receiv...",body


#### What letter parts are in the table?

In [5]:
# Letter parts
data.part.unique()

array(['heading', 'date', 'recipient', 'greeting', 'body', 'signature',
       'enclosures', nan], dtype=object)

#### Show a sample of the letters' dates 

In [6]:
# Dates in database
data[data.part=='date'].text.head()

1                  ['2013']
6      ['November 8. 2007']
10                  ['may']
13    ['September 1, 2018']
26       ['March 10, 2016']
Name: text, dtype: object

#### Do we have letters in the database signed by a given signator?

In [None]:
# Do we have letters in the database signed by a given signator?
given_signator = "Joan Lau"

stanfordNER_path = "/home/reem/tools/stanford-ner-2018-10-16/"
english_nertagger = st.StanfordNERTagger(os.path.join(stanfordNER_path,\
                              "classifiers/english.all.3class.distsim.crf.ser.gz"), \
                              os.path.join(stanfordNER_path,\
                                           "stanford-ner.jar"))

letters = []
letters_w_sig = data[data.part=='signature']
for i, row in letters_w_sig.iterrows():
    for i,line in enumerate(literal_eval(row.text)):
        for inline in line.split("\n"):
            tagged=  english_nertagger.tag(inline.split())
            if ('PERSON' in [tagged[i][1] for i,tup in enumerate(tagged)]):
                if given_signator.lower() == inline.lower():
                    letters.append(i)
                
if (len(letters)> 0):
    print("Letters signed by {}:".format(given_signator))
    for i in letters:
        print(data.iloc[i].image_name)
else:
    print("No letters in the database were signed by{}:".format(given_signator))

### Retrieving letters that are similar to a new letter image

Sometimes we may wish to find dataset letters that are similar to a letter not in our dataset. In this case we start by extracting information from the new letter using the *letter_image* class. We can then compare the contents of the new letter to those in the dataset to, for example, find all letters signed by the same signator, or all dataset letters written on the same day as the new letter.

(Note: For more examples on how to use the letter_image class please refer to the __[letter_image_examples.ipynb](https://github.com/ReemHal/letter_images/blob/master/letter_image_examples.ipynb)__ notebook)


In [None]:
#Let's try to match our dataset letters to an new letter image
image_name = "test_letter"
letter_obj = li.letter_image("", image_name)
letter_obj.process_letter()

#### Letter Parts

First let us have a look at the parts identified within test_letter

In [None]:
letter_obj.display_contours(display=True,save=False,savedir='contours')

#### Text within each part

Now let us have a look at the contents of each of those parts..

In [None]:
# What do we know about the test letter's content?
test_letter_content={}
for i,clip_item in enumerate(letter_obj.clip):
    test_letter_content[clip_item['part']] = clip_item['text']
    
for key in test_letter_content.keys():
    text = str.join("\n",[text for i,text in enumerate(test_letter_content[key])])
    print("<<{}>>\n{}\n=====".format(key, text))

#### Who signed our *test_letter*?

In [None]:
#Who signed the test letter?

stanfordNER_path = "/home/reem/tools/stanford-ner-2018-10-16/"
english_nertagger = st.StanfordNERTagger(os.path.join(stanfordNER_path,\
                              "classifiers/english.all.3class.distsim.crf.ser.gz"), \
                              os.path.join(stanfordNER_path,\
                                           "stanford-ner.jar"))

sig_segment = test_letter_content['signature']
for i,line in enumerate(sig_segment):
    tagged=  english_nertagger.tag(line.split())
    if ('PERSON' in [tagged[i][1] for i,tup in enumerate(tagged)]):
        signator = line

print("Letter signed by ", signator)

#### Who is the recipient of the letter?

In [None]:
# Who was the test letter sent to?
sig_segment = test_letter_content['recipient']
for i,line in enumerate(sig_segment):
    for inline in line.split("\n"):
        tagged=  english_nertagger.tag(inline.split())
        if ('PERSON' in [tagged[i][1] for i,tup in enumerate(tagged)]):
            recipient_name = inline

print("Letter was sent to ", recipient_name)

### Putting things together: find similar letters in our dataset

#### Retrieve letters written on the same date

Now we would like to find all dataset letters that were written on the same data as the *test_letter*

In [None]:
#Which letters were sent on the same day as the test letter?
options= data[data.part=='date']
for i, row in options.iterrows():
    if (test_letter_content['date'][0] in  row['text']):
        print(row['image_name'])

#### Were any letters in our database  sent to the same recipient as the test letter?

In [None]:
# Do we have letters in the database that were sent to the same recipient?
letters = []
letters_w_sig = data[data.part=='recipient']
for i, row in letters_w_sig.iterrows():
    for i,line in enumerate(literal_eval(row.text)):
        for inline in line.split("\n"):
            tagged=  english_nertagger.tag(inline.split())
            if ('PERSON' in [tagged[i][1] for i,tup in enumerate(tagged)]):
                if recipient_name == inline.lower():
                    letters.append(i)
                
if (len(letters)> 0):
    print("Letters sent to{}:".format(recipient_name))
    for i in letters:
        print(data.iloc[i].image_name)
else:
    print("No letters in the database were sent to {}".format(recipient_name))
    