# Dynamic Time Warping

## Files and Data
#### ground-truth folder
Contains ground truth data.


#### transcription.txt  
- XXX-YY-ZZ: XXX = Document Number, YY = Line Number, ZZ = Word Number
- Contains the character-wise transcription of the word (letters seperated with dashes)
- Special characters denoted with s_
	- numbers (s_x)
	- punctuation (s_pt, s_cm, ...)
	- strong s (s_s)
	- hyphen (s_mi)
	- semicolon (s_sq)
	- apostrophe (s_qt)
	- colon (s_qo)
    
### Task
Three files:  
##### training and test data:
train.txt, valid.txt
both contain a list of documents
##### keywords that are at least once in the training and validation set defined by train.txt and valid.txt
keywords.txt

In [1]:
import numpy as np
from matplotlib import pyplot as plt
import PIL
from dtaidistance import dtw
import cv2
import time
import os
import helpers
import pandas as pd

In [2]:
###### IMPORTS
IMAGES = helpers.import_images()

In [3]:
###### OTHER VARIABLES
# files where keywords for validation are located
VALID_DOCUMENT_IDS = helpers.get_file('valid.txt')

# files where keywords for training are located
TRAIN_DOCUMENT_IDS = helpers.get_file('train.txt')

# list of words which are present in the training set and also in the valid set
KEYWORDS = helpers.get_file('keywords.txt')

# transcript with all infos about words
transcript_list = helpers.get_file('ground-truth/transcription.txt')
TRANSCRIPT = helpers.parse_transcript(transcript_list)

In [4]:
###### FEATURE EXTRACTION
# try loading first
try:
    IMAGES_REDUCED = helpers.load_obj("images_reduced")
except FileNotFoundError:
    print("File not found, calculating feature vectors...")
    IMAGES_REDUCED = helpers.features_and_labels(IMAGES,TRANSCRIPT)
    #IMAGES_REDUCED = helpers.reduce_to_feature_vectors(IMAGES)
    print("Done!")
    helpers.save_obj(IMAGES_REDUCED, "images_reduced")


print(IMAGES_REDUCED[:10])

File not found, calculating feature vectors...
Done!
[{'id': 0, 'document': '270', 'image': array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  8., 11., 16., 19., 22.,
       25., 26., 25., 27., 29., 30., 30., 31., 31., 33., 34., 35., 37.,
       37., 37., 39., 39., 39., 38., 36., 34., 35., 36., 36., 38., 37.,
       34., 34., 31., 27., 23., 21., 18., 15., 12.,  6.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  3.,  9., 12., 14., 17.,
       19., 23., 23., 26., 27., 27., 28., 27., 26., 23., 22., 22., 22.,
       23., 22., 23., 23., 22., 23., 22., 23., 21., 19., 18., 16., 15.,
       14., 14., 16., 21., 22., 24., 27., 28., 29., 29., 29., 28., 28.,
       27., 24., 22., 23., 22., 22., 22., 23., 23., 21., 22., 22., 22.,
       22., 22., 22., 23., 22., 23., 22., 23., 27., 30., 33., 32., 31.,
       29., 27., 26., 21., 16., 12., 10.,  6.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  2.,  3.,  4.,  5.,  5.,  5.,  5.,  6.,
        7.,  7.,  8.,  9.,  8.,  9., 12., 11

### Evaluation

##### Feature Vectors
We used sliding window with 1px window width and 1px offset.  
Used vectors: 
1. number of black pixels per window

##### Questions (2)
*1. How many selected items are relevant?*  
*2. How many of the relevant are selected?*  
Precision = TP / TP + FP; Recall = TP / TP + FN  
WHERE TP: True positive, FP: False positive, FN: False negative


#### Steps
- iterate keywords
- for each keyword, find all other keywords
- for all other found keywords check if they are correct

In [43]:
keyword_dict = {}
val_set = list()
val_label = list()
train_set = list()
train_label = list()

for index, image in enumerate(IMAGES_REDUCED):
    word = TRANSCRIPT[index]['word']
    
    if image['document'] in TRAIN_DOCUMENT_IDS:
        train_set.append(np.array(image['image'],dtype=np.float))
        train_label.append(word)
        if word in KEYWORDS and word not in keyword_dict: 
            keyword_dict[word] = index
    else:
        val_set.append(np.array(image['image'], dtype=np.float))
        val_label.append(word)    

In [70]:
###### GET DISTANCE MATRIX
# compute the matrix with all distances of words to each other
first = 5

# define list of feature vectors to be compared
vectors = train_set[:first]

triangular_matrix = dtw.distance_matrix_fast(vectors[:first])
distance_matrix = np.triu(triangular_matrix) + np.triu(triangular_matrix).T
np.fill_diagonal(distance_matrix, 0)
print(distance_matrix[:,0])

tp = np.zeros(len(train_set))
fp = tp
fn = tp

for keyword, index in keyword_dict.items():
    #dictionary = {'features': train_set[:first], 'label': train_label[:first], 'distance': distance_matrix[:first,index]}
    #print(dictionary)
    #results = pd.DataFrame(dictionary)
    print(distance_matrix.shape)
    
    results = pd.DataFrame([train_set[:first],train_label[:first],distance_matrix[:first,index]],columns=['features','labels','distance'])
    
    results = results.sort_values('distance')
    print(keyword)
    
    for i in range(len(results)):
        print('Results:', results.head(i))
        print(results[:i])
        tp[i] += [x['label'] for x in results[:i]].count(keyword)
        fp[i] += i - tp
        fn[i] += [x['label'] for x in results[i:]].count(keyword)
        
        precision = tp/(tp+fp)
        recall = tp/(tp+fn)
        print(precision)
        print(recall)
        
    print('------------------------------------------------------------------')


[  0.         137.97101145 123.76186812 113.64858116 171.53133824]
(5, 5)


ValueError: 3 columns passed, passed data had 5 columns