<a href="https://colab.research.google.com/github/TristanFaine/Master_2_MLVC_Recognize_Handwritten_Equation/blob/main/code_template_explanation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task

We were given InkML files, which contain metadata and a list of strokes: [(0,0),(1;0)]...  
These are on-line handwritten mathematical expressions, we'll try to recognize them via LG(Labelled Graph) as output.  
This is the sequence of actions performed:  
1) Determine possible stroke combinations.  
2) Remove impossible combinations with a classifier.  
3) Convert each combination to a symbol.  

We'll handle spatial relations later  
We'll also handle the final decision later, but we can add a grammar or language model at that point, 

# Environment setup

## Getting project files

We will be importing the project files from our github repository.

In [1]:
!git clone https://github.com/TristanFaine/Master_2_MLVC_Recognize_Handwritten_Equation.git

Cloning into 'Master_2_MLVC_Recognize_Handwritten_Equation'...
remote: Enumerating objects: 122372, done.[K
remote: Counting objects: 100% (2661/2661), done.[K
remote: Compressing objects: 100% (2531/2531), done.[K
remote: Total 122372 (delta 163), reused 2605 (delta 116), pack-reused 119711[K
Receiving objects: 100% (122372/122372), 50.25 MiB | 15.11 MiB/s, done.
Resolving deltas: 100% (167/167), done.
Checking out files: 100% (211970/211970), done.


In [2]:
%cd Master_2_MLVC_Recognize_Handwritten_Equation/code

/content/Master_2_MLVC_Recognize_Handwritten_Equation/code


We will first show what each script does, and how to interpret their output, then we will show how to use the evaluation scripts, and finally detail how each script functions.

## Data to train our classifiers

[This](https://uncloud.univ-nantes.fr/index.php/s/OdAtErZgxKGjsNy) contains a bunch of symbols (in datasymbol_iso/) and expressions (in FullExpressions/) to help train our future classifiers.



## Showcasing parts of code

### segmenter.py explanation

The first action is to collect the possible stroke combinations, for now we'll simply take every consecutive stroke combinations.  
So if we have 4 strokes in one inkml file, 13 combinations can be done.  

Each line of the output corresponds to a hypothesis: indicating the symbol type, the symbol with index (starting with 1, but dummy value for now since we don't know what symbol the combination could be), then the symbol without index, then the supposed confidence of the model. We also have the strokes used next to these informations.

At this point, this can be optimized by already removed hypotheses take don't make sense, for instance trying to make a symbol with every single stroke is highly irregular and shouldn't happen, so we could remove that combination before calling the other scripts.

The ground truth for the segmentation is available in the original lg files, since the strokes used are listed next to each symbol.

In [48]:
!python3 segmenter.py -i ../data/formulaire001-equation001.inkml -o ../data/example.lg

../data/example.lg


In [47]:
!python3 segmenter.py -i ../data/formulaire001-equation001.inkml

O,hyp0,*,1.0,0
O,hyp1,*,1.0,1
O,hyp2,*,1.0,2
O,hyp3,*,1.0,3
O,hyp4,*,1.0,4
O,hyp5,*,1.0,0,1
O,hyp6,*,1.0,1,2
O,hyp7,*,1.0,2,3
O,hyp8,*,1.0,3,4
O,hyp9,*,1.0,0,1,2
O,hyp10,*,1.0,1,2,3
O,hyp11,*,1.0,2,3,4
O,hyp12,*,1.0,0,1,2,3
O,hyp13,*,1.0,1,2,3,4



As we've said earlier, we simply keep every consecutive stroke combination as our possible hypotheses.

### Git stuff

When messing up a commit, amend or do the 2 cells below

In [136]:
#!git stash
#!git stash drop

No local changes to save
No stash entries found.


In [137]:
#!git reset --soft HEAD^ 

In [145]:
#!git pull

error: You have not concluded your merge (MERGE_HEAD exists).
hint: Please, commit your changes before merging.
fatal: Exiting because of unfinished merge.


In [3]:
#!git config --global user.email "XXX@gmail.com"
#!git config --global user.name "XXX"

In [148]:
#!git add .

In [149]:
#!git add ../data/

In [150]:
#!git add ../lgeval

In [151]:
#!git commit -m "Fixed error that was introduced in lgEval/src/sumMetric"

[main 3acb30cb0] Fixed error that was introduced in lgEval/src/sumMetric
 3 files changed, 3 insertions(+), 7 deletions(-)


In [152]:
#!git push https://hiddentoken@github.com/TristanFaine/Master_2_MLVC_Recognize_Handwritten_Equation.git

Counting objects: 8, done.
Delta compression using up to 2 threads.
Compressing objects:  12% (1/8)   Compressing objects:  25% (2/8)   Compressing objects:  37% (3/8)   Compressing objects:  50% (4/8)   Compressing objects:  62% (5/8)   Compressing objects:  75% (6/8)   Compressing objects:  87% (7/8)   Compressing objects: 100% (8/8)   Compressing objects: 100% (8/8), done.
Writing objects:  12% (1/8)   Writing objects:  25% (2/8)   Writing objects:  37% (3/8)   Writing objects:  50% (4/8)   Writing objects:  62% (5/8)   Writing objects:  75% (6/8)   Writing objects:  87% (7/8)   Writing objects: 100% (8/8)   Writing objects: 100% (8/8), 654 bytes | 654.00 KiB/s, done.
Total 8 (delta 7), reused 0 (delta 0)
remote: Resolving deltas: 100% (7/7), completed with 7 local objects.[K
To https://github.com/TristanFaine/Master_2_MLVC_Recognize_Handwritten_Equation.git
   a7c31c815..3acb30cb0  main -> main


### CROHME_train_segmentSelector.py & segmentSelect.py explanation

Since we use neural networks in our overall process as our classifiers/predictors, they need to be trained beforehand. But let's first explain what we're trying to do in this part of the process:  

The script 'segmentSelect.py' takes as input the initial inkml file, alongside the "prototype" lg file: We combine the stroke combinations from the prototype file alongside the inkml stroke data to generate images, then we check whether these images make sense as a symbol, no matter the context.  
This is a classification problem with two possible outputs : Image is valid or invalid.  
We also check the confidence value of the model with a threshold in order to ignore unsure hypotheses, which should boost accuracy somewhat.

Now, for the training part, while we could simply store the weights of the models and import them, we still want to show the specifics of our training due to the characteristics of our data:

While training the model batch per batch, we make sure that each of these batches contain an representative random subset of the original data, since we have a lot less invalid images for training than valid images, while still trying to make sure that the model doesn't excessively consider valid images.  
The rest of the training logic is quite normal, the state of the model is saved whenever we achieve a new validation loss low, and we implemented early stopping to prevent overfitting.

In [33]:
!python3 CROHME_train_segmentSelector.py

cuda:0
('invalid', 'valid')
nb classes 2 , training size 12000, val size 4000, test size 4000
invalid valid valid invalid
AlexNet(
  (layer1): Sequential(
    (0): Conv2d(1, 96, kernel_size=(11, 11), stride=(4, 4), bias=False)
    (1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
  )
  (layer2): Sequential(
    (0): Conv2d(96, 384, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), bias=False)
    (1): BatchNorm2d(384, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
  )
  (layer3): Sequential(
    (0): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
  )
  (fc): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
    (2): ReLU()
  )
  (fc1): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(

### segmentSelect.py example

In [36]:
!python3 segmentSelect.py -o ../data/example2.lg ../data/formulaire001-equation001.inkml ../data/example.lg 

In [35]:
!python3 segmentSelect.py ../data/formulaire001-equation001.inkml ../data/example.lg 

O,hyp6,*,0.5201140642166138,1,2
O,hyp7,*,0.5395737886428833,2,3
O,hyp8,*,0.9398249983787537,3,4
O,hyp13,*,0.8418141603469849,1,2,3,4



This script takes as input the initial inkml file, alongside the "prototype" lg file: We combine the stroke combinations alongside the inkml data to generate images, then we check whether these make sense as a symbol, no matter the context.

For now this simply gives a fully random probability that the images are valid symbols.

Need to change two things in there, first adding a classifier to check the validity of supposed symbols,  
then change the threshold based on empirical evidence or what I feel like at the time.

### segmentReco.py explanation

In [34]:
!python3 CROHME_train_segmentReco.py

cuda:0
['i', 'a', 'gamma', 'M_', 'q', 'dot', 'geq', 'p', 'm', 'o', 'd', 'int', 's', ']', 'h', 'H_', 'b', 'pi', 'P_', 'forall', '!', 'beta', 'rightarrow', '+', 'e', 'log', 'A_', 'X_', ',', 'sum', 'y', 'G_', 'sqrt', 'R_', '-', 'C_', 'in', 'phi', 'Delta', '7', 'x', 'E_', 'B_', 'sigma', '8', 'lim', 'z', 'N_', '0', 'n', '{', 'sin', 'pm', 'tan', 'g', 'prime', 'leq', 'div_op', 'S_', '1', '6', 't', ')', 'neq', 'times', '}', '(', 'L_', 'lambda', 'cos', 'pipe', 'u', 'V_', 'v', 'lt', 'I_', 'k', '4', 'w', '3', 'mu', 'F_', 'ldots', '[', 'c', 'alpha', '=', '2', 'r', 'infty', 'Y_', 'f', 'exists', 'j', 'T_', '9', '5', 'theta', 'l', 'gt', 'div']
nb classes 101 , training size 6000, val size 2000, test size 2000
    a     a gamma     i
AlexNet(
  (layer1): Sequential(
    (0): Conv2d(1, 96, kernel_size=(11, 11), stride=(4, 4), bias=False)
    (1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
  )
  (layer2): Sequential(
    (0): Conv2d(96, 384, kernel_si

In [38]:
!python3 symbolReco.py -o ../data/example3.lg ../data/formulaire001-equation001.inkml ../data/example2.lg

In [37]:
!python3 symbolReco.py ../data/formulaire001-equation001.inkml ../data/example2.lg

O,hyp6,M_,0.9991289973258972,1,2
O,hyp7,M_,0.9998371601104736,2,3
O,hyp8,M_,0.9998027682304382,3,4
O,hyp13,M_,1.0,1,2,3,4



TODO: write what this does

### selectBestSeg.py explanation

In [40]:
!python3 selectBestSeg.py -o ../data/examplefinal.lg  ../data/example3.lg

In [39]:
!python3 selectBestSeg.py ../data/example3.lg

O,hyp13,M_,1.0,2,4,1,3



TODO: write what this does

#Process everything

In [41]:
!chmod 755 ./processAll.sh

In [64]:
! ./processAll.sh ../data/inkml_gt_mini ../data/lg_output

Recognize: ../data/inkml_gt_mini/UN_101_em_0.inkml
../data/lg_output/hyp/UN_101_em_0.lg
Recognize: ../data/inkml_gt_mini/UN_101_em_10.inkml
../data/lg_output/hyp/UN_101_em_10.lg
Recognize: ../data/inkml_gt_mini/UN_101_em_11.inkml
../data/lg_output/hyp/UN_101_em_11.lg
Recognize: ../data/inkml_gt_mini/UN_101_em_12.inkml
../data/lg_output/hyp/UN_101_em_12.lg
Recognize: ../data/inkml_gt_mini/UN_101_em_13.inkml
../data/lg_output/hyp/UN_101_em_13.lg
Recognize: ../data/inkml_gt_mini/UN_101_em_14.inkml
../data/lg_output/hyp/UN_101_em_14.lg
Recognize: ../data/inkml_gt_mini/UN_101_em_15.inkml
../data/lg_output/hyp/UN_101_em_15.lg
Recognize: ../data/inkml_gt_mini/UN_101_em_16.inkml
../data/lg_output/hyp/UN_101_em_16.lg
Recognize: ../data/inkml_gt_mini/UN_101_em_17.inkml
../data/lg_output/hyp/UN_101_em_17.lg
Recognize: ../data/inkml_gt_mini/UN_101_em_18.inkml
../data/lg_output/hyp/UN_101_em_18.lg
Recognize: ../data/inkml_gt_mini/UN_101_em_19.inkml
../data/lg_output/hyp/UN_101_em_19.lg
Recognize: .

## Evaluating

In [66]:
!chmod 755 ../lgeval/bin/evaluate

#USE ABSOLUTE PATH HERE ANTOINE!!!!! merci

#SI CA MARCHE PAS SUR IKOULA BAHHHHH BONNE CHANCE

In [94]:
# doing !export LgEvalDir = "/content/Master_2_MLVC_Recognize_Handwritten_Equation/lgeval/"
# doesn't work on colab since "The reason is that !export will assign the environment variable in an ephemeral sub-shell. But, you want to update the environment for the Python subprocess that spawns those sub-shells."
import os
os.environ['LgEvalDir'] = "/content/Master_2_MLVC_Recognize_Handwritten_Equation/lgeval/"

In [138]:
#!rm -rf Results_result

#USE ABSOLUTE PATH HERE ANTOINE!!!!! merci

The folder containing the metrics is created from wherever the evaluate script is called from.

In our case, that'd be the cloned repo's code folder.

In [131]:
! ../lgeval/bin/evaluate /content/Master_2_MLVC_Recognize_Handwritten_Equation/data/lg_output/result /content/Master_2_MLVC_Recognize_Handwritten_Equation/data/lg_gt

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
      /content/Master_2_MLVC_Recognize_Handwritten_Equation/data/lg_gt/UN_456_em_735.lg
      ['0', '1', '10', '2', '3', '4', '5', '6', '7', '8', '9']
  >> Comparing UN_456_em_736.lg
  !! IO Error (cannot open): /content/Master_2_MLVC_Recognize_Handwritten_Equation/data/lg_output/result/UN_456_em_736.lg
  !! Inserting ABSENT nodes for:
      /content/Master_2_MLVC_Recognize_Handwritten_Equation/data/lg_output/result/UN_456_em_736.lg vs.
      /content/Master_2_MLVC_Recognize_Handwritten_Equation/data/lg_gt/UN_456_em_736.lg
      ['0', '1', '10', '11', '12', '13', '14', '15', '16', '2', '3', '4', '5', '6', '7', '8', '9']
  !! Inserting ABSENT nodes for:
      /content/Master_2_MLVC_Recognize_Handwritten_Equation/data/lg_output/result/UN_456_em_736.lg vs.
      /content/Master_2_MLVC_Recognize_Handwritten_Equation/data/lg_gt/UN_456_em_736.lg
      ['0', '1', '10', '11', '12', '13', '14', '15', '16

Now you can check in Results_result/Summary.txt the results.

TODOLIST: I think I fucked up when converting lgeval stuff to python3, or when giving the correct number of spaces/tabs.

If something seems wrong after executing evaluate then change lgeval scripts back to normal src back to normal then re-converty everything C A R E F U L L Y with the original file side-by-side.
