# Worklog Hannes

## Initial Notes

Began studying what had already been done on relation extraction, especially for medical texts: 

- [Training a Machine Learning Classifier for Relation Extraction from Medical Literature](https://www.microsoft.com/developerblog/2016/09/13/training-a-classifier-for-relation-extraction-from-medical-literature/)
- [Distant supervision for relation extraction without labeled data](http://web.stanford.edu/~jurafsky/mintz.pdf)
- [Relation Extraction Using Support Vector Machine](https://www.aclweb.org/anthology/I05-1033)
- [Exploring Semi-supervised Variational Autoencoders for Biomedical Relation Extraction](https://arxiv.org/abs/1901.06103)


## Relations Model

The idea of this model would be to take the extracted mentions in a section or a sentence along with parts of the text and identifying the relations/interactions if any. The approach to making a neural net for this is not as obvious and will probably take a bit more reading and hopefully I can find some implementation that isn't too complex for inspiration for the first solution. 

### Initial Model Design Suggestion

Marcus gave some suggestions on how to approach building the relations extraction, simply put, using Keras: 

#### Network Design
-  Input 1, Input 2
-  Embeddings  [BioASQ](http://bioasq.org/news/bioasq-releases-continuous-space-word-vectors-obtained-applying-word2vec-pubmed-abstracts)
-  BILSTM/Attention
-  Concat (Merge layer)
-  Dense (1 or more)

Might use Spacy for POS and Dependency parsing. First models should just train for interaction vs. no interaction. 


## 2019-04-15

Missed the meeting because of chaotic trains. 

Relation extraction with SVM seems like a good approach with established performance with certain features. 

### Issues

- The [TAC2018 DDI corpus](https://bionlp.nlm.nih.gov/tac2018druginteractions/) isn't really translateable to the problem we are trying to solve.  

## 2019-04-25

Meeting with Anja, Pierre, Markus, Vilhelm and Olof:

I've decided to abandon the TAC2018 [DDI corpus](https://bionlp.nlm.nih.gov/tac2018druginteractions/) and have started using the Binarized [Bio-Infer corpus](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-50) instead. I'm going to try to use [SciSpacy](https://allenai.github.io/scispacy/) for POS tagging. 

### BioInfer Interaction Tagging
- Predicate Name (easiest to use for training and should be similar to training on no_interaction/causes/prevents)
  - POS/NEG
  - ACTION/POS(-/+/0)/NEG(-/+/0)
  - Equality/Similarity/F-CONTAIN/MUTUALCOMPLEX/... (many many more)
- Predicate MapsTo
  - Assembly/Binding/Modification/Cleavage/...
- Predicate Effect
  - Other/Physical/Negative_Regulations/... (envelopes eachother: Unspecified_regulation -> Negative_regulation
 

### Issues

- The entity tagging in the BioInfer corpus uses multiple tokens for single entities (solved mostly by just merging those tokens for that interaction) and sometimes uses parts of subtokens for multiple entities (simply ignoring these cases makes the most sense since they basically cause one entity to have an interaction with itself). 
- Not sure how to use embeddings in a good way, getting help from Marcus. 



# 2019-04-30

Built the first working classifier for the BioInfer corpus running the [SGDClassifier](https://scikit-learn.org/stable/modules/sgd.html) using Sklearn. Using only entity distance and n-grams give ~70% accuracy for no_interaction/interaction, no_interaction/POS/NEG and ~35% accuracy for the third predicate part (Equality/Similarity/F-CONTAIN/MUTUALCOMPLEX/... (many many more)). The plan is to build the features: 

- n-gram
- Distance
- POS
- Dependencies

Test it on the [SGDClassifier](https://scikit-learn.org/stable/modules/sgd.html) and record the results then try switching to a Keras solution and adding embeddings. 

### Issues
- SciSpacy/Spacy is not working on my current enviroment, will rebuild the enviroment and try to fix the issue. 

# 2019-05-06

Meeting in the BMC with the whole team. 

I presented my results from my initial scikit-learn classifier. The plan is to keep working with the code, make the pos extractions better and more reliable (some parsing changes really). Hopefully I can commit the Keras solution and embeddings after next weeks meeting. 

Spacy is working right now and hopefully switching to sciSpacy will work aswell and show a demonstrable improvement in accuracy. 

### Issues
- Code needs a lot of cleanup and commenting, not sure when it will make sense for me to spend the time needed for this. 

# 2019-05-13
Adjusted the code a lot, fixing sci-spacy, adding dependencies to the features, switched to using the regular SVC etc. Getting wrose results on the simpler classification (55%-65%) but slightly better on the harder classifications (45%). It changes way too much depending on how many negative examples it trains on which is annoying. I'm thinking of making a test set with way fewer negative examples while having more while training, will discuss at the meeting tomorrow. Been reading more about SVM settings: [in-depth-parameter-tuning-for-svc](https://medium.com/all-things-ai/in-depth-parameter-tuning-for-svc-758215394769)

# 2019-05-14

Meeting in the BMC with the whole team. 

The entity tagging and relations training sets seem  to be coming along, it will get a little tight for me getting the training sets next week. Discussed some distinctions on how my relations ectractor will try to do. As I have not worked on entity tagging it will need sentences with entities tagged in some manner. 

# 2019-05-17

Built a Keras neural net model for the type of inputs discussed at the last meeting. It performs around 70% accuracy for: no_interaction, Positive, Negative on the bioInfer data. Currently it simply converts the two entities to "ENTITY1" and "ENTITY2" and trains on the processed sentences. 

I'm planning to introduce both stemming to the pre-processing and embeddnings to the net.

### Issues
- Still somewhat unsure on the detail on how to use Keras
- Can't seem to "overtrain" the model on the training data no matter the batch and epoch sizes. Reaches 90% at best, which might be an issue with the data more than anythin, it's both small and arguably not the most consistent. 

# 2019-05-21
Introduced 3-grams to the Keras model. It allows much more results on the training set, 30 epochs with batch size 20 gives the output: 

```console
Training set: 
Accuracy: 0.9765
Loss: 0.0121


Test set: 
Accuracy: 0.7748
Loss: 0.1310
```

An improvement on both the trainingset learning rate (this is overtrained but it's good that you CAN overtrain the training set). 

Found a good source on word vector learning: [Distributed Representations of Sentences and Documents
](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)