Skip to content

Comparison of Entity Extraction techniques for annotation - build, train, and evaluate machine learning models for Named Entity Recognition.

Notifications You must be signed in to change notification settings

SimoneRichetti/NER-comparison

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Comparison of Entity Extraction techniques for annotation

This repo contains the implementation of several Machine Learning algorithms for Named Entity Recognition. We build, train and evaluate them on many different dataset, considering several aspects: quality of prediction, memory consumption, and latency of inference.

Dependencies

See environment.yml. In general, I used tensorflow.keras and scikit-learn for my ML experiments 🔮.

Setup

conda env create -f environment.yml
conda activate ner-suite

You can now play with the notebooks!

Project Structure

  • data/: directory in which are saved all the dataset used in the notebooks. The dataset are:
  • embeddings/: directory that contains different word embeddings:
    • glove.6B.100d.txt for english;
    • w2v.itWac.128d.txt for italian;
  • utils: a package that I made in order to increase code modularity, reusability and readability;
  • <algo>-<dataset>.ipynb: these are the notebooks with the experiments that we made;
  • environment.yml: conda environment file in order to replicate the environment on your machine and reproduce the experiments;
  • results.xlsx: results of the experiments;

Models and references:

  • Conditional Random Fields: a traditional Machine Learning algorithm which can deal with sequences. Refer to the original paper and the implementation of the sklearn wrapper;
  • LSTM: the most used recurrent neural network for modeling sequences. We also use it in combination with pre-trained embeddings like GloVe and itWac;
  • End-to-end model: in this paper it is proposed a model which combines a CNN to extract morphological features from the characters of the word, the GloVe embeddings to represent word-level features, a Bidirectional LSTM to model the context and finally a CRF layer to decode the best sequence of labels. We implemented it, thanks to the work already done in this repo.

TODO

  • Improve documentation;

About

Comparison of Entity Extraction techniques for annotation - build, train, and evaluate machine learning models for Named Entity Recognition.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published