This project is an adaptation of Guillaume's model for NER. I adapted the model to have more configurable options and to solve more tasks. The additional configurations were added to investigate how they impact performance of the model. Documentation, which includes theoretical description of the model and comparison of the different model configurations , can be found here.
- Added multitasking: This version solves POS ,CHUNKING and NER model
- Added word2vec: choose between google's word2vec pre-trained word embedding or Stanford's of gloVe word embedding. One can compare the resulting accuracy
- Added CNN-char embedding: (a terrible bug persists that needs fixing)
- In the terminal run
$ make wembedding
to download the standford glove file and
$ ./word2vec_download_google_model.sh
to download google news word2vec file
- Run build_data.py to generate the required files for training.
This generates .txt word embedding files and trimmed versions of them Note: When written, the word2vec to a .txt file is over 10GB in size so the process might take along time depending on your machine.
- Configure model in config.py.
Here, you can configure the model to perform one task from POS, CHUNK and NER. You can specify to whether to use word_embedding or not and which type to use, whether to use CRF, character embedding etc, and tune a number of hyperparameters. Note: running build.data.py generates a tag-specific file. So for example, to perform NER one must re-run build.data.py with 'task' in config.py set to NER in order to generate the required tag file.
- Run train.py to train the model.