ANLP - SRL

This project is part of the master course "Applied Natural Language Processing" by Davide Cavicchini. The goal is to develop an SRL system using a transformer encoder model. This repository includes data preprocessing, model training, and inference functionalities.

Directory Structure

├── ConceptNET                  - Unused scripts to generate data from ConceptNET
|   ├── ...
|
├── dataloaders
│   ├── pre-process.py          - General preprocessing script
│   ├── preprocess_nombank.py   - Preprocessing script for NomBank dataset
│   ├── preprocess_UP.py        - Preprocessing script for Universal Proposition dataset
│   ├── UP_dataloader.py        - Data loader for Universal Proposition dataset
│   ├── NOM_dataloader.py       - Data loader for NomBank dataset
|
├── inference                  
│   ├── inference.py            - Inference script for making predictions on new sentences from terminal
│   ├── interactive_KG.py       - Generates an interactive Knowledge Graph using as input a file or a wikipedia page
|
├── train
│   ├── functions.py            - Functions for training the model and evaluating the results
│   ├── utils.py                - Utility functions to load the data
│   ├── main.py                 - Main script for training the model
|
├── model.py                    - Core model architecture
├── evaluation.py               - Computes the metrics of the saved models on the test set and saves the to a file
├── requirements.txt            - Python dependencies
|
├── .gitignore                  - Git ignore file
├── README.md                   - Project README file

Setup and Installation

Clone the repository:

git clone https://github.com/DavidC001/ANLP.git
cd ANLP

Install dependencies:
```
pip install -r requirements.txt
```
Note: to use cuda you might need to install the version of torch that supports your cuda version.

Data pre-preprocessing

Before training the model, you need to preprocess the data. The dataloaders directory contains scripts for preprocessing different datasets:

preprocess_UP.py: To preprocess the Universal Proposition dataset.
preprocess_nombank.py: To preprocess the NomBank dataset. Modified from this implementation
pre-process.py: To preprocess both datasets.

To run these scripts, you need to adjust the paths and parameters according to your setup. These are the links to download the datasets:

For the paths to adjust in the script to preprocess the nomBank dataset make reference to this repository. Note, you also need to adjust the paths and download ALGNLP_2 data.

Model Architecture

The core model is defined in model.py, including the SRL_MODEL model class which extends a pre-trained transformer model based on BERT architecture. For more details on the model architecture, refer to the project report.

Training the Model

The training script is managed by train/main.py. The datasets need to be properly preprocessed and available in the datasets/preprocessed directory. In the script it's possible to define the tests to run, the model hyperparameters, here is an example of the hyperparameters:

tests = {
        "SRL_DISTILBERT": {
            "model_name": "distilbert-base-uncased", # name of the encoder model to use
            "combine_method": "gating", # how to combine the predicate and word representations, can be "sum", "concat", "soft_attention", "gating"
            "role_layers": [256], # hidden dimensions of the role classifier
            "norm_layer": True, # whether to apply layer normalization
            "proj_dim": 512, # dimension of the projection layer
            "relation_proj": True, # whether to project the relation representation
            "role_RNN": True, # whether to use a LSTM layer in the role classifier
            "RNN_type": "GRU", # type of RNN layer "RNN", "LSTM", "GRU"
            "train_encoder": True, # whether to train the encoder
            "train_embedding_layer": True, # whether to train the embedding layer
            "dropout_prob": 0.5, # dropout rate
        },
}

For more information on how these modify the model refer to the documentation in the model.py file.

Evaluation Metrics

The metrics used are Precision, Recall, and F1-score. These metrics are computed for both the identification of prepositions and the classification of semantic roles in the sentences.

To compute these metrics for the models in the models directory you can ran the script train/evaluation.py which will evaluate all the models in the directory and save the results to a json file. When calling the script you can choose what argument span identification strategy to evaluate using the --top and --concat flags. You can also control the confidence threshold for the span identification using the --threshold value flag.

For example: bash python ./train/evaluation.py --top --threshold 0.75

Inference

There are two scripts for inference:

inference.py: This script allows you to make predictions on new sentences from the terminal, to use it you need to adjust the paths and have a trained model available with its configuration json file.
interactive_KG.py: This script generates an interactive Knowledge Graph using as input a file (the default file is text.txt from the root directory of the repository) or a wikipedia page. To use it you need to set up an instance of Neo4j and adjust the credentials in the script.

Trained Models

The checkpoints and configuration files for the trained models are available in the following drive folder. The evaluation metrics for the models are available in the following drive folder

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ANLP - SRL

Directory Structure

Setup and Installation

Data pre-preprocessing

Model Architecture

Training the Model

Evaluation Metrics

Inference

Trained Models

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
ConceptNET		ConceptNET
dataloaders		dataloaders
inference		inference
train		train
.gitignore		.gitignore
ANLP_report_Davide_Cavicchini_247823.pdf		ANLP_report_Davide_Cavicchini_247823.pdf
LICENSE		LICENSE
README.md		README.md
chat_kg.py		chat_kg.py
evaluation.py		evaluation.py
evaluation_span.py		evaluation_span.py
model.py		model.py
requirements.txt		requirements.txt

License

DavidC001/ANLP

Folders and files

Latest commit

History

Repository files navigation

ANLP - SRL

Directory Structure

Setup and Installation

Data pre-preprocessing

Model Architecture

Training the Model

Evaluation Metrics

Inference

Trained Models

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages