Skip to content
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Geoscience Language Models

## Introduction
Language models are the foundation for the predictive text tools that billions of people in their everyday lives. Although most of these language models are trained on vast digital corpora, they are often missing the specialized vocabulary and underlying concepts that are important to specific scientific sub-domains. Herein we report two new language models that were re-trained using geoscientific text to address that knoweldge gap. The raw and processed text from the GEOSCAN publications database, which were used to generate these new language models are reported in a pending Open File. Language model performance and validation are discussed separately in pending manuscript. The supporting datasets and preferred language models can be used and expanded on in the future to support a range of down-stream natural language processing tasks (e.g., keyword prediction, document similarity, and recommender systems).
Language models are the foundation for the predictive text tools that billions of people in their everyday lives. Although most of these language models are trained on vast digital corpora, they are often missing the specialized vocabulary and underlying concepts that are important to specific scientific sub-domains. Herein we report two new language models that were re-trained using geoscientific text to address that knoweldge gap. The raw and processed text from the GEOSCAN publications database, which were used to generate these new language models are reported in a pending Open File. Language model performance and validation are discussed separately in a pending manuscript. The supporting datasets and preferred language models can be used and expanded on in the future to support a range of down-stream natural language processing tasks (e.g., keyword prediction, document similarity, and recommender systems).


## Geoscience language modelling methods and evaluation
Expand All @@ -27,7 +27,7 @@ Contextual language models, including the Bidirectional Encoder Representations


## Example Application
Principal component analysis (PCA) biplot of mineral names colour coded to the Dana classification scheme. Word vectors for macthing mineral names (n = 1893) are based on the preferred GloVE model. Minerals with similar classifications plot together in PCA space, reflecting similar vector properties. Word embeddings provide a powerful framework for evaluating and predicting mineral assemblages based on thousands of observations from the natural rock record.
Principal component analysis (PCA) biplot of mineral names colour coded to the Dana classification scheme. Word vectors for matching mineral names (n = 1893) are based on the preferred GloVE model. Minerals with similar classifications plot together in PCA space, reflecting similar vector properties. Word embeddings provide a powerful framework for evaluating and predicting mineral assemblages based on thousands of observations from the natural rock record.



Expand Down
2 changes: 1 addition & 1 deletion project_tools/Evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ where task is set to train on either the MULTICLASS or PAIRING keyword predictio
To use weights and biases (https://wandb.ai) visualizations, set the correct api keys and dir in `scripts/run_bert_evaluation.py` before finetuning. Refer to the wandb documentaion for more details.

#### Training the GloVe models with sklearn
Glove keyword prediction models can be trained with sklearn using the `run_keyword_prediction_classic()` funtion in `/nrcan_p2/evaluation/keyword_prediction.py`.
Glove keyword prediction models can be trained with sklearn using the `run_keyword_prediction_classic()` function in `/nrcan_p2/evaluation/keyword_prediction.py`.

To launch training, run one of the following scripts:

Expand Down
3 changes: 2 additions & 1 deletion project_tools/Training.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ You can train 3 types of embedddings models:
## GloVe

### Local training
To train a GloVe model, run the follwing command after setting the correct dataset and training parameters in the script (see the script for more details):
To train a GloVe model, run the following command after setting the correct dataset and training parameters in the script (see the script for more details):
```
bash scripts/train_glove_model.sh
```
Expand Down Expand Up @@ -40,6 +40,7 @@ python embeddings/mittens_scripts/run_mittens.py
--ORIGINAL_EMBEDDINGS_PATH ${ORIGINAL_EMBEDDINGS_PATH}
--MAX_ITER ${MAX_ITER}
--VECTOR_SIZE ${VECTOR_SIZE}

```

At minimum, you will probably need to change the following parameters:
Expand Down
4 changes: 2 additions & 2 deletions project_tools/embeddings/mittens_scripts/run_mittens.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,5 +53,5 @@
max_iter=args.MAX_ITER,
original_embeddings_filename=args.ORIGINAL_EMBEDDINGS_PATH,
vector_size=args.VECTOR_SIZE,
mittens_filename=args.MITTENS_FILENAME
)
mittens_filename=MITTENS_FILENAME
)
2 changes: 1 addition & 1 deletion project_tools/embeddings/mittens_scripts/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def run_train(
# Load pre-trained Glove embeddings
original_embeddings = glove2dict(original_embeddings_filename)

mittens_model = Mittens(n=config.VECTOR_SIZE, max_iter=max_iter)
mittens_model = Mittens(n=vector_size, max_iter=max_iter)

new_embeddings = mittens_model.fit(M, vocab=vocabulary, initial_embedding_dict=original_embeddings)

Expand Down
2 changes: 1 addition & 1 deletion project_tools/scripts/run_preprocess_csv_for_modelling.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
# (for debugging)
N_FILES="-1"

# Set below accorind to the provided dataset name
# Set below according to the provided dataset name
INPUT_DIRS=
OUTPUT_DIR=
PARTIAL_OUTPUT_DIR=
Expand Down