NRCan · RichardScottOZ · Jun 6, 2022 · Jul 14, 2022 · Jul 15, 2022 · Jul 23, 2022
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 # Geoscience Language Models
 
 ## Introduction
-Language models are the foundation for the predictive text tools that billions of people in their everyday lives. Although most of these language models are trained on vast digital corpora, they are often missing the specialized vocabulary and underlying concepts that are important to specific scientific sub-domains. Herein we report two new language models that were re-trained using geoscientific text to address that knoweldge gap. The raw and processed text from the GEOSCAN publications database, which were used to generate these new language models are reported in a pending Open File. Language model performance and validation are discussed separately in pending manuscript. The supporting datasets and preferred language models can be used and expanded on in the future to support a range of down-stream natural language processing tasks (e.g., keyword prediction, document similarity, and recommender systems).
+Language models are the foundation for the predictive text tools that billions of people in their everyday lives. Although most of these language models are trained on vast digital corpora, they are often missing the specialized vocabulary and underlying concepts that are important to specific scientific sub-domains. Herein we report two new language models that were re-trained using geoscientific text to address that knoweldge gap. The raw and processed text from the GEOSCAN publications database, which were used to generate these new language models are reported in a pending Open File. Language model performance and validation are discussed separately in a pending manuscript. The supporting datasets and preferred language models can be used and expanded on in the future to support a range of down-stream natural language processing tasks (e.g., keyword prediction, document similarity, and recommender systems).
 
 
 ## Geoscience language modelling methods and evaluation
@@ -27,7 +27,7 @@ Contextual language models, including the Bidirectional Encoder Representations
 
 
 ## Example Application
-Principal component analysis (PCA) biplot of mineral names colour coded to the Dana classification scheme. Word vectors for macthing mineral names (n = 1893) are based on the preferred GloVE model. Minerals with similar classifications plot together in PCA space, reflecting similar vector properties. Word embeddings provide a powerful framework for evaluating and predicting mineral assemblages based on thousands of observations from the natural rock record.
+Principal component analysis (PCA) biplot of mineral names colour coded to the Dana classification scheme. Word vectors for matching mineral names (n = 1893) are based on the preferred GloVE model. Minerals with similar classifications plot together in PCA space, reflecting similar vector properties. Word embeddings provide a powerful framework for evaluating and predicting mineral assemblages based on thousands of observations from the natural rock record.
 
 
 

diff --git a/project_tools/Evaluation.md b/project_tools/Evaluation.md
@@ -86,7 +86,7 @@ where task is set to train on either the MULTICLASS or PAIRING keyword predictio
 To use weights and biases (https://wandb.ai) visualizations, set the correct api keys and dir in `scripts/run_bert_evaluation.py` before finetuning. Refer to the wandb documentaion for more details.
 
 #### Training the GloVe models with sklearn
-Glove keyword prediction models can be trained with sklearn using the `run_keyword_prediction_classic()` funtion in `/nrcan_p2/evaluation/keyword_prediction.py`.
+Glove keyword prediction models can be trained with sklearn using the `run_keyword_prediction_classic()` function in `/nrcan_p2/evaluation/keyword_prediction.py`.
 
 To launch training, run one of the following scripts:
 

diff --git a/project_tools/Training.md b/project_tools/Training.md
@@ -10,7 +10,7 @@ You can train 3 types of embedddings models:
 ## GloVe
 
 ### Local training
-To train a GloVe model, run the follwing command after setting the correct dataset and training parameters in the script (see the script for more details):
+To train a GloVe model, run the following command after setting the correct dataset and training parameters in the script (see the script for more details):
 ```
 bash scripts/train_glove_model.sh
 ```
@@ -40,6 +40,7 @@ python embeddings/mittens_scripts/run_mittens.py
     --ORIGINAL_EMBEDDINGS_PATH ${ORIGINAL_EMBEDDINGS_PATH}
     --MAX_ITER ${MAX_ITER}
     --VECTOR_SIZE ${VECTOR_SIZE}
+
 ```
 
 At minimum, you will probably need to change the following parameters:

diff --git a/project_tools/embeddings/mittens_scripts/run_mittens.py b/project_tools/embeddings/mittens_scripts/run_mittens.py
@@ -53,5 +53,5 @@
     max_iter=args.MAX_ITER,
     original_embeddings_filename=args.ORIGINAL_EMBEDDINGS_PATH,
     vector_size=args.VECTOR_SIZE,
-    mittens_filename=args.MITTENS_FILENAME
-)
+    mittens_filename=MITTENS_FILENAME
+)
diff --git a/project_tools/embeddings/mittens_scripts/train.py b/project_tools/embeddings/mittens_scripts/train.py
@@ -35,7 +35,7 @@ def run_train(
     # Load pre-trained Glove embeddings
     original_embeddings = glove2dict(original_embeddings_filename)
 
-    mittens_model = Mittens(n=config.VECTOR_SIZE, max_iter=max_iter)
+    mittens_model = Mittens(n=vector_size, max_iter=max_iter)
 
     new_embeddings = mittens_model.fit(M, vocab=vocabulary, initial_embedding_dict=original_embeddings)
 

diff --git a/project_tools/scripts/run_preprocess_csv_for_modelling.sh b/project_tools/scripts/run_preprocess_csv_for_modelling.sh
@@ -7,7 +7,7 @@
 # (for debugging)
 N_FILES="-1"
 
-# Set below accorind to the provided dataset name
+# Set below according to the provided dataset name
 INPUT_DIRS=
 OUTPUT_DIR=
 PARTIAL_OUTPUT_DIR=