GitHub - ChangeIsKey/non-recorded-sense-detection: This repository contains all code and data used for the thesis.

This is the code that has been written and used as part of the lexicography project. Sharing is only for the sake of comprehensibility and reproducibility of results, so it may seem complicated or inconvenient in some places. In addition, some scripts have been changed to handle different data, which is why in some cases only certain data can be processed.

data

The data directory is due to its size not included in this repository. It can be downloaded [here]

This directory contains exemplary data to test the functionality of the scripts. All final results of the models' predictions are included in full. It is subdivided into /annotation_results, /corpora, /dictionary and /outputs.

The downloaded directory /data should be placed in the root directory of the project.

/annotation_results contains the results of both human annotation phases and thus also the final results of the models' predictions.
/corpora contains all four corpora types (historical and modern for both languages). Additionally, processed versions where every sentence is tokenized and lemmatized using spaCy.
/dictionary contains both WordNet and the Swedish dictionary, as well as versions where unique sense-identifier were added to distinguish between different senses.
/outputs contains all files that are generated by executing the scripts.

data analysis

Notebooks in this directory are used to analyze and transform data.

human annotation phase 1

sample_data.ipynb samples word usages of dictionary headwords from a corpus.
reduce_sample.ipynb reduces the sampled word usages to a set maximum per headword.
generate_wsbest.ipynb generates all files needed for human annotation in PhiTag.
reduce_sense_file.ipynb removes duplicates from the senses.tsv file for PhiTag.

model sense embedding

extract_context.ipynb extracts a context (gloss/examples) from the dictionary entries.
xl_model_embeddings.ipynb uses extracted context to generate sense-embeddings.

model tuning

sort_training_data.ipynb filters usable usages from the human annotation phase 1 for model tuning.
generate_gold_splits.ipynb randomly divides the usable data into known/unknown for training purposes.
vectorize_annotations.ipynb generates usage embeddings from the word usages of the human annotation phase 1.
cross_validation.ipynb performs cross-validation of a model on the training data generated by generate_gold_splits.ipynb

model prediction

sample_data.ipynb takes and cleans a sample from the corpora for the model prediction.
model_prediction.ipynb predicts for all usages in the sample whether they are covered by the dictionary based on the tuned threshold.

human annotation phase 2

pre_sort_samples.ipynb filters the models' predictions and sorts them by similarity to the nearest sense.
build_annotation_data.ipynb generates all files needed for human annotation in PhiTag.

human annotation analysis

ws_best_analysis.ipynb analyses the results of the human annotations.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data_analysis		data_analysis
helper_scripts		helper_scripts
human_annotation_analysis		human_annotation_analysis
human_annotation_phase1		human_annotation_phase1
human_annotation_phase2		human_annotation_phase2
model_prediction		model_prediction
model_sense_embeddings		model_sense_embeddings
model_tuning		model_tuning
xl-lexeme @ ef9ee75		xl-lexeme @ ef9ee75
.gitmodules		.gitmodules
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_analysis

data_analysis

helper_scripts

helper_scripts

human_annotation_analysis

human_annotation_analysis

human_annotation_phase1

human_annotation_phase1

human_annotation_phase2

human_annotation_phase2

model_prediction

model_prediction

model_sense_embeddings

model_sense_embeddings

model_tuning

model_tuning

xl-lexeme @ ef9ee75

xl-lexeme @ ef9ee75

.gitmodules

.gitmodules

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

data

data analysis

human annotation phase 1

model sense embedding

model tuning

model prediction

human annotation phase 2

human annotation analysis

About

Releases 1

Packages

Languages

ChangeIsKey/non-recorded-sense-detection

Folders and files

Latest commit

History

Repository files navigation

data

data analysis

human annotation phase 1

model sense embedding

model tuning

model prediction

human annotation phase 2

human annotation analysis

About

Resources

Stars

Watchers

Forks

Languages