Ye Olde French: Error Analysis of SIGMORPHON-UniMorph Shared Task Splits and Effect of Old and Middle French
This project uses scripts designed to inflect lemmas based on morphosyntactic descriptions and aims to uncover patterns behind errors made my the various scripts made, in this case, when run on the French langauge. Our goal is to emphasize the role of high-quality training data, as well as recommend improved practices for scraping and collecting training data to be used for these tasks.
This is a project pursued in an attempt to complete the SIGMORPHON2023 Inflection Shared Task part1 in regards to the French language.
The following are the folders found within the repository:
-
CounterSorterOutputContains files generated bycounterSorterOutput.pyanderrorSummation.py. -
NeuralTransducerFormattedContains formatted files for neural transducer outputs generated byformatNeuralErrors.py. -
NeuralTransducerOutputStores the output (found in the checkpoints folder) from the neural transducer. This includes files likefra.decode.dev.tsvandfra.decode.test.tsv. -
SegmentationsSplitsHolds segmentation splits, such asfra.segmentations, as well as generates splits from.segmentationsfiles (found in the UniMorph Fra Repository). -
SharedTaskDataIncludes data used for shared tasks, like development (fra.dev), training (fra.trn), and testing (fra.tst) splits. -
plots_for_paperContains R code and output for generating Figure 1 in SIGMORPHON 2024 paper (and related figures not included in paper) -
srcSource code directory containing scripts for various operations such as error analysis, data formatting, error counting, etc.
Below is an overview of the scripts located in the src directory:
reproduce.sh
Reproduces the data that was annotated for use in the paper.
errorSummation.py
Takes a set of output files and merges them together by form. Make sure that both files have all the same forms in the same order. You must list a number for the index for the column with the predicted form and the number of lines to skip for a header following each file. You may:
- Specify the split to check your words against using the
-sor--splitflag.- Specify the output file using the
-oor--outputflag.- Include the universally correct forms in the results using the
-cor--correctflag.- Assign header names to each of the files supplied using the
-nor--namesflag.- Run it on previously generated sums using the
-eor--errorsflag.
formatNeuralErrors.py
Takes the output files placed in
NeuralTransducerOutputand converts them to a format that is more human readable and is usable forerrorSummation.pyand places the new file into theNeuralTransducerFormattedfolder. You may:
- Specify a designated input directory using the
-por--pathflag.- Specify a designated output directory using the
-dor--destflag.
formatSegmentations.py
Takes the
.segmentationsfiles placed inSegmentationsSplitsand converts them to match the shared task data format. It then uses the splits inSharedTaskDatato create new splits in the same directory that have similar demographics but only include words in the.segmentationsfiles. You may:
- Specify a designated input directory for the
.segmentationsfiles using the-por--pathflag.- Specify a designated input directory for the original shared task splits using the
-oor--originalflag.- Force the recreation of the
.totalfile using the-for--forceflag (Normally, if the.totalfile is present, it will skip that step).And either:
- Specify a language to convert using the
-lor--langflag with the UniMorph abbreviation.Run all files using the[Not Yet Implemented]-aor--allflag.
nonneural.py
This is the baseline
nonneural.pytaken from the Sigmorphon 2023 Shared Task Repo. It has been modified to use the argparse module and to create separate output files for the different splits. You may:
- Specify a designated input directory using the
-por--pathflag.- Run it on the test split using the
-tor--testflag.- Turn on output file generation using the
-oor--outflag (The output is placed in the input directory).
properties.py
Contains default paths and settings for the project. The following properties are defined:
SEGMENTATIONS_FOLDER = "../SegmentationsSplits"SHARED_TASK_DATA_FOLDER = "../SharedTaskData"NEURAL_OUTPUT_FOLDER = "../NeuralTransducerOutput"NEURAL_ERRORS_FOLDER = "../NeuralTransducerFormatted"COUNTER_SORTER_OUTPUT_FOLDER = "../CounterSorterOutput"The following sort methods are defined for use with
counterSorter.py:
fieldwhich sorts alphabetically.suffixwhich sorts alphabetically from the end of the string.numberwhich sorts based on the number value of a column.
counterSorter.py
For a given file, either counts or sorts it and places the output in
CounterSorterOutputby default.
- If you are sorting, include an
sor the wordsortafter the specified file.- If you are counting, include a
cor the wordcountafter the specified file.For either option, you may:
- Specify a designated output file using the
-dor--destflag.- Specify a sorting function using the
-mor--methodflag and a key from theSORT_FUNCTIONSdictionary inproperties.py.- Invert the sort direction using the
-ror--reverseflag.- Ignore the header of the file using the
-sor--skipflag and a number of lines to skip.