This project is for Jérémie Arné's research thesis of Master 1.
Its goal is to find the occurences of certain words or groups of words in a text file.
You'll need these two languages installed and ready.
- I first read Jérémie's transcription (
.docx
) and convert it to a.txt
file using a trivial python script. - I then use a Rust program to find the occurrences of the words I'm looking for.
In order to make it easy to use for people that are not familiar with code and a terminal (Jérémie), I automated amost all the process.
In order to use it, you'll just need to:
-
Make sure the transcription file is in the
./src/assets/
folder. -
(Create or) fill the
src/assets/toFind.json
file. It should have the following structure:{ "the_word_to_look_for": 4, // This number is the maximum errors possible in that word. "another_word or expression": 5 }
You'll need to do this only ONCE.
cargo build --release
The script takes sevral arguments that can be found using this code:
./target/release/projet-jeremie -h
After making sure all configuration files are OK, the easiest way to get things working is via this command:
./target/release/projet-jeremie -ro
This command will:
-r
Run the python script to convert the.docx
transcription file into a.txt
one.-o
Will output the results in thesrc/outputs/occurences.json
file.
The JSON file for the strings to search must an object of "string": number
like so:
{
"Jehan de Luxembourg": 4,
"Duc de Bourgogne": 3
}
The numbers are here to precise the maximum number of errors for a given string.
The word algorithm
is a bit of a stretch here.
All I'm doing is reading the file line by line and for each line, I'm looking for the occurences of the words I'm looking for uing windows of the size of the word(s) I'm looking for.
Sometimes, words are written with different spellings.
For example, Jehan de Luxembourg
can be found as Jehan de Luxembourcq
or Jehan de Luxembouc
.
In the line Le vallet Jehan de Luxembourcq pris son arme.
, given the Jehan de Luxembourg
search, the looking window will be of size 3. And the program will browse the line like this:
- Le vallet Jehan | distance: 16
- vallet Jehan de | distance: 16
- Jehan de Luxembourcq | distance: 1
- de Luxembourcq pris | distance: 12
- Luxembourcq pris son | distance: 19
- pris son arme. | distance: 17
If the distance is less than the maximum distance allowed, the program will take it into account. If multiple occurences are found, the program will also take it into account.