Search for occurrences.

This project is for Jérémie Arné's research thesis of Master 1.
Its goal is to find the occurences of certain words or groups of words in a text file.

Requirements

You'll need these two languages installed and ready.

Rust
Python

Steps

I first read Jérémie's transcription (.docx) and convert it to a .txt file using a trivial python script.
I then use a Rust program to find the occurrences of the words I'm looking for.

Usage

Before running the script

In order to make it easy to use for people that are not familiar with code and a terminal (Jérémie), I automated amost all the process.

In order to use it, you'll just need to:

Make sure the transcription file is in the ./src/assets/ folder.

(Create or) fill the src/assets/toFind.json file. It should have the following structure:

{
  "the_word_to_look_for": 4, // This number is the maximum errors possible in that word.
  "another_word or expression": 5
}

Compiling

You'll need to do this only ONCE.

cargo build --release

Running the script

The script takes sevral arguments that can be found using this code:
./target/release/projet-jeremie -h

After making sure all configuration files are OK, the easiest way to get things working is via this command:

./target/release/projet-jeremie -ro

This command will:

-r Run the python script to convert the .docx transcription file into a .txt one.
-o Will output the results in the src/outputs/occurences.json file.

JSON file

The JSON file for the strings to search must an object of "string": number like so:

{
  "Jehan de Luxembourg": 4,
  "Duc de Bourgogne": 3
}

The numbers are here to precise the maximum number of errors for a given string.

Algorithm

The word algorithm is a bit of a stretch here. All I'm doing is reading the file line by line and for each line, I'm looking for the occurences of the words I'm looking for uing windows of the size of the word(s) I'm looking for.

Example

Sometimes, words are written with different spellings. For example, Jehan de Luxembourg can be found as Jehan de Luxembourcq or Jehan de Luxembouc.

In the line Le vallet Jehan de Luxembourcq pris son arme., given the Jehan de Luxembourg search, the looking window will be of size 3. And the program will browse the line like this:

Le vallet Jehan | distance: 16
vallet Jehan de | distance: 16
Jehan de Luxembourcq | distance: 1
de Luxembourcq pris | distance: 12
Luxembourcq pris son | distance: 19
pris son arme. | distance: 17

If the distance is less than the maximum distance allowed, the program will take it into account. If multiple occurences are found, the program will also take it into account.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
.husky		.husky
node_modules		node_modules
src		src
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENCE		LICENCE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search for occurrences.

Requirements

Steps

Usage

Before running the script

Compiling

Running the script

JSON file

Algorithm

Example

About

Releases

Contributors 2

Languages

License

TomPlanche/projet-jeremie

Folders and files

Latest commit

History

Repository files navigation

Search for occurrences.

Requirements

Steps

Usage

Before running the script

Compiling

Running the script

JSON file

Algorithm

Example

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Contributors 2

Languages