Language-Text-Extraction ReadMe

👩‍💻 Project BreakDown

The extract text.py is the main python file and start_work function is the driver function that starts the code.
The lexicon file should be a utf-8 text file that contains the words in the language.
The text document that requires language text extraction should be a utf-8 text file.

🔦 How the Code Works

The code works by collecting a text file as an input,
The text in the file are cleaned and split into sentences,
The words in each sentence are matched with the language's lexicon and a score is given to a sentence,
Based on the sentence score, the original sentence in text file (uncleaned) is written into a text file,
The code outputs four text files with each file containing sentences based on their sentence score
The four text files contain sentences with 25, 50, 75, and 100 sentence score.

🍿Tell me more about the four text files

After running the code outputs four text files, The files are named based on their match with the words in the lexicon.

🔨 The 100 percent text files contain sentences that match with a 100 percent - 74 percent score with the lexicon's language.
The 75 percent text files contain sentences that match with a 75 percent - 51 percent score
The 50 percent text files tend to contain mixed results,
The 25 percent text files usually contain sentences that are #NOT# the same language with the lexicon's language.

🧪 How to Run the Code

Move your lexicon text file and the language document text file to the code's directory
change the string variables lexicon_txt and corpus_txt to the name of your lexicon text file and the language document text file respectively
Run the code

📔 Note

The code cleans diacritics and digits from sentences before scoring them. See the cleanText.py file.
The code identifies sentences in text by using full stop (.), Edit the sentence_tokenizer.py if the desired language doesn't use dot to denote end of a sentence.

💡 A Few Extra Features

The python program is designed to only make use of the python standard libraries so it can be easily ported to another system.
The program also makes use of the python os library so that the program can cross platformly run on windows,linux based computers without having to worry about the file path differences i.e '/' and '\'.

👓 Author

Moses Bankole

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
LICENSE		LICENSE
ReadMe.md		ReadMe.md
Yoruba_corpus.txt		Yoruba_corpus.txt
Yoruba_lexicon.txt		Yoruba_lexicon.txt
cleanText.py		cleanText.py
extract text.py		extract text.py
sentence_tokenizer.py		sentence_tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

ReadMe.md

ReadMe.md

Yoruba_corpus.txt

Yoruba_corpus.txt

Yoruba_lexicon.txt

Yoruba_lexicon.txt

cleanText.py

cleanText.py

extract text.py

extract text.py

sentence_tokenizer.py

sentence_tokenizer.py

Repository files navigation

Language-Text-Extraction ReadMe

👩‍💻 Project BreakDown

🔦 How the Code Works

🧪 How to Run the Code

📔 Note

💡 A Few Extra Features

👓 Author

About

Releases

Packages

Languages

License

mosesab/Language-Text-Extraction-

Folders and files

Latest commit

History

Repository files navigation

Language-Text-Extraction ReadMe

👩‍💻 Project BreakDown

🔦 How the Code Works

🧪 How to Run the Code

📔 Note

💡 A Few Extra Features

👓 Author

About

Topics

Resources

License

Stars

Watchers

Forks

Languages