Skip to content

A Python library designed for preprocessing Creole text. The library includes various functions and tools to clean, tokenize, and prepare text data for natural language processing (NLP) tasks.

License

Notifications You must be signed in to change notification settings

bjclayton/CreoleNLTK

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CreoleNLTK: Creole Natural Language Toolkit

License: MIT Python Version Build Status

CreoleNLTK is a Python library designed for preprocessing Creole text. The library includes various functions and tools to prepare text data for natural language processing (NLP) tasks. It provides functionality for cleaning, tokenization, lowercasing, stopword removal, contraction to expansion, and spelling checking.

Features

  • Spelling Check: Identify and correct spelling errors.
  • Contraction to Expansion: Expand contractions in the text.
  • Stopword Removal: Remove common words that do not contribute much to the meaning.
  • Tokenization: Break the text into words or tokens.
  • Text Cleaning: Remove unwanted characters and clean the text.

Installation

You can install CreoleNLTK using pip:

pip install creolenltk

Usage

Spelling Checker

from creolenltk.spelling_checker import SpellingChecker

# Initialize the spelling checker
spell_checker = SpellingChecker()

# Correct spelling errors in a word
corrected_word = spell_checker.correction('òtgraf')

print(f"Original Word: òtgraf, Corrected Word: {corrected_word}") # òtograf

Contraction to Expansion

from creolenltk.contraction_expansion import ContractionToExpansion

# Initialize the contraction expander
contraction_expander = ContractionToExpansion()

# Expand contractions in a sentence
original_sentence = "L'ap manje. m'ap rete lakay mw."
expanded_sentence = contraction_expander.expand_contractions(original_sentence)

print(f"Original Sentence: {original_sentence}\nExpanded Sentence: {expanded_sentence}") # li ap manje. mwen ap rete lakay mwen.

Stopword Removal

from creolenltk.stopword import Stopword

# Initialize the stopword handler
stopword_handler = Stopword()

# Remove stopwords from a sentence
sentence_with_stopwords = "Sa se yon fraz tès ak kèk stopwords nan Kreyòl Ayisyen."
sentence_without_stopwords = stopword_handler.remove_stopwords(sentence_with_stopwords)

print(f"Sentence with Stopwords: {sentence_with_stopwords}\nWithout Stopwords: {sentence_without_stopwords}") # fraz tès stopwords Kreyòl Ayisyen.

Tokenizer

from creolenltk.tokenizer import Tokenizer

# Initialize the tokenizer
tokenizer = Tokenizer()

# Tokenize a sentence
sentence = "Sa se yon fraz senp"
tokens = tokenizer.word_tokenize(sentence, expand_contractions=True, lowercase=True)

print(f"Sentence: {sentence}\nTokens: {tokens}") # ["sa", "se", "yon", "fraz", "senp"]

For more detailed usage and examples, refer to the documentation.

License

MIT licensed. See the bundled LICENSE file for more details.

About

A Python library designed for preprocessing Creole text. The library includes various functions and tools to clean, tokenize, and prepare text data for natural language processing (NLP) tasks.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages