Husky_Simplex

Custom Text processing package

Data preprocessing is the first and most essential stage in developing a machine learning model as it affects the overall accuracy and efficiency of the outcome. Ordinary text data contains non-contextual words, noise, misspelled words, symbols, punctuations, and unnecessary syntactic connotations. To circumvent these hindrances, we need to clean raw text data into data that is acceptable for statistical and computational analysis.

The purpose of the package is to provide a one-stop platform for most of the necessary text preprocessing techniques. These steps are used to augment the computational significance of text data for Natural Language Processing tasks.

Package Functions

We have implemented three classes: Word, Clean, and Vectorizer. The Word class would contain methods that deal with words in the text data like Tokenization, Word Counter, and Stopword removal. Next, the Clean class deals with correcting noise and non-contextual words with no statistical significance. Punctuation removal, Symbol removal, Sentence splitting, and Stemming are the methods included in this class. And finally, the Vectorizer class contains the Bag of Words, Count_vectorizer, and the TFIDF_vectorizer methods. Overall, we will be implementing methods with the following functionalities:

Tokenization - Converting string input to a list of words.
Word counter - Counting the total number of words in the input.
Stopword removal - Removing non-contextual words that are only used for the grammatical structure.
Punctuation removal - Removing punctuations.
Symbol removal - Removing symbols.
Stemming - Removing tense connotations.
Bag of words - Quantifying words.
Count vectorization - Vectorization of text based on term frequency.
TF-IDF vectorization - Vectorization of text based on term frequency in relation to document frequency

Code Usage

Class Word (Tokenize)

sample_input = "She is a good person, and she loves pizza@#$%, that's probably because of her intestinal^*& prerogatory transformation. The neighbours got%£ some pizza, enjoying it without electrical assistance.........."
word = Word(sample_input)
print(word.tokenize())

Class Clean (remove_punctuation)

sample_input = "She is a good person, and she loves pizza@#$%, that's probably because of her intestinal^*& prerogatory transformation. The neighbours got%£ some pizza, enjoying it without electrical assistance.........."
clean = Clean()
print(clean.remove_punctuation(sample_input))

Class Vectorizer (BOW_fit_transform)

sample_input = "She is a good person, and she loves pizza@#$%, that's probably because of her intestinal^*& prerogatory transformation. The neighbours got%£ some pizza, enjoying it without electrical assistance.........."
vectorize = Vectorizer(sample_input)
print(vectorize.BOW_fit_transform())

Installation

pip install husky_simplex

or
git clone https://github.com/Sudhendra/Husky_Simplex.git
cd Husky_Simplex
pip install - r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
__pycache__		__pycache__
dist		dist
src/husky_simplex		src/husky_simplex
tests		tests
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pycache

pycache

dist

dist

src/husky_simplex

src/husky_simplex

tests

tests

LICENSE

LICENSE

README.md

README.md

pyproject.toml

pyproject.toml

requirements.txt

requirements.txt

Repository files navigation

Husky_Simplex

Package Functions

Code Usage

Class Word (Tokenize)

Class Clean (remove_punctuation)

Class Vectorizer (BOW_fit_transform)

Installation

About

Releases 3

Packages

Contributors 5

Languages

License

Sudhendra/Husky_Simplex

Folders and files

Latest commit

History

Repository files navigation

Husky_Simplex

Package Functions

Code Usage

Class Word (Tokenize)

Class Clean (remove_punctuation)

Class Vectorizer (BOW_fit_transform)

Installation

About

Resources

License

Stars

Watchers

Forks

Languages