Text Mining relies heavily on the pre-processing. This library is an assortment of common text processing techniques. The library is divided into following modules.
Load Data
This module contains all the functions for loading the data and outputting results
Text Processing
This module contains the functions for tokenizing, normalising, removing special characters etc.
Feature Selection
This module contains the functions for selecting text features like word frequency, ngrams, TTR etc.
Distance Measures
This module contains the functions for calculating the distance and similarity between two vectors
Corpus Processor
This is a module which helps to convert the corpus into dictionaries (Key- Author, Values - Books by author)
Building requires wheel . If not installed, please install using the following command.
python3 -m pip install --user --upgrade setuptools wheel
Install requirements
pip install -r requirements.txt
Then enter the package directory and build the package using the following command.
python3 setup.py sdist bdist_wheel
This creates the dist folder containing the packaged tar files.
pip install ./dist/preprocess_NLP_pkg-0.0.1.tar.gz
pip uninstall preprocess_NLP_pkg-0.0.1
- List of most frequent word_list in different languages from the Computation Linguistics Group, University of Neuchatel can be found here