GitHub - AlexTsev/NLP_Preprocess_Documents: Preprocess text documents (PDFs) using Python NLP libraries. Extract text with pdfplumber, tokenize with NLTK and SpaCy, remove Greek stopwords, and optionally handle punctuation. Includes scripts and folder structure for preparing datasets for machine learning or deep learning NLP workflows.

AlexTsev / NLP_Preprocess_Documents Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Preprocess text documents (PDFs) using Python NLP libraries. Extract text with pdfplumber, tokenize with NLTK and SpaCy, remove Greek stopwords, and optionally handle punctuation. Includes scripts and folder structure for preparing datasets for machine learning or deep learning NLP workflows.

0 stars 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
pdf		pdf
bag_of_words.py		bag_of_words.py
lemmatization_nltk.py		lemmatization_nltk.py
lemmatization_numpy_nltk.py		lemmatization_numpy_nltk.py
requirements.txt		requirements.txt
spacy_ex.py		spacy_ex.py
tokenizer_nltk.py		tokenizer_nltk.py