A repository for exploring clustering techniques in natural language processing (NLP), with a focus on analyzing textual datasets. This project demonstrates the implementation of unsupervised learning methods to group similar text documents effectively.
Preprocessing pipeline for text datasets (can preprocess data in Chinese)
Multiple clustering algorithms (K-Means, MinibatchKmeans, Birch, AffinityPropagation, AgglomerativeClustering, DBSCAN)
Support for various vectorization methods (VSM, LSI, LDA)
Easy integration with custom datasets (given text and keywords)
- Clone the repository:
git clone https://github.com/michellemashutian/clusteringText.git
cd clusteringText
- Install required dependencies:
pip install -r requirements.txt
Prepare your dataset in txt format with columns containing text data.
Run the main script:
python main.py
For any questions or feedback, feel free to reach out via issues or email me at mashutian0608@hotmail.com