Final year BTech Project
College of Engineering, Pune
Bachelor of Technology, Information Technology
2016 - 2020
Team Members:
- Devashish Gaikwad (111608023)
- Venkatesh Yelnoorkar (111608077)
- Atharva Jadhav (111608031)
AbstractClustering is a comprehensive machine learning project focused on clustering research paper abstracts using various natural language processing (NLP) techniques and clustering algorithms. The project processes academic papers from ArXiv and other repositories, creates embeddings from their abstracts, and groups similar papers together using multiple clustering approaches.
- Data Extraction: Converts JSON metadata from research paper repositories into structured CSV format
- Text Preprocessing: Implements custom preprocessing with tokenization, stopword removal, and lemmatization using NLTK
- Abstract Extraction:
convert_repository_json.pyandAbstractPreprocessorclass handle extraction and cleaning of paper abstracts
The project implements multiple word embedding approaches:
- Word2Vec: Pre-trained Word2Vec models for word-level embeddings
- GloVe: Support for GloVe embeddings (50d, 100d versions)
- FastText: FastText embeddings for handling out-of-vocabulary words
- Custom Embedding Classes: Modular architecture with
Word2Vec,GloveVectors, andFastTextVectorsclasses
- P-Means Aggregation: Implements generalized power mean (p-means) to aggregate word embeddings into sentence embeddings
- Combines multiple aggregation strategies:
- Arithmetic mean (p=1)
- Maximum pooling
- Minimum pooling
- Quadratic mean (p=2)
- Quartic mean (p=4)
Multiple clustering approaches are explored:
- K-Means Clustering: Standard K-Means with Euclidean distance
- GPU-Accelerated K-Means: CUDA implementation using
libKMCUDAfor large-scale clustering - Hierarchical Clustering: Agglomerative clustering methods
- DBSCAN: Density-based spatial clustering
- Deep Embedded Clustering (DEC): Keras-based deep learning approach for clustering
- Automated methods to find optimal number of clusters
- Evaluation metrics including silhouette score, elbow method
- Template notebooks for systematic K optimization
- SQLite Database:
papersqlite.pyprovides interface to store and query research papers - Schema: Paper ID, Title, Abstract
- Efficient retrieval of paper information
- Plotting utilities for cluster visualization
- DBSCAN plots and analysis
- HTML output generation for results presentation
AbstractClustering/
├── preprocessing.py # Text preprocessing utilities
├── word_embedding.py # Word embedding model loaders
├── sentence_embedding.py # Sentence embedding generation (p-means)
├── convert_repository_json.py # JSON to CSV converter
├── papersqlite.py # SQLite database interface
├── test_cuda.py # CUDA K-Means unit tests
├── BaseTemplate.ipynb # Template for clustering workflow
├── ColabTemplate.ipynb # Google Colab compatible version
├── CreateSentenceEmbedding.ipynb
├── ClusterKEuclid.ipynb
├── FindKTemplateClusterKEuclid.ipynb
├── Keras-DEC.ipynb # Deep Embedded Clustering
├── h_cluster.ipynb # Hierarchical clustering
├── plot_dbscan.ipynb
├── word2vec_cluster.ipynb
├── Devashish_word2vec_cluster.ipynb
├── TemplateClusterKEuclid.ipynb
├── MetaConcat.ipynb
├── PickleViewer.ipynb
├── splitembs.ipynb
├── dataset/ # Data files and GloVe embeddings
├── arxiv/ # ArXiv paper datasets
│ ├── processed/
│ └── unprocessed/
├── final app/ # Final application
│ ├── basiclib/ # Core library modules
│ └── finalapp.ipynb
├── templates/ # Template notebooks for different configurations
├── models/ # Saved models
├── plots/ # Generated visualizations
├── MegaDfs/ # Large dataframes
├── MegaSentEmbs/ # Sentence embeddings
├── PklCleaned/ # Cleaned pickle files
└── OptimalK/ # Optimal K analysis results
preproc(): Basic preprocessing functionAbstractPreprocessor: Class-based preprocessor with configurable tokenization and lemmatization
Word2Vec: Loads Word2Vec modelsGloveVectors: Loads and manages GloVe embeddingsFastTextVectors: Handles FastText embeddings with OOV support
gen_mean(): Generalized power mean calculationget_pmeans(): Aggregates word embeddings using multiple p-means
AbstractExtracter: Extracts and cleans abstracts from JSON repository files- Handles ArXiv and other research paper metadata formats
- Python 3.x
- Machine Learning: scikit-learn, NumPy, pandas
- Deep Learning: Keras, TensorFlow
- NLP: NLTK, Gensim
- GPU Computing: CUDA, libKMCUDA
- Embeddings: Word2Vec, GloVe, FastText
- Visualization: Matplotlib, Jupyter Notebooks
- Database: SQLite3
- Data Collection: Research papers downloaded from ArXiv
- Preprocessing: Abstracts extracted, tokenized, and cleaned
- Embedding Generation: Words converted to vectors using pre-trained models
- Sentence Embedding: Word vectors aggregated to create abstract representations
- Clustering: Similar abstracts grouped using various clustering algorithms
- Optimization: Find optimal number of clusters
- Evaluation: Analyze cluster quality and coherence
- Visualization: Generate plots and reports
- BaseTemplate.ipynb: Standard workflow template
- ColabTemplate.ipynb: Google Colab compatible version
- CreateSentenceEmbedding.ipynb: Generates embeddings for abstracts
- Keras-DEC.ipynb: Deep Embedded Clustering implementation
- FindKTemplateClusterKEuclid.ipynb: Optimal K determination
- plot_dbscan.ipynb: DBSCAN clustering analysis
- GPU acceleration using CUDA for large-scale K-Means clustering
- Batch processing for embedding generation
- Efficient data structures (NumPy arrays) for vector operations
- Optimized distance calculations (Euclidean, Cosine)
MIT License - Copyright (c) 2020 Devashish Gaikwad
See LICENSE file for full details.