Skip to content

DevashishX/AbstractClustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AbstractClustering

Final year BTech Project
College of Engineering, Pune
Bachelor of Technology, Information Technology
2016 - 2020

Team Members:

  • Devashish Gaikwad (111608023)
  • Venkatesh Yelnoorkar (111608077)
  • Atharva Jadhav (111608031)

Project Overview

AbstractClustering is a comprehensive machine learning project focused on clustering research paper abstracts using various natural language processing (NLP) techniques and clustering algorithms. The project processes academic papers from ArXiv and other repositories, creates embeddings from their abstracts, and groups similar papers together using multiple clustering approaches.

Key Features

1. Data Processing Pipeline

  • Data Extraction: Converts JSON metadata from research paper repositories into structured CSV format
  • Text Preprocessing: Implements custom preprocessing with tokenization, stopword removal, and lemmatization using NLTK
  • Abstract Extraction: convert_repository_json.py and AbstractPreprocessor class handle extraction and cleaning of paper abstracts

2. Word Embedding Methods

The project implements multiple word embedding approaches:

  • Word2Vec: Pre-trained Word2Vec models for word-level embeddings
  • GloVe: Support for GloVe embeddings (50d, 100d versions)
  • FastText: FastText embeddings for handling out-of-vocabulary words
  • Custom Embedding Classes: Modular architecture with Word2Vec, GloveVectors, and FastTextVectors classes

3. Sentence Embedding Generation

  • P-Means Aggregation: Implements generalized power mean (p-means) to aggregate word embeddings into sentence embeddings
  • Combines multiple aggregation strategies:
    • Arithmetic mean (p=1)
    • Maximum pooling
    • Minimum pooling
    • Quadratic mean (p=2)
    • Quartic mean (p=4)

4. Clustering Algorithms

Multiple clustering approaches are explored:

  • K-Means Clustering: Standard K-Means with Euclidean distance
  • GPU-Accelerated K-Means: CUDA implementation using libKMCUDA for large-scale clustering
  • Hierarchical Clustering: Agglomerative clustering methods
  • DBSCAN: Density-based spatial clustering
  • Deep Embedded Clustering (DEC): Keras-based deep learning approach for clustering

5. Optimal K Selection

  • Automated methods to find optimal number of clusters
  • Evaluation metrics including silhouette score, elbow method
  • Template notebooks for systematic K optimization

6. Database Integration

  • SQLite Database: papersqlite.py provides interface to store and query research papers
  • Schema: Paper ID, Title, Abstract
  • Efficient retrieval of paper information

7. Visualization and Analysis

  • Plotting utilities for cluster visualization
  • DBSCAN plots and analysis
  • HTML output generation for results presentation

Project Structure

AbstractClustering/
├── preprocessing.py              # Text preprocessing utilities
├── word_embedding.py            # Word embedding model loaders
├── sentence_embedding.py        # Sentence embedding generation (p-means)
├── convert_repository_json.py   # JSON to CSV converter
├── papersqlite.py              # SQLite database interface
├── test_cuda.py                # CUDA K-Means unit tests
├── BaseTemplate.ipynb          # Template for clustering workflow
├── ColabTemplate.ipynb         # Google Colab compatible version
├── CreateSentenceEmbedding.ipynb
├── ClusterKEuclid.ipynb
├── FindKTemplateClusterKEuclid.ipynb
├── Keras-DEC.ipynb            # Deep Embedded Clustering
├── h_cluster.ipynb            # Hierarchical clustering
├── plot_dbscan.ipynb
├── word2vec_cluster.ipynb
├── Devashish_word2vec_cluster.ipynb
├── TemplateClusterKEuclid.ipynb
├── MetaConcat.ipynb
├── PickleViewer.ipynb
├── splitembs.ipynb
├── dataset/                   # Data files and GloVe embeddings
├── arxiv/                     # ArXiv paper datasets
│   ├── processed/
│   └── unprocessed/
├── final app/                 # Final application
│   ├── basiclib/              # Core library modules
│   └── finalapp.ipynb
├── templates/                 # Template notebooks for different configurations
├── models/                    # Saved models
├── plots/                     # Generated visualizations
├── MegaDfs/                   # Large dataframes
├── MegaSentEmbs/              # Sentence embeddings
├── PklCleaned/                # Cleaned pickle files
└── OptimalK/                  # Optimal K analysis results

Core Modules

preprocessing.py

  • preproc(): Basic preprocessing function
  • AbstractPreprocessor: Class-based preprocessor with configurable tokenization and lemmatization

word_embedding.py

  • Word2Vec: Loads Word2Vec models
  • GloveVectors: Loads and manages GloVe embeddings
  • FastTextVectors: Handles FastText embeddings with OOV support

sentence_embedding.py

  • gen_mean(): Generalized power mean calculation
  • get_pmeans(): Aggregates word embeddings using multiple p-means

convert_repository_json.py

  • AbstractExtracter: Extracts and cleans abstracts from JSON repository files
  • Handles ArXiv and other research paper metadata formats

Technologies Used

  • Python 3.x
  • Machine Learning: scikit-learn, NumPy, pandas
  • Deep Learning: Keras, TensorFlow
  • NLP: NLTK, Gensim
  • GPU Computing: CUDA, libKMCUDA
  • Embeddings: Word2Vec, GloVe, FastText
  • Visualization: Matplotlib, Jupyter Notebooks
  • Database: SQLite3

Workflow

  1. Data Collection: Research papers downloaded from ArXiv
  2. Preprocessing: Abstracts extracted, tokenized, and cleaned
  3. Embedding Generation: Words converted to vectors using pre-trained models
  4. Sentence Embedding: Word vectors aggregated to create abstract representations
  5. Clustering: Similar abstracts grouped using various clustering algorithms
  6. Optimization: Find optimal number of clusters
  7. Evaluation: Analyze cluster quality and coherence
  8. Visualization: Generate plots and reports

Key Notebooks

  • BaseTemplate.ipynb: Standard workflow template
  • ColabTemplate.ipynb: Google Colab compatible version
  • CreateSentenceEmbedding.ipynb: Generates embeddings for abstracts
  • Keras-DEC.ipynb: Deep Embedded Clustering implementation
  • FindKTemplateClusterKEuclid.ipynb: Optimal K determination
  • plot_dbscan.ipynb: DBSCAN clustering analysis

Performance Optimization

  • GPU acceleration using CUDA for large-scale K-Means clustering
  • Batch processing for embedding generation
  • Efficient data structures (NumPy arrays) for vector operations
  • Optimized distance calculations (Euclidean, Cosine)

License

MIT License - Copyright (c) 2020 Devashish Gaikwad

See LICENSE file for full details.

About

Code for my bachelor's thesis. The published research paper can be found at https://link.springer.com/chapter/10.1007/978-981-33-6987-0_26

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •