Approaches :
- Semi Supervised KeyWord Extraction + Brown Clustering
- Semi Supervised KeyWord Extraction + Distil-Bert Embeddings + HDBSCAN
- Clustering using K-means and heirarchical clustering didn't yeild good results. The elbow method and silhouette clustering did not help in finding a perfect number of clusters.
- HDBSCAN was able to find a perfect cluster as shown in below image, but thinking about practicality of this solution, each problem can be atributted with multiple clusters, so Brown Clustering was applied to this problem.
- Each directory has its own model and output results.
model.py
- the main file containing the model and implementation(for brown clustering)
An utils file containing Brown Clustering implementation has been used. This code is partly taken from https://github.com/yangyuan/brown-clustering/- pickle files of objects of models are stored in respective folders for further use.
(full dependency has been attached into requirements.txt)
- python 3.8
- nltk
- tqdm
- pandas
- pickle
- matplotlib
- yake :
pip install git+https://github.com/LIAAD/yake
- sentence-transformers :
pip install sentence-transformers
- hdbscan :
pip install hdbscan
- umap :
pip install umap-learn
python model.py
To check similar words/ similar clustered words for Brown Clustering :
clustering.get_similar('vibration')
-
Brown Clustering
clustered_Dataset.csv
The column clusters contains the index ofcluster.csv
. This indicates each text can be given an attribute of multiple clusters. -
HDBSCAN + DistilBert Embeddings
hdbscan_Dataset.csv
The clusters column has been categorised into 11 clusters, which was derived from unsupervised number of clusters from using UMAP dimensionality reduction and HDBSCAN.
- Testing with new samples with Bert and Xlnet Embeddings.
- Evaluating the sanity of clusters, this was not possible with current sample size.