Text-Clustering

Text Clustering: Used to cluster sentences using modified k-means clustering algorithm.

Advantage: User need not to specify the number of output clusters required. Algorithm, will create clusters depending on the percentage of similairty between the sentences.

Requirements:

Python: 2.7.8
gensim: 1.0.0

Execution:

Download Google news word2vec pre-trained model file from:
- https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
Extract this file in Text-Clustering source package such that bin file is located in directory:
- Text-Clustering\GoogleNews-vectors-negative300.bin\GoogleNews-vectors-negative300.bin
Change to the source package directory "Text-clustering".
Open and edit the config.ini with the desired inputs as specified below:
1. word2vec_model:
  - Specify path of word2vec pre-trained model file (in bin format downloaded in Step 1) which is to be used for converting sentence to vectors.
  - Currently Google-news-pretrained vector model of dimension 300 is used.
2. threshold:
  - Threshold value to be used for clustering.
  - If similarity score of 2 sentences is greater that this threshold, then they are considered similar other different sentences.
  - Default threshold value is 0.80
3. input_file_path:
  - Path of input text file, containing sentences to be clustered.
4. output_dir_path:
  - Path of directory, where output_clusters are to be generated.
  - Default value is './output_clusters'
5. cluster_overlap:
  - Specified whether Cluster Overlapping is allowed or not.
  - If set to True, then a sentence can be present in more than 1 cluster.
  - If set to False, then a sentence can be present only in 1 cluster.
  - Default value is: True
6. word_vector_dim:
  - Dimension of vector used for representing each input sentence.
  - Default value is 300. NOTE: if word_vector_dim is changed then corresponding word2vec trained model is to be used.
7. representative_word_vector:
  - Specifies how Representation vector for each cluster is to be computed.
  - If "add": Representation vector for each cluster is computed by adding all sentence vectors in a cluster.
  - If "average": Representation vector for each cluster is computed by average of all the sentence vectors in a cluster.
  - Default value is 'average'.
Once config.ini file is updated, execute text-clustering project by using below command:
- $ python main_executor.py

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
clustering		clustering
input		input
output_clusters		output_clusters
word_embedding		word_embedding
LICENSE		LICENSE
README.md		README.md
config.ini		config.ini
config_reader.py		config_reader.py
main_executor.py		main_executor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-Clustering

Requirements:

Execution:

About

Releases

Packages

Languages

License

Ruchi2507/Text-Clustering

Folders and files

Latest commit

History

Repository files navigation

Text-Clustering

Requirements:

Execution:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages