Skip to content

Analyse text using spacy and cluster it using kmeans

Notifications You must be signed in to change notification settings

BishalLakha/Text-Clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP ASSIGNMENT

Data preprocessing

  1. Remove punctuation

    Punctuations are tokenized as words so removing them

  2. Remove digits

    Not considering digit to be words

Document clustering

k-means clustering using tfidf of bigram of text as feature vector. Chose it as it is comparatively easier to understand, and implement but have good results.

Finding:

Most top bigrams were made of stop words so removing stop words from the text corpus will be better as it will give better insight to the data.

Problem encountered:

Spacy nlp parser can only parse text of length 1000000. Tried to increase the maximum length of the parser but it resulted in memory error ( for 16 GB ram). So I divided the text corpus to 6 parts having maximum length of 1000000.

About

Analyse text using spacy and cluster it using kmeans

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published