NLP ASSIGNMENT

Data preprocessing

Remove punctuation

Punctuations are tokenized as words so removing them
Remove digits

Not considering digit to be words

Document clustering

k-means clustering using tfidf of bigram of text as feature vector. Chose it as it is comparatively easier to understand, and implement but have good results.

Finding:

Most top bigrams were made of stop words so removing stop words from the text corpus will be better as it will give better insight to the data.

Problem encountered:

Spacy nlp parser can only parse text of length 1000000. Tried to increase the maximum length of the parser but it resulted in memory error ( for 16 GB ram). So I divided the text corpus to 6 parts having maximum length of 1000000.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Data		Data
result		result
.gitignore		.gitignore
NLP Analysis.ipynb		NLP Analysis.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP ASSIGNMENT

Data preprocessing

Document clustering

About

Releases

Packages

Languages

BishalLakha/Text-Clustering

Folders and files

Latest commit

History

Repository files navigation

NLP ASSIGNMENT

Data preprocessing

Document clustering

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages