Clustering Pubmed abstracts about Alzheimer's Disease (AD)

Objective: Cluster Pubmed abstracts about AD.

Data: Journal articles, including reviews, were accessed using Biopython query tools. Retrieved information from 130,672 results was stored in mongoDB on AWS. Note there is a bug in the example code in the docs if using NCBI's history support. This is found in "9.15.2 Searching for and downloading abstracts using the history". The while loop will go into an infinite loop; my code in data_collection.ipynb fixes this (in a hacky way).

Algorithm: Clustering (unsupervised learning) using KMeans.

Evaluation: After clustering, topic modeling was done for abstracts as well as titles and a subset was visualized by using t-distributed stochastic neighbor embedding (t-SNE).

Further Improvement: Improve topic modeling by working on text preprocessing. Topic words, like "alzheimers", "neurodegenerative", "diseases" may benefit from being removed. Expanding this project to include additional neurodenerative diseases is possible and start a focused recommender system is possible.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Full_KMeans.ipynb		Full_KMeans.ipynb
README.md		README.md
data_collection.ipynb		data_collection.ipynb
midsize_analysis.ipynb		midsize_analysis.ipynb
small_analysis.ipynb		small_analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full_KMeans.ipynb

Full_KMeans.ipynb

README.md

README.md

data_collection.ipynb

data_collection.ipynb

midsize_analysis.ipynb

midsize_analysis.ipynb

small_analysis.ipynb

small_analysis.ipynb

Repository files navigation

Clustering Pubmed abstracts about Alzheimer's Disease (AD)

About

Releases

Packages

Languages

sfung11/Cluster_AD_Abstracts_Pubmed

Folders and files

Latest commit

History

Repository files navigation

Clustering Pubmed abstracts about Alzheimer's Disease (AD)

About

Resources

Stars

Watchers

Forks

Languages