Getting meaning out of scientific articles using nltk/gensim and scikit-learn
Jupyter Notebook Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
articles
elife-articles
img
json
notebooks
scripts
.gitignore
README.md
requirements.txt

README.md

Aim:

Help researchers in the bilbiography process. Don't miss interesting/relevant papers!

Idea:

Get meaning from scientific articles content and classify new articles.

diagram

Tools:

IPython/Jupyter notebook, Python 2, Matplotlib, Gensim, Scikit-learn.

Data:

I used eLife Sciences articles found on Github and now in my elife-articles/ directory.

Project:

  1. I parsed the xml articles using Beautiful Soup library.

Had a few unicode induced nightmares :-/ but I've been told it'll get better once I (finally) move to Python 3.

I chose to focus on articles only marked with the topic "Cell biology" or "Neuroscience" for my two categories A and B (see diagram above).

  1. I extracted terms/topics representative of each category.

I used LSI (Latent Semantic Indexing) first and then was recommanded to try LDA (Latent Dirichlet Allocation). For both models I used Gensim library.

  1. I trained a NB (naive Bayes) classifier and a KNN (K nearest neighbour) classifier on the data for the "Cell biology" and "Neuroscience" articles.

I tried to classify a new article on the presence or absence of certain terms returned as most frequents by the LSI model.

There weren't much difference between the NB and KNN classifier.

While the accuracy was quite high (> 80%) looking at the precision and recall showed the prediction was biased towards the category that had the highest number of data in the training set.

=> I need more data!

Slides:

This project was presented at PyData London 2015 and PyCon UK 2015 conferences. Here are the latest slides.


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.