Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
elife-articles
 
 
img
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Aim:

Help researchers in the bilbiography process. Don't miss interesting/relevant papers!

Idea:

Get meaning from scientific articles content and classify new articles.

diagram

Tools:

IPython/Jupyter notebook, Python 2, Matplotlib, Gensim, Scikit-learn.

Data:

I used eLife Sciences articles found on Github and now in my elife-articles/ directory.

Project:

  1. I parsed the xml articles using Beautiful Soup library.

Had a few unicode induced nightmares :-/ but I've been told it'll get better once I (finally) move to Python 3.

I chose to focus on articles only marked with the topic "Cell biology" or "Neuroscience" for my two categories A and B (see diagram above).

  1. I extracted terms/topics representative of each category.

I used LSI (Latent Semantic Indexing) first and then was recommanded to try LDA (Latent Dirichlet Allocation). For both models I used Gensim library.

  1. I trained a NB (naive Bayes) classifier and a KNN (K nearest neighbour) classifier on the data for the "Cell biology" and "Neuroscience" articles.

I tried to classify a new article on the presence or absence of certain terms returned as most frequents by the LSI model.

There weren't much difference between the NB and KNN classifier.

While the accuracy was quite high (> 80%) looking at the precision and recall showed the prediction was biased towards the category that had the highest number of data in the training set.

=> I need more data!

Slides:

This project was presented at PyData London 2015 and PyCon UK 2015 conferences. Here are the latest slides.


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

About

Getting meaning out of scientific articles using nltk/gensim and scikit-learn

Resources

Releases

No releases published
You can’t perform that action at this time.