# Broad Topic Classification

While some of the research outputs supplied by users could be identified and therefore programmatically classified into the two broad categories of "Life Sciences" and "Earth Sciences", there were far more that could not be identified. In order to make an estimate of the types of projects being completed, it is necessary to attempt to classify these remaining outputs.

This notebook details an attempt to train a Support Vector Machine (SVM) classifier and a Naive Bayes classifier in order to categorise the outputs based on their titles.

In [1]:
import sqlitedict

Get a list of journals represented by the identified outputs:

In [2]:
metadata = sqlitedict.SqliteDict('../synth/data/doi_metadata.db')
journal_list = []
for k, v in metadata.items():
    issn = v.get('ISSN', [])
    journal_list += issn
journal_list = list(set(journal_list))

for j in journal_list[:3]:
    print(j)

1572-9699
2032-3913
1026-2296


Scrape the ASJC data from this page: https://service.elsevier.com/app/answers/detail/a_id/15181/supporthub/scopus

In [3]:
from bs4 import BeautifulSoup
import requests

response = requests.get('https://service.elsevier.com/app/answers/detail/a_id/15181/supporthub/scopus')
page = BeautifulSoup(response.content)
table_rows = page.find('table').find_all('tr')[1:]
all_asjc = [tuple([cell.text for cell in row.find_all('td')]) for row in table_rows]
asjc = {x[0]: x[2] for x in all_asjc}

for x in all_asjc[:3]:
    print(x)

for x in list(asjc.items())[:3]:
    print(x)

('1000', 'Multidisciplinary', 'Multidisciplinary')
('1100', 'General Agricultural and Biological Sciences', 'Life Sciences')
('1101', 'Agricultural and Biological Sciences (miscellaneous)', 'Life Sciences')
('1000', 'Multidisciplinary')
('1100', 'Life Sciences')
('1101', 'Life Sciences')


Categorise each journal based on the subjects it's tagged with on CrossRef.

In [4]:
from crossref.restful import Etiquette, Journals
from collections import Counter
import json
import os

# multithreading speeds the download process up
from concurrent.futures import ThreadPoolExecutor
from tqdm.contrib.concurrent import thread_map

etiquette = Etiquette('SYNTH transform', '0.1', 'https://github.com/NaturalHistoryMuseum/synth_transform',
                      'data@nhm.ac.uk')
journal_api = Journals(etiquette=etiquette)

if os.path.exists('journals.json'):
    with open('journals.json', 'r') as f:
        all_issns = json.load(f)
else:
    all_issns = []

    def get_journal(issn):
        journal = journal_api.journal(issn)
        if journal is None:
            return
        subjects = journal.get('subjects', [])
        top_level_subjects = Counter([asjc.get(str(s['ASJC'])) for s in subjects])

        # for each category, make sure there's no overlap between subjects
        if top_level_subjects['Life Sciences'] > 0 and top_level_subjects['Physical Sciences'] == 0:
            all_issns.append((issn, 'life'))
        elif top_level_subjects['Physical Sciences'] > 0 and top_level_subjects['Life Sciences'] == 0:
            all_issns.append((issn, 'earth'))
        elif len(top_level_subjects) > 0 and top_level_subjects['Life Sciences'] == 0 and top_level_subjects['Physical Sciences'] == 0:
            all_issns.append((issn, 'other'))

    with ThreadPoolExecutor(10) as thread_executor:
        thread_map(get_journal, journal_list)
        
    with open('journals.json', 'w') as f:
        json.dump(all_issns, f)
        
print(len([j for j in all_issns if j[1] == 'life']))
print(len([j for j in all_issns if j[1] == 'earth']))
print(len([j for j in all_issns if j[1] == 'other']))

479
177
45


Get a sample of articles from each journal, attempting to ignore irrelevant results such as front/back matter, tables of contents, etc.

In [5]:
import os
import re


if os.path.exists('titles.json'):
    # load from json file if possible because downloading new results will take quite a while
    with open('titles.json', 'r') as f:
        work_titles = json.load(f)
else:
    work_titles = {
        'earth': [],
        'life': [],
        'other': []
    }
    
    ignore = [re.compile(i) for i in 
              ['(front|back) matter',
               'special issue',
               'price\W',
               '(volume|issue) \d']
             ]
    
    def iter_works(issn, add_to):
        for attempt in range(3):
            try:
                works = list(journal_api.works(issn).sample(100))
                break
            except json.decoder.JSONDecodeError:
                works = []
                continue
        for work in works:
            title = work.get('title')
            if title is None or len(title) == 0:
                continue
            title = title[0].lower()
            if len(title.split(' ')) < 5:
                # ignore it if it has fewer than 5 words in the title - these are usually not articles
                continue
            if any([rgx.search(title) is not None for rgx in ignore]):
                continue
            work_titles[add_to].append(title)
        
    
    with ThreadPoolExecutor(10) as thread_executor:
        thread_map(lambda x: iter_works(*x), all_issns)
        
    with open('titles.json', 'w') as f:
        json.dump(work_titles, f)


Transform the titles into data that can be used to train the classifier by:
1. removing punctuation (except hyphens);
2. discarding words that aren't nouns or adjectives;
3. stemming words so that e.g. "geology" and "geological" are both counted as the same word;
4. discarding the most frequent words.

In [6]:
import pandas as pd
from nltk.stem.porter import PorterStemmer
import spacy

nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
stemmer = PorterStemmer()

no_punct_rgx = re.compile(r'[^a-z- ]')
en_em_dash_rgx = re.compile(r'\s-\s')


if os.path.exists('training_data.csv'):
    # again, read from a file if available because this might take a while
    df = pd.read_csv('training_data.csv', index_col=0)
else:
    def process_texts(texts):
        token_lists = []

        def get_tokens(txt):
            txt = no_punct_rgx.sub(' ', txt.lower())
            txt = en_em_dash_rgx.sub(' ', txt)
            doc = nlp(txt)
            tokens = [stemmer.stem(token.text) for token in doc if token.pos_ in ['NOUN', 'ADJ'] and len(token.lemma_) > 1]
            token_lists.append(tokens)

        with ThreadPoolExecutor(10) as thread_executor:
            thread_map(get_tokens, texts)

        all_tokens = [t for sublist in token_lists for t in set(sublist)]
        most_common = [k for k, v in sorted(Counter(all_tokens).items(), key=lambda x: -x[1])][:20]
        print(most_common)

        output = [' '.join([token for token in doc if token not in most_common]) for doc in token_lists]

        return output


    labels = {'earth': 0,
              'life': 1,
              'other': 2}

    # transform the data into (title, label) tuples
    data = [(x, labels[k]) for k, v in work_titles.items() for x in v]
    df = pd.DataFrame(data, columns=['text', 'label'])
    df.text = process_texts(df.text)
    df = df.where(df != '')
    df = df.dropna(axis=0)
    df.to_csv('training_data.csv')
    
print(df.head())

                                            text  label
0               co enrich yield florunn cultivar      0
1  long term qualiti stabil assess cryosat- data      0
2            magnet boundari outer planet review      0
3               factor photocatalyt oxid ethylen      0
4  weather disturb low latitud low altitud model      0


Split the data into a training group and a testing group, then create a vectoriser to get a numerical representation of the text.

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

training_data, test_data = train_test_split(df, test_size=0.2, stratify=df.label)

vectoriser = TfidfVectorizer(max_df=0.95, min_df=2, max_features=1000)
features = vectoriser.fit_transform(training_data.text)

print(features[:5])

  (0, 777)	0.39119454696540323
  (0, 591)	0.3626417838975834
  (0, 114)	0.6313649373251458
  (0, 175)	0.4012984656187292
  (0, 224)	0.394709539287537
  (1, 604)	0.5156354922310193
  (1, 133)	0.6357126146484683
  (1, 994)	0.5744471348422607
  (2, 895)	0.4348512077415644
  (2, 79)	0.41289966924863764
  (2, 597)	0.4047166439878694
  (2, 959)	0.3772342053171079
  (2, 611)	0.5782015934585747
  (3, 363)	1.0
  (4, 887)	1.0


Use those features to train a Support Vector Machine (SVM) classifier.

In [8]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from datetime import datetime as dt

classifier = SVC()
print('Fitting SVC...')
start = dt.now()
classifier.fit(features, training_data.label)
print(f'Done ({(dt.now() - start).total_seconds()})')

predicted = classifier.predict(vectoriser.transform(test_data.text))
print(accuracy_score(test_data.label, predicted) * 100)
print(classification_report(test_data.label, predicted))
print(confusion_matrix(test_data.label, predicted))

Fitting SVC...
Done (164.440875)
81.01289833080425
              precision    recall  f1-score   support

           0       0.83      0.68      0.74      3102
           1       0.81      0.95      0.87      6755
           2       0.62      0.04      0.08       687

    accuracy                           0.81     10544
   macro avg       0.75      0.56      0.56     10544
weighted avg       0.80      0.81      0.78     10544

[[2096  996   10]
 [ 330 6418    7]
 [ 107  552   28]]


Similarly, it can be used to train a Naive Bayes classifier.

In [9]:
from sklearn.naive_bayes import MultinomialNB

nb_classifier = MultinomialNB()
print('Fitting Naive Bayes...')
start = dt.now()
nb_classifier.fit(features, training_data.label)
print(f'Done ({(dt.now() - start).total_seconds()})')

nb_predicted = nb_classifier.predict(vectoriser.transform(test_data.text))
print(accuracy_score(test_data.label, nb_predicted) * 100)
print(classification_report(test_data.label, nb_predicted))
print(confusion_matrix(test_data.label, predicted))

Fitting Naive Bayes...
Done (0.009734)
79.92223065250379
              precision    recall  f1-score   support

           0       0.83      0.63      0.72      3102
           1       0.79      0.95      0.86      6755
           2       0.58      0.04      0.08       687

    accuracy                           0.80     10544
   macro avg       0.73      0.54      0.55     10544
weighted avg       0.79      0.80      0.77     10544

[[2096  996   10]
 [ 330 6418    7]
 [ 107  552   28]]


Finally, use the best-performing classifier (the SVM classifier) to estimate the broad category of titles in the database.

In [10]:
from synth.model.analysis import Output
from synth.utils import Config, Context
import yaml
from sqlalchemy.orm import sessionmaker

with open('../config.yml', 'r') as f:
    config = Config(**yaml.safe_load(f))

context = Context(config)
session = sessionmaker(bind=context.target_engine)()

titles = [t[0] for t in session.query(Output.title).filter(Output.title.isnot(None)).all()]

for t in titles[:5]:
    print(t)

Molecular phylogeny within true bugs (Hemiptera: Miridae).
Gene-flow solid frozen - the roles of intrinsic and extrinsic factors on microevolution of Antarctic shelf fishes
Age and rate of speciation in the adaptive radiation of antarctic fishes (Trematominae)
Did glacial advances during the Pleistocene influence differently the demographic histories of benthic and pelagic Antarctic shelf fishes? – Inferences from intraspecific mitochondrial and nuclear DNA sequence diversity
Contribution to the Pupae of the Western Palearctic Tiger Moths (Lepidoptera, Noctuoidea, Arctiidae).


In [11]:
# use the same preprocessing as earlier, except without discarding common words

processed_titles = []

def get_tokens(txt):
    txt = no_punct_rgx.sub(' ', txt.lower())
    txt = en_em_dash_rgx.sub(' ', txt)
    doc = nlp(txt)
    tokens = [stemmer.stem(token.text) for token in doc if token.pos_ in ['NOUN', 'ADJ'] and len(token.lemma_) > 1]
    processed_titles.append(' '.join(tokens))

with ThreadPoolExecutor(10) as thread_executor:
    thread_map(get_tokens, titles)
    
for t in processed_titles[:5]:
    print(t)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=9594.0), HTML(value='')))


true bug
contribut pupa western palearct moth noctuoidea
contribut descript pupa western noctuida
gene flow solid role intrins extrins factor microevolut antarct shelf fish
compar biogeograph histori old world forest bird phylogeni molecular date


In [12]:
transformed_titles = vectoriser.transform(processed_titles)
predictions = pd.Series(classifier.predict(transformed_titles))

predictions = predictions.replace(0, 'earth').replace(1, 'life').replace(2, 'other')

print(predictions.value_counts())

life     7722
earth    1848
other      24
dtype: int64


In [13]:
classified_titles = pd.DataFrame({'text': pd.Series(titles), 'label': predictions})

for title in classified_titles[classified_titles.label=='earth'].sample(5).text:
    print(title)
    
print('\n')

for title in classified_titles[classified_titles.label=='life'].sample(5).text:
    print(title)

print('\n')
    
for title in classified_titles[classified_titles.label=='other'].sample(5).text:
    print(title)

Molecular and morphological analysis of the Sagitta setosa complex with description of a new species
The Types of Anthomyiidae (Diptera) in the Museum für Naturkunde Berlin, Germany
Genetic and age relationship of base metal mineralization along the Periadriatic-Balaton Lineament system on the basis of radiogenic isotope studies
ENTHESOPATHIES AND PREHISTORIC HUMAN ACTIVITIES - Methodological approach and application to European Upper Palaeolithic and Mesolithic human fossils
The Portalón at Cueva Mayor (Sierra de Atapuerca, Spain): a new archaeological sequence.


The vertebral remains of the late Miocene great ape Hispanopithecus laietanus from Can Llobateres 2 (Vallès-Penedès Basin, NE Iberian Peninsula)
New data on the ground beetles (Coleoptera: Carabidae) of Serbia
The identity of the tropical African Polichne mukonja Griffini, 1908 (Orthoptera, Tettigoniidae, Phaneropterinae)
Notes on the Galerucini from India and Sri Lanka, with description of Pyrrhalta warchalowskii sp. nov. f