<a href="https://colab.research.google.com/github/AI-Growth-Lab/SciNerTopic/blob/main/notebooks/Sci_NERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sci-NERTopic

This notebook is a friendly demo of the Sci-NERTopic modeling technique finetuned for the analysis of scientific documents with transformers. It is optimized to work with document abstracts. You should be able to run this notebook on a free Google Colab GPU instance. It is inspired by BERTopic and adopts the use of Sentence Transformers in combination with UMAP and HDBSCAN. In addition, a fine-tuned Named Entity Recognition (NER) model - retrained SciBert on the SciERC corpus - is used to extract various classes scientific keywords (in this example `Issue` and `Method` from analysed abstracts.

The techniques performs following stepps:

- Load up NER and SBERT models (from HuggingFace)
- Extract and aggregate NER keywords
- Embed all documents using SBERT
- Reduce dimensionality with UMAP
- Cluster embeddings with HDBSCAN
- use c-TF-IDF with NER keywords

In this notebook you can use either a pre-extracted dataset (~1000 documents about NLP extracted from OpenAlex), explore another subject area on OpenAlex (extract documents using their API) or upload your own data (e.g. extract from Web Of Science or Scopus).

In [None]:
#@title ##Install and import requirements

!pip install -U sentence-transformers simpletransformers umap-learn hdbscan -qqqq

from simpletransformers.ner import NERModel, NERArgs
from sentence_transformers import SentenceTransformer, util

import itertools
import io
from google.colab import files

import requests, json
import math

import numpy as np
import pandas as pd
import altair as alt

import umap
import hdbscan

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))


from gensim.corpora.dictionary import Dictionary
from gensim.models.tfidfmodel import TfidfModel
from gensim.matutils import corpus2csc, corpus2dense

import logging
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)


def kwReshaper(column, end_tags):
  """
  column = pd.column or iterable where ents are stored
  end_tag = tuple containing comma-separated strings with end tags of NER
  Extracts comma-separated keywords from CONLL format 
  as produced by SciBERT
  """
  extracted_kws = []
  for i in range(len(column)):

    abs = column[i]
    abs_gen = (_ for _ in abs)

    keywords = []
    term = []

    try:
      while abs_gen:
        t = next(abs_gen)
        if t[1] == 'O':
          continue
        elif t[1].endswith(end_tags):
          while True:
            term.append(t[0].strip(',.'))
            t = next(abs_gen)
            if t[1].startswith('B'):
              term.append(t[0].strip(',.'))
              keywords.append(' '.join(term))
              term = []
            if t[1] == 'O':
              keywords.append(' '.join(term))
              term = []
              break
    except StopIteration: 
      extracted_kws.append(keywords)
      keywords = []
      continue
  return extracted_kws


!mkdir outputs

## Selecting Transformer Models

In [None]:
#@title ###Select and load NER model

#@markdown `pretrained_ner_model_name_or_path` (huggingface or local). For this NER model, please keep the labels as specified.
pretrained_ner_model_name_or_path = "RJuro/SciNERTopic" #@param {type:"string"}
ner_labels = ['B-Material', 'O', 'B-OtherScientificTerm', 'I-OtherScientificTerm', 'B-Generic', 'B-Method', 'I-Method', 'B-Task', 'I-Task', 'I-Material', 'B-Metric', 'I-Metric', 'I-Generic'] #@param

model = NERModel(
    "bert", pretrained_ner_model_name_or_path, labels=ner_labels
)

In [None]:
#@title ###Select and load sentence transformer model

#@markdown `pretrained_sbert_model_name_or_path` (huggingface or local). We recommend `allenai-specter`for scientific documents. Consider `AI-Growth-Lab/PatentSBERTa`for patent documents.  [Read more here](https://www.sbert.net/docs/pretrained_models.html) about available pretrained models.
pretrained_sbert_model_name_or_path = "allenai-specter" #@param {type:"string"}


model_st = SentenceTransformer(pretrained_sbert_model_name_or_path)

## Loading Data and Selecting Key Columns

This notebook accepts any tabular data loaded as a `pandas.DataFrame`.
You need one column with the documents to analyse e.g. `Abstract`. Additionally, consider having a column with `Title` and publication `Year` for your documents.

In [None]:
# Load remote file - dataframe of 247 publications records on NLP research from Openalex
data = pd.read_csv('https://raw.githubusercontent.com/AI-Growth-Lab/SciNerTopic/main/data/nlp_openalex.csv')

In [None]:
#@title Load Data from OpenAlex

#@markdown You can check out the list of concepts witht heir IDs [here](https://docs.google.com/spreadsheets/d/1LBFHjPt4rj_9r0t0TTAlT68NwOtNH8Z21lBMsJDMoZg/edit#gid=575855905), e.g., NLP c204321447 
# specify endpoint
endpoint = 'works'
concept = "'c204321447'" #@param {type:"string"}
oa = True #@param {type:"boolean"}
nDocs = 2302 #@param {type:"slider", min:200, max:3000, step:1}
from_pub_date = "2017-01-01" #@param {type:"date"}
#@markdown Enter your email for API call to OpenAlex
email = 'test@test.com'#@param {type:"string"} 


def OA(oa):
  if True:
    return 'true'
  else:
    return 'false'



oa_str = OA(oa)

# build the 'filter' parameter
filters = ",".join((
    f'concepts.id:{concept}',
    'is_paratext:false', 
    f'from_publication_date:{from_pub_date}',
    f'is_oa:{oa_str}'
))

# put the URL together
filtered_works_url = f'https://api.openalex.org/{endpoint}?mailto={email}&filter={filters}'
print(f'complete URL with filters:\n{filtered_works_url}')


paging_param = 'per-page=100&cursor=*'

works_query = f'{filtered_works_url}&{paging_param}'

response = requests.get(works_query)
meta = json.loads(response.text)['meta']
next_cursor = meta['next_cursor']
results_alx = json.loads(response.text)['results']


cycles = math.floor((meta['count'] - 100) / meta['per_page'])+1
if cycles > 30:
  cycles = int(nDocs/100)

df_input = []

for result in results_alx:
  if result['abstract_inverted_index']:
    abs = ' '.join(result['abstract_inverted_index'].keys())
    df_input.append((result['id'], result['doi'],result['title'],result['publication_year'],abs))

for cycle in range(cycles):
  cycle_query = f'{works_query[:-1]}{next_cursor}'
  response = requests.get(cycle_query)
  meta = json.loads(response.text)['meta']
  next_cursor = meta['next_cursor']
  results_alx = json.loads(response.text)['results']
  for result in results_alx:
    if result['abstract_inverted_index']:
      abs = ' '.join(result['abstract_inverted_index'].keys())
      df_input.append((result['id'], result['doi'],result['title'],result['publication_year'],abs))


data = pd.DataFrame(df_input, columns=['id','doi','title','publication_year','abstract'])

print(f'Downloaded {str(len(data))} documents')

data.head()

In [None]:
#@title load local file (optional)
#@markdown Run this cell to upload local CSV file for analysis. Files will be deleted after restart of the notebook
#@markdown: specify the separator used in your CSV-
loc_sep = "," #@param {type:"string"}

data = files.upload()
key = list(data.keys())[0]
data = pd.read_csv(io.BytesIO(data[key]), sep=loc_sep)

data.info()

In [None]:
#Inspect DF fields

data.info()

In [None]:
#@title ### Select columns containing the text corpus, title and years

#@markdown Dataframe column containing the text
txt_col = "abstract" #@param {type:"string"}

#@markdown ### Optional
#@markdown Dataframe column containing a title
tit_col = "title" #@param {type:"string"}

#@markdown Dataframe column containing publication year
y_col = "publication_year" #@param {type:"string"}

#@markdown Dataframe column containing citation counts
cit_col = "Cited by" #@param {type:"string"}



In [None]:
#@markdown Extract and append keywords categories

data.dropna(subset=[txt_col], inplace=True)
data.index = range(len(data))

ixs = []
txts = []

for txt in data[txt_col]:
  sents = sent_tokenize(txt)
  ixs.append(len(sents))
  txts.extend(sents)

p, r = model.predict(txts)

joiner = (_ for _ in p)

abstracts = []
n_predictions = []

for i in ixs:
  pred_1 = [next(joiner) for _ in range(i)]
  pred_1 = list(itertools.chain(*pred_1))
  abstracts.append(list(itertools.chain(*[pr.items() for pr in pred_1])))

data['ents'] = abstracts
data.index = range(len(data))

extracted_kws = kwReshaper(data.ents, ('Method','OtherScientificTerm','Task'))
extracted_kw_M = kwReshaper(data.ents, ('Method'))
extracted_kw_T = kwReshaper(data.ents, ('Task'))
extracted_kw_O = kwReshaper(data.ents, ('OtherScientificTerm'))

data['ner-keywords'] = extracted_kws
data['ner-keywords-Method'] = extracted_kw_M
data['ner-keywords-Task'] = extracted_kw_T
data['ner-keywords-OtherSci'] = extracted_kw_O

In [None]:
#@markdown Create text embeddings 
embeddings = model_st.encode(data[txt_col], convert_to_tensor=True, show_progress_bar=True)

In [None]:
#@markdown parameters for the topic model


#@markdown [n_neighbors](https://umap-learn.readthedocs.io/en/latest/parameters.html#n-neighbors) controls how UMAP balances local versus global structure in the data. Low values of n_neighbors will force UMAP to concentrate on very local structure, which in turn will lead to a higher number of specific topics but potentially also higher numbers of unassigned observations. Consider increasing this parameter with corpus size.
n_neighbors = 5 #@param {type:"slider", min:2, max:25, step:1}

#@markdown [min_cluster_size](https://hdbscan.readthedocs.io/en/latest/parameter_selection.html#selecting-min-cluster-size) controls the minimum size of topics identified by the HDBSAN clustering. Higher values may be useful for larger corpus sizes. However, they can lead to higher numbers of unassigned values.
min_cluster_size = 15 #@param {type:"slider", min:2, max:100, step:1}

#@markdown [min_samples](https://hdbscan.readthedocs.io/en/latest/parameter_selection.html#selecting-min-samples) The larger the value of `min_samples` you provide, the more conservative the clustering – more points will be declared as noise. Thus lower numbers will result in fewer unassigned documents, but potentially less precisely defined topics. 
min_samples = 5 #@param {type:"slider", min:1, max:5, step:1}

In [None]:
#@markdown Execute dimensionality reduction and clustering

kws_lab = ['ner-keywords', 'ner-keywords-Method', 'ner-keywords-Task']
eid = []
df_outs = []


umap_reducer_abs = umap.UMAP(random_state=42, n_components=2, n_neighbors=n_neighbors)
embeddings_abs = embeddings.detach().cpu().numpy()
embeddings_abs_red = umap_reducer_abs.fit_transform(embeddings_abs)

for lab in kws_lab:
  data[lab] = data[lab].map(lambda t: [x.lower() for x in t])

clusterer_abs = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size, min_samples=min_samples)
clusterer_abs.fit(embeddings_abs_red)
clusters_col = list(set(clusterer_abs.labels_))

data['topic'] = clusterer_abs.labels_
cluster_share = data['topic'].value_counts(normalize=True)
cluster_share = [cluster_share[i] for i in clusters_col]

cluster_size = data['topic'].value_counts(normalize=False)
cluster_size = [cluster_size[i] for i in clusters_col]

In [None]:
#@markdown Keyword Filtering and Number of Keywords


#@markdown how often should keywords appear ovall to be considered for topic-labeling
no_below = 3 #@param {type:"slider", min:2, max:100, step:1}

#@markdown shore of documents in the corpus containing a term. General keywords appearing in too many documents will be discounted but may introduce ambiguity. Lower numbers will lead to fewer general keywords.
no_above = 0.1 #@param {type:"slider", min:0.1, max:0.9, step:0.05}

#@markdown Cut-off number of Top Keywords to keep as topic desctiptors
n_kws = 10 #@param {type:"slider", min:3, max:100, step:1}

In [None]:
#@markdown Rank NER keywords and display topic stats

top_kw = []
for lab in kws_lab:
  # Generate a dictionary and filter
  dictionary = Dictionary(data[lab])
  dictionary.filter_extremes(no_below=no_below, no_above=no_below)

  # construct corpus using this dictionary
  corpus = [dictionary.doc2bow(word) for word in [doc for doc in data[lab]]]
  # Create and fit a new TfidfModel using the corpus: tfidf
  tfidf = TfidfModel(corpus)
  # transform corpus to TFIDF
  corpus_tfidf = tfidf[corpus]
  # Let's check out the topics by getting "top-tfidf" for the different clusters (and we need to transponse)
  tfidf_matrix = corpus2dense(corpus_tfidf, len(dictionary)).T

  top_kw_i = []
  for i in clusters_col:
    cluster_index = data[data['topic'] == i].index
    topk = np.flip(np.argsort(np.sum(tfidf_matrix[cluster_index,:], axis=0)))[:n_kws]
    top_kw_i.append([dictionary[x] for x in topk])
    #print(str(i) + str([dictionary[x] for x in topk]))
  top_kw.append(top_kw_i)

df_out = pd.DataFrame(zip(clusters_col, cluster_size, cluster_share, top_kw[0],top_kw[1],top_kw[2]))
df_out.columns = ['topic','topic_size', 'topic_size_pct', 'top_kw_all', 'top_kw_method', 'top_kw_issue']


topic_centers = []
for topic in df_out['topic'].values:
  t_ix = data[data['topic']==topic].index
  topic_centers.append(np.median(embeddings_abs_red[t_ix], axis=0))
topic_centers = np.vstack(topic_centers)


emb_df = pd.DataFrame(np.hstack([clusterer_abs.labels_.reshape(-1,1),embeddings_abs_red]), columns=['topic','x','y'])
topic_centers_df = pd.DataFrame(topic_centers, columns=['x_med','y_med'])
topic_centers_df['topic'] = df_out['topic'].values
emb_df = pd.merge(emb_df,topic_centers_df, how='left')
emb_df['dist_topic_cent'] = emb_df.apply(lambda t: np.linalg.norm(np.array([t['x'],t['y']]) - np.array([t['x_med'],t['y_med']])), axis=1)

data['dist_topic_cent'] = emb_df['dist_topic_cent']

data.to_csv('outputs/data.csv', index=None)
df_out.to_csv('outputs/df_out.csv', index=None)

df_out

In [None]:
#@markdown Plot Topic-Map
df_out['desc'] = [', '.join(t) for t in df_out['top_kw_all']]
df_out['x'] = topic_centers[:,0]
df_out['y'] = topic_centers[:,1]

df_out_viz = df_out[~df_out['topic'].isin([0,-1])]

# plot
chart_map = alt.Chart(df_out_viz).mark_circle(size=60).encode(
    x='x',
    y='y',
    size=alt.Size("topic_size_pct:Q", scale=alt.Scale(range=[5/ df_out_viz.topic_size_pct.min(), 1000/ df_out_viz.topic_size_pct.max()]), title='Topic Size (share)'),
    tooltip=[alt.Tooltip('topic:Q', title='Topic N'),
        alt.Tooltip('desc:O', title='Top NER Keywords'),
        alt.Tooltip('topic_size_pct:Q', title='Topic Size (share)')]
).properties(
    title=f'NERTopic-mapplot - {len(df_out_viz)} Topics',
    width=800,
    height=600
).interactive()

chart_map.save('outputs/map_plot.html')

chart_map

In [None]:
#@markdown print 10 closest article titles wrt topic

top_p = 1 #@param {type:"number"}

for title in data.query(f'topic == {top_p}').sort_values('dist_topic_cent')[tit_col][:10]:
  print(title + '\n')

In [None]:
#@markdown print title of 10 most cited articles within topic

top_p = 1 #@param {type:"number"}

for row in data.query(f'topic == {top_p}').sort_values(cit_col, ascending=False)[:10].iterrows():
  print(row[1][tit_col]+ '; N Citations ' + str(row[1][cit_col]) + '\n')

In [None]:
#@markdown Plot topics over time (punchcard plot)

#@markdown Plot width
dyn_p_width = 800 #@param {type:"number"}

ctab1 = pd.crosstab(data['topic'], data[y_col], normalize='columns').stack().reset_index()
ctab1.columns = ['topic','year','YearlyShare']
ctab2 = pd.crosstab(data['topic'], data[y_col], normalize='index')
ctab2 = ctab2.stack().reset_index()
ctab2.columns = ['topic','year','TopicShare']
ctab2['YearlyShare'] = ctab1['YearlyShare']
ctab2 = ctab2.merge(df_out[['topic','top_kw_all','topic_size_pct', 'topic_size']], left_on='topic', right_on='topic')
ctab2['topic_size_pct'] = ctab2['topic_size_pct']*100
ctab2['desc'] = [', '.join(t) for t in ctab2['top_kw_all']]

dyanamic_chart = alt.Chart(ctab2).mark_point(filled=True).encode(
    y=alt.X('topic:O',title='Topic Number'),
    x=alt.Y('year:O', title='Year'),
    color=alt.Color('max(YearlyShare):Q', scale=alt.Scale(scheme="inferno")),
    size=alt.Size("TopicShare:Q", scale=alt.Scale(range=[0, 600])),
    order=alt.Order("TopicShare:Q", sort="descending"),
    tooltip=[
        alt.Tooltip('TopicShare:Q', title='topic distr. over time'),
        alt.Tooltip('topic:O', title='Topic'),
        alt.Tooltip('year:O', title='Year'),
        alt.Tooltip('topic_size:Q', title='Topic Size (N)'),
        alt.Tooltip('topic_size_pct:Q', title='Topic Size in %'),
        alt.Tooltip('desc:O', title='Topic descr.')
    ]
).properties(width=800)

dyanamic_chart.save('outputs/dynamic_plot.html')
dyanamic_chart


In [None]:
#@markdown ### Download Processed Data, Summaries and Plots

!zip -r outputs.zip outputs

files.download('outputs.zip') 