# Evaluate Word Burstiness Scores on the Genia Corpus Data

Authors: Samuel Sarria Hurtado and Paul Sheridan

Goal: Evaluate the following word burstiness scores on the Genia corpus data
- Kwok
- Irvine and Callison-Burch
- Derivation of Proportions (DOP for short)
- Chi-square
- Naive Sarria Hurtado Mullen Sheridan
- Sarria Hurtado Mullen Sheridan tail probability

Likewise evaluate KeyBERT word scores. Calculate P@k scores for each scoring function using the Genia terms as ground truth.

## Preliminaries

In [None]:
# Mount Google drive
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)

# Add path to Python function files to system path
import sys
import json
import pandas as pd
imports_path = '/content/drive/MyDrive/2023-bursty-summer-project/computation/genia/bursty-score-evaluation'
sys.path.append(imports_path)
import word_stats
import word_burstiness_metrics as wbm
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import scipy
import nltk
from nltk.corpus import stopwords
from io import StringIO
from numpy import nan

Mounted at /content/drive/


### Read in all necessary data

In [None]:
# load the genia corpus, the genia keywords, the keybert rankings, and all the different lists of stopwords.
json_genia_path = '/content/drive/MyDrive/2023-bursty-summer-project/computation/genia/preprocessing/output/GENIAcorpus3.02-preprocessed.json'

with open(json_genia_path, "r") as j:
  genia = json.loads(j.read())

keyword_genia_path = '/content/drive/MyDrive/2023-bursty-summer-project/computation/genia/preprocessing/output/GENIAcorpus3.02-keywords.csv'

with open(keyword_genia_path, "r") as c:
  key_words = pd.read_csv(c)
important_words = key_words.lex.to_numpy()

pauls_word_list_path = '/content/drive/MyDrive/2023-bursty-summer-project/computation/bursty-tail-gamma-estimation/gamma-pvalues-noquotes.tsv'

with open(pauls_word_list_path, 'r') as p:
  pauls_words_df = pd.read_table(p)
  pauls_words = pauls_words_df.term.to_numpy()
  pauls_scores = pauls_words_df[['term', 'nlog_pval_est']]

json_keybert_path = '/content/drive/MyDrive/2023-bursty-summer-project/computation/genia/preprocessing/keyBERT/keybert_scores.json'

with open(json_keybert_path, 'r') as k:
  keybert_scores = json.loads(k.read())
keybert = pd.DataFrame(keybert_scores)
keybert.columns = ['term', 'keybert']

terrier_path = '/content/drive/MyDrive/2023-bursty-summer-project/computation/genia/bursty-score-evaluation/stopwords.txt'

terrier_stopwords = np.loadtxt(terrier_path, dtype=str)

myisam_path = '/content/drive/MyDrive/2023-bursty-summer-project/computation/genia/bursty-score-evaluation/myisam-stopwords.txt'

with open(myisam_path, 'r') as t:
  myisam_txt = StringIO(t.read() + '  NA  NA')

myisam_stopwords = np.loadtxt(myisam_txt, dtype=str)
myisam_stopwords = np.reshape(myisam_stopwords, (545, ))
nas = np.where(myisam_stopwords == 'NA')[0]
myisam_stopwords = np.delete(myisam_stopwords, nas).tolist()

## Data cleaning

Here is an issue with the vocab we'll use for the genia collection. There seems to bea null value. In the next couple of cells we show that this NaN value should really be the string `'null'`. We do this by showing that there is a string `'null'` in the genia corpus but it is not in our vocabulary.



In [None]:
pre_vocab = []
for i in range(len(genia)):
  pre_vocab.append(genia[i].split())

vocab = []
for i in range(len(pre_vocab)):
  for j in range(len(pre_vocab[i])):
    vocab.append(pre_vocab[i][j])

In [None]:
# Here is the NaN value
vocab_null = np.where(np.array(vocab) == 'null')[0][0]
null_val = np.where(pauls_words_df.term.isnull())[0][0]
print('This is the NaN value in our desired vocab: ', null_val, '. In the genia corupus, this is the word \'null\'')
print('This is the index of the \'null\' term in the genia corpus: ', vocab_null)

# Now to fix it, we change the value of our pauls_words to 'null'
pauls_words[null_val] = 'null'

This is the NaN value in our desired vocab:  14816 . In the genia corupus, this is the word 'null'
This is the index of the 'null' term in the genia corpus:  97738


Here we show that the vocab we'll use for the CountVectorizer is the same as all of the unique words in the genia collection.

In [None]:
# First we take out all the duplicates in our imported vocab and all the words in genia
pauls_words = list(set(pauls_words))
vocab = list(set(vocab))

# We show that there are no words that are in our vocab that aren't in genia
# And likewise there are no words in genia that aren't in our imported vocab
set_difference1 = set(pauls_words).difference(set(vocab))
set_difference2 = set(vocab).difference(set(pauls_words))

# Then we show that both vocabs have the same length
same_len = len(vocab) == len(pauls_words)

print('This is the set difference of our imported vocab and the imported one: ', set_difference1)
print('This is the set difference in the other direction: ', set_difference2)
print('Are the lengths of both vocabs the same? ', same_len)

This is the set difference of our imported vocab and the imported one:  set()
This is the set difference in the other direction:  set()
Are the lengths of both vocabs the same?  True


## Vectorizing the GENIA collection

In [None]:
# Custom function so the Count vectorizer won't ignore any words
def analyzer_custom(doc):
  return doc.split()

In [None]:
counter = CountVectorizer(lowercase=False, vocabulary=pauls_words, analyzer=analyzer_custom)
collection = counter.transform(genia)

### Important Text Analysis Variables

In [None]:
m = len(counter.get_feature_names_out())
d = collection.shape[0]
N_i = word_stats.get_Ni(collection)
N_j = word_stats.get_Nj(collection)
N = word_stats.get_N(N_j)
B_ij = word_stats.get_Bij(collection)
B_i = word_stats.get_Bi(B_ij)
B_j = word_stats.get_Bj(B_ij)
CF = word_stats.get_cf(N_i)
DF = word_stats.get_df(B_i, d)
nij_by_nj = word_stats.get_nij_by_nj(collection, N_j)
alpha_i = word_stats.get_alpha_i(collection, B_j, N_j)
mu_alpha_i = word_stats.get_mu_alpha_i(alpha_i, d)
sigma_alpha_i = word_stats.get_sigma_alpha_i(alpha_i, mu_alpha_i, d)

### Little experiment

In [None]:
def eidf_icf_diff(theta, d, nj, bi_obs):
  sum1 = np.power(1 - theta, nj, dtype = np.longdouble).sum()
  sum2 = np.power(1 - theta, 2*nj, dtype = np.longdouble).sum()

  EDF = 1 - (1/d)*sum1
  EIDF = (sum1 - sum2)/((2*d**2)*(EDF**2)) - np.log(EDF)

  return EIDF - np.log(d/bi_obs)


In [None]:
thetas = np.array(range(1, max(N_i.A[0]) + 1))/N
opt_thetas = []
for i in range(m):
  result = scipy.optimize.brentq(f = lambda x: eidf_icf_diff(x, d, N_j.T.A[0], B_i.A[0][i]), a=min(thetas), b=max(thetas))
  opt_thetas.append(result)

opt_thetas = np.array(opt_thetas)

In [None]:
def get_B1(thetas, vector_cf, N):
  expected_ICF = word_stats.get_eicf(thetas, N)
  observed_ICF = word_stats.get_icf(vector_cf)
  return expected_ICF - observed_ICF

In [None]:
B1 = get_B1(opt_thetas, CF, N)

In [None]:
B1.A[0]

array([ 1.79150853e+00,  1.09836135e+00, -2.72500228e-03, ...,
        2.86051669e-01, -2.50942360e-04, -2.50942360e-04])

### Important Bursty Heuristics/Measures

In [None]:
church = word_stats.get_church(N_i, B_i)
irvine = wbm.get_irvine(nij_by_nj, B_i)
gries = wbm.get_gries(collection, N_i, N_j, N)
chisqr = wbm.get_chisq_score(collection)

## Heuristics/Measures Comparison

### Preliminary Dataframes

In [None]:
pauls_socres = pauls_words_df[['term', 'nlog_pval_est']]
dta = {'term': counter.get_feature_names_out(), 'church': church.A[0], 'irvine': irvine.A[0], 'dop': gries.A[0], 'chisq_score': chisqr, 'RICF': B1.A[0]}
df = pd.DataFrame(data=dta)
all_scores = df.merge(keybert,how='left', left_on='term', right_on='term')

In [None]:
sorted_indices = []
cols = all_scores.columns.values.tolist()
for col in cols:
  if col == 'term':
    sorted_indices.append(np.array(all_scores['term']))
  elif col == 'keybert':
    a = np.array(all_scores[[col]])
    sorted_indices.append(5000 - scipy.stats.rankdata(a, method='ordinal', nan_policy='omit'))
  else:
    a = np.array(all_scores[[col]])
    sorted_indices.append(len(a) - scipy.stats.rankdata(a, method='ordinal', nan_policy='omit').astype(int))

sorted_indices = np.array(sorted_indices)
m_t_pair = zip(cols, sorted_indices)
measures_indices = dict(m_t_pair)
measures_indices_df = pd.DataFrame(measures_indices)

for i in range(len(measures_indices_df['keybert'])):
  if not np.isnan(measures_indices_df['keybert'][i]):
    measures_indices_df['keybert'][i] = int(measures_indices_df['keybert'][i])

In [None]:
nltk.download('stopwords')
st_words = stopwords.words('english')
all_stopwords = [st_words, terrier_stopwords, myisam_stopwords]
lst_stopwords = []
for i in range(len(all_stopwords)):
  for j in range(len(all_stopwords[i])):
    lst_stopwords.append(all_stopwords[i][j])

lst_stopwords = set(lst_stopwords)
vocab_st_words = list(set(pauls_words).intersection(set(lst_stopwords)))
vocab_st_words_in = []
for i in range(len(measures_indices_df['term'])):
  if measures_indices_df['term'][i] in vocab_st_words:
    vocab_st_words_in.append(i)

vocab_st_words_in = np.array(vocab_st_words_in)

table_of_stop_words = measures_indices_df[measures_indices_df['term'].isin(vocab_st_words)]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Statistical analysis of measures/heuristics

In [None]:
# Each measure has a unique ranking for a word in the corpus. The smaller the ranking
# the higher the burstiness. These are the quartiles of the rankings for all measures.

quantiles = []
for col in cols:
  if col == 'term' or col == 'keybert': continue
  quantiles.append(table_of_stop_words[col].quantile([0, 0.25, 0.5, 0.75, 1]))

quantiles_df = pd.DataFrame(quantiles)
display(quantiles_df)

Unnamed: 0,0.00,0.25,0.50,0.75,1.00
church,17.0,7686.0,8680.0,22151.0,40199.0
irvine,68.0,11210.0,16645.0,22143.0,39942.0
dop,0.0,257.0,1111.0,3685.0,39924.0
chisq_score,1.0,1591.0,31516.0,32286.0,39676.0
RICF,2069.0,8047.0,8736.0,38716.0,40238.0


In [None]:
#quantiles_df.to_csv(path)

In [None]:
sorted_terms = []
measures = cols[1:]
for measure in measures:
  sorted_terms.append(np.array(all_scores[['term', measure]].sort_values(measure, ascending=False)['term']))

sorted_terms = np.array(sorted_terms)
measure_term_pair = zip(measures, sorted_terms)
sorted_measures = dict(measure_term_pair)

In [None]:
def top_k(dct, k):
  keys = dct.keys()
  values = []
  for key in keys:
    values.append(dct[key][:k])
  keys_values_pair = zip(keys, values)
  return dict(keys_values_pair)

In [None]:
# These are the top 10 most bursty words as ranked by each measure

top_10_terms = pd.DataFrame(top_k(sorted_measures, 10))
top_10_terms

Unnamed: 0,church,irvine,dop,chisq_score,RICF,keybert
0,Bcl-6,Bcl-6,of,suggest,Bcl-6,CD3+CD28-induced_interleukin-2_production
1,v-erbA,TCRzeta,the,here,v-erbA,CD28-mediated_activation
2,SMX,ML-9,in,indicate,SMX,CD28-induced_IL-2_promoter_activity
3,SHP1,AITL,and,results,ML-9,CD28-associated_signaling_pathway
4,ML-9,SHP1,to,however,SHP1,CD28_costimulatory_pathway
5,beta-casein,beta-casein,a,study,beta-casein,CD28-mediated_signal_transduction
6,TCRzeta,A-myb,that,previous,DM,interleukin_2_CD28-responsive_complex
7,EBNA-2,I_kappaB,by,recent,p95vav,CD3/CD28-induced_IL-2_mRNA_accumulation
8,I_kappaB,SMX,with,investigate,I_kappaB,CD28_costimulation_pathway
9,p95vav,Rap1_protein,we,thus,TCRzeta,CD28_receptor_ligation


In [None]:
#top_10_terms.to_csv(path)

In [None]:
green = ['G#cell_type', 'G#cell_component', 'G#cell_line',
         'G#other_artificial_source']
blue = ['G#nucleotide', 'G#polynucleotide', 'G#DNA_N/A',
        'G#DNA_domain_or_region', 'G#DNA_family_or_group', 'G#DNA_molecule',
        'G#DNA_substructure', 'G#RNA_N/A', 'G#RNA_domain_or_region',
        'G#RNA_family_or_group', 'G#RNA_molecule', 'G#RNA_substructure']
light_blue = ['G#amino_acid_monomer', 'G#peptide', 'G#protein_N/A',
              'G#protein_complex', 'G#protein_domain_or_region',
              'G#protein_family_or_group', 'G#protein_molecule',
              'G#protein_substructure', 'G#protein_subunit',
              'G#other_organic_compound', 'G#organic', 'G#inorganic', 'G#atom',
              'G#carbohydrate', 'G#lipid']
yellow = ['G#virus', 'G#mono_cell', 'G#multi_cell', 'G#body_part', 'G#tissue']
red = ['G#other_name']

sem = np.array(key_words['sem'])
lex = np.array(key_words['lex'])
lex_sem_dct = dict(zip(lex, sem))

def get_color_words(lst_color):
  words = []
  for k, v in lex_sem_dct.items():
    if v in lst_color:
      words.append(k)
  return words

green_words = get_color_words(green)
blue_words = get_color_words(blue)
light_blue_words = get_color_words(light_blue)
yellow_words = get_color_words(yellow)
red_words = get_color_words(red)

print('green words: ', len(green_words), 'blue words: ', len(blue_words),
      'light blue: ', len(light_blue_words), 'yellow words: ', len(yellow_words),
      'red words: ', len(red_words))

green words:  4051 blue words:  5574 light blue:  10159 yellow words:  1444 red words:  10560


In [None]:
def count_words(lst, imp_words):
  counter = 0
  for x in lst:
    if x in imp_words:
      counter += 1
  return counter

def create_p_k(lst_words):
  measures = sorted_measures.keys()
  counts = [[], [], [], [], [], [], [], [], [], []]
  p_k_dct = dict(zip(measures, counts))
  for measure in p_k_dct.keys():
    for value in at_values:
      p_k_dct[measure].append(count_words(top_k(sorted_measures, value)[measure], lst_words)/value)
  result = pd.DataFrame(p_k_dct)
  result.index = at_values
  return result

In [None]:
# These are the p@k scores for the different categories of domain-specific words

at_values = np.array([10, 50, 100, 500, 1000, 5000])
highlights = {'green_words': green_words, 'blue_words': blue_words, 'light_blue_words': light_blue_words, 'yellow_words': yellow_words, 'important_words': important_words, 'red_words': red_words}
dfs = []
for k, v in highlights.items():
  dfs.append(create_p_k(v))
  print(k)
  display(dfs[-1])

green_words


Unnamed: 0,church,irvine,dop,chisq_score,RICF,keybert
10,0.0,0.0,0.0,0.0,0.0,0.0
50,0.0,0.0,0.0,0.0,0.0,0.02
100,0.01,0.02,0.01,0.0,0.01,0.02
500,0.036,0.06,0.02,0.012,0.044,0.038
1000,0.059,0.068,0.026,0.024,0.059,0.05
5000,0.0998,0.099,0.0538,0.064,0.1006,0.0932


blue_words


Unnamed: 0,church,irvine,dop,chisq_score,RICF,keybert
10,0.0,0.1,0.0,0.0,0.0,0.0
50,0.1,0.12,0.0,0.0,0.1,0.04
100,0.09,0.1,0.01,0.01,0.1,0.03
500,0.144,0.146,0.01,0.016,0.146,0.06
1000,0.13,0.139,0.019,0.024,0.135,0.081
5000,0.156,0.149,0.0544,0.0774,0.1568,0.1246


light_blue_words


Unnamed: 0,church,irvine,dop,chisq_score,RICF,keybert
10,1.0,0.8,0.0,0.0,1.0,0.1
50,0.76,0.72,0.02,0.02,0.8,0.08
100,0.82,0.75,0.03,0.01,0.83,0.07
500,0.694,0.626,0.062,0.026,0.69,0.158
1000,0.655,0.593,0.091,0.05,0.654,0.227
5000,0.4316,0.413,0.176,0.1494,0.432,0.3056


yellow_words


Unnamed: 0,church,irvine,dop,chisq_score,RICF,keybert
10,0.0,0.0,0.0,0.0,0.0,0.0
50,0.04,0.04,0.0,0.0,0.04,0.0
100,0.03,0.03,0.0,0.0,0.03,0.0
500,0.03,0.03,0.014,0.004,0.03,0.0
1000,0.035,0.042,0.02,0.009,0.034,0.0
5000,0.043,0.0422,0.026,0.0288,0.0438,0.0008


important_words


Unnamed: 0,church,irvine,dop,chisq_score,RICF,keybert
10,1.0,1.0,0.0,0.0,1.0,1.0
50,0.96,1.0,0.06,0.04,1.0,1.0
100,0.98,0.98,0.09,0.05,1.0,1.0
500,0.984,0.974,0.164,0.1,0.992,1.0
1000,0.983,0.963,0.218,0.189,0.985,0.998
5000,0.9346,0.9054,0.423,0.5006,0.9368,0.9952


red_words


Unnamed: 0,church,irvine,dop,chisq_score,RICF,keybert
10,0.0,0.1,0.0,0.0,0.0,0.9
50,0.06,0.12,0.04,0.02,0.06,0.86
100,0.03,0.08,0.04,0.03,0.03,0.88
500,0.08,0.112,0.058,0.042,0.082,0.744
1000,0.104,0.121,0.062,0.082,0.103,0.64
5000,0.2042,0.2022,0.1128,0.181,0.2036,0.471


In [None]:
# This cell writes the p@k tables to a folder, uncomment to rewrite.

#dfs[0].to_csv('/content/drive/MyDrive/2023-bursty-summer-project/computation/genia/bursty-score-evaluation/p@k tables/green_words.csv')
#dfs[1].to_csv('/content/drive/MyDrive/2023-bursty-summer-project/computation/genia/bursty-score-evaluation/p@k tables/blue_words.csv')
#dfs[2].to_csv('/content/drive/MyDrive/2023-bursty-summer-project/computation/genia/bursty-score-evaluation/p@k tables/light_blue_words.csv')
#dfs[3].to_csv('/content/drive/MyDrive/2023-bursty-summer-project/computation/genia/bursty-score-evaluation/p@k tables/yellow_words.csv')
#dfs[4].to_csv('/content/drive/MyDrive/2023-bursty-summer-project/computation/genia/bursty-score-evaluation/p@k tables/light_blue_words.csv')

In [None]:
# Tbis cell counts how many words of each color there are in the corpus
color_words = [green_words, blue_words, light_blue_words, yellow_words, important_words, red_words]
color_words_counter = [0, 0, 0, 0, 0, 0]
for i in range(len(color_words)):
  for j in range(len(pauls_words)):
    if pauls_words[j] in color_words[i]:
      color_words_counter[i] += N_i.A[0][j]

word_count_zip = zip(color_words, color_words_counter)
word_counter_dict = dict(word_count_zip)

[11850, 12533, 44909, 6655, 104692, 28745]

In [None]:
# This cell counts the number of unique words of each color in the corpus

color_words_counter = [0, 0, 0, 0, 0, 0]
for i in range(len(color_words)):
  for j in range(len(pauls_words)):
    if pauls_words[j] in color_words[i]:
      color_words_counter[i] += 1

unique_word_count_zip = zip(color_words, color_words_counter)
unique_word_num_dict = dict(unique_word_count_zip)