The goal of this project is to build out a term risk classifier. You give the classifier a one word term and it will output whether that term is associated with a higher or lower risk for Covid-19. In order to build this out, we use term frequency–inverse document frequency to retrieve documents, and provided document embeddings as our input to a Linear Discriminant Analysis classifier.

First bring in all of the libraries we will need for our project

In [None]:
import numpy as np
import pickle
import pandas as pd
from collections import defaultdict
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split

Now we begin by reading in all of the data in metadata.csv and setting up our tf-idf helper functions and structures

In [None]:
meta_df = pd.read_csv("/kaggle/input/CORD-19-research-challenge/metadata.csv")
num = len(meta_df)

porter = PorterStemmer()
 
def getalpha(string):
    res = ''
    for s in string:
        if s.isalpha():
            res += s
    return res
 
def get_frequency(abstract):
    words = word_tokenize(abstract)
    words = [getalpha(k).lower() for k in words if len(getalpha(k)) > 0]
    words = [porter.stem(k) for k in words]
    freq = defaultdict(lambda: 0)
    for i in words:
        freq[i] += 1
    return freq
 
term_frequency = {} # map from document -> word -> frequency
doc_frequency = defaultdict(lambda: 0) # map from word -> frequency

Supposing you have already have some of the data processed, you can use this next section of code to pick up from where you left off

In [None]:
with open("/kaggle/working/result.pkl", "rb") as f:
    term_frequency, doc_frequency = pickle.load(f)
    doc_frequency = defaultdict(lambda: 0, doc_frequency) # map from word -> frequency

Whether you have some data processed, or none at all, you will use the next section of code to process all of the remaining data. When running, for every 1000 documents processed, it will print the fraction of documents processed/ total number of documents. Note that because the abstract of some of the documents are null, only about 69% of the documents will actually be processed and used. Once this is complete, the data can be saved to a pickle file in the next step for future use.

In [None]:
for i, row in meta_df[["cord_uid", "abstract"]].iterrows():
    if row.cord_uid in term_frequency:
        continue
    if len(term_frequency) % 1000 == 1:
        print(len(term_frequency) / num)
    if type(row.abstract) is not str:
        continue
    freqs = get_frequency(row.abstract)
    term_frequency[row.cord_uid] = {}
    tot = sum(freqs.values())
    
    for k in freqs:
        term_frequency[row.cord_uid][k] = freqs[k] / tot
        doc_frequency[k] += 1

We can now save however much data was processed in the step above

In [None]:
with open("/kaggle/working/result-test.pkl", "wb") as f:
    pickle.dump((term_frequency, dict(doc_frequency)), f)

We now load in the data from the saved pickle file and set up our function that will return the TF-IDF for any given term for each document that it appears in. Note that we have a built in threshold (thresh) so that we will only return documents with high enough TF-IDFs, indicating that the term is of some importance within the returned documents.

In [None]:
with open('/kaggle/working/result.pkl', "rb") as f:
    tf, dfa = pickle.load(f)

def get_tfidf_term(term):
    thresh = .05
    porter = PorterStemmer()
    term = porter.stem(term.lower())
    idf = np.log(len(tf) / dfa[term])
    tfidf = {
        doc: tf[doc][term] * idf for doc in tf if term in tf[doc]
    }
    tfidf={doc: tfidf[doc] for doc in tfidf if tfidf[doc] > thresh}
    return tfidf

Below we have a simple check to see what different stemmed terms we have in our documents. We can use this to further fine tune our classification algorithm by including words we see here with an associated level of risk in either the 'good' or 'bad' risk terms. Note that these stemmed terms may look strange or incomplete because they are representations of different words that fall into the same idea, for example smoke, smoking, and smoker may all be represented in this data as 'smok'.

In [None]:
dfa

We know use terms known to be associated with high risk patients and terms known to be associated with low risk patients to get a set of documents that contain these terms with relatively high frequency.

In [None]:
bad_terms = ["Cancer","compromise", "Senior", "Pulmonary","Pregnancy","Old", "Death","Obesity","Sickle","Smoking","Diabetes","Complication","Disease", "issue"]

good_terms = ["immune", "success", "aid", "Recover", "prevent", "recovery","Healthy","Thin","Athletic","Young","Fit", "good", "D", "Well", "Help"]
bad_docs = []
good_docs = []

for term in bad_terms:
    bad_docs.extend(get_tfidf_term(term).keys())
    
for term in good_terms:
    good_docs.extend(get_tfidf_term(term).keys())
    
bad_docs = set(bad_docs)
good_docs = set(good_docs)

Below we check on a couple of things to get more information about the documents we retrieved above:
1. We find the ratio of document intersection for good and bad docs and the minimum between the two. This tells us how separated the smaller set is from the larger set
2. We find the ration of intersection vs union of the good and bad sets. This tells us something very similar to the first, but with more regard to the dataset as a whole
3. We find the total number of docs in our good set
4. We find the total number of docs in our bad set

In [None]:
print(len(bad_docs.intersection(good_docs))/min(len(bad_docs), len(good_docs)))
print(len(bad_docs.intersection(good_docs))/len(bad_docs.union(good_docs)))

print(len(good_docs))
print(len(bad_docs))

Now we load in the data for document embeddings that have already been provided by Kaggle

In [None]:
embeddings = pd.read_csv('/kaggle/input/CORD-19-research-challenge/cord_19_embeddings/cord_19_embeddings_2021-01-18.csv', sep=',')

After loading in the embeddings we take the subset of embeddings for good docs and for bad docs. These embeddings are what our algorithm will be training on. The idea is that documents containing information about low risk terms will have similar document embeddings and that high risk terms will also have similar document embeddings. Using this assumption we can build a classifier that will classify based on the average document embedding of documents associated with any given term.

In [None]:
good = embeddings[embeddings.ug7v899j.isin(good_docs)].drop("ug7v899j", axis=1)
bad = embeddings[embeddings.ug7v899j.isin(bad_docs)].drop("ug7v899j", axis=1)

We now more specifically separate our document data into good (0) and bad (1) document embeddings

In [None]:
X = np.concatenate((np.array(good), np.array(bad)))
Y = [0 if i < len(good) else 1 for i in range(len(X))]

We split our data below into train and test data to train and see the accuracy on our 'ground truth' documents. 

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=0.15)

clf = LinearDiscriminantAnalysis()
clf.fit(X_train, Y_train)
clf.score(X_test, Y_test)

We now introduce a function that will take in a term and output the mean associated risk of all of the documents that the term has high frequency with.

In [None]:
def pred(term):
    encodings = embeddings[embeddings.ug7v899j.isin(get_tfidf_term(term).keys())].drop("ug7v899j", axis=1)
    predictions = clf.predict(encodings)
    mean_pred = np.mean(predictions)
    risk_assessment = "Higher Risk"
    if(mean_pred < .5):
        risk_assessment = "Lower Risk"
    return mean_pred, risk_assessment

Here below you can now put through different terms to see how they rank between high risk and low risk along with the actual mean value between 0 and 1

In [None]:
pred("elderly")

The classifier performs fairly well given the prior knowledge of high and low risk terms that we had given the program. A good extension and next step of this project would be to make this a semi surpervised model that finds terms that classify strongly with higher or lower risk and then including them in the terms used for training.