# Recommender System that return top related research papers according to query

**Hello! Welcome to this kernel. **
<p>I am a beginner in data science so if you have any suggestions or thoughts you want to share please do not hesitate to leave a comment!! This is also one of my methond to learn more knowledge! I am currently a student, and this project is actually one of my courses' final accessment. I just thought it would nice to post it here too! 

# Goal
<p>For this project, we get over 45000 biomedical papers as the dataset. This is a very large dataset and it is hard to find valuable information directly from this large dataset. Therefore, I want to build a <b>recommender system</b> that can give recommendations on what papers to read according to a specific query.

Import important libraries

In [None]:
import pandas as pd 
import numpy as np 
import re
import nltk
import matplotlib.pyplot as plt
import seaborn as sns

#nltk.download("stopwords")
#nltk.download('wordnet')
#nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud
from nltk.stem import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from gensim.models import word2vec
from sklearn import metrics
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm
import json
import os

import warnings
warnings.filterwarnings("ignore")

# Load and read the csv file

In [None]:
# load the meta data from the CSV file 
df=pd.read_csv("../input/covid19-json-to-csv-file/df.csv")
print (df.shape)

df["abstract"] = df["abstract"].str.lower()
df['title'] = df['title'].str.lower()
df['full_text'] = df['full_text'].str.lower()
#show 10 lines of the new dataframe
print (df.shape)

Next we will read the matadata csv file that contains around 45000 papers. Some of the papers does not have the full text. We only need the "title" and "abstract" column from this dataframe. 

In [None]:
metadata=pd.read_csv("../input/CORD-19-research-challenge/metadata.csv", usecols=['title','abstract'])
metadata["abstract"] = metadata["abstract"].str.lower()
metadata['title'] = metadata['title'].str.lower()
print(metadata.shape)

Here we merge the two dataframe together. The rows that have the same title from both column will merged, and others will become NaN if there is no match row. 

In [None]:
papers = pd.merge(df, metadata, how = 'left')
papers

In [None]:
papers=papers.dropna()
papers

# 1. Data Cleaning

For the cleaning part, we will remove the following parts:
*  remove stopwords, and add nonrelevent word into the stopwords list in order to remove them
*  remove punctuations such as ":+=%"
*  remove urls from the columns
*  and lemmatize the word in each row 

In [None]:
stop = set(stopwords.words('english'))
stop |= set(['title','abstract','preprint','biorxiv','read','author','funder','copyright','holder','https','license','et','al','may',
             'also','medrxiv','granted','reuse','rights','used','reserved','peer','holder','figure','fig','table','doi','within'])
lemmatizer = WordNetLemmatizer()

In [None]:
def data_preprocessing(text):
    text = ' '.join(re.sub('https?://\S+|www\.\S+','',text).split())
    text = text.replace('\n', '')
    text = re.sub("[!@#$+%*:()/<.=,—']", '', text)
    text = ' '.join([word for word in text.split() if word not in stop])
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
    return text

In [None]:
papers['title'] = papers['title'].apply(lambda x: data_preprocessing(x))
papers['abstract'] = papers['abstract'].apply(lambda x: data_preprocessing(x))
papers['full_text'] = papers['full_text'].apply(lambda x: data_preprocessing(x))

In [None]:
papers.reset_index()

After applying the data_preprecessing function to the "title", "abstract", and "full_text" column, we get our clean text. Each row of the three columns are in lower case, stopwords and punctuations removed, and lemmatized.

# 2. Data Visualization and Exploratory Data Analysis

In this part i will do the data visualization and exploratory data analysis. For my task, my aim is to make recommendations based on a specific query. Therefore, for the EDA, i will take a look at how the data is distributed based on the published year and paper content.

### 2.1 Word Cloud
<p>In this section, i will generate a word cloud based on the content in the abstract column, the title column, and the full_text column. 

In [None]:
contentCorpus = papers.full_text.values
plt.figure(figsize = (12, 8))
wordcloud = WordCloud(width = 3000,height = 2000,background_color="white",max_words=1000).generate(str(contentCorpus))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Figure 1. Full_text Corpus Word Cloud')

Figure 1 is the word cloud generated from the full text column. From this we can see that "sequence","virus","protein","sample" are the most common words among all the body text. This word cloud provide us a general idea of what are the literature's content.

In [None]:
contentCorpus = papers.abstract.values
plt.figure(figsize = (12, 8))
wordcloud = WordCloud(width = 3000,height = 2000,background_color="white",max_words=1000).generate(str(contentCorpus))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Figure 2. Abstract Corpus Word Cloud')

From Figure 2, we can see the word cloud generated from the abstract. We can see that the word "ibv", "sequence","virus","sample","dna","isolate" are some of the largest words in the word cloud.

In [None]:
contentCorpus = papers.title.values
plt.figure(figsize = (12, 8))
wordcloud = WordCloud(width = 3000,height = 2000,background_color="white",max_words=10000).generate(str(contentCorpus))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Figure 3. Title Corpus Word Cloud')

Figure 3 is the title column's word cloud. "Infections","replications","rna","pseudoknots" are the most common words in the column.

### 2.2 Countplot for different virus discussed among the published papers
<p>From Figure 1,2,and 3 we can have a general idea of what the papers are talking about. In this section, I will categorize the different virus that those paper discussed. By knowing this, we would know how many of them are talking about the covid-19, how many discussed about other virus.
<p>The following code do the virus assignemnt part, I added a column "virs" to record the topic for each paper. I determined the virus of each paper by looking for specific keywords in the full_text. For example, if an article writes about "covis-19" in the full_text, I will assign this paper to topic covid-19.

In [None]:
papers['virus'] = np.where(papers.full_text.str.contains('covid-19|covid|wuhan'), 'covid-19',
              np.where(papers.full_text.str.contains('alphacoronavirus|alpha-cov'), 'alphacoronavirus',
              np.where(papers.full_text.str.contains('betacoronavirus|mers|mers-cov|sars|sars-cov|sars-cov2'), 'betacoronavirus',
              np.where(papers.full_text.str.contains('gammacoronavirus|ibv'), 'gammacoronavirus',
              "None"))))

In [None]:
papers['virus'].value_counts()

In [None]:
plt.figure(figsize = (16, 8))
ax = sns.countplot(x="virus", data=papers)
ax.set_title('Figure 4. Distribution of different virus covered in the papers')
plt.xticks(rotation=45)

Figure 4 is the visualization of the distribution of differnt virus disscussed among the dataset. As we can see from the figure, the betacoronavirus has the most count. The species in this virus are MERS and SARS which were the two outbreaked dieases. Covid-19 is disscussed a lot too, around 1000 papers mentioned the covid-19 already. Around 400-500 papers are taking about the gamma and alpha coronavirus. 

### 2.3 Distribution of the topic covered in the metadata
From above, we know that around 1000 papers are talking about covid-19 and others are related to other coronaviruses. In this section, we will take a look at the topic covered in those papers. I determined the topic of each paper by looking for specific keywords in the abstract. For example, if an article writes about "transmission" in the abstract, I will assign this paper to topic transmission.
<p>Similar to the above section, the topic column will be added and assigned corrsponding values. After the assignemnt step, i got 7467 papers talking about "genetics|origin|evolution", 2085 papers talking about "transmission", 6469 papers talking about "vaccines|therapeutics", 219 papers talking about "incubation", 788 papers talking about "non-pharmaceutical interventions", 5430 papers taking about "medical care", and 326 papers talking about "ethical|social". Lastly, 2073 papers were not assigned to any topic.

In [None]:
papers['topic'] = np.where(papers.abstract.str.contains('transmission|transmitting'), 'transmission',
              np.where(papers.abstract.str.contains('incubation'), 'incubation',
              np.where(papers.abstract.str.contains('vaccines|vaccine|vaccination|therapeutics|therapeutic|drug|drugs'), 'vaccines|therapeutics',
              np.where(papers.abstract.str.contains('gene|origin|evolution|genetics|genomes|genomic'), 'genetics|origin|evolution',
              np.where(papers.abstract.str.contains('npi|npis|interventions|distancing|isolating|isolation|isolate|mask'), 'non-pharmaceutical interventions',
              np.where(papers.abstract.str.contains('ards|ecmo|respirators|eua|clia|ventilation|cardiomyopathy|ai'), 'medical care',
              np.where(papers.abstract.str.contains('ethical|social|media|rumor|misinformation|ethics|multidisciplinary'), 'ethical|social',
              "None")))))))

In [None]:
papers['topic'].value_counts()

In [None]:
plt.figure(figsize = (12, 8))
ax = sns.countplot(x="topic", data=papers)
ax.set_title('Figure 5. Distribution of different topics covered in the matadata')
plt.xticks(rotation=30)

From Figure 5, we can see that topic "genetics|origin|evolution" has the most count, followed by "transmission", "vaccines|therapeutics", and "medical care". "transmission", "non-pharmaceutical interventions", "ethical|social", and "incubation" has the least count. 

# 3. Model selection and fitting to data

Now, even we have a general idea of what the articals in metadata are talking about, the quantity of papers are still too large. Researchers will have hard time find the paper or the topic they want to read in this many articles. Therefore, my goal was therefore important. The steps that i will take to achieve my goal is described as follow:
* tokenized the sentense in each row of the "title", "abstract", and "full_text"
* create three new columns called "title_tokenized", "abstract_tokenized", and "full_text_tokenized"
* implement word embedding method (word2vec) as features, here i used joining (averaging) vectors from the words of each sentense. (I used the abstract column to do the training as the full_text would run a really long time and have similar results.)
* append the vectors of each row to a new column called "abstract_embedding", "title_embedding", and "full_text_embedding"
* embedding the qurey phrase to vector form by using **word2vec**
* calculate the cosine similary between the query vector and each row of the entire abstract embedding column
* append the similarity scores to a new column called "cosine_score"
* sort the column and rank the top 10 paper titles with highest cosine score.




In [None]:
tokenized_sentences_title = [sentence.split() for sentence in papers['title'].values]
tokenized_sentences_abstract = [sentence.split() for sentence in papers['abstract'].values]
tokenized_sentences_full_text = [sentence.split() for sentence in papers['full_text'].values]

In [None]:
papers['title_tokenized'] = tokenized_sentences_title
papers['abstract_tokenized'] = tokenized_sentences_abstract
papers['full_text_tokenized'] = tokenized_sentences_full_text

In [None]:
model = word2vec.Word2Vec(tokenized_sentences_abstract, size = 100, min_count=1)

In [None]:
def buildWordVector(word_list, size):
    #function to average all words vectors in a given paragraph
    vec = np.zeros(size)
    count = 0.
    for word in word_list:
        if word in model.wv:
            vec += model.wv[word]
            count += 1.
    if count != 0:
        vec /= count
    return vec

In [None]:
papers['title_embedding'] = papers['title_tokenized'].apply(lambda x: buildWordVector(x, size = 100))
papers['abstract_embedding'] = papers['abstract_tokenized'].apply(lambda x: buildWordVector(x, size = 100))


In [None]:
papers.head(10)

In [None]:
def embedding_query(query):
    query = query.split(' ')
    query_vec = np.zeros(100).reshape((1,100))
    count = 0
    for word in query:
        if word in model.wv:
            query_vec += model.wv[word]
            count += 1.
    if count != 0:
        query_vec /= count
    return query_vec

In [None]:
# reference: https://www.kaggle.com/mathijs02/recommend-a-paper-by-using-word-embeddings
def get_similarity(query,n_top):
    query_vec = embedding_query(query)
    papers["cos_sim"] = papers['abstract_embedding'].apply(
        lambda x: metrics.pairwise.cosine_similarity(
            [x],query_vec.reshape(1,-1))[0][0])
    top_list = (papers.sort_values("cos_sim", ascending=False)
                [["title","abstract","cos_sim"]]
                .drop_duplicates()[:n_top])
    return top_list

In [None]:
get_similarity('transmission incubation in human ',10)

# 4. Deriving insights about policy and guidance to tackle the outbreak based on model findings

In the previous section, I created a recommender system with word embedding feature type. The dataset is huge, it contains over 45000 papers and 35000 of them have full text. If the researchers use the dataset directly, it would be a super hard and time spending mission. By the recommender system built in the previous section, researchers could just providing a specific query, or a paper title. Then the model will calculate the cosine similarity between the query and the dataset and return the top 10 or 20 similar papers for researchers to read. This indeed minimize their time and would be benefit for fighting against the COVID-19.
<p>Next, I will run following querys related to the COVID-19 as some examples:

* Risk factors of the novel coronavirus 2019
* covid-19 genetics, origin, or evolution 
* Drugs or medicines to treat COVID-19 patients


By running the above queries, we get our top 10 recommendation. (The result is shown below)
<p>For the first query, "The risk factors of covid-19", we get pretty great recommendations on the paper. We can see many of the recommendated papers contain "risk" and we can see that they described differnt kinds of risk factors. For researchers, governments, and healthcare professionals who are interested in reading the risk factors of the novel cornoavirus, i would suggest them reading the above articles. 
<p>For the second query, it asks about the genetics, origins, or evolutions of the covid-19. The results are also satisfying, as we can see the recommendated papers are about the genetic diversity, about the virus evolution. Therefore, if researchers want to find out the potential genetics, origins or evolution of the novel virus, I would recommend them reading those papers.
<p>For the third query, it is about drugs or medicines to treat covid-19 patients. The returned recommended papers are very good as well. We can see some of the papers recommend chinese medicine. Thus, for healthcare professionals or the public heath department, if they need any information of the drug and medicine to treat patients, I would highly recommend them to read the above paper. 
<p>To conclude, there are over 45000 papers related to the new coronavirus, it would be a waste of time for the researchers to go through them one by one. A recommender system like this notebook did would save a lot of time. This would be a very fast way for researchers, governments, healthcare professionals to find more information about a similar, relevant material. 

In [None]:
get_similarity('risk covid-19',10)

In [None]:
get_similarity('covid-19 genetics origin evolution',10)

In [None]:
get_similarity('drugs medicine to treat covid-19 patients',10)