# What do we know about virus genetics, origin, and evolution?
 What do we know about virus genetics, origin, and evolution? What do we know about the virus origin and management measures at the human-animal interface?

**Specifically, we want to know what the literature reports about:**

* Real-time tracking of whole genomes and a mechanism for coordinating the rapid dissemination of that information to inform the development of diagnostics and therapeutics and to track variations of the virus over time.
* Access to geographic and temporal diverse sample sets to understand geographic distribution and genomic differences, and determine whether there is more than one strain in circulation. Multi-lateral agreements such as the Nagoya Protocol could be leveraged.
* Evidence that livestock could be infected (e.g., field surveillance, genetic sequencing, receptor binding) and serve as a reservoir after the epidemic appears to be over.
* Evidence of whether farmers are infected, and whether farmers could have played a role in the origin.
* Surveillance of mixed wildlife- livestock farms for SARS-CoV-2 and other coronaviruses in Southeast Asia.
* Experimental infections to test host range for this pathogen.
* Animal host(s) and any evidence of continued spill-over to humans
* Socioeconomic and behavioral risk factors for this spill-over
* Sustainable risk reduction strategies

In [1]:
import os 
import pandas as pd
import json
from tqdm import tqdm
import re
import matplotlib.pyplot as plt
import heapq

import re
import nltk

In [2]:
# Get a list of stopwords from nltk
stopwords = nltk.corpus.stopwords.words("english")

dirs = ['biorxiV_medrxiv', 'comm_use_subset', 'custom_license', 'noncomm_use_subset']

docs = []
for d in dirs:
    print(d)
    for file in tqdm(os.listdir(f"{d}/{d}")):
        filepath = f"{d}/{d}/{file}"
        j = json.load(open(filepath,'rb'))
        title = j['metadata']['title']
        try: 
            abstract = j['abstract'][0]['text']
        except:
            abstract = ''
            
        fulltext = ''
        for text in j['body_text']:
            fulltext += text['text'] + "\n\n"
        docs.append([title, abstract, fulltext])

  9%|▊         | 76/885 [00:00<00:01, 758.59it/s]

biorxiV_medrxiv


100%|██████████| 885/885 [00:01<00:00, 837.79it/s]
  1%|          | 65/9118 [00:00<00:14, 640.68it/s]

comm_use_subset


100%|██████████| 9118/9118 [00:16<00:00, 565.89it/s]
  0%|          | 0/16959 [00:00<?, ?it/s]

custom_license


100%|██████████| 16959/16959 [00:28<00:00, 590.25it/s]
  3%|▎         | 81/2353 [00:00<00:02, 804.55it/s]

noncomm_use_subset


100%|██████████| 2353/2353 [00:03<00:00, 678.31it/s]


In [5]:
df = pd.DataFrame(docs, columns = ['title', 'abstract', 'fulltext'])
fulltexts = df['fulltext'].values

In [10]:
def clean_text(text):
    # Removing Square Brackets and Extra Spaces
    text = re.sub(r'\[[0-9]*\]',' ', text)
    text = re.sub(r'\s+',' ', text)
 
    text = re.sub(r'\{\{[\s\S]*?\}\}', '', text)

    # Remove doi links
    #text = re.sub(r'^https://$', '',text)
    return text

def clean_spchar_digs(text):
    # Removing special characters and digits
    text = re.sub('[^a-zA-Z]', ' ', text )
    text = re.sub(r'\s+', ' ', text)
    
    return text

def word_freq(formatted_text):
    #creates a dictionary of words as keys and frequency as values
    word_frequencies = {}
    for word in nltk.word_tokenize(formatted_text):
        if word not in stopwords:
            if word not in word_frequencies.keys():
                word_frequencies[word] = 1
            else:
                word_frequencies[word] += 1

    maximum_frequency = max(word_frequencies.values())
    #divides the values by the maximum frequency
    for word in word_frequencies.keys():
        word_frequencies[word] = (word_frequencies[word]/maximum_frequency)
    
    return word_frequencies

def sent_scores(sentence_list, word_frequencies):
    #uses the word frequencies to score the sentences by adding up the scores
    #of the words that make up the sentence
    sentence_scores = {}
    for sent in sentence_list:
        for word in nltk.word_tokenize(sent.lower()):
            if word in word_frequencies.keys():
                if len(sent.split(' ')) <60: #limits sentence to less than 60 words
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = word_frequencies[word]
                    else:
                        sentence_scores[sent] += word_frequencies[word]
                    
    return sentence_scores

def get_summary(dirty_text):
    text = clean_text(dirty_text)
    formatted_text = clean_spchar_digs(text)

    sentence_list = nltk.sent_tokenize(text)

    word_frequencies = word_freq(formatted_text) 
    sentence_scores = sent_scores(sentence_list,word_frequencies)
    
    
    summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get) #first value is number highest scoring sentences to print
    summary = '\n\n '.join(summary_sentences)
    return summary

def get_improved_summary(searchlist):
    #get summary where all you have to do is provide the words you are searching for in a list
    covid_alias = ['CoV', 'COVID', 'Covid', 'corona virus', 'coronavirus', 'Coronavirus', 'Corona virus'] #depending on here the results w
    desired_sents = {}
    covid_sents = {}
    for text in fulltexts:
        for sentence in text.split('. '):
            for i in searchlist:
                if i.lower() in sentence.lower(): #using .lower changes the results dramatically
                    if sentence not in desired_sents.keys():
                        desired_sents[sentence] = sentence 
            for j in covid_alias:
                if j in sentence:
                    if sentence not in covid_sents.keys():
                        covid_sents[sentence] = sentence
    desired_sents = set(desired_sents.keys())
    covid_sents = set(covid_sents.keys())
    desired_sents = list(desired_sents.intersection(covid_sents))
    desired_text = ''
    for x in desired_sents:
        desired_text += ' ' + x
    text = clean_text(desired_text)
    formatted_text = clean_spchar_digs(text)

    sentence_list = nltk.sent_tokenize(text)

    word_frequencies = word_freq(formatted_text) 
    sentence_scores = sent_scores(sentence_list,word_frequencies)
    
    
    summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get) #first value is number highest scoring sentences to print
    summary = '\n\n '.join(summary_sentences)
    return summary   

### Livestock

In [22]:
slist = ['livestock', 'animals', 'hosts', 'spillover', 'suscept']
livestock_sum = get_improved_summary(slist)
print(livestock_sum)

Although three animals were identified as susceptible to SARS-CoV infection, the larger sale volume of civets in comparison to other animals in the market made them the target animals of subsequent surveillance studies Thus, even highly susceptible infant mice do not easily favor infection with non-mouse coronaviruses.

 The causative agent has now been determined to be a novel coronavirus (SARS-CoV) that is genetically distinct from any previously identified coronavirus known to cause disease in animals or humans [1, These studies indicate that an enormous, previously unrecognized reservoir for coronaviruses exists among animals, which is not unlike the reservoir that exists for influenza viruses in animals.

 The 12 bat cells were also tested for susceptible to HCoV-229E infection Interestingly, among characterization of many respiratory virus infections such as various influenza strains , respiratory syncytial virus , Nipah virus , and coronaviruses , other viruses have also recentl

### Genetic sequencing recptor binding

In [9]:
slist = ['field surveillance', 'genetic sequencing', 'receptor binding']
livestock_sum = get_improved_summary(slist)
print(livestock_sum)

Considerable evidence has proved that recombinant receptor binding domain (rRBD)-based subunit vaccine is a promising candidate vaccine against the SARS-CoV infection The viral receptor binding domain (RBD) of the S protein, located between residue 318 and 510 of the S1 domain , interacts with angiotensin-converting enzyme 2 (ACE2), which has been identified as the SARS-CoV receptor .

 Coronavirus infection starts with receptor binding via the S protein (Figure 1) S1-NTD -S1 subunit N terminal domain is shown in purple; S1-CTD -S1 subunit C terminal domain is shown in red; RBD -PDCoV receptor binding domain is shown in yellow; TMD -trans-membrane domain is shown in orange.

 The S protein of coronaviruses is usually responsible for receptor binding and is solely responsible for membrane fusion for cellular entry The discovery of HCoVs, their receptor usage, cell tropism and receptor binding domain (RBD) is summarized in Table 1 .

  The three dimensional structural analysis revealed t

### Strains

In [13]:
slist = ['strain']
livestock_sum = get_improved_summary(slist)
print(livestock_sum)

As shown in Additional file 1, thirty-two FCoV strains, seven canine coronavirus (CCoV) strains, four transmissible gastroenteritis coronavirus (TGEV) strains, a porcine respiratory coronavirus (PRCV) strain and a Mink coronavirus (MiCoV) strain were used for comparative genome analysis Therefore, our results clearly demonstrate the usefulness of the sequence variation information as molecular fingerprint in ''tagging'' SARS-CoV viral strains.

 Feline coronavirus strains including type I feline infectious peritonitis virus (FIPV) , type II FIPV WSU 79-1146, and type II feline enteric coronavirus (FECV) WSU 79-1683 were kindly provided by Dr However, there is no serotypic discrimination between strains because the ability of human serum to neutralize diverse MERS-CoV strains, including the EMC/2012 strain, does not differ (14) .

 Intense coronavirus surveillance is ongoing in China since the SARS-CoV outbreak, and it was demonstrated that co-roosting can maintain all strains with the 