Methodology: 
*   Use Beautiful Soup to web-scrape the article
*   Use Huggingface Summarization pipeline
*   Summarize the article chunk-wise
*   Concatenate that to get a summary of the overall article





#Loading important dependencies

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 5.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 5.7 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 27.7 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 63.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 48.4 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Fo

In [2]:
from transformers import pipeline
from bs4 import BeautifulSoup
import requests

#Summarization pipeline from Huggingface

In [3]:
summarizer=pipeline('summarization')

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

#Using BeautifulSoup to scrape an article

In [38]:

URL='https://www.biorxiv.org/content/10.1101/2021.05.25.445601v3.full' #Strainflow Paper by Dr. Tavpritesh Sethi from IIIT-D

In [39]:
req_url=requests.get(URL)
req_url


<Response [200]>

In [40]:
req_url.text



In [41]:
soup = BeautifulSoup(req_url.text, 'html.parser')
soup

<!DOCTYPE html>

<html dir="ltr" lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:mml="http://www.w3.org/1998/Math/MathML">
<head prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book#">
<!--[if IE]><![endif]-->
<link href="//d33xdlntwy0kbs.cloudfront.net" rel="dns-prefetch"/>
<link href="//www.google.com" rel="dns-prefetch"/>
<link href="//scholar.google.com" rel="dns-prefetch"/>
<link href="//www.google-analytics.com" rel="dns-prefetch"/>
<link href="//stats.g.doubleclick.net" rel="dns-prefetch"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="https://www.biorxiv.org/sites/default/files/images/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
<link href="/content/10.1101/2021.05.25.445601v3.full.pdf" rel="alternate" title="Full Text (PDF)" type="application/pdf"/>
<link href="/content/10.1101/2021.05.25.445601v3.full.txt" 

In [42]:
results = soup.find_all(['h1', 'p'])
results

[<h1 class="highwire-cite-title" id="page-title">Genomic Surveillance of COVID-19 Variants with Language Models and Machine Learning</h1>,
 <p id="p-4">The global efforts to control COVID-19 are threatened by the rapid emergence of novel SARS-CoV-2 variants that may display undesirable characteristics such as immune escape or increased pathogenicity. Early prediction of emerging strains could be vital to pandemic preparedness but remains an open challenge. Here, we developed <em>Strainflow</em>, to learn the latent dimensions of 0.9 million high-quality SARS-CoV-2 genome sequences, and used machine learning algorithms to predict upcoming caseloads of SARS-CoV-2. In our <em>Strainflow</em> model, SARS-CoV-2 genome sequences were treated as documents, and codons as words to learn unsupervised codon embeddings (latent dimensions). We discovered that codon-level changes lead to a change in the entropy of the latent dimensions. We used a machine learning algorithm to find the most relevant 

In [43]:
text = [result.text for result in results]
text
# for result in results:
#   text=result.text
#   print(text)

['Genomic Surveillance of COVID-19 Variants with Language Models and Machine Learning',
 'The global efforts to control COVID-19 are threatened by the rapid emergence of novel SARS-CoV-2 variants that may display undesirable characteristics such as immune escape or increased pathogenicity. Early prediction of emerging strains could be vital to pandemic preparedness but remains an open challenge. Here, we developed Strainflow, to learn the latent dimensions of 0.9 million high-quality SARS-CoV-2 genome sequences, and used machine learning algorithms to predict upcoming caseloads of SARS-CoV-2. In our Strainflow model, SARS-CoV-2 genome sequences were treated as documents, and codons as words to learn unsupervised codon embeddings (latent dimensions). We discovered that codon-level changes lead to a change in the entropy of the latent dimensions. We used a machine learning algorithm to find the most relevant latent dimensions called Dimensions of Concern (DoCs) of SARS-CoV-2 spike genes,

In [44]:
Article_data = ' '.join(text)
Article_data



#Split article into individual sentences

In [45]:
#Replacing .?! into "end of sentence"
Article_data = Article_data.replace('.', '.<eos>') 
Article_data = Article_data.replace('?', '?<eos>')
Article_data = Article_data.replace('!', '!<eos>')
Article_data



In [46]:
sentences=Article_data.split('<eos>')
sentences

['Genomic Surveillance of COVID-19 Variants with Language Models and Machine Learning The global efforts to control COVID-19 are threatened by the rapid emergence of novel SARS-CoV-2 variants that may display undesirable characteristics such as immune escape or increased pathogenicity.',
 ' Early prediction of emerging strains could be vital to pandemic preparedness but remains an open challenge.',
 ' Here, we developed Strainflow, to learn the latent dimensions of 0.',
 '9 million high-quality SARS-CoV-2 genome sequences, and used machine learning algorithms to predict upcoming caseloads of SARS-CoV-2.',
 ' In our Strainflow model, SARS-CoV-2 genome sequences were treated as documents, and codons as words to learn unsupervised codon embeddings (latent dimensions).',
 ' We discovered that codon-level changes lead to a change in the entropy of the latent dimensions.',
 ' We used a machine learning algorithm to find the most relevant latent dimensions called Dimensions of Concern (DoCs) 

In [47]:
sentences[0]

'Genomic Surveillance of COVID-19 Variants with Language Models and Machine Learning The global efforts to control COVID-19 are threatened by the rapid emergence of novel SARS-CoV-2 variants that may display undesirable characteristics such as immune escape or increased pathogenicity.'

#Converging the sentences into chunks of 500 words

In [48]:
max_chunk = 500
current_chunk = 0 
chunks = []
for sentence in sentences:
    if len(chunks) == current_chunk + 1: 
        if len(chunks[current_chunk]) + len(sentence.split(' ')) <= max_chunk:
            chunks[current_chunk].extend(sentence.split(' '))
        else:
            current_chunk += 1
            chunks.append(sentence.split(' '))
    else:
        print(current_chunk)
        chunks.append(sentence.split(' '))

for chunk_id in range(len(chunks)):
    chunks[chunk_id] = ' '.join(chunks[chunk_id])

0


In [49]:
chunks

 ' On the other hand, machine learning approaches are likely to be biased by underlying data characteristics and do not explain the biological basis of predictions.  Here, we propose Strainflow, a hybrid architecture of machine learning and language modeling, along with empirical experiments to demonstrate explainable genomic signals for tracking and predicting the spread of SARS-CoV-2 across countries.  Strainflow is rooted in language models for generating sequence embeddings that have recently shown promise for capturing biological insights from DNA sequences.  Typically, in language models, word embeddings represent the latent space (dimensions) of a corpus of text (Mikolov et al. , 2013) and can capture highly nonlinear and contextual relationships.  Codons (tri-nucleotides, 3-mers) translations represent a natural basis for word representations and have been utilized in the past for learning embedding models for modeling various outcomes such as mutation susceptibility (Yilmaz, 2

In [50]:
len(chunks) #number of 500-word chunks

12

In [51]:
chunks[0]



#Summarizing each chunk

In [52]:
chunk_summary = summarizer(chunks, max_length=200, min_length=50, do_sample=False)
chunk_summary

[{'summary_text': ' Global efforts to control COVID-19 are threatened by the rapid emergence of novel SARS-CoV-2 variants that may display undesirable characteristics such as immune escape or increased pathogenicity . Early prediction of emerging strains could be vital to pandemic preparedness but remains an open challenge .'},
 {'summary_text': ' Strainflow is rooted in language models for generating sequence embeddings that have recently shown promise for capturing biological insights from DNA sequences . The approach is extensible to global threats in pathogen surveillance such as emerging infections, pandemics, and antibiotic resistance . The global tSNE highlights dynamic emerging patterns derived from LR of spike genes of SARS-CoV-2 .'},
 {'summary_text': ' To investigate the information content in the latent dimensions or space (LD or LS) of the Strainflow model, we performed qualitative and quantitative analysis on 0. 9 million SARS-CoV-2 spike genes collected from December, 20

In [53]:
chunk_summary[0]

{'summary_text': ' Global efforts to control COVID-19 are threatened by the rapid emergence of novel SARS-CoV-2 variants that may display undesirable characteristics such as immune escape or increased pathogenicity . Early prediction of emerging strains could be vital to pandemic preparedness but remains an open challenge .'}

In [54]:
' '.join([summ['summary_text'] for summ in chunk_summary])



In [55]:
text = ' '.join([summ['summary_text'] for summ in chunk_summary])
text

