# Extractive text summarization project.

* Extractive summarization aims at identifying the salient information that is then extracted and grouped together to form a concise summary. 

* Abstractive summary generation rewrites the entire document by building internal semantic representation, and then a summary is created using natural language processing.

In [None]:
from sentence_transformers import SentenceTransformer




In [None]:
%pip install LexRank


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting LexRank
  Downloading lexrank-0.1.0-py3-none-any.whl (69 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.8/69.8 KB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Collecting urlextract>=0.7
  Downloading urlextract-1.8.0-py3-none-any.whl (21 kB)
Collecting path.py>=10.5
  Downloading path.py-12.5.0-py3-none-any.whl (2.3 kB)
Collecting path
  Downloading path-16.6.0-py3-none-any.whl (26 kB)
Collecting uritools
  Downloading uritools-4.0.1-py3-none-any.whl (10 kB)
Installing collected packages: uritools, path, urlextract, path.py, LexRank
Successfully installed LexRank-0.1.0 path-16.6.0 path.py-12.5.0 uritools-4.0.1 urlextract-1.8.0


In [None]:
from LexRank import degree_centrality_scores

In [None]:
import nltk
nltk.download('punkt')
import numpy as np


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
 model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
document = """
The technology at the heart of the most innovative progress in health care artificial intelligence (AI) is in a subdomain called machine learning (ML), which describes the use of software algorithms to identify patterns in very large datasets. ML has driven much of the progress of health care AI over the past 5 years, demonstrating impressive results in clinical decision support, patient monitoring and coaching, surgical assistance, patient care, and systems management. Clinicians in the near future will find themselves working with information networks on a scale well beyond the capacity of human beings to grasp, thereby necessitating the use of intelligent machines to analyze and interpret the complex interactions between data, patients, and clinical decision makers. However, as this technology becomes more powerful, it also becomes less transparent, and algorithmic decisions are therefore progressively more opaque. This is problematic because computers will increasingly be asked for answers to clinical questions that have no single right answer and that are open-ended, subjective, and value laden. As ML continues to make important contributions in a variety of clinical domains, clinicians will need to have a deeper understanding of the design, implementation, and evaluation of ML to ensure that current health care is not overly influenced by the agenda of technology entrepreneurs and venture capitalists. The aim of this article is to provide a nontechnical introduction to the concept of ML in the context of health care, the challenges that arise, and the resulting implications for clinicians.
"""

In [None]:
#Split the document into sentences
sentences = nltk.sent_tokenize(document)
print("Num sentences:", len(sentences))

Num sentences: 7


In [None]:
#Compute the sentence embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)
embeddings

tensor([[-0.0242,  0.0052, -0.0049,  ...,  0.0052,  0.0987, -0.0198],
        [-0.0220, -0.0485,  0.0473,  ..., -0.0291,  0.0181,  0.0178],
        [-0.0115, -0.0389, -0.0047,  ...,  0.0634,  0.0042, -0.0572],
        ...,
        [-0.0177,  0.0239, -0.0424,  ...,  0.0183,  0.0302,  0.0132],
        [ 0.0003, -0.0153, -0.0098,  ...,  0.0094,  0.0676, -0.0611],
        [ 0.0379, -0.0280, -0.0103,  ...,  0.0116,  0.0799, -0.0324]],
       device='cuda:0')

In [None]:

#Compute the pair-wise cosine similarities
cos_scores = util.cos_sim(embeddings, embeddings).cpu().numpy()
cos_scores

array([[1.0000001 , 0.7872413 , 0.5536606 , 0.21243115, 0.31392062,
        0.6049767 , 0.5642389 ],
       [0.7872413 , 1.        , 0.5510065 , 0.2097673 , 0.34670934,
        0.70290637, 0.628978  ],
       [0.5536606 , 0.5510065 , 1.        , 0.2658146 , 0.472893  ,
        0.5480004 , 0.38552487],
       [0.21243115, 0.2097673 , 0.2658146 , 0.99999994, 0.18209803,
        0.22326922, 0.09074488],
       [0.31392062, 0.34670934, 0.472893  , 0.18209803, 1.0000001 ,
        0.41686526, 0.29272667],
       [0.6049767 , 0.70290637, 0.5480004 , 0.22326922, 0.41686526,
        0.99999994, 0.70623785],
       [0.5642389 , 0.628978  , 0.38552487, 0.09074488, 0.29272667,
        0.70623785, 1.0000001 ]], dtype=float32)

In [None]:
#Compute the centrality for each sentence
centrality_scores = degree_centrality_scores(cos_scores, threshold=None)
centrality_scores


array([1.12481091, 1.17779551, 1.05247874, 0.60863274, 0.84301207,
       1.17100933, 1.02225816])

In [None]:

#We argsort so that the first element is the sentence with the highest score
most_central_sentence_indices = np.argsort(-centrality_scores)
most_central_sentence_indices

array([1, 5, 0, 2, 6, 4, 3])

In [None]:
#Print the 5 sentences with the highest scores
print("\n\nSummary:")
for idx in most_central_sentence_indices[:2]:
    print(sentences[idx].strip())



Summary:
ML has driven much of the progress of health care AI over the past 5 years, demonstrating impressive results in clinical decision support, patient monitoring and coaching, surgical assistance, patient care, and systems management.
As ML continues to make important contributions in a variety of clinical domains, clinicians will need to have a deeper understanding of the design, implementation, and evaluation of ML to ensure that current health care is not overly influenced by the agenda of technology entrepreneurs and venture capitalists.
