In [6]:
import spacy

In [7]:
nlp = spacy.load('en_core_web_sm')

In [3]:
text = """This is an example text. We will use seven sentences and we will return 3. This blog is written by Yujian Tang. Yujian is the best software content creator. This is a software content blog focused on Python, your software career, and Machine Learning. Yujian's favorite ML subcategory is Natural Language Processing. This is the end of our example."""

In [4]:
doc = nlp(text)

In [10]:
dictionary = {}
for word in doc:
    word = word.text.lower()
    if word in dictionary:
        dictionary[word] += 1
    else:
        dictionary[word] = 1

In [11]:
dictionary

{'this': 4,
 'is': 6,
 'an': 1,
 'example': 2,
 'text': 1,
 '.': 7,
 'we': 2,
 'will': 2,
 'use': 1,
 'seven': 1,
 'sentences': 1,
 'and': 2,
 'return': 1,
 '3': 1,
 'blog': 2,
 'written': 1,
 'by': 1,
 'yujian': 3,
 'tang': 1,
 'the': 2,
 'best': 1,
 'software': 3,
 'content': 2,
 'creator': 1,
 'a': 1,
 'focused': 1,
 'on': 1,
 'python': 1,
 ',': 2,
 'your': 1,
 'career': 1,
 'machine': 1,
 'learning': 1,
 "'s": 1,
 'favorite': 1,
 'ml': 1,
 'subcategory': 1,
 'natural': 1,
 'language': 1,
 'processing': 1,
 'end': 1,
 'of': 1,
 'our': 1}

In [25]:
sents = []
for i, sent in enumerate(doc.sents):
    sent_score = 0
    for word in sent:
        word = word.text.lower()
        sent_score += dictionary[word]
    sents.append((sent.text.replace('\n', ' '), sent_score/len(sent), i))

In [26]:
sents

[('This is an example text.', 3.5, 0),
 ('We will use seven sentences and we will return 3.', 2.0, 1),
 ('This blog is written by Yujian Tang.', 3.125, 2),
 ('Yujian is the best software content creator.', 3.125, 3),
 ('This is a software content blog focused on Python, your software career, and Machine Learning.',
  2.2777777777777777,
  4),
 ("Yujian's favorite ML subcategory is Natural Language Processing.", 2.3, 5),
 ('This is the end of our example.', 3.0, 6)]

In [27]:
sents = sorted(sents, key=lambda x: -x[1])
sents = sorted(sents[:3], key=lambda x: x[2])
sents

[('This is an example text.', 3.5, 0),
 ('This blog is written by Yujian Tang.', 3.125, 2),
 ('Yujian is the best software content creator.', 3.125, 3)]

In [28]:
summary = ' '.join([sent[0] for sent in sents])
summary

'This is an example text. This blog is written by Yujian Tang. Yujian is the best software content creator.'

To improve this very rudimentary summarizer (extractive summarization), we can do the following:
1. Remove stop words from the dictionary

#### Summary theory

1. **Extractive summarization** 
Attempts to identify significant sentences and then adds them to the summary, which will contain exact sentences from the original text.
2. **Abstractive summarization**
Attempts to identify important sections, interpret the context and intelligently generate a summary

In [9]:
from spacy.lang.en.stop_words import STOP_WORDS
from heapq import nlargest

In [30]:
text = """"In an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills. Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services. As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses. The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, With AI being the defining technology of our time, it is transforming lives and industry and the jobs of tomorrow will require a different skillset. This will require more collaborations and training and working with AI. That’s why it has become more critical than ever for educational institutions to integrate new cloud and AI technologies. The program is an attempt to ramp up the institutional set-up and build capabilities among the educators to educate the workforce of tomorrow. The program aims to build up the cognitive skills and in-depth understanding of developing intelligent cloud connected solutions for applications across industry. Earlier in April this year, the company announced Microsoft Professional Program In AI as a learning track open to the public. The program was developed to provide job ready skills to programmers who wanted to hone their skills in AI and data science with a series of online courses which featured hands-on labs and expert instructors as well. This program also included developer-focused AI school that provided a bunch of assets to help build AI skills."""

In [31]:
doc = nlp(text)

In [34]:
tokens = [token.text for token in doc]
freq = {}

for token in doc:
    if token.text.lower() not in STOP_WORDS and token.is_punct == False:
        if token.text.lower() in freq.keys():
            freq[token.text.lower()] += 1
        else:
            freq[token.text.lower()] = 1

In [35]:
freq

{'attempt': 2,
 'build': 5,
 'ai': 14,
 'ready': 3,
 'workforce': 2,
 'microsoft': 4,
 'announced': 2,
 'intelligent': 3,
 'cloud': 5,
 'hub': 3,
 'launched': 1,
 'empower': 1,
 'generation': 1,
 'students': 2,
 'skills': 5,
 'envisioned': 1,
 'year': 2,
 'collaborative': 1,
 'program': 8,
 'support': 2,
 '100': 1,
 'institutions': 2,
 'infrastructure': 2,
 'course': 1,
 'content': 1,
 'curriculum': 1,
 'developer': 3,
 'development': 2,
 'tools': 2,
 'access': 1,
 'services': 4,
 'redmond': 1,
 'giant': 1,
 'wants': 1,
 'expand': 1,
 'reach': 1,
 'planning': 1,
 'strong': 1,
 'ecosystem': 1,
 'india': 2,
 'set': 2,
 'core': 1,
 'iot': 1,
 'selected': 1,
 'campuses': 1,
 'company': 2,
 'provide': 2,
 'azure': 2,
 'cognitive': 2,
 'bot': 1,
 'machine': 1,
 'learning': 2,
 'according': 1,
 'manish': 1,
 'prakash': 1,
 'country': 1,
 'general': 1,
 'manager': 1,
 'ps': 1,
 'health': 1,
 'education': 1,
 'said': 1,
 'defining': 1,
 'technology': 1,
 'time': 1,
 'transforming': 1,
 'lives':

In [36]:
max_freq = max(freq.values())
for token in freq.keys():
    freq[token] = freq[token]/max_freq

In [37]:
freq

{'attempt': 0.14285714285714285,
 'build': 0.35714285714285715,
 'ai': 1.0,
 'ready': 0.21428571428571427,
 'workforce': 0.14285714285714285,
 'microsoft': 0.2857142857142857,
 'announced': 0.14285714285714285,
 'intelligent': 0.21428571428571427,
 'cloud': 0.35714285714285715,
 'hub': 0.21428571428571427,
 'launched': 0.07142857142857142,
 'empower': 0.07142857142857142,
 'generation': 0.07142857142857142,
 'students': 0.14285714285714285,
 'skills': 0.35714285714285715,
 'envisioned': 0.07142857142857142,
 'year': 0.14285714285714285,
 'collaborative': 0.07142857142857142,
 'program': 0.5714285714285714,
 'support': 0.14285714285714285,
 '100': 0.07142857142857142,
 'institutions': 0.14285714285714285,
 'infrastructure': 0.14285714285714285,
 'course': 0.07142857142857142,
 'content': 0.07142857142857142,
 'curriculum': 0.07142857142857142,
 'developer': 0.21428571428571427,
 'development': 0.14285714285714285,
 'tools': 0.14285714285714285,
 'access': 0.07142857142857142,
 'services

In [38]:
sent_tokens = [sent for sent in doc.sents]
sent_scores = {}
for sent in sent_tokens:
    for word in sent:
        word = word.text.lower()
        if word in freq.keys():
            if sent in sent_scores.keys():
                sent_scores[sent] += freq[word]
            else:
                sent_scores[sent] = freq[word]

In [39]:
sent_scores

{"In an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills.: 5.0,
 Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services.: 5.857142857142858,
 As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses.: 4.2142857142857135,
 The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.: 4.428571428571429,
 According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, With AI being the

In [3]:
def summarize(text: str, percentage: float) -> str:
    if percentage > 1:
        raise ValueError("Percentage should be a float between 0 and 1")
    doc = nlp(text)
    freq = {}

    for token in doc:
        if token.text.lower() not in STOP_WORDS and token.is_punct == False:
            if token.text.lower() in freq.keys():
                freq[token.text.lower()] += 1
            else:
                freq[token.text.lower()] = 1

    max_freq = max(freq.values())
    for token in freq.keys():
        freq[token] = freq[token]/max_freq

    sent_tokens = [sent for sent in doc.sents]
    sent_scores = {}
    for sent in sent_tokens:
        for word in sent:
            word = word.text.lower()
            if word in freq.keys():
                if sent in sent_scores.keys():
                    sent_scores[sent] += freq[word]
                else:
                    sent_scores[sent] = freq[word]

    select_length = int(len(sent_tokens)*percentage)
    summary = nlargest(select_length, sent_scores, key=sent_scores.get)
    summary = sorted(summary, key=lambda sentence: sentence.start)
    final_summary = [word.text for word in summary]
    summary = ' '.join(final_summary)
    return summary

In [58]:
summarize(text, 0.3)

'"In an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills. Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services. The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.'

In [59]:
summarize(text, 0.2)

'"In an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills. Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services.'

### Abstractive summarization

In [1]:
from newspaper import Article

In [61]:
url = 'https://www.theguardian.com/world/2023/nov/26/russia-wages-electronic-warfare-using-uk-made-tech-ukraine-dossier-claims'
article = Article(url)
article.download()
article.parse()
article.text

'Many of the countries that have sanctioned Russia over the war in Ukraine need to take urgent action to disrupt the supply of technology for its electronic warfare campaign, according to a new report.\n\nThe dossier compiled by Ukraine and circulated to the major countries which have imposed sanctions identifies key Russian firms involved in the development and production of electronic military equipment. It says the UK and other countries have not yet sanctioned some of the firms involved.\n\nIt identified what it claims is technology made by British firms in some of the advanced electronic equipment engaged in the conflict, and says more effective action is required to block the use of foreign components.\n\nThe report states: “The effectiveness of Russian electronic systems largely depends on access to imported components that are widely used in the production of such systems ... Specific steps should be taken immediately to reduce the Russian military-industrial complex’s capabili

In [65]:
summarize(article.text.replace("\n", ''), 0.1)

'Many of the countries that have sanctioned Russia over the war in Ukraine need to take urgent action to disrupt the supply of technology for its electronic warfare campaign, according to a new report. The dossier compiled by Ukraine and circulated to the major countries which have imposed sanctions identifies key Russian firms involved in the development and production of electronic military equipment. They include the entities Strela Research and Production Association, Protek Research and Development Enterprise and Radioelectronic Technologies Concern, which it says have not been sanctioned by the UK.It also names components from British firms which it says have been found in Russian electronic warfare.'

In [2]:
url = 'https://wien.orf.at/stories/3239608/'
article = Article(url)
article.download()
article.parse()
article.text

'„Er wurde auf Anordnung der Staatsanwaltschaft enthaftet“, teilte Florian Kreiner, der Verteidiger des 47-jährigen Tschetschenen, am Sonntagabend mit. Die Staatsanwaltschaft Wien bestätigte das auf APA-Anfrage. Die Anordnung sei erlassen worden, „weil sich der dringende Tatverdacht nicht erhärtet hat“, wie Behördensprecherin Nina Bussek feststellte.\n\nVerteidiger: „Von Anfang an kein Beweis“\n\nUrsprünglich hatte es geheißen, der bisher unbescholtene Familienvater habe gemeinsam mit einem 28-jährigen Tadschiken und dessen 27 Jahre alter Ehefrau, die seit 2022 in Wien lebten, einem länderübergreifenden radikalislamischen Terrornetzwerk angehört, das Anschläge in Deutschland und in Wien erwogen haben soll. Die Staatsanwaltschaft Wien ermittelt gegen die mutmaßliche Zelle der Terrorgruppe Islamischer Staat Provinz Khorasan (ISPK) wegen terroristischer Vereinigung (§ 278b StGB) in Verbindung mit terroristischen Straftaten (§ 278c StGB).\n\nDer 28 Jahre alte Tadschike und seine Ehefrau be

In [10]:
article.text
summarize(article.text.replace("\n", ''), 0.05)

'Verteidiger: „Von Anfang an kein Beweis“Ursprünglich hatte es geheißen, der bisher unbescholtene Familienvater habe gemeinsam mit einem 28-jährigen Tadschiken und dessen 27 Jahre alter Ehefrau, die seit 2022 in Wien lebten, einem länderübergreifenden radikalislamischen Terrornetzwerk angehört, das Anschläge in Deutschland und in Wien erwogen haben soll.'