# Text Summarization with spaCy

Spacy is a Python library that provides various natural language processing (NLP) capabilities, including text summarization. 

Text summarization involves condensing a longer piece of text into a shorter summary, while retaining the most important information. 

Spacy's summarization capabilities rely on machine learning algorithms that identify the most important sentences in a text and use them to generate a summary. 

Spacy's summarization capabilities can be customized by adjusting various parameters, such as the length of the summary and the importance assigned to different types of words and phrases. 

Text summarization with Spacy can be used in a variety of applications, such as news articles, legal documents, and academic papers, to quickly and efficiently distill important information from longer texts.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer



In [3]:
# Load CSV file into DataFrame
df = pd.read_csv('output-merge.csv')
df = df[['Title', 'Content','Category','SubCategory']]
df

Unnamed: 0,Title,Content,Category,SubCategory
0,\n Consumer confidence worsened in April ...,US consumer confidence worsened in April as Am...,MONEY,economy
1,\n GM earnings much better than expected ...,General Motors reported a much better-than-exp...,BUSINESS,business
2,"\n Chevrolet Bolt, GM’s first popular ele...","The Chevrolet Bolt EV, General Motors first fu...",BUSINESS,business
3,\n New home sales rise for the fourth mon...,"New home sales rose in March, climbing for the...",BUSINESS,homes
4,\n Tucker Carlson out at Fox News\n,"Fox News and Tucker Carlson, the right-wing ex...",ENTERTAINMENT,
...,...,...,...,...
1180,Hospitals and health care facilities should dr...,Walensky told Congress that masking guidance ‘...,HEALTHY LIVING,health
1181,ChatGPT for health care providers: Can the AI ...,"OpenAI CEO Sam Altman said that he was ""a litt...",HEALTHY LIVING,health
1182,Want to get better sleep? Exercise for this lo...,Get the rest you need with these simple tweaks...,HEALTHY LIVING,health
1183,Massachusetts town says Avian Flu detected amo...,Fox News Flash top headlines are here. Check o...,HEALTHY LIVING,health


In [4]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest
spacy.cli.download("en_core_web_md")
from string import punctuation
punctuation=punctuation+ '\n'

Collecting en_core_web_md==2.3.1
  Using cached en_core_web_md-2.3.1-py3-none-any.whl
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [5]:
import nltk

# Download the stopwords corpus (only need to do this once)
nltk.download('stopwords')

# Load the stopwords into a set
stopwords = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/keerthanaakannan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
nlp = spacy.load('en_core_web_md')

Another way of doing it by defining a function called textSummarizer()

In [7]:
def textSummarizer(text, percentage):
    
    # load the model into spaCy
#     nlp = spacy.load('en_core_web_sm')
    
    # pass the text into the nlp function
    doc= nlp(text)
    
    ## The score of each word is kept in a frequency table
    tokens=[token.text for token in doc]
    freq_of_word=dict()
    
    # Text cleaning and vectorization 
    for word in doc:
        if word.text.lower() not in list(STOP_WORDS):
            if word.text.lower() not in punctuation:
                if word.text not in freq_of_word.keys():
                    freq_of_word[word.text] = 1
                else:
                    freq_of_word[word.text] += 1
                    
    # Maximum frequency of word
    max_freq=max(freq_of_word.values())
    
    # Normalization of word frequency
    for word in freq_of_word.keys():
        freq_of_word[word]=freq_of_word[word]/max_freq
        
    # In this part, each sentence is weighed based on how often it contains the token.
    sent_tokens= [sent for sent in doc.sents]
    sent_scores = dict()
    for sent in sent_tokens:
        for word in sent:
            if word.text.lower() in freq_of_word.keys():
                if sent not in sent_scores.keys():                            
                    sent_scores[sent]=freq_of_word[word.text.lower()]
                else:
                    sent_scores[sent]+=freq_of_word[word.text.lower()]
    
    
    len_tokens=int(len(sent_tokens)*percentage)
    
    # Summary for the sentences with maximum score. Here, each sentence in the list is of spacy.span type
    summary = nlargest(n = len_tokens, iterable = sent_scores,key=sent_scores.get)
    
    # Prepare for final summary
    final_summary=[word.text for word in summary]
    
    #convert to a string
    summary=" ".join(final_summary)
    
    # Return final summary
    return summary

## Summarising for each news article

In [8]:
df['Summary'] = df['Content'].apply(lambda x: textSummarizer(x,0.3))

In [16]:
df.to_excel('summary.xlsx')

# Summary for each category of news article

In [10]:
# Apply textSummarizer function to 'Content' column
# df['Summary'] = df['Content'].apply(lambda x: textSummarizer(x,0.3))
# df
df['Summary'] = df['Content'].astype(str).apply(lambda x: textSummarizer(x, 0.1))

In [11]:
def tclean(title):
    
    title = title.replace("\n", "").strip()
    return title
df['Title'] = df['Title'].apply(tclean)


In [12]:
df[['Category','SubCategory','Title','Summary']]

Unnamed: 0,Category,SubCategory,Title,Summary
0,MONEY,economy,Consumer confidence worsened in April as Ameri...,The business group’s measure of economic expec...
1,BUSINESS,business,GM earnings much better than expected as reven...,It also said it now expects to earnings betwee...
2,BUSINESS,business,"Chevrolet Bolt, GM’s first popular electric ve...",The Michigan assembly plant where it’s produce...
3,BUSINESS,homes,New home sales rise for the fourth month in a row,"New home sales rose in March, climbing for the..."
4,ENTERTAINMENT,,Tucker Carlson out at Fox News,"Jonathan Greenblatt, the head of the Anti-Defa..."
...,...,...,...,...
1180,HEALTHY LIVING,health,Hospitals and health care facilities should dr...,"""After three years of universal masking in hea..."
1181,HEALTHY LIVING,health,ChatGPT for health care providers: Can the AI ...,""" This might mean making up diseases the patie..."
1182,HEALTHY LIVING,health,Want to get better sleep? Exercise for this lo...,TEXAS HOSPITAL SEES 30 INFANT DEATHS IN 15-MON...
1183,HEALTHY LIVING,health,Massachusetts town says Avian Flu detected amo...,"The town of Swansea, 50 miles south of Boston,..."


In [15]:
grouped_df = df.groupby(['Category','SubCategory','Title','Summary']).count()
grouped_df
filtered = grouped_df.loc[('WORLD NEWS','europe')]
filtered

Unnamed: 0_level_0,Unnamed: 1_level_0,Content
Title,Summary,Unnamed: 2_level_1
Analysis: Will Italy’s PM stop the boats or will the boats stop her?,"“This is a serious issue, I think this is the most relevant crisis that she’s facing and the most relevant challenge for her government now,” Giovanni Orsina, director of the School of Government at Luiss Guido Carli University in Rome, told CNN, adding that she is addressing migration on two fronts: by putting pressure on Europe and by taking it very seriously at home. “If you ask migration experts if she could stop boats, the answer would be no,” she said, adding that the only thing that has ever stopped migration was the Covid-19 pandemic. “The turning point (is) when the migrants stop headlining the news and start becoming the people in front of their homes, you find them in the streets and squares in small Italian towns, then it becomes existential not abstract.”",1
Cocaine worth nearly $440 million found floating in the sea off Italy,,1
"Fleet of Russian spy ships has been gathering intelligence in Nordic waters, investigation finds","“We saw a couple of months ago that Russian ships, a Russian ship, wanted to enter the area where Dutch windmill parks in the North Sea are located with the intention to see how the command and control structure of these windmill farms, how it is operated,” Jan Swillens, head of the Dutch Military Intelligence and Security Service (MIVD), told reporters. Russia has a fleet of suspected spy ships operating in Nordic waters as part of a program for the potential sabotage of underwater cables and wind farms in the region, according to a joint investigation by the public broadcasters of Sweden, Denmark, Norway and Finland.",1
Foreign powers rescue nationals while Sudanese must fend for themselves,"One British citizen, named Fatima, told the BBC that she feels “abandoned” by the government, calling the situation on the ground “traumatizing.” An Egyptian diplomat, Mohamed Al-Gharawi, was shot and killed on his way back to the Egyptian embassy in Khartoum on Monday following “evacuation procedures for Egyptian citizens in Sudan,” Egypt’s Ministry of Foreign Affairs said. EU High Representative for Foreign Affairs and Security Policy Josep Borrell said Monday that more than 1,000 EU nationals have been evacuated so far, calling it a “successful operation.” “We have managed to relocate some our staff by road to Kassala and Gedaref and will try to evacuate some non-essential personnel by road to Ethiopia and Chad,” Africa’s regional spokesperson Alyona Synenko told CNN in a statement. At the same time, Britons in Sudan said they feel “abandoned” by the UK government’s move to evacuate diplomats only. US special forces helped bring almost 100 people – mostly US embassy staff, as well as a small number of diplomatic professionals from other countries – to safety over the weekend, US officials said. The RSF responded by offering their sincerest condolences to the Egyptian government and said that they will “spare no effort in cooperating with the brothers in the Republic of Egypt to uncover the facts about the Gharawi incident.” As regions of Sudan are battered by the violence, the International Committee of the Red Cross said it will have to “adapt” its emergency response.",1
Italian minister sparks fury for saying immigration leads to ‘ethnic replacement’,Far right White supremacist groups and conservative media personalities in both Europe and the US have been widely condemned in recent years for attempting to inflame nativist feelings among conservative White populations by warning that immigrants are “replacing” native born populations.,1
"NATO chief says Ukraine’s ‘rightful place’ is in the alliance, but Kyiv likely won’t join any time soon","In his first visit to Ukraine since the invasion began, Stoltenberg said he discussed a “multiyear support initiative” with President Volodymyr Zelensky, adding that it would help Ukraine transition from Soviet-era equipment and doctrines to “NATO standards.” These standards include being able to demonstrate a “functioning democratic political system based on a market economy” and “the fair treatment of minority populations,” according to NATO’s website, along with other things that are more difficult to demonstrate during wartime. All allies agree on that,” Stoltenberg said, adding that the main focus of the alliance now is “to ensure that Ukraine prevails.”",1
One of the last survivors of the Warsaw Ghetto resistance tells of the bravery of those who dared to stand up against the Nazis,"“She knew that if I stayed I would not survive and it was more important to her that someone would be left alive to tell the story and because of that I agreed to leave the ghetto,” said Vitis-Shomron, who made a commitment to tell the world what had happened after she escaped with her mother and sister on April 17. Find out more: Remembering the Warsaw Ghetto Uprising and the people who fought back “We would spread the news to different parts of the ghetto,” she said. One of the last surviving members of the Warsaw Ghetto resistance has told CNN the world must never forget the bravery of those who stood up to the Nazis, 80 years after World War II’s largest Jewish uprising. “I wanted to stay and fight with my fellow fighters,” she told CNN, explaining that she had sought the advice of one of the commanders within the resistance movement.",1
Poland and Hungary ban Ukrainian grain amid glut from neighbor,"Hungarian Agriculture Minister István Nagy on Sunday announced Budapest would also temporarily ban the import of grain, oil seeds and other agricultural products from Ukraine, saying the move was necessary “in the absence of meaningful EU measures.”",1
"Russia having difficulty making new weapons, but might have enough older ones, report says","“Moscow is estimated to have lost anywhere from 1,845 to 3,511 tanks one year into the war,” the CSIS report says, with losses of its newer, upgraded T-72B3 main battle tank, first delivered in 2013, noted as especially damaging. In the case of high-quality ball bearings – “critical to producing any type of moving vehicle,” the report said – 55% of Russia’s pre-war supply came from Europe and North America. “This is the crux of this war in its second year: the Russian military can rely on its mass and continue feeding older or less than state-of-the-art technology as long as it thinks it can simply outlast the Western deliveries of weapons and systems to Ukraine,” the CSIS report says.",1
"Russia’s Lavrov hosts UN meeting on ‘international peace,’ gets slammed by Western diplomats","“As was the case during the Cold War, we have reached the dangerous, possibly even more dangerous threshold,” Lavrov said, accusing the “United States and its allies” of “abandoning diplomacy and demanding clarification of relations on the battlefield.” Lavrov repeatedly described the Ukrainian government as “the putchists” and “the Nazi Kyiv regime,” “Russia’s invasion of Ukraine, in violation of the United Nations Charter and international law, is causing massive suffering and devastation to the country and its people and adding to the global economic dislocation triggered by the Covid-19 pandemic,” he said, sitting right next to Lavrov.",2


<left><img src="images/sudan.gif" width="400" height="300" /></left>