# Text Summarization with spaCy

Spacy is a Python library that provides various natural language processing (NLP) capabilities, including text summarization. 

Text summarization involves condensing a longer piece of text into a shorter summary, while retaining the most important information. 

Spacy's summarization capabilities rely on machine learning algorithms that identify the most important sentences in a text and use them to generate a summary. 

Spacy's summarization capabilities can be customized by adjusting various parameters, such as the length of the summary and the importance assigned to different types of words and phrases. 

Text summarization with Spacy can be used in a variety of applications, such as news articles, legal documents, and academic papers, to quickly and efficiently distill important information from longer texts.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import string
import math
import spacy
from sklearn.feature_extraction.text import CountVectorizer

In [4]:
# Load CSV file into DataFrame
df = pd.read_csv('output-merged.csv')
df = df[['Title', 'Content']]
df

Unnamed: 0,Title,Content
0,\n Kids under 6 were increasingly treated...,The outbreak of Covid-19 presented many danger...
1,\n How to exercise when you have a chroni...,Many people struggle to maintain a regular wor...
2,\n Australia unveils biggest defense over...,Australia has unveiled a radical shakeup of it...
3,\n Another cheetah has died after relocat...,A cheetah from Africa has died two months afte...
4,\n Residents in South Florida condo build...,Residents of a South Florida condo building ha...
...,...,...
1205,ChatGPT for health care providers: Can the AI ...,"OpenAI CEO Sam Altman said that he was ""a litt..."
1206,Want to get better sleep? Exercise for this lo...,Get the rest you need with these simple tweaks...
1207,Massachusetts town says Avian Flu detected amo...,Fox News Flash top headlines are here. Check o...
1208,The world lost faith in childhood vaccines dur...,Fox News Flash top headlines are here. Check o...


In [5]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest
spacy.cli.download("en_core_web_md")
from string import punctuation
punctuation=punctuation+ '\n'

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [6]:
import nltk

# Download the stopwords corpus (only need to do this once)
nltk.download('stopwords')

# Load the stopwords into a set
stopwords = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to C:\Users\Yamini
[nltk_data]     Manral\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
nlp = spacy.load('en_core_web_md')

Another way of doing it by defining a function called textSummarizer()

In [8]:
def textSummarizer(text, percentage):
    
    # load the model into spaCy
#     nlp = spacy.load('en_core_web_sm')
    
    # pass the text into the nlp function
    doc= nlp(text)
    
    ## The score of each word is kept in a frequency table
    tokens=[token.text for token in doc]
    freq_of_word=dict()
    
    # Text cleaning and vectorization 
    for word in doc:
        if word.text.lower() not in list(STOP_WORDS):
            if word.text.lower() not in punctuation:
                if word.text not in freq_of_word.keys():
                    freq_of_word[word.text] = 1
                else:
                    freq_of_word[word.text] += 1
                    
    # Maximum frequency of word
    max_freq=max(freq_of_word.values())
    
    # Normalization of word frequency
    for word in freq_of_word.keys():
        freq_of_word[word]=freq_of_word[word]/max_freq
        
    # In this part, each sentence is weighed based on how often it contains the token.
    sent_tokens= [sent for sent in doc.sents]
    sent_scores = dict()
    for sent in sent_tokens:
        for word in sent:
            if word.text.lower() in freq_of_word.keys():
                if sent not in sent_scores.keys():                            
                    sent_scores[sent]=freq_of_word[word.text.lower()]
                else:
                    sent_scores[sent]+=freq_of_word[word.text.lower()]
    
    
    len_tokens=int(len(sent_tokens)*percentage)
    
    # Summary for the sentences with maximum score. Here, each sentence in the list is of spacy.span type
    summary = nlargest(n = len_tokens, iterable = sent_scores,key=sent_scores.get)
    
    # Prepare for final summary
    final_summary=[word.text for word in summary]
    
    #convert to a string
    summary=" ".join(final_summary)
    
    # Return final summary
    return summary

## Summarising for each news article

In [None]:
df['Summary'] = df['Content'].apply(lambda x: textSummarizer(x,0.3))

In [None]:
df