##Importing the files from kaggle and unzipping them

In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [0]:
!rm /root/.kaggle
!mkdir /root/.kaggle

!cp /content/drive/"My Drive"/Codes/kaggle.json /root/.kaggle

!kaggle datasets download -d pariza/bbc-news-summary


In [0]:
!unzip /content/bbc-news-summary.zip

In [0]:
import spacy
#!python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg')

This project is about building the Automatic Extractive Text Summarization System.
The dataset for this purpose has been taken from https://www.kaggle.com/pariza/bbc-news-summary

This dataset for extractive text summarization has four hundred and seventeen political news articles of BBC from 2004 to 2005 in the News Articles folder.

The dataset consists of 5 categories of news articles namely Business, Entertainment, Politics, Sports, Tech. We will build and test our model on Business news articles.

In [0]:

import os
import pandas as pd

all_files = os.listdir("/content/BBC News Summary/News Articles/business/")
print(all_files)

business_news_articles = []

for file_name in all_files:
    name = "/content/BBC News Summary/News Articles/business/" + str(file_name)
    with open(name, 'r') as f:
        business_news_articles.append(f.read())

df_articles = pd.DataFrame(business_news_articles)
df_articles.columns = ['article']
df_articles.head(10)


After importing all the articles from the text files under Business folder, we preprocess the data by doing the following:

1. Removing Extra New Lines

2. Removing punctuations and Numbers and special characters.

3. Removing the Stopwords.

In [0]:
def remove_newlines(text):
    newlines_removed = text.replace('\n', ' ')
    
    return newlines_removed;

df_articles['lines_removed'] = df_articles['article'].apply(remove_newlines)

df_articles.head(10)


exploring the data:

we first visualise the word count distribution with histotgram.

after removing stopwords, unigram distribution can be shown. top 10 or 20 unigrams....

top 20 bigrams can be shown.

we can show the distribution of lengths of the articles.

word clouds.



we can also visualize the distribution of articles across GPEs present in those. like top 10 GPEs mentioned across all the articles.
we'll get to know that BBC has published news articles preferrably related to specific regions.

In [0]:

punctuations = '!"#$%^&*{`|}~<=>?;:/+-,@[\_]'

def clean_articles(article):
    toks = nlp(article, disable=['tagger', 'parser', 'ner'])
    punc_removed = [token.text for token in toks if token.text not in punctuations]
    words = [word.lower() for word in punc_removed]    
    # remove tokens with numbers in them
    words = [word for word in words if word.isalpha() or word == '.']
    
    return ' '.join(words);
    

df_articles['cleaned_up_text'] = df_articles['lines_removed'].apply(clean_articles)


df_articles['article_length'] = [len(article.split()) for article in df_articles['cleaned_up_text']]

df_articles.head(10)

In [0]:
#Plotting Distribution of Article Lengths
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))
plt.xlabel('Count')
plt.ylabel('Article Lengths')
plt.hist(df_articles['article_length'])
plt.show()

In [0]:
# removing stopwords and unigram/bigram distributions
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stops = stopwords.words('english')


def remove_stops(text):
    stops_removed = []
    doc = nlp(text, disable=['parser','tagger'])
    tokens = [token.text for token in doc if token.text not in stops]
    stops_removed = ' '.join(tokens)
    
    return stops_removed;

df_articles['stops_removed'] = df_articles['cleaned_up_text'].apply(remove_stops)

df_articles.head(10)
df_articles['stops_removed'][0]

After Preprocessing, we'll do data exploration.

We will have a look at distribution of Article lengths.

We see that major number of articles are of the length between  200-300 words.

We will further observe the words appearing in the articles by visualizing the Top Unigrams, Bigrams and Trigrams distributions.

Upon observing the top ngrams present across the articles, it is clear that most of those are Business related articles.

We can also observe the Word CLoud and see the words appearing the most.

##Vizualizing the n-grams

In [0]:
from nltk.util import ngrams
from collections import Counter

unigrams_frequencies = Counter([])
bigrams_frequencies = Counter([])
trigrams_frequencies = Counter([])

def generate_ngrams():
    global unigrams_frequencies, bigrams_frequencies, trigrams_frequencies;
    for tex in df_articles['stops_removed']:
        toks = nlp(tex, disable=['parser', 'tagger'])
        tok_texts = [t.text for t in toks if t.text != '.']
        unigrams = ngrams(tok_texts, 1)
        bigrams = ngrams(tok_texts, 2)
        trigrams = ngrams(tok_texts, 3)
        bigrams_frequencies += Counter(bigrams)
        unigrams_frequencies += Counter(unigrams)
        trigrams_frequencies += Counter(trigrams)

    return;
    
generate_ngrams()


In [0]:


top_unigrams = unigrams_frequencies.most_common(20)
top_bigrams = bigrams_frequencies.most_common(20)
top_trigrams = trigrams_frequencies.most_common(20)

#unzipping the ngrams
strings_unigrams, count_unigrams = zip(*top_unigrams)
strings_bigrams, count_bigrams = zip(*top_bigrams)
strings_trigrams, count_trigrams = zip(*top_trigrams)

#converting lists of strings out of ngrams
list_of_strings_unigrams = []
list_of_strings_bigrams = []
list_of_strings_trigrams = []

for s in strings_unigrams:
    for st in s:
        list_of_strings_unigrams.append(st)

list_of_strings_bigrams = [str(bi_strings) for bi_strings in strings_bigrams]
list_of_strings_trigrams = [str(tri_strings) for tri_strings in strings_trigrams]

plt.figure(1, figsize=(10, 8))
plt.title('Top Unigrams')
plt.xlabel('Count')
plt.ylabel('Unigrams')
plt.barh(list_of_strings_unigrams, count_unigrams)

plt.figure(2, figsize=(10, 8))
plt.title('Top Bigrams')
plt.xlabel('Count')
plt.ylabel('Bigrams')
plt.barh(list_of_strings_bigrams, count_bigrams)

plt.figure(3, figsize=(10, 8))
plt.title('Top Trigrams')
plt.xlabel('Count')
plt.ylabel('Trigrams')
plt.barh(list_of_strings_trigrams, count_trigrams)

plt.show()


##Visualizing the GPEs

we will visualize the Geopolitical Entities present across the articles. Like top 10 GPEs mentioned across all the articles.

It is observed that the business news articles contain US the most. Apart from US, others such as UK, Chine, India, Japan are mentioned more frequently.

Most of the business news are related to these countires, while other african and american countries are covered less.



In [0]:
GPE_counts = {}

def Count_GPEs(text):
    global GPE_counts;
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ == 'GPE':
            if ent.text not in GPE_counts.keys():
                GPE_counts[ent.text] = 1
            else:
                GPE_counts[ent.text] += 1


for article in df_articles['lines_removed']:
    Count_GPEs(article)

In [0]:
#getting the top GPE counts
top_GPEs = sorted(GPE_counts.items(), key=lambda x: x[1], reverse=True)

top_GPEs[:20]

In [0]:
#unzip the keys and values
GPEs, counts = zip(*top_GPEs[:20])

plt.figure(figsize=(10, 8))
plt.title('Top Geographical Places')
plt.xlabel('Count')
plt.ylabel('Places')
plt.barh(GPEs, counts)


##Let's generate the word cloud

In [0]:
from wordcloud import WordCloud
from matplotlib import rcParams
import matplotlib.pyplot as plt

all_articles = ' '.join(df_articles['stops_removed'].str.lower())

wordcloud = WordCloud(background_color="white", max_words=500).generate(all_articles)

rcParams['figure.figsize'] = 15, 10
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

##And now the Summary Part.

For summary, we calculate the word frequencies from the article and then calculate the sentence score by summing the frequencies of words present in it. If the length of a sentence is more than 30 words, we won't consider it to be in the summary.

At last we take the 7 highest scored sentences out of all to make the summary.


In [0]:
import re
import heapq

def generate_summary(text_with_periods, cleaned_text):
    sample_text = text_with_periods
    doc = nlp(sample_text)
    sentence_list = []

    for idx, sent in enumerate(doc.sents):
        sentence_list.append(re.sub(r'[^\w\s]', '', str(sent)))

    #calculate word frequencies
    word_freqs = {}
    for word in nlp(cleaned_text, disable=['parser', 'tagger', 'ner']):
        if word.text not in word_freqs.keys():
            word_freqs[word.text] = 1
        else:
            word_freqs[word.text] += 1

    max_freq = max(word_freqs.values())

    #weighted frequency:
    for word in word_freqs.keys():
        word_freqs[word] /= max_freq

    #calculate sentence scores
    sent_scores = {}
    for sent in sentence_list:
        for word in nlp(sent.lower(), disable=['tagger','parser','ner']):
            if word.text in word_freqs.keys():
                if len(sent.split(' ')) < 30:
                    if sent not in sent_scores.keys():
                        sent_scores[sent] = word_freqs[word.text]
                    else:
                        sent_scores[sent] += word_freqs[word.text]

    summary_sentences = heapq.nlargest(5, sent_scores, key= sent_scores.get)

    summary = '. '.join(summary_sentences)
    print("Original Text::::::::::::\n")
    print(text_with_periods)
    print('\n\nSummarized text::::::::\n')
    print(summary)

    return summary;





In [0]:
system_summary = generate_summary(df_articles['cleaned_up_text'][10], df_articles['stops_removed'][10])
