<h1>Extractive text summarization in python</h1>

<p>Extractive text summarization is a "technique" for summarizing text data. What's characteristic for this kind of summarization is that the summary it produces is a carbon copy of the most important sentences given in the text to be summarized.</p>

<p>The way I achieve this here is through using term frequency - inverse document frequency(TF-IDF) over the entire text body. In short, the text input data is taken, it is split into different sentences, and TF-IDF calculation os executed over each sentence. All the TF-IDF scores are, then, saved in a list and sorted from highest to lowesr, thus the resulting sentences are ordered from most to least relevant.</p>

<h1>Setting the project up</h1>

<p>Before getting to the fun part, I need to import a couple of libraries. </p>

In [72]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import re
import math
from nltk.tokenize import word_tokenize, sent_tokenize
import contractions
from nltk.tokenize import RegexpTokenizer


<p> Before pre-processing any sort of data I need to have data, of course. I've taken a fragment of an article about the black thursday from investopedia. You can check it out <a href='https://www.investopedia.com/terms/b/blackthursday.asp'> here. </a>  </p>

In [33]:
InputText = """Black Thursday marked the beginning of the end of one of the longest-running bull markets in U.S. history. For nearly the entire decade of the 1920s, stock prices had been steadily climbing, rising to unprecedented heights. The Dow Jones Industrial Average (DJIA) increased sixfold from 63 in August 1921 to 381 in September 1929.
However, even before the New York Stock Exchange (NYSE) opened on that fateful Thursday in 1929, the elevated equity prices were making investors and financial experts uneasy. On Sept. 5, at the annual National Business Conference, economist Roger Babson predicted that “sooner or later a crash is coming, and it may be terrific.” Throughout September, stock prices gyrated, with sudden declines and rapid recoveries.
The jitters continued into October. In fact, on Oct. 23, the Dow fell 4.6%. A Washington Post headline exclaimed, “Huge Selling Wave Creates Near-Panic as Stocks Collapse.”
By this time, the stock market had already fallen nearly 20% since its record close of 381 on Sept. 3. When trading opened on Thursday, Oct. 24, the Dow fell 11% in the first few hours.  Even more ominous was the heavy trading volume: It was to hit a record 12.9 million shares—three times the normal amount—by day’s end."""

In [34]:
InputText

'Black Thursday marked the beginning of the end of one of the longest-running bull markets in U.S. history. For nearly the entire decade of the 1920s, stock prices had been steadily climbing, rising to unprecedented heights. The Dow Jones Industrial Average (DJIA) increased sixfold from 63 in August 1921 to 381 in September 1929.\nHowever, even before the New York Stock Exchange (NYSE) opened on that fateful Thursday in 1929, the elevated equity prices were making investors and financial experts uneasy. On Sept. 5, at the annual National Business Conference, economist Roger Babson predicted that “sooner or later a crash is coming, and it may be terrific.” Throughout September, stock prices gyrated, with sudden declines and rapid recoveries.\nThe jitters continued into October. In fact, on Oct. 23, the Dow fell 4.6%. A Washington Post headline exclaimed, “Huge Selling Wave Creates Near-Panic as Stocks Collapse.”\nBy this time, the stock market had already fallen nearly 20% since its rec

<p>It's a fairly short text, but it is enough for the purposes of this notebook. After all we're not training a model to recognize specific texts, we don't need much data for what we're trying to do. One sample text like that is enough, of course - the longer the text, the better the result would be (I suppose).</p>

<h1>Preprocessing</h1>

<p>What we now need to do is to split the text into different sentences (every sentence ends with a period or a semicolon), save them in a list and remove the so-called stop words from each sentence. Stop words are the words that do not carry much meaning by themselves, if any. In English, those words are "the", "and", "but" etc.</p>

In [40]:
InputTextSentences = sent_tokenize(InputText)

In [42]:
InputTextSentences

['Black Thursday marked the beginning of the end of one of the longest-running bull markets in U.S. history.',
 'For nearly the entire decade of the 1920s, stock prices had been steadily climbing, rising to unprecedented heights.',
 'The Dow Jones Industrial Average (DJIA) increased sixfold from 63 in August 1921 to 381 in September 1929.',
 'However, even before the New York Stock Exchange (NYSE) opened on that fateful Thursday in 1929, the elevated equity prices were making investors and financial experts uneasy.',
 'On Sept. 5, at the annual National Business Conference, economist Roger Babson predicted that “sooner or later a crash is coming, and it may be terrific.” Throughout September, stock prices gyrated, with sudden declines and rapid recoveries.',
 'The jitters continued into October.',
 'In fact, on Oct. 23, the Dow fell 4.6%.',
 'A Washington Post headline exclaimed, “Huge Selling Wave Creates Near-Panic as Stocks Collapse.”\nBy this time, the stock market had already fall

<p>Now that we have all the sentences in the text, we can tokenize each word (split the sentence into words) just like we did with the sentences in the previous cells. </p>

In [79]:
WordsInSentenceList = [word_tokenize(sentence.lower()) for sentence in InputTextSentences]

In [93]:
WordsInSentenceList[0:2]

[['black',
  'thursday',
  'marked',
  'the',
  'beginning',
  'of',
  'the',
  'end',
  'of',
  'one',
  'of',
  'the',
  'longest-running',
  'bull',
  'markets',
  'in',
  'u.s.',
  'history',
  '.'],
 ['for',
  'nearly',
  'the',
  'entire',
  'decade',
  'of',
  'the',
  '1920s',
  ',',
  'stock',
  'prices',
  'had',
  'been',
  'steadily',
  'climbing',
  ',',
  'rising',
  'to',
  'unprecedented',
  'heights',
  '.']]

<p>Next, we need to remove the stopwords from the text. </p>

In [108]:
NoStopWordsSentences = [word for sentence in WordsInSentenceList for word in sentence if word not in stopwords.words('english')]

In [116]:
NoStopWordsSentencesJoined = " ".join(NoStopWordsSentences).split(' . ')

In [118]:
NoStopWordsSentencesJoined[1]

'nearly entire decade 1920s , stock prices steadily climbing , rising unprecedented heights'

<p>Let's remove any punctuation now, so that we don't get any unpleasant surprises down in the notebook.</p>

In [136]:
RegexTokenizer = RegexpTokenizer(r'\w+')

In [137]:
NoPunctuationSentences = [RegexTokenizer.tokenize(sentence) for sentence in NoStopWordsSentencesJoined]

In [140]:
NoPunctuationSentences[0:2]

[['black',
  'thursday',
  'marked',
  'beginning',
  'end',
  'one',
  'longest',
  'running',
  'bull',
  'markets',
  'u',
  's',
  'history'],
 ['nearly',
  'entire',
  'decade',
  '1920s',
  'stock',
  'prices',
  'steadily',
  'climbing',
  'rising',
  'unprecedented',
  'heights']]

In [169]:
ProcessedData = [" ".join(sentence) for sentence in NoPunctuationSentences]

<p>Our data seems to be clean now, ready to be processed by the algorithm we will use ! </p>

<h1>Data processing</h1>

<p>The first thing we've got to do is to represent our data as numbers, so that the computer can understand it. This we will achieve through the TfIdfVectorizer, thanks to scikit learn :) </p>

In [174]:
Vectorizer = TfidfVectorizer()

In [176]:
VectorizedDataMatrix = Vectorizer.fit_transform(ProcessedData)

<p>What we just did on the cell above, is - take the input data, turn it into numbers, return the tf-idf value for each word and save it all to a matrix, hence the variable name. The tf-idf value is the key thing here, this is the formula through which the tf-idf value is calculated for every word in our text:</p>

$TFIDF(S) = TF(S) * IDF(S)$, where TF is the frequency of a given term - S - in a given text and idf is the inverse-document frequency of the given term.: $IDF = log( N / df)$, where N is the total number of documents/texts/sentences(depends on the case) that contain the given term S and df is the number of all the documents.