# <center>NATURAL LANGUAGE PROCESSING </center>
## <center> CAC 1 : TEXT SUMMARIATION </center>
### <center> K Nidhi Sharma, 2148041 </center>

Text summarization is the method of generating a brief and accurate summary of a large text while concentrating on the passages that provide relevant information and keeping the main ideas intact.

In general, there are two methods used in NLP to summarise texts: extraction and abstraction.

In **extraction-based summarising**, a selection of words that best capture the text's key ideas are selected, then they are concatenated to create a summary.

Advanced deep learning algorithms are used in **abstraction-based summarization** to paraphrase and condense the original text, exactly like humans do.

### Importing libraries 

In [1]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
import regex as re

### Reading the data

In [2]:
text_original=open('E:\\PG_sem3\\nlp\\text.txt').read()
print("The length of the original text is",len(text_original)) 

The length of the original text is 5010


In [3]:
print("ORIGINAL TEXT:\n", text_original)

ORIGINAL TEXT:
 Enzo Ferrari was not initially interested in the idea of producing road cars when
he formed Scuderia Ferrari in 1929, with headquarters in Modena. Scuderia Ferrari
(pronounced [skude?ri?a]) literally means &quot;Ferrari Stable&quot; and is usually used to
mean &quot;Team Ferrari.&quot; Ferrari bought,[citation needed] prepared, and fielded Alfa
Romeo racing cars for gentleman drivers, functioning as the racing division of
Alfa Romeo. In 1933, Alfa Romeo withdrew its in-house racing team and Scuderia
Ferrari took over as its works team:[1] the Scuderia received Alfa&#39;s Grand Prix
cars of the latest specifications and fielded many famous drivers such as Tazio
Nuvolari and Achille Varzi. In 1938, Alfa Romeo brought its racing operation again
in-house, forming Alfa Corse in Milan and hired Enzo Ferrari as manager of the new
racing department; therefore the Scuderia Ferrari was disbanded.[1]
The first vehicle made with the Ferrari name was the 125 S. Only two of this smal

### Pre-processing the text file

When the original text is observed, it is noticed that there exists extra new line spaces, and quotes are represented as **&quot** and apostrophe is denoted as **&#39**. 

In the pre-processing step, the following is corrected to get accurate results. 

In [4]:
text=open('E:\\PG_sem3\\nlp\\text.txt').read()

In [5]:
text=re.sub("\n"," ",text) # removing newline (\n)
text=re.sub("&#39;","'",text) #replacing apostrophe
text=re.sub("&quot;","'",text) #replacing quotes

print("PROCESSED TEXT:\n",text)

PROCESSED TEXT:
 Enzo Ferrari was not initially interested in the idea of producing road cars when he formed Scuderia Ferrari in 1929, with headquarters in Modena. Scuderia Ferrari (pronounced [skude?ri?a]) literally means 'Ferrari Stable' and is usually used to mean 'Team Ferrari.' Ferrari bought,[citation needed] prepared, and fielded Alfa Romeo racing cars for gentleman drivers, functioning as the racing division of Alfa Romeo. In 1933, Alfa Romeo withdrew its in-house racing team and Scuderia Ferrari took over as its works team:[1] the Scuderia received Alfa's Grand Prix cars of the latest specifications and fielded many famous drivers such as Tazio Nuvolari and Achille Varzi. In 1938, Alfa Romeo brought its racing operation again in-house, forming Alfa Corse in Milan and hired Enzo Ferrari as manager of the new racing department; therefore the Scuderia Ferrari was disbanded.[1] The first vehicle made with the Ferrari name was the 125 S. Only two of this small two-seat sports/racin

## Summarization using frequency distribution

Steps involved in summarization:
1. Import libraries
2. Remove stop words and tokenization
3. Create frequency table
4. Scoring sentence
5. Summarization

### Removing stopwords & word tokenization

In [6]:
stopwords = stopwords.words("english") # english stopwords such as - a, the, and, an, etc
words = word_tokenize(text) #word tokenization for frequency

### Creating frequency distribution

A dictionary is created for the word frequency table from the given text file. The words which are not part of stopwords arrays are only included in  this dictionary.

In [7]:
freqTable = dict() # new dictionary for storing the frequency of words

for word in words:
    word=word.lower()
    if word in stopwords:
        continue
    if word in freqTable:
        freqTable[word]+=1
    else:
        freqTable[word]=1
freqTable

{'enzo': 7,
 'ferrari': 38,
 'initially': 2,
 'interested': 1,
 'idea': 1,
 'producing': 1,
 'road': 6,
 'cars': 10,
 'formed': 1,
 'scuderia': 8,
 '1929': 1,
 ',': 45,
 'headquarters': 1,
 'modena': 1,
 '.': 37,
 '(': 6,
 'pronounced': 1,
 '[': 12,
 'skude': 1,
 '?': 2,
 'ri': 1,
 ']': 12,
 ')': 6,
 'literally': 1,
 'means': 1,
 "'ferrari": 1,
 'stable': 1,
 "'": 2,
 'usually': 1,
 'used': 2,
 'mean': 1,
 "'team": 1,
 'bought': 1,
 'citation': 1,
 'needed': 1,
 'prepared': 1,
 'fielded': 3,
 'alfa': 7,
 'romeo': 5,
 'racing': 7,
 'gentleman': 1,
 'drivers': 2,
 'functioning': 1,
 'division': 1,
 '1933': 1,
 'withdrew': 1,
 'in-house': 2,
 'team': 2,
 'took': 2,
 'works': 2,
 ':': 1,
 '1': 5,
 'received': 2,
 "'s": 7,
 'grand': 2,
 'prix': 1,
 'latest': 1,
 'specifications': 1,
 'many': 2,
 'famous': 1,
 'tazio': 1,
 'nuvolari': 1,
 'achille': 1,
 'varzi': 1,
 '1938': 1,
 'brought': 1,
 'operation': 1,
 'forming': 1,
 'corse': 2,
 'milan': 1,
 'hired': 1,
 'manager': 1,
 'new': 4,
 'de

### Sentence tokenization

In [8]:
sentences=sent_tokenize(text) #sentence tokenization

Using term frequency method, each sentence is scored by it's words. Frequency of every non-stop word in a sentence.

In [9]:
sentenceValue=dict() #creating dictionay for sentence scores

for sentence in sentences:
    for word, freq in freqTable.items():
        if word in sentence.lower():
            if sentence in sentenceValue:
                sentenceValue[sentence]+=freq
            else:
                sentenceValue[sentence]=freq
print(sentenceValue)

{'Enzo Ferrari was not initially interested in the idea of producing road cars when he formed Scuderia Ferrari in 1929, with headquarters in Modena.': 178, "Scuderia Ferrari (pronounced [skude?ri?a]) literally means 'Ferrari Stable' and is usually used to mean 'Team Ferrari.'": 141, 'Ferrari bought,[citation needed] prepared, and fielded Alfa Romeo racing cars for gentleman drivers, functioning as the racing division of Alfa Romeo.': 195, "In 1933, Alfa Romeo withdrew its in-house racing team and Scuderia Ferrari took over as its works team:[1] the Scuderia received Alfa's Grand Prix cars of the latest specifications and fielded many famous drivers such as Tazio Nuvolari and Achille Varzi.": 241, 'In 1938, Alfa Romeo brought its racing operation again in-house, forming Alfa Corse in Milan and hired Enzo Ferrari as manager of the new racing department; therefore the Scuderia Ferrari was disbanded.': 188, '[1] The first vehicle made with the Ferrari name was the 125 S. Only two of this s

### Score calculation 

To compare the scores, average score of each sentence is calculated. 

In [10]:
sumValues=0
for snetence in sentenceValue:
    sumValues +=sentenceValue[sentence]
sumValues

4995

In [11]:
average = int(sumValues/len(sentenceValue))
average

135

### Generating summary
Generating summary using the scores as threshold.

In [12]:
summary=''
for sentence in sentences:
    if(sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):
        summary += " "+ sentence
print("SUMMARIZED TEXT:\n",summary)

SUMMARIZED TEXT:
  Enzo Ferrari was not initially interested in the idea of producing road cars when he formed Scuderia Ferrari in 1929, with headquarters in Modena. Ferrari bought,[citation needed] prepared, and fielded Alfa Romeo racing cars for gentleman drivers, functioning as the racing division of Alfa Romeo. In 1933, Alfa Romeo withdrew its in-house racing team and Scuderia Ferrari took over as its works team:[1] the Scuderia received Alfa's Grand Prix cars of the latest specifications and fielded many famous drivers such as Tazio Nuvolari and Achille Varzi. In 1938, Alfa Romeo brought its racing operation again in-house, forming Alfa Corse in Milan and hired Enzo Ferrari as manager of the new racing department; therefore the Scuderia Ferrari was disbanded. In September 1939, Ferrari left Alfa Romeo under the provision he would not use the Ferrari name in association with races or racing cars for at least four years. [1] A few days later he founded Auto Avio Costruzioni, headqua

In [13]:
print("The length of the original text is ",len(text_original))
print("The length of the summarized text is",len(summary))

The length of the original text is  5010
The length of the summarized text is 2269


## Summarization using TF-IDF

Steps:
1. Implement TF-IDF vectorizer.
2. Create a dataframe of the vectorized array using x.toarray()
3. Calculate sentence weights.
4. Append the weights in the dataframe and sort the dataframe as per the weights.
5. Set threshold.
6. Generate summary based on scores.


In [14]:
vector=TfidfVectorizer() #vectorization
x=vector.fit_transform(sentences)

In [15]:
import pandas as pd
data=pd.DataFrame(x.toarray(),index=sentences) #dataframe

data["Sentence_weights"]=(data.sum(axis=1))/len(data.columns) #weights
finaldf = data.sort_values(by = 'Sentence_weights',ascending=False)
finaldf[["Sentence_weights"]]

finaldf = finaldf.drop(finaldf[finaldf.Sentence_weights < 0.011].index) #threshold
finaldf[["Sentence_weights"]]

Unnamed: 0,Sentence_weights
"125 S replica 166 MM Touring Barchetta The first series produced Ferrari, the 1958 250 GT Coupé The first Ferrari-badged car was the 1947 125 S, powered by a 1.5 L V12 engine;[1] Enzo Ferrari reluctantly built and sold his automobiles to fund Scuderia Ferrari.",0.014229
"In 1933, Alfa Romeo withdrew its in-house racing team and Scuderia Ferrari took over as its works team:[1] the Scuderia received Alfa's Grand Prix cars of the latest specifications and fielded many famous drivers such as Tazio Nuvolari and Achille Varzi.",0.014111
"It was initially offered to loyal and recurring customers, each of the 399 made (minus the 400th which was donated to the Vatican for charity) had a price tag of $650,000 apiece (equivalent to £400,900).",0.013875
"[20] Ferrari's former CEO and Chairman, Luca di Montezemolo, resigned from the company after 23 years, who was succeeded by Amedeo Felisa and finally on 3 May 2016 Amedeo resigned and was succeeded by Sergio Marchionne, CEO and Chairman of Fiat Chrysler Automobiles, Ferrari's parent company.",0.013821
"An immediate result was an increase in available investment funds, and work started at once on a factory extension intended to transfer production from Fiat's Turin plant of the Ferrari engined Fiat Dino.",0.013121
"In September 1939, Ferrari left Alfa Romeo under the provision he would not use the Ferrari name in association with races or racing cars for at least four years.",0.012982
"In 1938, Alfa Romeo brought its racing operation again in-house, forming Alfa Corse in Milan and hired Enzo Ferrari as manager of the new racing department; therefore the Scuderia Ferrari was disbanded.",0.012832
"In 1989, the company was renamed Ferrari S.p.A.[19] From 2002 to 2004, Ferrari produced the Enzo, their fastest model at the time, which was introduced and named in honor of the company's founder, Enzo Ferrari.",0.012762
"On 15 September 2012, 964 Ferrari cars worth over $162 million (£99.95 million) attended the Ferrari Driving Days event at Silverstone Circuit and paraded round the Silverstone Circuit setting a world record.",0.012495
"This rear mid- engine layout would go on to be used in many Ferraris of the 1980s, 1990s and to the present day.",0.011491


In [16]:
summary_tfidf=""

for i in finaldf.index:
    summary_tfidf+=" "+i

print(summary_tfidf) #summary

 125 S replica  166 MM Touring Barchetta The first series produced Ferrari, the 1958 250 GT Coupé The first Ferrari-badged car was the 1947 125 S, powered by a 1.5 L V12 engine;[1] Enzo Ferrari reluctantly built and sold his automobiles to fund Scuderia Ferrari. In 1933, Alfa Romeo withdrew its in-house racing team and Scuderia Ferrari took over as its works team:[1] the Scuderia received Alfa's Grand Prix cars of the latest specifications and fielded many famous drivers such as Tazio Nuvolari and Achille Varzi. It was initially offered to loyal and recurring customers, each of the 399 made (minus the 400th which was donated to the Vatican for charity) had a price tag of $650,000 apiece (equivalent to £400,900). [20] Ferrari's former CEO and Chairman, Luca di Montezemolo, resigned from the company after 23 years, who was succeeded by Amedeo Felisa and finally on 3 May 2016 Amedeo resigned and was succeeded by Sergio Marchionne, CEO and Chairman of Fiat Chrysler Automobiles, Ferrari's p

In [17]:
print("The length of the original text is ",len(text_original))
print("The length of the summarized text is",len(summary_tfidf))

The length of the original text is  5010
The length of the summarized text is 2814


----