# Abstractive Text Summarization

**Abstractive text summarization is a type of text summarization technique where a summary is generated for a given input text without simply selecting and rearranging existing sentences (as in extractive summarization). Instead, abstractive summarization involves understanding the input text's meaning and generating a concise, coherent summary in natural language, potentially using paraphrasing and restructuring of sentences.**

In [None]:
!pip install tensorflow nltk datasets
!pip install --upgrade tensorflow




[notice] A new release of pip is available: 23.0.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.0.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


**Importing Required Libraries**

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding, TimeDistributed
from tensorflow.keras.models import Model
import nltk
from datasets import load_dataset
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')




[nltk_data] Error loading wordnet: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>
[nltk_data] Error loading omw-1.4: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


False

**Importing CNN/Dailymail Datasets**

**The CNN/Daily Mail dataset is a popular dataset used for training and evaluating abstractive text summarization models. It consists of news articles (from CNN and Daily Mail) paired with human-generated summaries.**

In [None]:
dataset = load_dataset("cnn_dailymail", "3.0.0")
train_data = dataset['train']

**Creating a function to distinguish between the context of the similar words (Word Sense Disambiguation)**

In [None]:
from nltk.corpus import wordnet as wn
from nltk.tokenize import word_tokenize

def disambiguate_word_senses(sentence):
    words = word_tokenize(sentence)
    lemmatized_sentence = []
    for word in words:
        synsets = wn.synsets(word)
        if synsets:
            lemmatized_sentence.append(synsets[0].lemmas()[0].name())  # Choose the most common sense
        else:
            lemmatized_sentence.append(word)
    return ' '.join(lemmatized_sentence)

# Assuming train_data is a list of dictionaries with key 'article'
preprocessed_texts = [disambiguate_word_senses(article['article']) for article in train_data if 'article' in article]

In [None]:
preprocessed_texts

["London , England ( Reuters ) -- harass potter star Daniel Radcliffe addition entree to angstrom report £20 million ( $ 41.1 million ) luck arsenic helium bend eighteen on Monday , merely helium insist the money wo n't cast angstrom enchantment on him . Daniel Radcliffe arsenic harass potter inch `` harass potter and the order of the Phoenix '' To the disappointment of chitchat columnist about the universe , the young actor say helium hour_angle no plan to fritter his cash away on fast car , drink and celebrity party . `` iodine bash n't plan to beryllium one of those people World_Health_Organization , arsenic soon arsenic they bend eighteen , suddenly bargain themselves angstrom massive sport car collection Oregon something similar , '' helium state Associate_in_Nursing Australian interviewer earlier this calendar_month . `` iodine bash n't think iodine 'll beryllium particularly excessive . `` The things iodine like buying are things that cost about ten pound -- book and cadmium and

**Function for saving Models and Variables onto a pickle file at our local machine for another runtime**

In [None]:
import pickle
import os

In [None]:
# Save preprocessed_text
preprocessed_texts_path = r"/Desktop/Abstractive Text Summarisation Minor Project"

In [None]:
# Function for saving model, variable for another runtime
def save_work(path, var_str, var_name):
    temp = os.path.join(path, var_name)

    # Create the directory if it doesn't exist
    os.makedirs(os.path.dirname(temp), exist_ok=True)

    absolute_path = os.path.abspath(temp)
    print(f"Saving file to: {absolute_path}")

    try:
        if not os.path.exists(temp):
            with open(temp, 'wb') as f:
                pickle.dump(var_str, f)
            print("File saved successfully.")
            return True
        else:
            print("File already exists.")
            return False
    except Exception as e:
        print(f"An error occurred: {e}")
        return False

In [None]:
# Function for loading model, variable for another runtime
def load_work(path,var_name):
    temp = os.path.join(path,var_name)
    if os.path.exists(temp):
        with open(temp, 'rb') as f:
            var_name = pickle.load(f)
        return var_name
    else:
        return False

**Invoking the function to save the preprocessed text**

In [None]:
save_work(preprocessed_texts_path,preprocessed_texts,var_name="preprocessed_texts")

Saving file to: C:\Desktop\Abstractive Text Summarisation Minor Project\preprocessed_texts
File already exists.


False

In [None]:
preprocessed_texts = load_work(preprocessed_texts_path,var_name = "preprocessed_texts")

In [None]:
preprocessed_texts

**Configuring the Neural Network Input Layers**

In [None]:
# Model parameters
vocab_size = 10000  # Choose your vocabulary size
embedding_dim = 256
lstm_units = 512

# Encoder
encoder_inputs = Input(shape=(None,))
enc_emb = Embedding(vocab_size, embedding_dim)(encoder_inputs)
encoder_lstm = LSTM(lstm_units, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(enc_emb)
encoder_states = [state_h, state_c]

# Decoder
decoder_inputs = Input(shape=(None,))
dec_emb_layer = Embedding(vocab_size, embedding_dim)
dec_emb = dec_emb_layer(decoder_inputs)
decoder_lstm = LSTM(lstm_units, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(dec_emb, initial_state=encoder_states)
decoder_dense = TimeDistributed(Dense(vocab_size, activation='softmax'))
decoder_outputs = decoder_dense(decoder_outputs)

# Attention Mechanism
attention_layer = Attention()
context_vector, attention_weights = attention_layer([decoder_outputs, encoder_outputs])
decoder_combined_context = Concatenate(axis=-1)([context_vector, decoder_outputs])

# Seq2Seq Model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)





**Model Parameters**

In [None]:
vocab_size = 6000
max_seq_length = 350
embedding_dim = 256
lstm_units = 512

**Tokenizing the text and feeding it to the neural network along with the text padding**

In [None]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

# Initialize the tokenizer
tokenizer = Tokenizer(num_words=vocab_size, oov_token='<unk>')
tokenizer.fit_on_texts(preprocessed_texts)

# Convert texts to sequences
sequences = tokenizer.texts_to_sequences(preprocessed_texts)

# Determine the desired sequence length (you can set this to a fixed value)
max_seq_length = 100

# Pad or truncate sequences to the desired length
encoder_input_data = pad_sequences(sequences, maxlen=max_seq_length, padding='post')

# Prepare decoder input data and output data
# Shift target sequences by one time step
decoder_input_data = sequences  # Decoder input is the same as target, without the last word
decoder_target_data = [seq[1:] for seq in sequences]  # Decoder target is one step ahead

# Pad or truncate decoder sequences to the same length as encoder sequences
decoder_input_data = pad_sequences(decoder_input_data, maxlen=max_seq_length, padding='post')
decoder_target_data = pad_sequences(decoder_target_data, maxlen=max_seq_length, padding='post')

# Convert lists to NumPy arrays
encoder_input_data = np.array(encoder_input_data)
decoder_input_data = np.array(decoder_input_data)
decoder_target_data = np.array(decoder_target_data)  # Ensure this is an array of integers

# Compile the model
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')





**Defining the number of iterations for the Neural Network**

In [None]:
epochs = 3

# Set batch size
batch_size = 64

# Calculate the total number of samples
total_samples = len(encoder_input_data)

# Determine the number of batches
num_batches = total_samples // batch_size

In [None]:
# For monitoring the progess tqdm library has been used
from tqdm import tqdm

# Calculate the total number of batches including the remaining samples
total_batches = (total_samples + batch_size - 1) // batch_size

# Train the model in batches
for epoch in range(epochs):
    print(f'Epoch {epoch + 1}/{epochs}')

    # Create tqdm progress bar for batches
    progress_bar = tqdm(range(total_batches), desc=f'Epoch {epoch + 1}/{epochs}')

    for batch in progress_bar:
        start_idx = batch * batch_size
        end_idx = min((batch + 1) * batch_size, total_samples)
        encoder_batch_data = encoder_input_data[start_idx:end_idx]
        decoder_input_batch = decoder_input_data[start_idx:end_idx]
        decoder_target_batch = decoder_target_data[start_idx:end_idx]

        # Train the model on the current batch
        loss = model.train_on_batch([encoder_batch_data, decoder_input_batch], decoder_target_batch)

        # Update the progress bar description with the current loss
        progress_bar.set_postfix(loss=loss, refresh=True)

Epoch 1/3


Epoch 1/3:   0%|                                                                              | 0/4487 [00:00<?, ?it/s]




Epoch 1/3: 100%|██████████████████████████████████████████████████████| 4487/4487 [3:03:06<00:00,  2.45s/it, loss=1.03]


Epoch 2/3


Epoch 2/3: 100%|█████████████████████████████████████████████████████| 4487/4487 [3:01:36<00:00,  2.43s/it, loss=0.392]


Epoch 3/3


Epoch 3/3: 100%|█████████████████████████████████████████████████████| 4487/4487 [3:01:04<00:00,  2.42s/it, loss=0.185]


**Saving the Model onto the local machine**

In [None]:
model_path = r'/Desktop/Abstractive Text Summarisation Minor Project'

In [None]:
save_work(model_path,model,var_name='abstract_text_summarision_pickle_model')

Saving file to: C:\Desktop\Abstractive Text Summarisation Minor Project\abstract_text_summarision_pickle_model
File saved successfully.


True

In [None]:
model_pickel = load_work(model_path,var_name='abstract_text_summarision_pickle_model')

In [None]:
model_pickel

<keras.src.engine.functional.Functional at 0x24e0f130b50>

In [None]:
model.save(r'/Desktop/Abstractive Text Summarisation Minor Project/ATS_model_h5_.h5')

  saving_api.save_model(


**Loading the Model for testing purposes**

In [None]:
from tensorflow.keras.models import load_model

# Specify the full path to your saved model
model_path_h5_local = r'/Desktop/Abstractive Text Summarisation Minor Project/ATS_model_h5_.h5'

# Load the model
model_h5 = load_model(model_path_h5_local)

In [None]:
model_pickel

<keras.src.engine.functional.Functional at 0x24e0f130b50>

In [None]:
model_h5

<keras.src.engine.functional.Functional at 0x24e0edb6290>

In [None]:
model

<keras.src.engine.functional.Functional at 0x24d08836bf0>

**Defining the Output Layer of the Neural Network**

In [None]:
# Define the encoder model
encoder_model = Model(encoder_inputs, encoder_states)

# Define the decoder model
decoder_state_input_h = Input(shape=(lstm_units,))
decoder_state_input_c = Input(shape=(lstm_units,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

dec_emb2 = dec_emb_layer(decoder_inputs)
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2)

decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs2] + decoder_states2)


**Utility function for Summarizing the Text**

In [None]:
def summarize_text(input_text):
    # Convert input_text to sequences and pad
    input_seq = tokenizer.texts_to_sequences([input_text])
    input_seq = pad_sequences(input_seq, maxlen=max_seq_length, padding='post')

    # Encode input and retrieve initial decoder state
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1
    target_seq = np.zeros((1, 1))
    # Populate the first character of target sequence with the start character
    target_seq[0, 0] = tokenizer.word_index['start']

    stop_condition = False
    decoded_sentence = ''

    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = tokenizer.index_word[sampled_token_index]
        decoded_sentence += ' ' + sampled_char

        # Exit condition: either hit max length or find stop character
        if (sampled_char == 'end' or len(decoded_sentence) > 50):
            stop_condition = True

        # Update the target sequence (of length 1)
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]

    return decoded_sentence.strip()

**Testing the Model**

In [None]:
input_article = '''As results for assembly elections in four states - Chhattisgarh, Rajasthan, Madhya Pradesh and Telangana - pour in, the BJP looks to score a big win in three heartland states. In Telangana, K Chandrasekhar Rao-led BRS has accepted defeat against the Congress.
Prime Minister Narendra Modi is addressing the party workers at the party headquarters in Delhi shortly

BJP leaders credited Prime Minister Narendra Modi's leadership, Amit Shah's strategy and party's welfare policies for the positive trends that show a clear victory for the party in three states. The opposition, however, claims that the results will not have any impact on the Lok Sabha elections in 2024.  '''
# This is the input article that you want to summarize.

# You can use the summarize function to generate a summary:
summary = summarize_text(input_article)

print("Generated Summary:", summary)


Generated Summary:
Prime Minister Narendra Modi is addressing the BJP workers at the party headquarters in Delhi shortly BJP leaders credited Prime Minister Narendra Modi's leadership, Amit Shah's strategy and party's welfare policies for the positive trends that show a clear victory for the party in three states.
