<a href="https://colab.research.google.com/github/Nehalshetta/SumText/blob/model/sum_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <span style="color:black">I.Dataset Preprocessing</span>
## <span style="color:black">This code block <span style="color:green"> imports </span>every Python library used in the code</span>

This code block imports various libraries and modules for text processing and analysis tasks. It includes **pandas** for data manipulation, **re** and **string** for regular expressions and string operations, **unicodedata** for working with Unicode characters, **nltk** for natural language processing, **stopwords** for removing common words, **word_tokenize** for tokenizing text into words, **WordNetLemmatizer** for lemmatization, **PorterStemmer** for stemming, and **BeautifulSoup** for web scraping. These imports provide the necessary functionality to clean and preprocess text data, tokenize words, and perform other text-related tasks.

In [None]:
!pip install nltk

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
import pandas as pd
import re
import string
import unicodedata
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from bs4 import BeautifulSoup
from nltk.tokenize import word_tokenize
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding, Attention, Conv1D, MaxPooling1D, Flatten, Concatenate, Dropout
from tensorflow.keras.optimizers import Adam
from keras.utils import to_categorical

The code **nltk.download('punkt')** downloads resources for NLTK's **word_tokenize** function, used for tokenizing text.
**nltk.download('stopwords')** downloads a collection of common stopwords, while **nltk.download('wordnet')** downloads the WordNet lexical database. These downloads ensure that the necessary resources are available for text preprocessing and analysis tasks using NLTK.

- Creates a directory named **.kaggle** in the user's home directory.
- Copies the **kaggle.json** file to the **.kaggle** directory, which is used for authentication with the Kaggle API.
- Sets the permissions of the **kaggle.json** file to read and write only for the owner.
- Downloads a specific dataset from Kaggle using the Kaggle CLI.
- Extracts the contents of the downloaded zip file into a directory named **dataset-folder**.

In [None]:
# Define dataset directory
dataset_directory = '/content/dataset-folder/cnn_dailymail/'

In [None]:
# download the dataset
!mkdir -p ~/.kaggle
!cp /content/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

!kaggle datasets download -d gowrishankarp/newspaper-text-summarization-cnn-dailymail

!unzip /content/newspaper-text-summarization-cnn-dailymail.zip -d dataset-folder

Downloading newspaper-text-summarization-cnn-dailymail.zip to /content
 97% 487M/503M [00:03<00:00, 168MB/s]
100% 503M/503M [00:03<00:00, 157MB/s]
Archive:  /content/newspaper-text-summarization-cnn-dailymail.zip
  inflating: dataset-folder/cnn_dailymail/test.csv  
  inflating: dataset-folder/cnn_dailymail/train.csv  
  inflating: dataset-folder/cnn_dailymail/validation.csv  


In [None]:
# Load the dataset
train = pd.read_csv(dataset_directory + 'train.csv')
validation = pd.read_csv(dataset_directory + 'validation.csv')
test = pd.read_csv(dataset_directory + 'test.csv')

In [None]:
# Remove unnecessary columns
columns_to_keep = ['article', 'highlights']
train = train[columns_to_keep]
validation = validation[columns_to_keep]
test = test[columns_to_keep]

This function is designed to preprocess and clean text data by removing HTML tags, converting to lowercase, removing accents, tokenizing, removing stopwords and non-alphabetic tokens, and lemmatizing the words. It can be used as a preprocessing step for text analysis tasks.

In [None]:
# Define function to clean text data
def clean_text(text):
    # Remove HTML tags and other markup
    clean_text = re.sub(r'<.*?>', '', text)

    # Convert to lowercase and remove accents
    clean_text = clean_text.lower()
    clean_text = unicodedata.normalize('NFKD', clean_text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

    # Tokenize text
    tokens = word_tokenize(clean_text)

    # Remove stopwords and punctuation
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if not t in stop_words and t.isalpha()]

    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]

    # Join tokens back into a string
    clean_text = ' '.join(tokens)

    return clean_text


In [None]:
# Clean the text in the train dataset
train['article'] = train['article'].apply(clean_text)
train['highlights'] = train['highlights'].apply(clean_text)

In [None]:
# Clean the text in the validation dataset
validation['article'] = validation['article'].apply(clean_text)
validation['highlights'] = validation['highlights'].apply(clean_text)

In [None]:
# Clean the text in the test dataset
test['article'] = test['article'].apply(clean_text)
test['highlights'] = test['highlights'].apply(clean_text)

In [None]:
# Save cleaned datasets into new CSV files
train.to_csv(dataset_directory + 'cleaned_train.csv', index=False)
validation.to_csv(dataset_directory + 'cleaned_validation.csv', index=False)
test.to_csv(dataset_directory + 'cleaned_test.csv', index=False)

In [None]:
# Load cleaned datasets
cleaned_train = pd.read_csv(dataset_directory + 'cleaned_train.csv')
cleaned_validation = pd.read_csv(dataset_directory + 'cleaned_validation.csv')
cleaned_test = pd.read_csv(dataset_directory + 'cleaned_test.csv')

In [None]:
# Combine all the text data into one list
text_data = list(cleaned_train['article']) + list(cleaned_validation['article']) + list(cleaned_test['article'])

# Tokenize the text data
vocab_size = 100000
tokenizer = Tokenizer(num_words=vocab_size, oov_token="<OOV>")
tokenizer.fit_on_texts(text_data)

# Convert the text data into numerical tokens
train_tokens = tokenizer.texts_to_sequences(cleaned_train['article'])
val_tokens = tokenizer.texts_to_sequences(cleaned_validation['article'])
test_tokens = tokenizer.texts_to_sequences(cleaned_test['article'])

# Pad the sequences to ensure they all have the same length
max_len = 512
train_tokens_padded = pad_sequences(train_tokens, maxlen=max_len, padding='post', truncating='post')
val_tokens_padded = pad_sequences(val_tokens, maxlen=max_len, padding='post', truncating='post')
test_tokens_padded = pad_sequences(test_tokens, maxlen=max_len, padding='post', truncating='post')

# Print the tokenized and padded sequences
print(train_tokens_padded)
print(val_tokens_padded)
print(test_tokens_padded)

[[ 1381   563   160 ...     0     0     0]
 [   37  6931  6002 ...     0     0     0]
 [ 2693   481   163 ...     0     0     0]
 ...
 [  504  4448  5598 ...     0     0     0]
 [ 4700 12630   160 ...     0     0     0]
 [   37    71   809 ...   773  1581    97]]
[[ 5499  8832 17397 ...     0     0     0]
 [  776   482 17087 ...     0     0     0]
 [   53   939   584 ...     0     0     0]
 ...
 [ 3483   767  3839 ...     0     0     0]
 [  640   208  2862 ...     0     0     0]
 [  515   128   744 ...     0     0     0]]
[[  308  2210   545 ...     0     0     0]
 [ 2693  2232   227 ...     0     0     0]
 [22285 17981  7272 ...     0     0     0]
 ...
 [ 1116  2147  5879 ... 16618   247   618]
 [ 3225  4310  4268 ...     0     0     0]
 [  929   105 13451 ...     0     0     0]]


In [None]:
# Define the model architecture for extractive summarization
def extractive_summarization_model(input_dim, output_dim, hidden_units):
    # Encoder
    encoder_inputs = Input(shape=(None,))
    encoder_embedding = Embedding(input_dim, hidden_units, mask_zero=True)(encoder_inputs)
    encoder_cnn = Conv1D(filters=64, kernel_size=3, activation='relu', padding='same')(encoder_embedding)
    encoder_lstm = LSTM(hidden_units, return_sequences=True, return_state=True, dropout=0.2, recurrent_dropout=0.2)
    encoder_outputs, state_h, state_c = encoder_lstm(encoder_cnn)
    encoder_states = [state_h, state_c]

    # Decoder
    decoder_inputs = Input(shape=(None,))
    decoder_embedding = Embedding(output_dim, hidden_units, mask_zero=True)(decoder_inputs)
    decoder_lstm = LSTM(hidden_units, return_sequences=True, return_state=True, dropout=0.2, recurrent_dropout=0.2)
    decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
    attention = Attention()([decoder_outputs, encoder_outputs])
    decoder_attention = Concatenate()([decoder_outputs, attention])
    decoder_dense = Dense(output_dim, activation='softmax')
    decoder_outputs = decoder_dense(decoder_attention)

    # Define the model
    model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
    return model

# Define the model architecture for abstractive summarization
def abstractive_summarization_model(input_dim, output_dim, hidden_units):
    # Encoder
    encoder_inputs = Input(shape=(None,))
    encoder_embedding = Embedding(input_dim, hidden_units, mask_zero=True)(encoder_inputs)
    encoder_lstm = LSTM(hidden_units, return_state=True, dropout=0.2, recurrent_dropout=0.2)
    encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
    encoder_states = [state_h, state_c]

    # Decoder
    decoder_inputs = Input(shape=(None,))
    decoder_embedding = Embedding(output_dim, hidden_units, mask_zero=True)(decoder_inputs)
    decoder_lstm = LSTM(hidden_units, return_sequences=True, return_state=True, dropout=0.2, recurrent_dropout=0.2)
    decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
    decoder_dense = Dense(output_dim, activation='softmax')
    decoder_outputs = decoder_dense(decoder_outputs)

    # Define the model
    model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
    return model

# Define the model parameters
input_dim = len(tokenizer.word_index) + 1
output_dim = len(tokenizer.word_index) + 1
hidden_units = 256

# Build the extractive summarization model
extractive_model = extractive_summarization_model(input_dim, output_dim, hidden_units)
extractive_model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy')

# Build the abstractive summarization model
abstractive_model = abstractive_summarization_model(input_dim, output_dim, hidden_units)
abstractive_model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy')

# Print the model summaries
extractive_model.summary()
abstractive_model.summary()



Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embedding (Embedding)          (None, None, 256)    111964672   ['input_1[0][0]']                
                                                                                                  
 input_2 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 conv1d (Conv1D)                (None, None, 64)     49216       ['embedding[0][0]']              
                                                                                              

In [None]:
# Define the target data for extractive and abstractive models
train_target_data = cleaned_train['article']
val_target_data = cleaned_validation['article']

# Tokenize the target data
train_target_tokens = tokenizer.texts_to_sequences(train_target_data)
val_target_tokens = tokenizer.texts_to_sequences(val_target_data)

# Pad the target sequences to ensure they have the same length as input sequences
train_target_tokens_padded = pad_sequences(train_target_tokens, maxlen=max_len, padding='post', truncating='post')
val_target_tokens_padded = pad_sequences(val_target_tokens, maxlen=max_len, padding='post', truncating='post')

# Convert target data to categorical format
train_target_categorical = to_categorical(train_target_tokens_padded, num_classes=vocab_size)
val_target_categorical = to_categorical(val_target_tokens_padded, num_classes=vocab_size)

# Train the extractive summarization model
extractive_model.fit([train_tokens_padded, train_tokens_padded], train_target_categorical,
                     validation_data=([val_tokens_padded, val_tokens_padded], val_target_categorical),
                     batch_size=32, epochs=10)

# Train the abstractive summarization model
abstractive_model.fit([train_tokens_padded, train_tokens_padded], train_tokens_padded,
                      validation_data=([val_tokens_padded, val_tokens_padded], val_tokens_padded),
                      batch_size=batch_size, epochs=epochs)
