<a href="https://colab.research.google.com/github/Atfssene/FRASA/blob/main/Text_Summarization_Model_FRASA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Summarization Model

In this notebook, we will create a model for text summarization task. TextRank and SumBasic will be our feature extraction from senteces to create a weights that will be feeded to a neural networks. Let's start!

## Import library

In [None]:
!pip install Sastrawi
# !pip install fasttext

Collecting Sastrawi
[?25l  Downloading https://files.pythonhosted.org/packages/6f/4b/bab676953da3103003730b8fcdfadbdd20f333d4add10af949dd5c51e6ed/Sastrawi-1.0.1-py2.py3-none-any.whl (209kB)
[K     |█▋                              | 10kB 13.6MB/s eta 0:00:01[K     |███▏                            | 20kB 17.0MB/s eta 0:00:01[K     |████▊                           | 30kB 20.8MB/s eta 0:00:01[K     |██████▎                         | 40kB 23.7MB/s eta 0:00:01[K     |███████▉                        | 51kB 26.2MB/s eta 0:00:01[K     |█████████▍                      | 61kB 27.1MB/s eta 0:00:01[K     |███████████                     | 71kB 24.0MB/s eta 0:00:01[K     |████████████▌                   | 81kB 24.7MB/s eta 0:00:01[K     |██████████████                  | 92kB 22.1MB/s eta 0:00:01[K     |███████████████▋                | 102kB 22.5MB/s eta 0:00:01[K     |█████████████████▏              | 112kB 22.5MB/s eta 0:00:01[K     |██████████████████▊             | 122k

In [None]:
# Import library
import pandas as pd
import numpy as np
import re
import networkx as nx
import tensorflow as tf
import nltk
nltk.download('punkt')

from nltk.tokenize import sent_tokenize, word_tokenize
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
from sklearn.metrics.pairwise import cosine_similarity

# For pre trained text embedding from FastText
# import gzip
# import fasttext
# import fasttext.util

factory = StopWordRemoverFactory()
stop_words = factory.get_stop_words()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Read data

In [None]:
train = tf.keras.utils.get_file('train.csv', 'https://raw.githubusercontent.com/Atfssene/FRASA/main/Text%20Summarization/train.csv')
test = tf.keras.utils.get_file('test.csv', 'https://raw.githubusercontent.com/Atfssene/FRASA/main/Text%20Summarization/test.csv')

df_train = pd.read_csv(train, dtype=object, converters={'labels':eval})
df_test = pd.read_csv(test, dtype=object, converters={'labels':eval})
df_train.info()
# df_test.info()

Downloading data from https://raw.githubusercontent.com/Atfssene/FRASA/main/Text%20Summarization/train.csv
Downloading data from https://raw.githubusercontent.com/Atfssene/FRASA/main/Text%20Summarization/test.csv


  after removing the cwd from sys.path.


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15012 entries, 0 to 15011
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   labels      15012 non-null  object
 1   paragraphs  15012 non-null  object
 2   summary     15012 non-null  object
dtypes: object(3)
memory usage: 352.0+ KB


  """


## Preprocess Data

In [None]:
# take in row [label, paragraphs, summary] => use apply
# for labels convert False/True to 0/1
# for paragraphs and summary, clean the data, 

def preprocess(row):

  sentences = []
  processed = []
  for row in sent_tokenize(row):
    sentences.append(sent_tokenize(row.lower()))
  sentences = [y for x in sentences for y in x]
  for i, text in enumerate(sentences):
    text = re.sub(r"[^a-zA-Z]", " ", text)
    text = re.sub(r"\b\w{1,3}\b"," ",text)
    text = " ".join([word for word in text.split() if not word in stop_words])
    processed.append(text)
  return processed

df_train['coba'] = (df_train[:5].apply(lambda row: preprocess(row['paragraphs']), axis=1))
df_train
# for x in xy:
  # print(len(x), x)

Unnamed: 0,labels,paragraphs,summary,coba
0,"[False, True, True, True, False, False, False,...","Jakarta, CNN Indonesia - - Dokter Ryan Thamrin...",Dokter Lula Kamal yang merupakan selebriti sek...,[jakarta indonesia dokter ryan thamrin terkena...
1,"[False, False, False, False, False, True, True...",Selfie ialah salah satu tema terpanas di kalan...,Asus memperkenalkan ZenFone generasi keempat...,[selfie ialah salah satu tema terpanas kalanga...
2,"[True, True, False, False, False, False, False...","Jakarta, CNN Indonesia - - Dinas Pariwisata Pr...",Dinas Pariwisata Provinsi Bengkulu kembali men...,[jakarta indonesia dinas pariwisata provinsi b...
3,"[True, True, False, False, False, True, False,...",Merdeka.com - Indonesia Corruption Watch (ICW)...,Indonesia Corruption Watch (ICW) meminta Komis...,[merdeka indonesia corruption watch meminta ko...
4,"[False, True, True, True, True, False, False, ...",Merdeka.com - Presiden Joko Widodo (Jokowi) me...,Jokowi memimpin upacara penurunan bendera. Usa...,[merdeka presiden joko widodo jokowi memimpin ...
...,...,...,...,...
15007,"[True, True, False, False, True, True, False, ...","MANCHESTER, JUARA.net - Mantan striker Manches...","Mantan striker Manchester United, Andrew' Andy...",
15008,"[True, True, True, False, False, False, False,...","Jakarta, CNN Indonesia - - Ratu Tisha Destria ...",Ratu Tisha Destria terpilih menjadi Sekretaris...,
15009,"[True, True, True, True, False, False, False, ...",ITALIA - Borussia Dortmund berhasil lolos ke b...,Borussia Dortmund lolos ke babak 16 Liga Europ...,
15010,"[True, False, True, False, False, False, False...",AC Milan kembali ke jalur kemenangan dengan me...,AC Milan kembali ke jalur kemenangan pasca dit...,


Pre-processing raw text for feature extraction with rules:
1. Splits paragraphs into sentences.
2. Lowercasing letter.
3. Remove punctuation.
4. Remove stopword.
5. Remove non alphanumerical letter.

In [None]:
factory = StopWordRemoverFactory()
stop_words = factory.get_stop_words()

def preprocess_text(row):
  sentences = []
  processed = ""
  for row in sent_tokenize(row['clean_paragraphs']):
    sentences.append(sent_tokenize(row.lower()))
  sentences = [y for x in sentences for y in x]
  for i, text in enumerate(sentences):
    text = re.sub(r"[^a-zA-Z]", " ", text)
    text = re.sub(r"\b\w{1,3}\b"," ",text)
    text = " ".join([word for word in text.split() if not word in stop_words])
    processed = processed + text +". "
  return processed

In [None]:
train['preprocess_text'] = train.apply(lambda row: preprocess(row), axis=1)

Convert gold labels into binary

In [None]:
def convert_binary(label_row):
  labels = []
  for label in label_row:
    if label == True:
      labels.append(1)
    elif label == False:
      labels.append(0)
  return labels

Cleaning dataset

In [None]:
factory = StopWordRemoverFactory()
stop_words = factory.get_stop_words()

def preprocess_text(row):
  sentences = []
  processed = ""
  for row in sent_tokenize(row['clean_paragraphs']):
    sentences.append(sent_tokenize(row.lower()))
  sentences = [y for x in sentences for y in x]
  for i, text in enumerate(sentences):
    text = re.sub(r"[^a-zA-Z]", " ", text)
    text = re.sub(r"\b\w{1,3}\b"," ",text)
    text = " ".join([word for word in text.split() if not word in stop_words])
    processed = processed + text +". "
  return processed

In [None]:
df_train['binary_label'] = df_train.apply(lambda row: convert_binary(row['labels']),axis=1)
df_train.head()

Unnamed: 0,labels,paragraphs,summary,clean_text,binary_label
0,"[False, True, True, True, False, False, False,...","Jakarta, CNN Indonesia - - Dokter Ryan Thamrin...",Dokter Lula Kamal yang merupakan selebriti sek...,jakarta indonesia dokter ryan thamrin terkenal...,"[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,"[False, False, False, False, False, True, True...",Selfie ialah salah satu tema terpanas di kalan...,Asus memperkenalkan ZenFone generasi keempat...,selfie ialah salah satu tema terpanas kalangan...,"[0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, ..."
2,"[True, True, False, False, False, False, False...","Jakarta, CNN Indonesia - - Dinas Pariwisata Pr...",Dinas Pariwisata Provinsi Bengkulu kembali men...,jakarta indonesia dinas pariwisata provinsi be...,"[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,"[True, True, False, False, False, True, False,...",Merdeka.com - Indonesia Corruption Watch (ICW)...,Indonesia Corruption Watch (ICW) meminta Komis...,merdeka indonesia corruption watch meminta kom...,"[1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0]"
4,"[False, True, True, True, True, False, False, ...",Merdeka.com - Presiden Joko Widodo (Jokowi) me...,Jokowi memimpin upacara penurunan bendera. Usa...,merdeka presiden joko widodo jokowi memimpin u...,"[0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]"


## Create TextRank

Download Indonesian word vector

In [None]:
# !wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.id.300.vec.gz
# !wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.id.300.bin.gz

Unzipping...

In [None]:
# !gunzip cc.id.300.vec.gz
# !gunzip cc.id.300.bin.gz
# from gensim.models.keyedvectors import KeyedVectors
# kv = KeyedVectors.load_word2vec_format('/content/drive/MyDrive/model_summarization/cc.id.300.vec', limit=400000)

In [None]:
# kv.save_word2vec_format("/content/cc.id.vec", binary=False)

Load pretrained words embeddings

In [None]:
word_embeddings = {}
file = open('/content/drive/MyDrive/model_summarization/cc.id.vec', encoding='utf-8')
for f in file:
    values = f.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
file.close()

len(word_embeddings)

400001

In [None]:
# For sorting return list
def sorting(e):
  return e[2]

TextRank Algorithm

In [None]:
def textrank(df):
    sentences = sent_tokenize(df['paragraphs'])
    clean_sentences = sent_tokenize(df['clean_text'])

    sentence_vectors = []
    for i in clean_sentences:
      if len(i) != 0:
        v = sum([word_embeddings.get(w, np.zeros((300,))) for w in i.split()])/(len(i.split())+0.001)
      else:
        v = np.zeros((300,))
      sentence_vectors.append(v)

    sim_mat = np.zeros([len(sentences), len(sentences)])
    res = len(sentence_vectors)
    res2 = len(sentences)
    a = a + 1
    if (res != res2):
      
      print(a, res, res2)
      
    pass
    # for i in range(len(sentences)):
    #   for j in range(len(sentences)):
    #     if i != j:
    #       # pass
    #       sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,300), sentence_vectors[j].reshape(1,300))[0,0]


    # nx_graph = nx.from_numpy_array(sim_mat)
    # scores = nx.pagerank_numpy(nx_graph)

    # ranked_sentences = sorted(([scores[i],i+1,s] for i,s in enumerate(sentences)), reverse=True)

    # text_rank = []
    # for index, sentence in enumerate(ranked_sentences):
    #   sentence.insert(1, index+1)
    #   text_rank.append(sentence)

    # # Return list(TextRank weights, TextRank order, sentence order, sentence) => text_rank
    # text_rank = sorted(text_rank,key=sorting)

    # TR_weight = []
    # TR_order = []
    # for i in range(len(text_rank)):
    #   TR_weight.append(text_rank[i][0])
    #   TR_order.append(text_rank[i][1])
    # Just Return 2 list(TextRank weights, TextRank order)
    # return TR_weight#, TR_order
coba = df_train.apply(lambda row: textrank(row), axis=1)
coba

In [None]:
df_train['text_rank'] = df_train.apply(lambda row: textrank(row), axis=1)
# df_train['sum_basic'], df_train['sum_basic_order'] = df_train.apply(lambda row: sumbasic(row['paragraphs'], row['clean_text']), axis=1)

df_train.head()

Example result from variable text_rank:


```
[0.06091013588314788, 1, 3, 'Lula menuturkan, sakit itu membuat Ryan mesti vakum dari semua kegiatannya, termasuk menjadi pembawa acara Dokter Oz Indonesia.']
[0.05986087469041391, 2, 2, 'Dokter Lula Kamal yang merupakan selebriti sekaligus rekan kerja Ryan menyebut kawannya itu sudah sakit sejak setahun yang lalu.']
[0.05837850794448731, 3, 5, 'Setahu saya dia orangnya sehat, tapi tahun lalu saya dengar dia sakit.']
[0.05811819645865672, 4, 14, '“ Saya tidak tahu, barangkali penyakit yang dulu sama yang sekarang berbeda, atau penyebab kematiannya beda dari penyakit sebelumnya.']
[0.05776311916576284, 5, 13, 'Meski demikian, ia mendengar beberapa kabar yang menyebut bahwa penyebab Ryan meninggal adalah karena jatuh di kamar mandi.']
[0.05773574513429258, 6, 8, 'Lula yang mengenal Ryan sejak sebelum aktif berkarier di televisi mengaku belum sempat membesuk Ryan lantaran lokasi yang jauh.']
[0.05656949054199408, 7, 1, 'Jakarta, CNN Indonesia - - Dokter Ryan Thamrin, yang terkenal lewat acara Dokter Oz Indonesia, meninggal dunia pada Jumat (4 / 8) dini hari.']
[0.05628137259134671, 8, 16, 'Ryan Thamrin terkenal sebagai dokter yang rutin membagikan tips dan informasi kesehatan lewat tayangan Dokter Oz Indonesia.']
[0.05626494382459023, 9, 6, '( Karena) sakitnya, ia langsung pulang ke Pekanbaru, jadi kami yang mau jenguk juga susah.']
[0.056198999719088594, 10, 7, 'Barangkali mau istirahat, ya betul juga, kalau di Jakarta susah isirahatnya, " kata Lula kepada CNNIndonesia.com, Jumat (4 / 8).']
[0.0559482929036589, 11, 11, 'Enggak tahu berat sekali apa bagaimana, " tutur Ryan.']
[0.05588348948999913, 12, 12, 'Walau sudah setahun menderita sakit, Lula tak mengetahui apa penyebab pasti kematian Dr Oz Indonesia itu.']
[0.05466616578431332, 13, 10, 'Itu saya enggak tahu, belum sempat jenguk dan enggak selamanya bisa dijenguk juga.']
[0.05358955066414319, 14, 9, 'Dia juga tak tahu penyakit apa yang diderita Ryan. "']
[0.05168620622168481, 15, 17, 'Ryan menempuh Pendidikan Dokter pada tahun 2002 di Fakultas Kedokteran Universitas Gadjah Mada.']
[0.050898801974925634, 16, 4, 'Kondisi itu membuat Ryan harus kembali ke kampung halamannya di Pekanbaru, Riau untuk menjalani istirahat. "']
[0.05009205343710138, 17, 18, 'Dia kemudian melanjutkan pendidikan Klinis Kesehatan Reproduksi dan Penyakit Menular Seksual di Mahachulalongkornrajavidyalaya University, Bangkok, Thailand pada 2004.']
[0.04915405357039277, 18, 15, 'Kita kan enggak bisa mengambil kesimpulan, " kata Lula.']
```



## Create SumBasic

SumBasic Algorithm

In [None]:
frequency = {}
processed =  df_train['clean_text'].iloc[0]
for word in word_tokenize(processed):
  if word.isalnum():
    if word not in frequency.keys():
      frequency[word]=1
    else:
      frequency[word]+=1
max_fre = max(frequency.values())
for word in frequency.keys():
    frequency[word]=(frequency[word]/max_fre)
    
scores = {}
for i, sentence in enumerate((sent_tokenize(processed))):
  for word in word_tokenize(sentence):
    if word in frequency.keys():
        if i not in scores.keys():
          scores[i] = frequency[word]
        else:
          scores[i] += frequency[word]
ranked_sentences = sorted(([scores[i],i,s] for i,s in enumerate(sentences)), reverse=True)


# Return list(SumBasic weights, SumBasic order, sentence order, sentence) => sum_bas
sum_bas = []
for index, sentence in enumerate(ranked_sentences):
  sentence.insert(1, index+1)
  sum_bas.append(sentence)

In [None]:
for sentence in sorted(sum_bas,key=sorting):
  print(sentence)

[4.727272727272726, 1, 0, 'Jakarta, CNN Indonesia - - Dokter Ryan Thamrin, yang terkenal lewat acara Dokter Oz Indonesia, meninggal dunia pada Jumat (4 / 8) dini hari.']
[3.909090909090908, 4, 1, 'Dokter Lula Kamal yang merupakan selebriti sekaligus rekan kerja Ryan menyebut kawannya itu sudah sakit sejak setahun yang lalu.']
[4.09090909090909, 2, 2, 'Lula menuturkan, sakit itu membuat Ryan mesti vakum dari semua kegiatannya, termasuk menjadi pembawa acara Dokter Oz Indonesia.']
[1.9999999999999998, 12, 3, 'Kondisi itu membuat Ryan harus kembali ke kampung halamannya di Pekanbaru, Riau untuk menjalani istirahat. "']
[1.0909090909090908, 17, 4, 'Setahu saya dia orangnya sehat, tapi tahun lalu saya dengar dia sakit.']
[0.9090909090909092, 18, 5, '( Karena) sakitnya, ia langsung pulang ke Pekanbaru, jadi kami yang mau jenguk juga susah.']
[2.0, 11, 6, 'Barangkali mau istirahat, ya betul juga, kalau di Jakarta susah isirahatnya, " kata Lula kepada CNNIndonesia.com, Jumat (4 / 8).']
[3.7272

In [None]:
def count_both(rows):
  for index, row in enumerate(rows['paragraphs']):
    rows['text_rank'] = index

  # Calling textrank, return with weight and textrank order
  # Calling sumbasic, return with weight and sumbasic order
  # create new column for dataframe

In [None]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15012 entries, 0 to 15011
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   labels      15012 non-null  object
 1   paragraphs  15012 non-null  object
 2   summary     15012 non-null  object
 3   clean_text  15012 non-null  object
 4   bin_labels  15012 non-null  object
 5   text_rank   15012 non-null  object
dtypes: object(6)
memory usage: 703.8+ KB


Run both for all rows

## Neural Network

In [None]:
df_train['bin_labels'] = df_train['labels'].apply(convert_binary)