<a href="https://colab.research.google.com/github/ShinAsakawa/ShinAsakawa.github.io/blob/master/2022notebooks/2022_0107Text_summarization_of_Japanese_Articles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


* url: https://medium.com/@shubhamsingh_31435/text-summarization-of-japanese-articles-using-python-and-nlp-47a214d769b
* date: 2022_0107
* title: Text Summarization of Japanese Articles using python and NLP
* filename: 2022_0107Text_summarization_of_Japanese_Articles.ipynb

# Text Summarization of Japanese Articles using python and NLP


In [None]:
import platform

isColab = platform.system() == 'Linux'
if isColab:
    !pip install --upgrade wikipedia > /dev/null 2>&1 
    !pip install --upgrade jamdict jamdict-data nagisa pykakasi > /dev/null 2>&1

In [None]:
# https://nagisa.readthedocs.io/en/latest/basic_usage.html
import nagisa

text = 'Pythonで簡単に使えるツールです'
words = nagisa.tagging(text)
print(words)
#=> Python/名詞 で/助詞 簡単/形状詞 に/助動詞 使える/動詞 ツール/名詞 です/助動詞

# Get a list of words
print(words.words)
#=> ['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']

# Get a list of POS-tags
print(words.postags)
#=> ['名詞', '助詞', '形状詞', '助動詞', '動詞', '名詞', '助動詞']

In [None]:
# Extarcting all nouns from a text
words = nagisa.extract(text, extract_postags=['名詞'])
print(words)
#=> Python/名詞 ツール/名詞

# Filtering specific POS-tags from a text
words = nagisa.filter(text, filter_postags=['助詞', '助動詞'])
print(words)
#=> Python/名詞 簡単/形状詞 使える/動詞 ツール/名詞

# A list of available POS-tags
print(nagisa.tagger.postags)
#=> ['補助記号', '名詞', ... , 'URL']


In [None]:
# default
text = "3月に見た「3月のライオン」"
print(nagisa.tagging(text))
#=> 3/名詞 月/名詞 に/助詞 見/動詞 た/助動詞 「/補助記号 3/名詞 月/名詞 の/助詞 ライオン/名詞 」/補助記号

# If a word ("3月のライオン") is included in the single_word_list, it is recognized as a single word.
new_tagger = nagisa.Tagger(single_word_list=['3月のライオン'])
print(new_tagger.tagging(text))
#=> 3/名詞 月/名詞 に/助詞 見/動詞 た/助動詞 「/補助記号 3月のライオン/名詞 」/補助記号

In [None]:
text = '(人•ᴗ•♡)こんばんは♪'
words = nagisa.tagging(text)
print(words)
#=> (人•ᴗ•♡)/補助記号 こんばんは/感動詞 ♪/補助記号

url = 'https://github.com/taishi-i/nagisaでコードを公開中(๑¯ω¯๑)'
words = nagisa.tagging(url)
print(words)
#=> https://github.com/taishi-i/nagisa/URL で/助詞 コード/名詞 を/助詞 公開/名詞 中/接尾辞 (๑　̄ω　̄๑)/補助記号

words = nagisa.filter(url, filter_postags=['URL', '補助記号', '助詞'])
print(words)
#=> コード/名詞 公開/名詞 中/接尾辞

In [None]:
#https://github.com/neocl/jamdict
from jamdict import Jamdict
jam = Jamdict()

# use wildcard matching to find anything starts with 食べ and ends with る
result = jam.lookup('食べ%る')

# print all word entries
for entry in result.entries:
     print(entry)


In [None]:
# for k in ['chars', 'entries', 'names', 'text']:
#     print(f'{k} : {type(getattr(result, k))}')
#     if isinstance(getattr(result,k), list):
#         _list = getattr(result,k)
#         for x in _list:
#             print(k, x)
#             #print(getattr(result,k)[x])

#dir(result)
for l in result.text().split('。'):
    print(l)

In [None]:
# print all related characters
for c in result.chars:
    print(repr(c))


In [None]:
!python3 -m jamdict lookup 日本語教育学
# ========================================
# Found entries
# ========================================
# Entry: 1264430 | Kj:  言語学 | Kn: げんごがく
# --------------------
# 1. linguistics ((noun (common) (futsuumeishi)))

# ========================================
# Found characters
# ========================================
# Char: 言 | Strokes: 7
# --------------------
# Readings: yan2, eon, 언, Ngôn, Ngân, ゲン, ゴン, い.う, こと
# Meanings: say, word
# Char: 語 | Strokes: 14
# --------------------
# Readings: yu3, yu4, eo, 어, Ngữ, Ngứ, ゴ, かた.る, かた.らう
# Meanings: word, speech, language
# Char: 学 | Strokes: 8
# --------------------
# Readings: xue2, hag, 학, Học, ガク, まな.ぶ
# Meanings: study, learning, science

# No name was found.

In [None]:
# Using KRAD/RADK mapping
# Jamdict has built-in support for KRAD/RADK (i.e. kanji-radical and radical-kanji mapping). The terminology of radicals/components used by Jamdict can be different from else where.

# A radical in Jamdict is a principal component, each character has only one radical.
# A character may be decomposed into several writing components.
# By default jamdict provides two maps:

# jam.krad is a Python dict that maps characters to list of components.
# jam.radk is a Python dict that maps each available components to a list of characters.

# Find all writing components (often called "radicals") of the character 雲
print(jam.krad['雲'])
# ['一', '雨', '二', '厶']

# Find all characters with the component 鼎
chars = jam.radk['鼎']
print(chars)
# {'鼏', '鼒', '鼐', '鼎', '鼑'}

# look up the characters info
result = jam.lookup(''.join(chars))
for c in result.chars:
    print(c, c.meanings())
# 鼏 ['cover of tripod cauldron']
# 鼒 ['large tripod cauldron with small']
# 鼐 ['incense tripod']
# 鼎 ['three legged kettle']
# 鼑 []

In [None]:
#Finding name entities
# Find all names with 鈴木 inside
#result = jam.lookup('%鈴木%')
result = jam.lookup('%岩下%')
for name in result.names:
    print(name)

# [id#5025685] キューティーすずき (キューティー鈴木) : Kyu-ti- Suzuki (1969.10-) (full name of a particular person)
# [id#5064867] パパイヤすずき (パパイヤ鈴木) : Papaiya Suzuki (full name of a particular person)
# [id#5089076] ラジカルすずき (ラジカル鈴木) : Rajikaru Suzuki (full name of a particular person)
# [id#5259356] きつねざきすずきひなた (狐崎鈴木日向) : Kitsunezakisuzukihinata (place name)
# [id#5379158] こすずき (小鈴木) : Kosuzuki (family or surname)
# [id#5398812] かみすずき (上鈴木) : Kamisuzuki (family or surname)
# [id#5465787] かわすずき (川鈴木) : Kawasuzuki (family or surname)
# [id#5499409] おおすずき (大鈴木) : Oosuzuki (family or surname)
# [id#5711308] すすき (鈴木) : Susuki (family or surname)
# ...

In [None]:
jam.lookup('花火')
for entry in jam.lookup('花火').entries:
     print(entry)


In [None]:
# All necessary imports 
import wikipedia # for fetching japanese article from wikipedia
import re # regex library for searching patterns and pre-processing the text
import nagisa # library used for Natural Language Processing for japanese
import pykakasi # library for conversion of Kanji into Hirigana, Katakana and Romaji
import heapq # library for implementing priority queues where the queue item with higher weight is given more priority in processing
import pandas as pd # library for managing the data in form of table
from jamdict import Jamdict # library for searching the japanese vocabulary

# set the language as Japanese for wikipedia article
wikipedia.set_lang("ja")

# search article on any topic
wikipedia.search("COVID-19") # searching for article related to "COVID-19" across wikipedia

In [None]:
article = wikipedia.page("2019新型コロナウイルス") # getting the article for topic: "2019新型コロナウイルス"
article_content = article.content # getting the content of the article

# Cleaning the article using regex for pre-processing
text = re.sub(r'\[[0-9]+\]','',article_content) # removing references such [1] or [2] etc from paragraph
text = re.sub(r"\s+",' ',text) # for removing the extra spaces

In [None]:
# Pre-Processing the japanese data using regex
clean_text = text.lower() # converts any english word in lower case
clean_text = re.sub(r"\W"," ",clean_text) # removing any non-words characters which include special characters, comma, punctuation
clean_text = re.sub(r"\d"," ",clean_text) # removing any digits
clean_text = re.sub(r"\s+",' ',clean_text) # removing any extra spaces in middle 
clean_text = re.sub(r"^\s",' ',clean_text) # removing any extra spaces in beginning
clean_text = re.sub(r"\s$",' ',clean_text) # removing any extra spaces in end

# After cleaning and pre-processing, article is broken into individual sentences
sentences = text.split("。") # getting all the sentences using "。" as delimiter

# using "nagisa" library to get individual words extracted using following Parts of Speech:
# 英単語   : for English words
# 接頭辞   : for conjunctions
# 形容詞   : for adjective
# 名詞     : for noun
# 動詞     : for verb
# 助動詞   : for auxilary verbs
# 副詞     : for adverbs
jp_tokenised_words = nagisa.extract(clean_text, extract_postags=['英単語','接頭辞','形容詞','名詞','動詞','助動詞','副詞'])
tokenised_words = jp_tokenised_words.words

# list of stop-words. Stop words are words which are filtered out before or after processing of natural language data  
jp_stopwords = ["あそこ","あっ","あの","あのかた","あの人","あり","あります","ある","あれ","い","いう","います","いる","う","うち","え","お","および","おり","おります","か","かつて","から","が","き","ここ","こちら","こと","この","これ","これら","さ","さらに","し","しかし","する","ず","せ","せる","そこ","そして","その","その他","その後","それ","それぞれ","それで","た","ただし","たち","ため","たり","だ","だっ","だれ","つ","て","で","でき","できる","です","では","でも","と","という","といった","とき","ところ","として","とともに","とも","と共に","どこ","どの","な","ない","なお","なかっ","ながら","なく","なっ","など","なに","なら","なり","なる","なん","に","において","における","について","にて","によって","により","による","に対して","に対する","に関する","の","ので","のみ","は","ば","へ","ほか","ほとんど","ほど","ます","また","または","まで","も","もの","ものの","や","よう","より","ら","られ","られる","れ","れる","を","ん","何","及び","彼","彼女","我々","特に","私","私達","貴方","貴方方"]


In [None]:
#https://gist.github.com/shubh2016shiv/f409b89b19303f1ba4e8b5f23b981e46#file-text_summarization-py
# Calculate the frequency of each word 
word2count = {} # dictionary stores the word as a key and frequency as its value
for word in tokenised_words:
    if word not in jp_stopwords:  # We dont want to include any stop word
        if word not in word2count.keys():
            word2count[word] = 1
        else:
            word2count[word] += 1
            

# Calculate the weighted frequency of each word by dividing the frequency of the word by maximum frequency of word in whole article            
for key in word2count.keys():
    word2count[key] = word2count[key]/max(word2count.values()) # Weighted Frequency

    
'''After Calculating the weighted frequency of each word,
Importance score of the sentence is calculated by adding all weighted frequency of words in that sentence'''

# Below function , "getSpaceSeperatedJpWords(text)" inserts spaces among words in Japanese sentence by using 'pykakasi' library
'''For example:
sentence is  "日本は素晴らしい国です", then,
result will be "日本 は 素晴ら しい 国 です" 
with each word has either proper meaning or grammar meaning

日本 means "Japan"

は is a particle for topic marker.

素晴ら しい means "amazing". Even though it is a single word but there is space between 素晴ら and しい. 
Reason is that "しい" is grammatically significant as it i-adjective and can be conjugated.

国 is "Country"

です is "is / are"
'''
def getSpaceSeperatedJpWords(text):
    wakati = pykakasi.wakati()
    conv = wakati.getConverter()
    result_with_spaces = conv.do(text)
    return result_with_spaces
  

sent2score={} # This dictionary stores each sentence and its score as value
for sentence in sentences: # for each sentence in all sentences
    # get each word as a token using "'英単語','接頭辞','形容詞','名詞','動詞','助動詞','副詞'" as list of filters
    tokenised_sentence = nagisa.extract(sentence, extract_postags=['英単語','接頭辞','形容詞','名詞','動詞','助動詞','副詞'])
    words = tokenised_sentence.words
    for word in words: # if each word of all words in that sentence and
        if word in word2count.keys(): # if that word is available in "word2count" dictionary
            if len(getSpaceSeperatedJpWords(sentence).split(" ")) < 20: # threshold of 20 is chosen for removing the sentences which are long and not important
                if sentence not in sent2score.keys(): # then add its corresponding weighted freqency 
                    sent2score[sentence] = word2count[word] 
                else:
                    sent2score[sentence] += word2count[word]
    

閾値 20 は，この閾値より短い文章を選択するために考慮されています。
これは，長い文章は不必要に冗長な情報を含んでいるため，ｌ要約を生成するために長い文章を避けるために行われます。
各文章にスコアを割り当てた結果は以下のようになります。
<!-- The threshold value of 20 is considered to select those sentences which are shorter than this threshold value. 
This is done to avoid any long sentences for generating the summary, as long sentences unnecessarily contain redundant information. After each sentence is assigned its own score, the result is: -->

In [None]:
sent2score