The theory of asymmetric information will always be an important topic in the financial sector. Nowadays, tech-savvy financial institutions are commonly using natural language processing (NLP), and especially named entity recognition (NER) to find patterns, behaviors, or signals for investment research.

Japan is the third biggest economy by nominal GDP. It’s the land of conglomerates such as Toyota, Sony, Uniqlo, Softbank or Nintendo, and it’s the fifth biggest stock market. Japanese stocks have outperformed their European peers for years. Compared to sky-high valuations in the USA, the Japanese companies are high-quality firms and undervalued. And the Japanese market is probably the embodiment of asymmetric information due to the complexity of Japanese.

## Chairman & CEO, SoftBank Group Corp, Masayoshi Son
### ソフトバンクグループ（株）代表取締役 会長兼社長執行役員、孫 正義

After a three-year hiatus, the most powerfull Japanese businessman and the biggest tech investor in the world, Masayoshi Son came back to Twitter to share his voice, but in Japanese...
What did he say? Does it have any impact on Softbank, or his recent investments? 

**On this project, we will try to understand:**
- Specificities of natural language processing in Japanese 
- Informal speech of **Masayoshi Son (Softbank)** on Twitter for micro perspective **(Insights starting Chapter 7.)**
- Formal language of **Governor Kuroda (Bank of Japan)** with PDF files for macro perspective

**Technical goal:**
Create a NLP pipeline in Japanese for NER based on a solid framework with a small, carefully designed dataset using tiny computing power

**Challenge**: Not tweak split mode, fine tuning or hyper-parameters, as well as, avoid PyTorch, Keras and TensorFlow.

# Natural Language Processing / 自然言語処理

The natural language processing (NLP) can be break into three steps, text processing, feature extraction, and modeling.
In simple words, NLP is to transform human language with various steps including cleaning, preprocessing, engineering a statistical model for particular tasks. I will not go into details of each steps but mostly explaining three key points for Japanese.

## 1. Problem I / Tokenization トークン化
### Splitting raw text into small units  /  生テキストを単語の集合に自動的に分割
For the sake of an example, let’s try to tokenize one sentence in English and one tweet from Masayoshi Son with NLTK. 

In [71]:
import sys
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
english = 'Good muffins cost $3.88\nin New York.'
word_tokenize(english)

['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.']

* English sentences can easily be cut into words (White space tokenization) with a common tool such as NLTK.

In [72]:
japanese = '今日の決算発表の録画はこちらです。興味があるの方、ご覧ください。'
word_tokenize(japanese)

['今日の決算発表の録画はこちらです。興味があるの方、ご覧ください。']

* However, as an agglutinative language, Japanese doesn’t have space to separate words. Basic morpholizers cannot cut sentences into linguistic units in Japanese (as well as Chinese, Korea, or Hindi).

## 2. Solution I / Japanese morphological analyzers  /  日本語を形態素解析
Specific morphological analyzers based on dictionaries are mandatory to tokenize in Japanese. Written in C++, **MeCab** has been the most popular for years. Other morphological analyzers are available such as Kuromoji in Java, Suika in Ruby, JUMAN in Perl, Janome in Pure Python, and RakutenMa in JavaScript.

### NLP framework, spaCy  /  自然言語処理フレームワーク、spaCy
SpaCy is not a morphological analyzer, but a full NLP framework integrating pretrained pipelines in multiple languages including Japanese. Developed by Explosion, spaCy used to leverage UniDic and MeCab in the first version. Now, they are using SudachiDic and SudachiPy from Works Applications. We will use spaCy as a backbone for this notebook.

In [73]:
import spacy
from spacy import displacy 
nlp = spacy.load('ja_core_news_md') 
doc = nlp('大西市長、了解致しました。')
r_list = []
for sent in doc.sents:
    r_list = r_list + [[str(token.i), token.text, token.lemma_, token.pos_, token.tag_] for token in sent]
df_list = pd.DataFrame(r_list, columns = ['token_no', 'text', 'lemma', 'pos', 'tag'])

In [74]:
df_list

Unnamed: 0,token_no,text,lemma,pos,tag
0,0,大西,大西,PROPN,名詞-固有名詞-地名-一般
1,1,市,市,NOUN,名詞-普通名詞-一般
2,2,長,長,NOUN,接尾辞-名詞的-一般
3,3,、,、,PUNCT,補助記号-読点
4,4,了解,了解,VERB,名詞-普通名詞-サ変可能
5,5,致し,致す,AUX,動詞-非自立可能
6,6,まし,ます,AUX,助動詞
7,7,た,た,AUX,助動詞
8,8,。,。,PUNCT,補助記号-句点


In [75]:
displacy.render(doc, style='dep', jupyter=True, options={'compact':True,'distance':120})

* Using the tool displacy, we can visualize the sentence into linguistic units and syntactic dependencies. The name Onishi (大西) and the title "City Mayor" (市長) are well linked together. The SOV structure of Japanese language is also well understand with the verb "comprehension" (了解) in the polite form (致す) on the past tense (ました).

In [76]:
displacy.render(doc, style='ent', options={'compact':True})

* On April 22 2020, Masayoshi Son responding to the mayor of Kumamoto. In this small sentence, we can see that the **pretrained Japanese NER** model perfectly recognized the Mayor of Kumamoto, Kazufumi Onishi (熊本市長 大西一史) and his title (市長). 
***

## 3. Problem II / Named Entity Recognition  / 固有表現抽出
The Japanese writing system is based on three separate sets of characters: hiragana (平仮名 / ひらがな), katakana (片仮名 / カタカナ), and kanji (漢字). Modern Japanese is arguably the hardest language to process with a blend of Sino-Japanese, native Japanese, Latin script (romaji /ローマ字), and vocabulary from the westerners (Dutch, Portuguese, French, English, German). The numeral system is a mix of Arabic and traditional Chinese numerals. 

The Japanese digital world brought us emojis in Unicode : ) or Kaoji using Cyrillic alphabet. ∵･(ﾟДﾟ) and Greek alphabets ＿φ(°-°=). If you add regional slang, colloquialism, marketing contractions, and the constant evolution of the language, it becomes a beautiful nightmare for named entity recognition.


## 4. Solution II / Pretrained pipelines / 学習済みモデル
**SpaCy (2.4)** is providing 3 powerful Japanese pipelines optimized for CPU in small (12MB), medium (41MB), and large (531MB). Based on Universal Dependencies, SudachiPy, and spaCy framework, **GINZA (4.0)** is also popular among data teams in Japan. Let's compare see how they compare on tricky sentences about Softbank. 
### 4.a spaCy Model / spaCyのモデル

In [77]:
doc1 = ('ソフトバンクグループが外資系金融機関からの借り入れを増やしている。'
'ソフトバンクは7月1日から、同社の5Gブランドである「SoftBank 5G」と「ドラえもん」がコラボレーションした新テレビCM。'
'福岡ソフトバンクホークス・王貞治会長「君たちが歴史の中でも一番強い」と最大限の称賛。'
'ソフトバンク(9434)は、ソフトバンクG(9984)の子会社、会長の孫正義氏は創業者取締役です。')

displacy.render(nlp(doc1),style='ent',jupyter=True)

Without modifying the split mode, the spaCy Japanese pipeline calibrated with the medium model is giving us incredible results. 

1. In the first sentence, **SoftBank Group (ソフトバンクグループ)** is recognized as a company. 

2. In the second sentence about tv spot (featuring Bruce Willis as Doraemon), the mobile company **Softbank (ソフトバンク)** is spotted perfectly with the **date (7月1日)**, the product **Softbank 5G** and collaboration with Japanese anime **Doraemon (ドラえもん)** as a product.

3. In the third sentence, the NER model also have the baseball team **Fukuoka Softbank Hawks (福岡ソフトバンクホークス)**, with a small mistake on the chairman and baseball legend, **Sadaharu Oh (王貞治)** due to his name of Chinese origins. 

4. In the fourth sentence (from financial news), it's more tricky, the **Softbank (ソフトバンク)** with the ticker (9434) is good, but **Softbank Group (ソフトバンクG)** with the ticker (9984) is recognized as a product. And CEO of Softbank is spotted as a person **Masaoyshi Son 孫正義**. 
### 4.b Ginza Model / GiNZAのモデル

In [78]:
nlp_ginza = spacy.load('ja_ginza') 
displacy.render(nlp_ginza(doc1),style='ent',jupyter=True)

The **Ginza pretrained pipeline** is giving us interesting results. 

1. In the first sentence, **SoftBank Group (ソフトバンクグループ)** is recognized as a person. 

2. In the second sentence about the tv spot, the mobile company **Softbank (ソフトバンク)** is spotted as a sport team and Japanese anime character **Doraemon (ドラえもん)** as a music. 

3. In the third sentence, the pipeline gave us the baseball team **Fukuoka Softbank Hawks (福岡ソフトバンクホークス)** as a person. The former coach, chairman and baseball legend, **Sadaharu Oh (王貞治)** is perfectly spotted as well as his title. 

4. In the fourth sentence, the mobile company **Softbank (ソフトバンク)** is recognized as a sports team. The **Softbank Group (ソフトバンクG)** as Geographical Entity (GEO). CEO of Softbank is spotted as a person **Masaoyshi Son 孫正義**. The honorific label for name **Sir(氏)**, the company **company director (取締役)** are both recognized. However, the title **founder (創業者)** is missing in the middle.

## 5. Problem III / Wrong NER labels  / 間違えてしまった固有表現タグ
While the Japanese pipeline by spaCy give us great results but doesn't separate entities. The Ginza pipeline give us separate entities, unfortunately the wrong ones.

Depending on who you ask **“What is Softbank?”**, you will get a different answer. The Japanese people will tell you it’s a telecom company. A baseball fan will tell you it’s the best team in Japan since five years. For people working in the finance sector, Softbank is two giant tech funds, Vision Fund I & II. 

Based on media outlets and news, pretrained models are reproducing errors of humans. The Japanese financial press is writing **SoftbankG (ソフトバンクG)** for Softbank Group. Even Masayoshi Son on Twitter is using the contraction **SB**. Apparently, the Ginza model was trained on the data of website **Livedoor** by the data team at **Recruit Holdings** (Indeed/Glassdoor) which is perfect for titles and job position. However, a model trained on lot of articles on baseball using the word **Softbank (ソフトバンク)** to describe the Japanese baseball team, **Fukuoka Softbank Hawks (福岡ソフトバンクホークス)** will give us the wrong entities.

## 6. Solution III / Industry specific labeling  / 依存関係のラベリング

Regardless of the amazing capabilities of spaCy and Ginza, NER needs to be adjusted to be domain specific. We can have three approaches, the traditional rules-based model, bespoke statistical model with prior knowledge, or a neural approach but costly in terms of computing power.

I have trained a NER bespoke model with a small training dataset in order to recognized particular entities:

1. **Softbank** (company) 2. **Softbank Hawks** (baseball team) 3. **Softbank Vision Fund** 
4. **Masayoshi Son** 5. **Softbank tickers** 6. **Sofbank 5G products** 
7. **Sars-Cov2**  8. **Vaccin** 

### 6.a Bespoke Model / ビスポークのモデル 

In [79]:
import spacy
nlp_bespoke = spacy.load(r"SBModel") #Loading bespoke model
displacy.render(nlp_bespoke(doc1),style='ent',jupyter=True)

* My simple bespoke NER model in Japanese is able to recognized the business entities companies, tickers and products, as well as separate the baseball team. The model is still a little bit confused on the Softbank as a baseball team on the second sentence. 
***
***- Now let's see try the model on few tweets of Masayoshi Son !***


## 7. Tweets of Masayoshi Son / 孫正義氏のツイート

With an impressive 2.8M followers, the last two tweets from Masayoshi Son were from October 2015 about the Hawks’ victory and in February 2017 for the announcement of an environmental grant.

The Japanese investor came back strong with 110 tweets from March 10th, 2020 to June 6th 2021. In the first three tweets, we can easily understand that Masayoshi Son was concerned about the Sars-Cov2.

In [80]:
import csv
df_masa = pd.read_csv("MasaSB.csv", sep=";",encoding='utf-8')
df_masa.head(3)

Unnamed: 0,UserScreenName,UserName,Date,Text
0,孫正義,@masason,2020-03-10 12:15:50,久しぶりのツイートです。新型コロナウイルスの状況を心配しています。
1,孫正義,@masason,2020-03-11 02:52:50,行動を開始します。
2,孫正義,@masason,2020-03-11 09:46:03,本日厚労省を訪問しました。医療崩壊を起こさないよう連携しながらやっていきたい。#コロナ検査有志


## 8. Haiku / 一句

Often pictured by the western media as "*not your typical Japanese business man*". Masayoshi Son’s Twitter feed is telling us another story. Masayoshi Son is using the polite form (丁寧語) to speak with his followers. While speaking to government officials, he is using the humble form (謙譲語) by respect. We can see that the Japanese CEO is supporting his local baseball team, takes pictures of cherry blossoms, and write haiku.

In [81]:
print(df_masa['Text'][95])

十六で
志立て単身渡米。
今の正直な心境を一句
したためてみました。

飛び跳ねて
田で鳴くカエル
空遠し。


>"At sixteen, I had the will to go alone in USA. I will try to write a haiku about my present state of mind:

>**Jump up and down**

>**Croaking frogs in the rice field**

>**Far away"**

  *Masayoshi Son*

*Note: the standard NLP preprocessing (delete the `\n`) would have destroy the structure of this poetic tweet. : )*

In [82]:
text = df_masa['Text']

In [83]:
import re  #Tweets Cleaning 
def clean_text(text):
    text = re.sub('\n', '', text)
    text = re.sub('\u3000', '', text)
    text = re.sub("@[A-Za-z0-9]+","",text)
    text = re.sub("#","",text)
    text = re.sub("(?:\@|http?\://|https?\://|www)\S+", "", text)
    return text
df_masa['Text'] = df_masa['Text'].map(lambda x: clean_text(x))

In [84]:
text.head(6)

0                    久しぶりのツイートです。新型コロナウイルスの状況を心配しています。
1                                            行動を開始します。
2       本日厚労省を訪問しました。医療崩壊を起こさないよう連携しながらやっていきたい。コロナ検査有志
3                             現在検討中の簡易PCR検査の流れ コロナ検査有志
4    検査したくても検査してもらえない人が多数いると聞いて発案したけど、評判悪いから、やめようかな...
5    新型コロナウイルスに不安のある方々に、簡易PCR検査の機会を無償で提供したい。まずは100万...
Name: Text, dtype: object

### Test of custom NER model with 4 tweets

In [85]:
blm_fund = df_masa['Text'][62]
doc4 = nlp_bespoke(blm_fund)
displacy.render(doc4,style="ent", jupyter=True)

* The custom model is spotting **Softbank Group (SBグループ)** as a company, but not the new fund.

In [86]:
paypay = df_masa['Text'][103]
doc5 = nlp_bespoke(paypay)
displacy.render(doc5,style="ent",jupyter=True)

* The NER model is spotting **Hawks (ホークス)** as the baseball team without being confused of the **Fukuoka PayPay Stadium (福岡PayPayドーム)**.

In [87]:
hage = df_masa['Text'][75]
doc6 = nlp_bespoke(hage)
displacy.render(doc6,style="ent", jupyter=True)

* The entity **Sars Cov-2 (新型コロナウイルス)** was perfectly found as an event inside a confusing joke linking covid-19 and baldness.

In [88]:
vaccin = df_masa['Text'][102]
doc7 = nlp_bespoke(vaccin)
displacy.render(doc7,style="ent",jupyter=True)

* And last sentence, the entity **vaccin (ワクチン)** was also found correctly. 

## 9. Insights / 洞察

* Covid-19:

In 110 tweets, Masayoshi Son mostly spoke about Covid-19 with 77 tweets. He connected with government officials to provide assistance and even the Fukuoka PayPay stadium. The Japanese CEO used his companies to obtain masks, test kits, and other necessities.

* Business:

Masayoshi Son only talked 4 times about Softbank Group as a company with 2 tweets for the announcement of financial results, and 1 tweet to respond about the first deficit of Softbank. 1 tweet to announce that the SoftBank group will launch a `$100` million fund **"Opportunity Growth Fund"** for minority-owned startups in response to the Black Lives Matter movement. 

* Other facts:

6 tweets to celebrate the championship title of Softbank Hawks. 3 tweets to share his concerns about the upcoming Tokyo Olympics. Only 1 haiku, but more than 10 tweets with inspiring quotes. He shared one picture of cherry blossom, one of lotus flower, one of his brother and one joke about baldness. Since Masayoshi Son is not using Twitter as a digital tool to promote Softbank, he might not be the best CEO to research any signals from.

## 10. Hiroshi Mikitani of Rakuten / 三木谷浩史 (楽天株式会社)

During the same period, **Hiroshi Mikitani**, the founder and chairman of **Rakuten** shared around 230 tweets to his 1.1M followers. The Japanese entrepreneur is carefully sharing about his ventures (**Lyft, Viber, Pinterest**), Rakuten’s financial results, his baseball team **Tohoku Rakuten Golden Eagles**, partnership in sports (**FC Barcelona, Golden State Warriors**) and meditation. As a digital-savvy, Hiroshi Mikitani secured an investment deal worth `150` billion yen (`$1.4` billion) from the Japan Post and Tencent Investment for an `8%` stake in Rakuten. More on the social media usage, the Japanese CEO displays a clear pattern in his digital communication.

## 11. Governor of the Bank of Japan, Haruhiko Kuroda / 第31代日本銀行総裁、黒田 東彦
After **Jerome Powell** of the US Federal Reserve, **Haruhiko Kuroda** of Bank of Japan is probably one of the most carefully listened bank governors. The 31st Japanese governor is not on Twitter, but the Bank of Japan is constantly sharing great insights both in Japanese and in English.

Yet, not fully... let's look at the RSS feed

In [89]:
import feedparser
jarawrss = ['https://www.boj.or.jp/rss/whatsnew.xml']
enrawrss = ['https://www.boj.or.jp/en/rss/whatsnew.xml',]

feedsja = []
for url in jarawrss:
    feedsja.append(feedparser.parse(url))

postsja = [] 
for feed in feedsja:
    for post in feed.entries:
        postsja.append((post.published, post.title, post.link))

JA = pd.DataFrame(postsja, columns=['JA_Date','JA_Title','JA_Link']) 

feeds = []
for url in enrawrss:
    feeds.append(feedparser.parse(url))

posts = []
for feed in feeds:
    for post in feed.entries:
        posts.append((post.published,post.title, post.link ))

EN = pd.DataFrame(posts, columns=['EN_Date','EN_Title', 'EN_Link'])

In [90]:
JA.head(5)

Unnamed: 0,JA_Date,JA_Title,JA_Link
0,"Mon, 14 Jun 2021 17:00:00 +0900",日本銀行による国庫短期証券の銘柄別買入額,http://www.boj.or.jp/statistics/boj/other/tmei...
1,"Mon, 14 Jun 2021 17:00:00 +0900",日本銀行が保有する国債の銘柄別残高,http://www.boj.or.jp/statistics/boj/other/mei/...
2,"Mon, 14 Jun 2021 11:00:00 +0900",（金研ニュースレター）2021年国際コンファランス,http://www.boj.or.jp/announcements/release_202...
3,"Mon, 14 Jun 2021 10:00:00 +0900",営業毎旬報告（6月10日現在）,http://www.boj.or.jp/statistics/boj/other/acma...
4,"Mon, 14 Jun 2021 10:00:00 +0900",金融広報関連事務担当者の募集について,http://www.boj.or.jp/announcements/release_202...


In [91]:
EN.head(5)

Unnamed: 0,EN_Date,EN_Title,EN_Link
0,"Mon, 14 Jun 2021 17:00:00 +0900",T-Bills Purchased by the Bank of Japan,http://www.boj.or.jp/en/statistics/boj/other/t...
1,"Mon, 14 Jun 2021 17:00:00 +0900",Japanese Government Bonds Held by the Bank of ...,http://www.boj.or.jp/en/statistics/boj/other/m...
2,"Mon, 14 Jun 2021 11:00:00 +0900",(IMES Newsletter) 2021 BOJ-IMES Conference,http://www.boj.or.jp/en/announcements/release_...
3,"Mon, 14 Jun 2021 10:00:00 +0900",Bank of Japan Accounts (June 10),http://www.boj.or.jp/en/statistics/boj/other/a...
4,"Thu, 10 Jun 2021 16:30:00 +0900",Basic Figures on Fails (May),http://www.boj.or.jp/en/statistics/set/bffail/...


In [92]:
print(f"Bank of Japan - RSS feed - Articles in Japanese",JA.shape)
print(f"Bank of Japan - RSS feed - Articles in English",EN.shape)

Bank of Japan - RSS feed - Articles in Japanese (89, 3)
Bank of Japan - RSS feed - Articles in English (64, 3)


The difference in the RSS feed of BOJ is the perfect illustration for the asymmetric of information due to language. The extremely interesting research paper about the **"Digital Transformation of Japanese Banks (わが国の銀行におけるデジタル・トランスフォーメーション)"** was published in March 29th in Japanese and May 31st 2021 in English. 

In [93]:
research = JA['JA_Title'][85]
doc8 = nlp_bespoke(research)
displacy.render(doc8,style="ent",jupyter=True)

And yes, my custom NER model is still working on the RSS feed of the BOJ. For legal reason, I will not go deeper on this notebook. Overall, the **Bank of Japan** is sharing 30% more reports, statitics, research/studies, and announcements in Japanese, than in English.

## 13. Final thoughts on NER in Japanese

Fifteen years ago, French was difficult to process using C++, Pascal, and Object Pascal. Eight years ago, Japanese was still a nightmare. In recent years, it became really easy to do NLP using a framework in multiple languages.

SpaCy is fast, powerful, stable, and perfect for production. Ginza could be leveraged for Japan in a less domain-specific task. I achieved pretty amazing results with my custom model based on a small training datasets.

In the financial sector, a bespoke NER model needs to include corporate entities, subsidiaries, titles (CEO, CFO), boards of directors (for financial reports), tickers, Bloomberg shortcuts (TPX/JSDA), macro phenomenon, currencies, products, business partnerships, and financial keywords such as “赤字” for deficit or “倒産” for bankruptcy.

In Japanese language, the tokenization can play a huge role in the final results. Wrong tokenization often leads to loss in both syntax and semantic.

Coming from a linguistic background, upcoming data scientists are focusing too much on the latest neural model or scores, when it’s important to have good labeling in the context and a stable model.

The optimal results in NLP are often relying on linguistic basics, proficiency in the language, and the knowledge of one particular sector. I will also argue that Japanese is mandatory but incorporating correct English can create better NER models in Japanese.

Thank you for reading!

Please feel free to contact me if you have any questions.



**Akim Mousterou**

***
**Disclaimer**: *None of the content published on this notebook constitutes a recommendation that any particular security, portfolio of securities, transaction, or investment strategy is suitable for any specific person. None of the information providers or their affiliates will advise you personally concerning the nature, potential, value, or suitability of any particular security, portfolio of securities, transaction, investment strategy, or other matter.*