<a href="https://colab.research.google.com/github/ShinAsakawa/ShinAsakawa.github.io/blob/master/2025notebooks/2025_0915CDPja.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0. 準備作業

<img src="https://raw.githubusercontent.com/project-ccap/project-ccap.github.io/refs/heads/master/2025figs/1998Zorzi_CDP_fig1.svg" style="width:49%;"><br/>
<p>Zorzi+(1998) Fig.1 Architecture of the model. The arrow means full connectivity between layers. Each box stand for a group of letters (26) or phonemes (44).</p>


<img src="https://raw.githubusercontent.com/project-ccap/project-ccap.github.io/refs/heads/master/2025figs/1998Zorzi_CDP_fig8.svg" width="49%;"><br/>
<p>Zorzi+(1998) Fig.8. Architecture of the model with the hidden layer pathway. In both the direct pathway and the mediated pathway the layers are fully connected (arrows).</p>

<img src="https://raw.githubusercontent.com/project-ccap/project-ccap.github.io/refs/heads/master/2025figs/1998Zorzi_fig10.svg" width="49%"><br/>
<p style="align-text:center">
Figure 10. Lexical and sublexical procedures in reading aloud, and their interaction in the phonological decision system, where the final phonological code is computed for articulation.
</p>


## 0.1 必要なライブラリの輸入

In [None]:
%config InlineBackend.figure_format = 'retina'
import torch
#device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
device = torch.device('cuda:0' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu')
print(f'device:{device}')

# 全モデル共通使用するライブラリの輸入
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_sequence

from collections import OrderedDict
import sys
import os
import numpy as np
import operator
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt

HOME = os.environ['HOME']

# 実行環境が Google colablatory であるかを判定
from IPython import get_ipython
isColab = 'google.colab' in str(get_ipython())
print(f'isColab:{isColab}')

try: # 図に日本語を表示するため
    import japanize_matplotlib
except ImportError:
    !pip install japanize_matplotlib
    import japanize_matplotlib

try:
    import jaconv
except ImportError:
    !pip install jaconv

try: # 自作ライブラリの輸入
    import CDP_ja
except ImportError:

    !git clone https://github.com/ShinAsakawa/CDP_ja.git
    import CDP_ja

## CDP_ja の読み込み (トークナイザ tokenizers, モデル)

* トークナイザ：モーラ分かち書き, 訓令式ローマ字, 学習漢字, 常用漢字
* モデル direct_TLA, indirect_TLA, combined_TLA<br/>
をインポートする

In [None]:
# %load_ext autoreload
# %autoreload 2

# モーラ分かち書き, 訓令式ローマ字表記による音韻表現，学習漢字，常用漢字用トークナイザが定義されている
from CDP_ja import mora_Tokenizer, kunrei_Tokenizer, gakushu_Tokenizer, joyo_Tokenizer

mora_tokenizer = mora_Tokenizer()
kunrei_tokenizer = kunrei_Tokenizer()
gakushu_tokenizer = gakushu_Tokenizer()
joyo_tokenizer = joyo_Tokenizer()

# TLA モデル 3 種の定義， 単層パーセプトロン， 三層パーセプトロン， スキップ結合つき三層パーセプトロン
from CDP_ja import direct_TLA, indirect_TLA, combined_TLA

# リカレントニューラルネットワークモデルを用いたモデル
from CDP_ja import Seq2Seq_wAtt, Seq2Seq_woAtt

# そのたの実用的関数群
from CDP_ja import fit_an_epoch
from CDP_ja import eval_an_epoch
from CDP_ja import Psylex71_Dataset
from CDP_ja import init_seed

# NTT 日本語語彙特性 単語頻度データ psylex71 データセットの定義

In [None]:
psylex71_dss={}   # データセットを複数格納するための辞書

input_tokenizer = gakushu_tokenizer
input_tokenizer = joyo_tokenizer
output_tokenizer = mora_tokenizer

inp_minlen=1      # 入力情報に与えるための最短単語文字長

for inp_maxlen in [2,3,4,5,6,7]:  # 入力情報に与えるための最長単語文字長
#for inp_maxlen in [2]:  # 入力情報に与えるための最長単語文字長

    if inp_maxlen == inp_minlen:       # 最初だけ全文字候補を出力
        display=True
    else:
        display=False

    psylex71_dss[inp_maxlen] = Psylex71_Dataset(
        inp_minlen=inp_minlen,
        inp_maxlen=inp_maxlen,             # この値だけがループ内で変化する
        input_tokenizer=input_tokenizer,   # 入力情報をトークン化するためのトークナイザ
        output_tokenizer=output_tokenizer, # 出力情報をトークン化するためのトークナイザ
        #excel_fname='psylex71all.xlsx',
        excel_fname='psylex71all_sorted.xlsx',
        #device=device,
        display=display)

    print(f'psylex71 最短文字長:{inp_minlen:2d}, 最長文字長:{inp_maxlen:2d}',
          f'データセットサイズ (単語数):{psylex71_dss[inp_maxlen].__len__():7,d} 語')

# 訓練 (train) データセット，検証 (valid) データセット，検査 (test) データセットへ分割

## データセットの選択

## データセットの分割,訓練,検査,検証データセット

In [None]:
# データセットの分割,訓練,検査,検証データセット

# 乱数の種を設定
seed=42
init_seed(seed=seed)

# どちらのデータセットを用いるかを _ds に代入することで指定する
inp_maxlen = 2
#inp_maxlen = 5
_ds = psylex71_dss[inp_maxlen]
#_ds = psylex71_ds_mora
#_ds = psylex71_ds_o2o
#_ds = psylex71_ds_p2p
#_ds = psylex71_ds_kunrei

# ここでは時間節約のため,全体のデータ数のうち 以下の割合だけデータを用いて検証を行う
#train_size = int(_ds.__len__() * 0.9)
#valid_size = int(_ds.__len__() * 0.1)
train_size = int(_ds.__len__() * 0.8)
valid_size = int(_ds.__len__() * 0.2)
train_size = int(_ds.__len__() * 0.1)
valid_size = int(_ds.__len__() * 0.1)
resid_size = _ds.__len__() - train_size - valid_size

# 実際のデータ分割
train_ds, valid_ds, resid_size = torch.utils.data.random_split(
    dataset=_ds,
    lengths=(train_size, valid_size, resid_size),
    generator=torch.Generator().manual_seed(seed))

print(f'train_size:{train_size}')
print(f'valid_size:{valid_size}')

## バッチサイズの定義とデータローダの設定

In [None]:
# ミニバッチサイズの定義
batch_size = 1024
batch_size = 128

# データセットとミニバッチサイズを用いて PyTorch 用のデータローダを宣言
train_dl = torch.utils.data.DataLoader(dataset=train_ds, batch_size=batch_size, shuffle=True)
valid_dl = torch.utils.data.DataLoader(dataset=valid_ds, batch_size=batch_size, shuffle=False)

# 並列計算のための準備
def _collate_fn(batch):
    inps, tgts = list(zip(*batch))
    inps = list(inps)
    tgts = list(tgts)
    return inps, tgts

# 訓練データセット用データローダ
train_dl = torch.utils.data.DataLoader(
    dataset=train_ds,
    batch_size=batch_size,
    shuffle=True,
    num_workers=0,
    collate_fn=_collate_fn)

# 検証データセット用のデータローダ
valid_dl = torch.utils.data.DataLoader(
    dataset=valid_ds,
    batch_size=batch_size,
    shuffle=True,
    num_workers=0,
    collate_fn=_collate_fn)

print(f'train_ds.__len__():{train_ds.__len__()}')
print(f'valid_ds.__len__():{valid_ds.__len__()}')

# モデルの定義
## TLA モデルの定義

In [None]:
#from RAM import Transformer
from CDP_ja import Transformer
transformer = Transformer(src_vocab_size=len(input_tokenizer.tokens),
                          tgt_vocab_size=len(output_tokenizer.tokens),
                          model_dim=256,
                          num_heads=4,
                          num_layers=1,
                          max_seq_length=_ds.out_maxlen,
                          dropout=0.,
                          ff_dim=32,
                          device=device)
transformer.eval();

## 定義したモデルで動作チェック

In [None]:
inp_maxlen

In [None]:
# モデルを宣言するために必要なハイパーパラメータを定義
_ds = train_ds                     # データセット
n_hid=256                          # 中間層のニューロン数
out_f = None
hidden_out_f = None
n_layers=1
bidirectional=False

tla_direct = direct_TLA(
    inp_vocab_size=len(input_tokenizer.tokens),
    out_vocab_size=len(output_tokenizer.tokens),
    inp_len = inp_maxlen + 2,
    out_f = out_f,
    out_len = _ds.dataset.out_maxlen).to(device)
print(tla_direct.eval())


tla_indirect = indirect_TLA(
    inp_vocab_size=len(input_tokenizer.tokens),
    out_vocab_size=len(output_tokenizer.tokens),
    inp_len = inp_maxlen + 2,
    out_f = out_f,
    hidden_out_f=hidden_out_f,
    out_len = _ds.dataset.out_maxlen).to(device)
print(tla_indirect.eval())

tla_combined = combined_TLA(
    inp_vocab_size=len(input_tokenizer.tokens),
    out_vocab_size=len(output_tokenizer.tokens),
    inp_len = inp_maxlen + 2,
    out_f = out_f,
    hidden_out_f=hidden_out_f,
    out_len = _ds.dataset.out_maxlen).to(device)
print(tla_combined.eval())

tla_seq2seq = Seq2Seq_wAtt(
    enc_vocab_size=len(input_tokenizer.tokens),
    dec_vocab_size=len(output_tokenizer.tokens),
    n_layers=n_layers,
    bidirectional=bidirectional,
    n_hid=n_hid).to(device)
print(tla_seq2seq.eval())

tla_seq2seq0 = Seq2Seq_woAtt(
    enc_vocab_size=len(input_tokenizer.tokens),
    dec_vocab_size=len(output_tokenizer.tokens),
    n_layers=n_layers,
    bidirectional=bidirectional,
    n_hid=n_hid).to(device)
print(tla_seq2seq.eval())

transformer = Transformer(src_vocab_size=len(input_tokenizer.tokens),
                          tgt_vocab_size=len(output_tokenizer.tokens),
                          model_dim=n_hid,
                          num_heads=4,
                          num_layers=n_layers,
                          max_seq_length=_ds.dataset.out_maxlen,
                          dropout=0.,
                          ff_dim=32,
                          device=device)
transformer.eval()

# モデルを tla に代入
#tla = tla_vanilla
model1 = tla_direct
model2 = tla_indirect
model3 = tla_combined
model4 = tla_seq2seq
model5 = tla_seq2seq0
model6 = transformer

model = model1

# N 個のデータを実行してみる
N = 2
ids = np.random.permutation(_ds.__len__())[:N]  # データをシャフルして先頭の N 項目だけ ids に入れる

for idx in ids:
    # データセットから返ってくる値は入力信号 inp と教師信号 tch
    inp, tch = _ds.__getitem__(idx)
    print(f'idx:{idx}:', f'inp:{inp}', f'tch:{tch}')

    # 入出力信号はトークン ID 番号であるため人間が読みやすいように変換して表示
    print(f'_ds.dataset.ids2inp({inp}):{_ds.dataset.ids2inp(inp)}')
    print(f'_ds.dataset.taregt_ids2target({tch}):{_ds.dataset.target_ids2target(tch)}')
    inp = pad_sequence(inp.unsqueeze(0), batch_first=True).to(device)
    tch = pad_sequence(tch.unsqueeze(0), batch_first=True).to(device)

    # モデルにデータを与えて出力を得る
    outs = model(inp, tch)

    # 結果の表示
    print('教師:', _ds.dataset.target_ids2target([idx.cpu().numpy() for idx in tch.squeeze(0)]), end=": ")
    print('教師 ids:', [int(_tch.cpu().numpy()) for _tch in tch.squeeze(0)])
    print('出力 ids:', [int(_out.argmax().cpu().numpy()) for _out in outs[0]], end="\n===\n")

In [None]:
from asa_pytorch_lstm import NaiveLSTM
#print(dir(NaiveLSTM))

class vanilla_LM(torch.nn.Module):
    def __init__(self,
                 inp_vocab_size:int=None, # len(psylex71_ds.input_tokenizer.tokens),
                 out_vocab_size:int=None, # len(psylex71_ds.output_tokenizer.tokens),
                 n_hid:int=512,
                 device:str='cpu'):
        super().__init__()
        self.inp_vocab_size=inp_vocab_size
        self.out_vocab_size=out_vocab_size
        self.n_hid=n_hid

        self.emb_layer = torch.nn.Linear(in_features=inp_vocab_size, out_features=n_hid).to(device)
        self.emb_outf = torch.nn.Tanh()
        self.out_outf = torch.nn.Sigmoid()
        self.lstm = NaiveLSTM(n_inp=n_hid, n_hid=n_hid).to(device)
        self.out_layer = torch.nn.Linear(in_features=n_hid, out_features=out_vocab_size).to(device)

    def forward(self, X):
        '''互換性のため Y を入力としているが実際には使っていない'''

        #X = X.float()               # ワンホットベクトルは整数 int64 なので浮動小数点に変換
        # 入力 X はトークン ID リストであるので，ワンホットベクトル化する
        X = torch.nn.functional.one_hot(X, num_classes=self.inp_vocab_size)

        #print(X)
        #sys.exit()
        X = X.float()
        X = self.emb_layer(X)       # 埋め込み層への信号伝搬
        X = self.emb_outf(X)        # 埋め込み層の非線形変換

        X, (h, c) = self.lstm(X)

        X = self.out_layer(X)       # 出力層への信号伝搬
        X = self.out_outf(X)        # 出力層での非線形変換

        return X

lm = vanilla_LM(inp_vocab_size=len(input_tokenizer.tokens),
                out_vocab_size=len(output_tokenizer.tokens),
                n_hid=1024,
                device=device)
lm.eval()

# 学習

## 訓練に用いるモデルを再定義

In [None]:
# 以下では 3 つのモデルを定義して比較している
_ds = train_ds
n_layers=1
bidirectional=False
n_hid=1024
n_hid=2048

out_f = None
# hidden_out_f = 'sigmoid'
# out_f = None
#out_f = 'tanh'
hidden_out_f = 'ReLU'

tla_combined = combined_TLA(
    inp_vocab_size=len(input_tokenizer.tokens),
    out_vocab_size=len(output_tokenizer.tokens),
    inp_len=2+2,
    out_len=_ds.dataset.out_maxlen,
    out_f=out_f,
    hidden_out_f=hidden_out_f,
    device=device)
#print(tla_combined.eval())

tla_direct = direct_TLA(
    inp_vocab_size=len(input_tokenizer.tokens),
    out_vocab_size=len(output_tokenizer.tokens),
    inp_len=inp_maxlen+2,
    out_len=_ds.dataset.out_maxlen,
    out_f=out_f,
    device=device)
#print(tla_direct.eval())

tla_indirect = indirect_TLA(
    inp_vocab_size=len(input_tokenizer.tokens),
    out_vocab_size=len(output_tokenizer.tokens),
    inp_len=inp_maxlen+2,
    out_len=_ds.dataset.out_maxlen,
    out_f=out_f,
    hidden_out_f=hidden_out_f,
    # out_f=None,
    # hidden_out_f=None,
    device=device)
#print(tla_indirect.eval())

enc_dec = Seq2Seq_wAtt(
    enc_vocab_size=len(input_tokenizer.tokens),
    dec_vocab_size=len(output_tokenizer.tokens),
    n_layers=n_layers,
    bidirectional=bidirectional,
    n_hid=n_hid).to(device)
#print(seq2seq.eval())

enc_dec0 = Seq2Seq_woAtt(
    enc_vocab_size=len(input_tokenizer.tokens),
    dec_vocab_size=len(output_tokenizer.tokens),
    n_layers=n_layers,
    bidirectional=bidirectional,
    n_hid=n_hid).to(device)
#print(seq2seq0.eval())

transformer = Transformer(
    src_vocab_size=len(gakushu_tokenizer.tokens),
    tgt_vocab_size=len(mora_tokenizer.tokens),
    model_dim=n_hid,
    num_heads=4,
    num_layers=n_layers,
    max_seq_length=_ds.dataset.out_maxlen,
    dropout=0.,
    ff_dim=32,
    device=device)
#print(transformer.eval())

## 実際の訓練

In [None]:
#_ds = psylex71_dss[2]
_ds = train_ds

In [None]:

# out_f = None
model1 = indirect_TLA(
    inp_vocab_size=len(input_tokenizer.tokens),
    out_vocab_size=len(output_tokenizer.tokens),
    inp_len = inp_maxlen + 2,
    out_f = out_f,
    hidden_out_f=None,
    out_len = _ds.dataset.out_maxlen).to(device)

model2 = indirect_TLA(
    inp_vocab_size=len(input_tokenizer.tokens),
    out_vocab_size=len(output_tokenizer.tokens),
    inp_len = inp_maxlen + 2,
    out_f = out_f,
    hidden_out_f='sigmoid',
    out_len = _ds.dataset.out_maxlen).to(device)

model3 = indirect_TLA(
    inp_vocab_size=len(input_tokenizer.tokens),
    out_vocab_size=len(output_tokenizer.tokens),
    inp_len = inp_maxlen + 2,
    out_f = out_f,
    hidden_out_f='tanh',
    out_len = _ds.dataset.out_maxlen).to(device)

model4 = indirect_TLA(
    inp_vocab_size=len(input_tokenizer.tokens),
    out_vocab_size=len(output_tokenizer.tokens),
    inp_len = inp_maxlen + 2,
    out_f = out_f,
    hidden_out_f='ReLU',
    out_len = _ds.dataset.out_maxlen).to(device)

In [None]:
model1 = tla_direct
model2 = tla_indirect
model3 = tla_combined
model4 = tla_seq2seq
model5 = tla_seq2seq0
model6 = transformer

#str(models[0]).split(",")[0]
#str(tla_indirect).split("\n")[0]
for _model in [model1, model2, model3, model4, model5, model6]:
    print(str(_model).split('\n')[0].replace('(','')) #.split('.')[-1])

In [None]:
model_list = []
#out_fs = [None, 'sigmoid', 'tanh', 'ReLU']
out_fs = [None]

#hidden_out_fs = [None, 'sigmoid', 'tanh', 'ReLU']
hidden_out_fs = [None, 'tanh', 'ReLU']

#tla_models = [indirect_TLA, combined_TLA, direct_TLA]
#models = [model1, model2, model3, model4, model5, model6]
models = [enc_dec, enc_dec0, transformer]

max_modelname = 0
for _model in models:
    for hidden_out_f in hidden_out_fs:
        for out_f in out_fs:
            #model_name = str(_model).split("'")[1].split('.')[-1]
            model_name = str(_model).split('\n')[0].replace('(','')
            if 'direct_TLA' == model_name:
                model = _model(
                    inp_vocab_size=len(input_tokenizer.tokens),
                    out_vocab_size=len(output_tokenizer.tokens),
                    inp_len= inp_maxlen + 2,
                    out_f = out_f,
                    out_len = _ds.dataset.out_maxlen).to(device)
            else:
                model = _model(
                    inp_vocab_size=len(input_tokenizer.tokens),
                    out_vocab_size=len(output_tokenizer.tokens),
                    inp_len= inp_maxlen + 2,
                    out_f = out_f,
                    hidden_out_f=hidden_out_f,
                    out_len = _ds.dataset.out_maxlen).to(device)
            model_name += '_OUT' + str(out_f) + '_HID' + str(hidden_out_f)

            max_modelname = len(model_name) if len(model_name) > max_modelname else max_modelname
            model_list.append((model_name, model))


#loss_f = torch.nn.CrossEntropyLoss(ignore_index=-1)
loss_f = torch.nn.CrossEntropyLoss()
print(len(model_list))
print(f'max_modelname:{max_modelname}')
for i, _model in enumerate(model_list):
    print(i, _model[0], end="\t\t")
    if ((i+1) % 3) == 0:
        print()

In [None]:
# 学習率リストとエポック数の定義
iter_params = [(1e-3, 2), (1e-4, 2), (1e-5, 2)]
iter_params = [(1e-3, 20)]
iter_params = [(1e-1, 50), (1e-2,20), (1e-3,20), (1e-4,10)]
iter_params = [(1e-3, 30)]
iter_params = [[1e-3, 10],(1e-4,10)]
iter_params = [(1e-2, 10)]
# iter_params = [(1e-2, 50), (1e-3, 50), (1e-4, 10)]

models = model_list
try:
    isinstance(results, list)
except:
    results = [{} for _ in models]

# 途中結果を印字するタイミング
interval = 1
#interval = 3
#interval = 10

for (lr, epochs) in iter_params: # 学習率とエポック数を定義済のリストに従って変化させる

    # 最適化関数の学習率を設定
    optimizers = []
    for model in models:
        optimizers.append(torch.optim.Adam(model[1].parameters(), lr=lr))

    print(f'lr:{lr}, epochs:{epochs}')
    # エポック数だけ学習を行う
    for epoch in range(epochs):
        print(f"エポック:{epoch+1:3d}")

        for N, (_model, optimizer) in enumerate(zip(models, optimizers)):
            #model_name = str(type(model)).split("'")[1].split('.')[-1]
            model_name = _model[0]
            model = _model[1]

            # 1 エポックの検証を行う
            is_seq2seq = 'Seq2Seq' in str(type(model))
            out, model = eval_an_epoch(
                model=model,
                _dl=valid_dl,
                loss_f=loss_f,
                is_seq2seq=is_seq2seq,
                device=device)
            if (epoch % interval) == 0:
                print(f"{model_name:35s} ",
                      f"検証損失値={out['sum_loss']:10.3f}",
                      f"正解率={out['P']:5.3f}",
                      f"({out['count']:5d}/{out['N']:5d})",
                      end="\t")

            if not 'valid_loss' in results[N]:
                results[N]['valid_loss'] = [out['sum_loss']]
            else:
                results[N]['valid_loss'].append(out['sum_loss'])
            if not 'valid_P' in results[N]:
                results[N]['valid_P'] = [out['P']]
            else:
                results[N]['valid_P'].append(out['P'])


            # 1 エポックの訓練を行う
            out, model, optimizer = fit_an_epoch(
                model=model,
                _dl=train_dl,
                loss_f=loss_f,
                optimizer=optimizer,
                is_seq2seq=is_seq2seq,
                device=device)
            if (epoch % interval) == 0:
                print(f"学習損失値={out['sum_loss']:10.3f}",
                      f"正解率={out['P']:5.3f}",
                      f"({out['count']:5d}/{out['N']:5d})",
                      end="\t")

            if not 'train_loss' in results[N]:
                results[N]['train_loss'] = [out['sum_loss']]
            else:
                results[N]['train_loss'].append(out['sum_loss'])
            if not 'train_P' in results[N]:
                results[N]['train_P'] = [out['P']]
            else:
                results[N]['train_P'].append(out['P'])

            if (epoch % interval) == 0:
                print()

## 学習曲線の描画

In [None]:
for i in range(6):
    #plt.plot(results[i]['train_loss'], 'x-', label=f'{i}:訓練データ')
    #plt.plot(results[i]['valid_loss'], 'o-', label=f'{i}:検証データ')
    plt.plot(results[i]['train_P'], 'x-', label=f'モデル{i}:訓練データ')
    #plt.plot(results[i]['valid_P'], 'o-', label=f'モデル{i}:検証データ')

plt.legend()
plt.xlabel('エポック数')
plt.ylim(0,1)
plt.show()

#print(results)


# 読めなかった単語を調べてみる

In [None]:
from tqdm.notebook import tqdm

In [None]:
_ds = psylex71_ds_mora

_errors = {}
for N, model in enumerate(models[2:]):
    model = model.eval()
    model_name = str(type(model)).split('\'')[1].split('.')[-1]

    if not model_name in _errors:
        _errors[model_name] = []

    is_seq2seq = 'seq2seq' in str(type(model)).lower()
    verbose = False
    errors = []
    for idx in tqdm(range(_ds.__len__() >> 3)):
        #_ds.getitem(idx), _ds.__getitem__(idx)
        inp, tch = _ds.__getitem__(idx)
        inp = pad_sequence(inp.unsqueeze(0), batch_first=True).to(device)
        tch = pad_sequence(tch.unsqueeze(0), batch_first=True).to(device)
        if is_seq2seq:
            out, state = model(inp,tch)
        else:
            out = model(inp,tch)
        out = out.squeeze(0).argmax(dim=1)
        tch = tch.squeeze(0)
        yesno = (((out == tch) * 1).sum() == len(tch)).detach().cpu().numpy()

        if yesno != True:
            if verbose:
                print(f'{idx:07d}',
                      f'{yesno}',
                      f'出力:{"".join(ch for ch in _ds.output_tokenizer.decode(out)).replace("<PAD>","")}',
                      f'{_ds.getitem(N)}')
            else:
                _errors[model_name].append((f'{idx:07d}',
                                            f'{yesno}',
                                            f'出力:{"".join(ch for ch in _ds.output_tokenizer.decode(out)).replace("<PAD>","")}',
                                            f'{_ds.getitem(idx)}'))

    print(f'{model_name}',
      f'total num. erros:{len(_errors[model_name])}', '/', f'{_ds.__len__()}',
      f'P(Correct)={((_ds.__len__() - len(_errors[model_name])) / _ds.__len__())*100:.3f}%')

In [None]:
for model_name in list(_errors.keys()):
    for err in _errors[model_name][-5:]:
        print(model_name, err)

In [None]:
# 読めなかった単語を調べてみる

loss_f = torch.nn.CrossEntropyLoss(ignore_index=-1)

cr = 0
model = tla_seq2seq0.eval()
for i, (inp, tch) in enumerate(valid_ds):
    out = model(inp,tch)
    out_ids = out.argmax(dim=1)
    yesno = (((tch==out_ids) * 1).sum()  == len(tch)).detach().cpu().numpy()
    #if yesno == True:
    #    cr += 1
    if yesno == False:
        print(f'入力:{"".join(c for c in gakushu_tokenizer.decode(inp)).replace('<SOW>','').replace('<EOW>','')}',
              f'出力:{"".join(c for c in mora_tokenizer.decode(out_ids.detach().cpu().numpy())).replace('<PAD>','').replace('<SOW>','').replace('<EOW>','')}',
              f'正解:{"".join(c for c in mora_tokenizer.decode(tch)).replace('<PAD>','').replace('<SOW>','').replace('<EOW>','')}',
              end=" "
             )
        loss = loss_f(out,tch)
        loss.backward()
        print(f'損失値:{loss.item():.3f}') # , type(loss.item()), loss)
    #if cr >= 30:
    #    break


In [None]:
def eval_a_word(model:torch.nn.Module=tla_seq2seq0,
                wrd:str="",
                input_tokenizer=gakushu_tokenizer,
                output_tokenizer=mora_tokenizer):
    inps = torch.LongTensor(input_tokenizer(wrd))
    inps = torch.LongTensor([input_tokenizer.tokens.index('<SOW>')]+input_tokenizer(wrd)+[input_tokenizer.tokens.index('<EOW>')])

    #print(wrd, inps)

    inps = pad_sequence(inps.unsqueeze(0), batch_first=True).to(device)
    #out0 = pad_sequence(torch.LongTensor([output_tokenizer.tokens.index('<SOW>')]).unsqueeze(0)).to(device)
    #outs = model(inps,out0)
    #out_ch = output_tokenizer.decode(outs.squeeze(0).argmax(dim=1))
    #print(out_ch, outs.argmax(dim=1))
    #sys.exit()

    tchs = pad_sequence(torch.LongTensor([[  2,  54, 148,  56,  10,   1,   0]]),batch_first=True).to(device)
    #tchs = pad_sequence(torch.LongTensor([[  2,  0]]),batch_first=True).to(device)
    outs = model(inps,tchs)
    print(f'inps:{inps}')
    print(f'tchs:{tchs}')
    print(f'outs:{outs.squeeze(0).argmax(dim=1)}')
    sys.exit()
    return inp, wrd, out0, outs

model = tla_seq2seq0.eval()
inp, wrd, out0, outs = eval_a_word(model=model, wrd='戦争')
# print(inp,wrd, out0, outs.squeeze(0).argmax(dim=1))
# sys.exit()


In [None]:
wrd='戦争'

if wrd in psylex71_ds_mora.inputs:
    idx = psylex71_ds_mora.inputs.index(wrd)
    print(idx, wrd, psylex71_ds_mora.getitem(idx), psylex71_ds_mora.__getitem__(idx))

output_tokenizer.tokens.index('')

#sys.exit()
def eval_a_word(model:torch.nn.Module=tla_seq2seq0,
                wrd:str="",
                input_tokenizer=gakushu_tokenizer,
                output_tokenizer=mora_tokenizer):
    inps = torch.LongTensor(input_tokenizer(wrd))
    print(wrd, inps)
    inps = pad_sequence(inps.unsqueeze(0), batch_first=True).to(device)
    print(wrd, inps)
    return inp, wrd

eval_a_word(model=tla_seq2seq0, wrd=wrd)
sys.exit()

inps,tchs = valid_ds.__getitem__(8)
inps = pad_sequence(inps.unsqueeze(0), batch_first=True).to(device)
tchs = pad_sequence(tchs.unsqueeze(0), batch_first=True).to(device)

# inps, tchs = next(iter(valid_dl))
# inps = pad_sequence(inps, batch_first=True).to(device)
# tchs = pad_sequence(tchs, batch_first=True).to(device)
model.eval()
outs = model(inps,tchs)
print(inps.size(), tchs.size(), outs.size()) # torch.Size([1024, 2]) torch.Size([1024, 9]) torch.Size([1024, 9, 155])


for inp, tch, out in zip(inps, tchs, outs):
    print(inp.size(), tch.size(), out.size())
    print(gakushu_tokenizer.decode(inp), mora_tokenizer.decode(tch), out.argmax(dim=1))
    print(mora_tokenizer.decode(out.argmax(dim=1).detach().cpu().numpy()))



In [None]:
inp = '戦争'
tch = 'センソウ'
inp_ids = torch.LongTensor(gakushu_tokenizer(inp)).unsqueeze(0)
tch_ids = torch.LongTensor([mora_tokenizer.tokens.index('<SOW>')]).unsqueeze(0)
model = tla_seq2seq0
inps = pad_sequence(inp_ids, batch_first=True).to(device)
tchs = pad_sequence(tch_ids, batch_first=True).to(device)
outs = model(inps, tchs)
print(outs.size(), outs.squeeze(0).argmax(), mora_tokenizer(tch))
print(mora_tokenizer.decode([outs.squeeze(0).argmax()]))

tch_ids = torch.LongTensor(mora_tokenizer('センソウ')).unsqueeze(0)
tch_ids = mora_tokenizer('センソウ')
tch_ids = [mora_tokenizer.tokens.index('<SOW>')]+tch_ids+[mora_tokenizer.tokens.index('<EOW>')]
tch_ids = torch.LongTensor(tch_ids).unsqueeze(0)
outs = model(inps, tchs)
print(tch_ids, outs.squeeze(0).argmax())

# # 正解のカウント
# out_ids = [out.argmax(dim=1) for out in outs]
# for tch, out in zip(tchs[:], out_ids[:]):
# yesno = ((tch==out) * 1).sum().cpu().numpy() == len(tch)
# count += 1 if yesno else 0



In [None]:
#train_ds.dataset.output_tokenizer
results[0]

In [None]:
model = tla_seq2seq0
model = model2
_ds = valid_ds
with torch.no_grad():
    for i in range(3):
    #for i in range(_ds.__len__()):

        idx = _ds.indices[i]
        inp, tgt = _ds.dataset.__getitem__(idx)
        inp_ids = pad_sequence(inp.unsqueeze(0), batch_first=True).to(device)
        tgt_ids = pad_sequence(tgt.unsqueeze(0), batch_first=True).to(device)

        enc_emb = model.encoder_emb(inp_ids)
        enc_out, (hnx, cnx) = model.encoder(enc_emb)

        dec_inp = torch.tensor([_ds.dataset.output_tokenizer.tokens.index('<SOW>')], device=device)
        dec_inp = model.decoder_emb(dec_inp).unsqueeze(0)
        dec_state = (hnx, cnx)

        print(f'inp:{inp.cpu().numpy()}',
              f'{"".join(c for c in _ds.dataset.ids2inp(inp.cpu().numpy()))}',
              f'{tgt.cpu().numpy()}',
              f'{"".join(c for c in _ds.dataset.target_ids2target(tgt.cpu().numpy()))}'
             )

        for _i in range(len(tgt_ids[0][1:])):
        #for i in range(len(tgt_ids[0])):

            dec_out, dec_state = model.decoder(dec_inp, dec_state)
            dec_out = dec_out.argmax().unsqueeze(0)
            dec_inp = model.decoder_emb(dec_out).unsqueeze(0)
            # print(f'_i:{_i}',
            #        f'dec_out:{dec_out.detach().cpu().numpy()}',
            #        f'tgt_ids[0][{_i+1}]:{tgt_ids[0][_i+1]}')

        out_ = model(inp_ids,tgt_ids)
        out_ids =  out_.squeeze(0).argmax(dim=1).detach().cpu().numpy()
        out_tokens = "".join(c for c in psylex71_ds.target_ids2target(out_ids))
        print(out_tokens)

        #inp, tgt = _ds.__getitem__(i)
        #inp_ids = inp

        #print(f'out_.squeeze(0).argmax(dim=1){out_.squeeze(0).argmax(dim=1).detach().cpu().numpy()}')


In [None]:
out_ids = out_.squeeze(0).argmax(dim=1).detach().cpu().numpy()
print(_ds.dataset.target_ids2target(out_ids), out_ids)
#print(psylex71_ds.target_ids2target(out_ids), out_ids)

out_tokens = "".join(c for c in _ds.dataset.target_ids2target(out_ids))
out_tokens
inp, tgt = _ds.__getitem__(2)
inp_ids = _ds.dataset.ids2inp(inp)
inp_ids

In [None]:
model = tla_seq2seq
with torch.no_grad():
    for i in range(_ds.__len__()):
        inp, tgt = chihaya_ds.__getitem__(i)
        # print(f'インプット:{"".join(c for c in chihaya_ds.ids2tkn(inp))}') #i].cpu().numpy()))}')
        # print(f'ターゲット:{"".join(c for c in chihaya_ds.ids2tkn(tgt))}') #n(c for c in chihaya_ds.ids2tkn(tgt_ids[i].cpu().numpy()))}')

        inp_ids = pad_sequence(inp.unsqueeze(0), batch_first=True).to(device)
        tgt_ids = pad_sequence(tgt.unsqueeze(0), batch_first=True).to(device)
        #inp_ids = torch.as_tensor(inp, device=device)
        #tgt_ids = torch.as_tensor(tgt, device=device)
        enc_out, enc_state = model.encoder(inp_ids)
        #dec_out, dec_state = decoder(tgt_ids, enc_state)

        dec_ids = dec_out.argmax(dim=1).detach().cpu().numpy()
        print(f'len(dec_ids):{len(dec_ids)}')
        print("".join(c for c in chihaya_ds.ids2tkn(dec_ids)))

        dec_inp = torch.tensor([chihaya_ds.chihaya_tokens.index('<SOS>')], device=device)
        dec_inp = tgt_ids[0].unsqueeze(0)
        for i in range(len(dec_ids)):
            dec_out, dec_state = decoder(dec_inp, dec_state)
            #print(dec_out.size(), dec_out.argmax().cpu().numpy(), type(dec_out.argmax())) #, dec_out)

            if teacher_forcing:
                dec_inp = tgt_ids[i].unsqueeze(0)
            else:
                dec_inp = dec_out.argmax().unsqueeze(0) # .clone().detach()

            #dec_inp = dec_out.argmax().unsqueeze(0).clone().detach()
            print(f'({dec_inp.cpu().numpy()}',
                  f'{chihaya_ds.chihaya_tokens[dec_inp.cpu().numpy()[0]]})', end=" ") # , type(dec_inp)) # , dec_inp)
            #print(dec_inp.size(), dec_inp.argmax().cpu().numpy(), type(dec_inp.argmax())) # , dec_inp)
            #sys.exit()
        sys.exit()


# vect2seq model の定義

In [None]:
class Vec2Seq(nn.Module):
    def __init__(self,
                 sem_dim:int,
                 dec_vocab_size:int,
                 n_hid:int,
                 n_layers:int=2,
                 bidirectional:bool=False):
        super().__init__()

        # 単語の意味ベクトル a.k.a 埋め込み表現 を decoder の中間層に接続するための変換層
        # 別解としては，入力層に接続する方法があるが，それはまた別実装にする
        self.enc_transform_layer = nn.Linear(
            in_features=sem_dim,
            out_features=n_hid)
        self.decoder_emb = nn.Embedding(
            num_embeddings=dec_vocab_size,
            embedding_dim=n_hid,
            padding_idx=0)

        self.decoder = nn.LSTM(
            input_size=n_hid,
            hidden_size=n_hid,
            num_layers=n_layers,
            batch_first=True,
            bidirectional=bidirectional)

        # 最終出力層
        self.bi_fact = 2 if bidirectional else 1
        self.out_layer = nn.Linear(self.bi_fact * n_hid, dec_vocab_size)

    def forward(self, enc_inp, dec_inp):
        enc_emb = self.enc_transform_layer(enc_inp)
        hnx, cnx = enc_emb.clone(), enc_emb.clone()
        hnx = hnx.unsqueeze(0)
        cnx = cnx.unsqueeze(0)

        if self.bi_fact == 2:
            hnx = hnx.repeat(2)
            cnx = cnx.repeat(2)

        dec_emb = self.decoder_emb(dec_inp)

        batch_size = enc_inp.size(0)
        exp_hid_size = self.decoder.get_expected_hidden_size(enc_inp, batch_sizes=[batch_size])
        dec_out, (hny, cny) = self.decoder(dec_emb,(hnx, cnx))

        return self.out_layer(dec_out)

# 以下確認作業
# ds = train_ds
# tla_vec2seq = Vec2Seq(
#     sem_dim=n_layers,
#     dec_vocab_size=len(mora_tokenizer.tokens),
#     n_hid=n_hid,
#     n_layers=n_layers,
#     bidirectional=bidirectional).to(device)
# print(tla_vec2seq.eval())

# Jalex の読み込み

In [None]:
import pandas as pd
try:
    import jaconv
except:
    !pip install jaconv --upgrade
    import jaconv
# Mecab を使ってヨミを得るために MeCab を import する
from ccap.mecab_settings import wakati, yomi #, parser

jalex_base = os.path.join(HOME, 'study/2025_2014jalex')
jalex_xls_fname = 'JALEX.xlsx'
jalex_fname = os.path.join(jalex_base, jalex_xls_fname)
jalex_DF = pd.read_excel(jalex_fname)
jalex_DF
jalex_words = jalex_DF['目標語']
print(len(jalex_words))

jalex_dic = OrderedDict()
for wrd in jalex_words:
    if not wrd in jalex_dic:
        _yomi = yomi(wrd).strip()
        _wakati = wakati(wrd).strip()
        _mora = mora_tokenizer.wakachi(_yomi)
        _kunrei = kunrei_tokenizer.wakachi(_yomi)
        _hira_yomi = jaconv.kata2hira(_yomi)
        _julius = jaconv.hiragana2julius(_hira_yomi).split(' ')

        jalex_dic[wrd] = {'ヨミ':_yomi, 'モーラ':_mora, '訓令':_kunrei, 'ユリウス':_julius} # , 'Jalex':jalex_DF['wrd']}
        print(jalex_dic[wrd])
        sys.exit()

In [None]:
#jalex_DF[jalex_DF.iloc('目標語'=='あさって')]
jalex_DF['目標語'].str.contains('あさって')
#print(dir(jaconv))
#help(jaconv.normalize)

## 0.2 NTT 日本語語彙特性 単語頻度データ psylex71.txt のダウンロード

In [None]:
# if isColab:
#     !pip install googledrivedownloader==0.4
#     from google_drive_downloader import GoogleDriveDownloader as gdd
#     import os

#     # 共有ファイルのIDを指定
#     file_id = '1eBJDN392BsUckg5LBFbbw5KT9PCsmnxI' # 'psylex71utf8_.txt
#     # https://drive.google.com/file/d/1eBJDN392BsUckg5LBFbbw5KT9PCsmnxI/view?usp=drive_link

#     # 保存したい場所とファイル名を指定\n",
#     # 例: /content/ ディレクトリに original_file_name.拡張子 という名前で保存\n",
#     destination_path = '/content/psylex71utf8_.txt' # ファイルの拡張子を適切に設定してください\n",
#     try:
#         print(f"ファイルのダウンロードを開始します (ファイルID: {file_id})...")
#         gdd.download_file_from_google_drive(file_id=file_id,
#                                             dest_path=destination_path)
#                                             # unzip=True if file_id is for a zip file):
#         print(f"ファイルのダウンロードが完了しました。'{destination_path}' に保存されました。")

#         # ダウンロードしたファイルを読み込む例 (テキストファイルの場合)
#         if os.path.exists(destination_path):
#             print("ダウンロードしたファイルの内容 (最初の数行):")
#             with open(destination_path, 'r') as f:
#                 # ファイルの内容を表示 (例: 最初の5行)
#                 for i in range(5):
#                     line = f.readline()
#                     if not line:
#                         break
#                     print(line.strip())
#         else:
#             print(f"エラー: ダウンロード先のファイル '{destination_path}' が見つかりません。")

#     except Exception as e:
#         print(f"ファイルのダウンロード中にエラーが発生しました: {e}")

In [None]:
import torch
class Psylex71_Dataset_Original(torch.utils.data.Dataset):
    '''ニューラルネットワークモデルに Psylex71 を学習させるための PyTorch 用データセットのクラス'''

    def __init__(self,
                 dic=Psylex71,
                 grph_list=grph_list,
                 phon_list=mora_list,
                 special_tokens=special_tokens,
                 maxlen_phon=maxlen_phon +2, # ＋2 しているのは <SOW>,<EOW> という 2 つのスペシャルトークンを付加するため
                 device=device):
        super().__init__()
        self.dic = dic
        self.special_tokens = special_tokens
        self.maxlen_phon = maxlen_phon
        self.grph_list = grph_list
        self.phon_list = phon_list
        self.input_cands = grph_list
        #self.target_cands = special_tokens + phon_list
        self.target_cands = special_tokens + mora_list
        # self.inputs = [v['単語'] for v in dic.values()]
        # self.targets = [v['ヨミ'] for v in dic.values()]
        # self.targets = [v['モーラ'] for v in dic.values()]
        self.inputs = [v['単語'] for v in dic.values()]
        self.targets = [v['ヨミ'] for v in dic.values()]
        self.targets = [v['モーラ'] for v in dic.values()]
        self.device = device

    def __len__(self):
        return len(self.dic)

    def __getitem__(self, idx):
        inp, tgt = self.inputs[idx], self.targets[idx]

        # 入力信号にも <SOW>, <EOW> トークンを付与する場合
        #inp = [self.input_cands.index('<SOW>')]  + [self.input_cands.index(x) for x in inp]  + [self.input_cands.index('<EOW>')]

        # 入力信号にはスペシャルトークンを付与しない場合
        inp = [self.input_cands.index(x) for x in inp]

        # ターゲット (教師)信号 には <SOW>, <EOW> を付与する
        tgt = [self.target_cands.index('<SOW>')] + [self.target_cands.index(x) for x in tgt] + [self.target_cands.index('<EOW>')]

        while len(tgt) < self.maxlen_phon:
            tgt = tgt + [self.target_cands.index('<PAD>')]

        inp, tgt = torch.LongTensor(inp), torch.LongTensor(tgt)
        inp, tgt = inp.to(self.device), tgt.to(self.device)
        return inp, tgt

    def getitem(self, idx):
        #inp, tgt = self.inputs[idx], self.targets[idx]
        wrd = self.inputs[idx]
        phn = self.targets[idx]
        return wrd, phn

    def ids2argmax(self, ids):
        out = np.array([torch.argmax(idx).numpy() for idx in ids], dtype=np.int32)
        return out

    def ids2tgt(self, ids):
        #out = [self.target_cands[torch.argmax(idx)] for idx in ids]
        out = [self.target_cands[idx - len(self.special_tokens)] for idx in ids]
        return out

    def ids2inp(self, ids):
        out = [self.input_cands[idx] for idx in ids]
        #out = [self.input_cands[idx - len(self.special_tokens)] for idx in ids]
        return out

    def target_ids2target(self, ids:list):
        ret = []
        for idx in ids:
            if idx == self.target_cands.index('<EOW>'):
                return ret+['<EOW>']
            ret.append(self.target_cands[idx])
        return ret

## Perry+2007 Appendix

#### A. Complex Graphemes Used in the CDP  Sublexical Network

The complex graphemes are identical to those implemented in the connectionist model of spelling of Houghton and Zorzi (2003).

* The onset consonants were as follows: ch, gh, gn, kn, ph, qu, sh, th, wh, and wr.
* The vowels were as follows: air, ai, ar, au, aw, ay, ear, eau, eir, eer, ea, ee, ei, er, eu, ew, ey, ier, ieu, iew, ie, ir, oar, oor, our, oa, oe, oi, oo, ou, or, ow, oy, uar, ua, ue, ui, ur, uy, ye, and yr.
* The coda consonants were as follows: ght, tch, que, ch, ck, dd, dg, ff, gh, gn, ll, mb, ng, ph, sh, ss, th, tt, and zz.

#### Appendix B. Parameters Used in the Model
Parameter type Parameter value

##### Lexical route

* Features
    * Feature-to-letter excitation 0.005
    * Feature-to-letter inhibition  0.150
* Letters
    * Letter-to-letter inhibition 0
    * Letter-to-orthography excitation 0.075
    * Letter-to-orthography inhibition  0.550
* Orthographic lexicon
    * Orthography-to-orthography inhibition  0.06
    * Orthography-to-phonology excitation 1.40
    * Orthography-to-letter excitation 0.30
* Phonological lexicon
    * Phonology-to-phonology inhibition  0.160
    * Phonology-to-phoneme excitation 0.128
    * Phonology-to-phoneme inhibition  0.010
    * Phonology-to-orthography excitation 1.100
* Phonological output buffer
    * Phoneme-to-phoneme inhibition  0.040
    * Phoneme-to-phonology excitation 0.098
    * Phoneme-to-phonology inhibition  0.060

##### Overall parameters

* Overall activation rate 0.2
* Lexicon frequency scaling   >0.4 X log (word frequency)
* Phoneme naming activation criterion 0.67
* Cycle-to-cycle stopping criterion 0.0023
* Maximum number of cycles a word is run for before being timed out and considered an outlier 250

##### Parameters used in the sublexical network

* Network to phonological output buffer activation 0.085
* Number of cycles taken for each letter to be processed 15
* Level of activation that a letter must be over before grapheme identification begins 0.21
* Temperature ($\tau$) in the assembly network 3
* Learning rate ($\epsilon$) in the assembly network 0.05

### Appendix C

#### Activation and Learning Equations Used With the Sublexical Network

The sublexical spelling-to-sound network is identical to the two-layer assembly network of Zorzi et al. (1998b), except that instead of letter units, we have grapheme units. These include the complex (i.e., multiletter) graphemes listed in Appendix A in addition to all single letters.

#### Activation Function

For any given input pattern, the input units are clamped to a value of 1.0 or 0.0, according to the presence or absence of the grapheme they encode; the net input to each output unit is simply
$$
\text{net}_{i}=\sum_{j}w_{ij}a_{j},
$$
where $a_{j}$ is the activation value of the input unit $j$, and $w_{ij}$ is the weight of the connections linking the unit $j$ to the output unit $i$. The activation of the output unit $i$ is determined by an S shaped squashing function (sigmoid) of the net input, bounding phoneme activations in the range $[0,1]$ and with $f(0)=0$ (i.e., no input and no output):
$$
O_{i}=\frac{1}{1+e^{-(\text{net}_{i}-1)\tau}},
$$
where $\tau$ is a temperature parameter determining the slope of the function ($\tau=3$ for all simulations). Note that the $-1$ in the
exponent shifts the sigmoid to the right, such that $f(0)=0.5$ is very close to 0, rather than the standard $f(0)=0.5$. As in Zorzi et al. (1998b), in the simulations reported here, values less than 0.05 are set to 0, so no input really does mean no output.

#### Learning Rule

The model was trained with the simple gradient descent technique known as the delta rule (Widrow & Hoff, 1960). For any input pattern, the error correction is made by changing the weights according to the difference between the activation of the output units and desired activation pattern. The desired output is just the correct pronunciation of the orthographic input (nodes that should be on have a target activation of 1, nodes that should be off have a target activation of 0). Formally,
$$
\Delta w_{ij}=\epsilon(t_{i}-o_{i})a_{j},
$$
where $\epsilon$ is a learning rate (0.05 in the simulations), $a_j$ is the activation of the $j$-th input unit and $t_{i}$ and $o_{i}$ are the teaching input and the actual output of the ith output unit, respectively (for further details, see Zorzi et al., 1998b, pp. 1136–1137).

Zorzi, M., Houghton, G., & Butterworth, B. (1998b). Two routes or one in reading aloud? A connectionist dual-process model. Journal of Experimental Psychology: Human Perception and Performance, 24, 1131–1161.

### D.

<img src="2007Perry_appendixD.svg" width="77%">

### from page 281

**Training corpus**. The training corpus was extracted from the English CELEX word form database (Baayen, Piepenbrock, & van Rijn, 1993), and it basically consists of all monosyllables with an orthographic frequency equal to or bigger than one. The database was also cleaned. For example, acronyms (e.g., mph), abbreviations, and proper names were removed. Note that we did not remove words with relatively strange spellings (e.g., isle). A number of errors in the database were also removed. This left 7,383 unique orthographic patterns and 6,663 unique phonological patterns.(footnote 2)

**Network training**. In previous simulation work (Hutzler et al., 2004), we have shown that adapting the training regimen to account for explicit teaching methods is important for simulating reading development. Explicit teaching of small-unit correspondences is an important step in early reading and can be simulated by pretraining a connectionist model on a set of grapheme–phoneme correspondences prior to the introduction of a word corpus.

The two-layer associative network was initially pretrained for 50 epochs on a set of 115 grapheme–phoneme correspondences selected because they are similar to those found in children’s phonics programs (see Hutzler et al., 2004, for further discussion). They consist of very common correspondences but are by no means all possible grapheme–phoneme relationships. The same correspondence (e.g., the correspondence $L\rightarrow/l/$) may be exposed in more than one position in the network where it commonly occurs. The list of correspondences used appears in Appendix D. Note that the total number differs from that of Hutzler et al. (2004) because of the different coding scheme (their simulations were based on the CDP model). Learning parameters were identical to those in Zorzi et al. (1998b; see Appendix C).

### from page 280

### Sublexical Route

**入力と出力の表現**。前述の通り、我々は下位語彙経路に書記素法バッファを追加した。この書記素バッファは Houghton&Zorzi(2003) のスペルモデルから採用され、2 層連合ネットワークで使用される入力符号化方式に実装された。これにより、単一の入力ノードは CDP のように個々の文字のみを表すのではなく、ck, th などの複雑な形態素も表すようになった。Houghton＆Zorzi (付録 A) が指定する複雑な形態素集合には、10 のオンセット形態素，41 の母音形態素，19 のコーダ形態素が含まれる。文字がこれらの形態素を形成するために結合する場合、文字ではなく形態素が活性化される (つまり、結合符号化)。注意：これらの複雑な形態素は、基本的に英語で最も頻繁に現れるもの(Perry&Ziegler2004) だが、決して英語で存在するすべての形態素の集合ではない。
<!-- Input and output representation. As justified earlier, we added an orthographic buffer to the sublexical route. The orthographic buffer was taken from the spelling model of Houghton&Zorzi(2003) and was implemented in the input coding scheme used with the two-layer associative network. Thus, single input nodes do not represent individual letters only, as in CDP, but also complex graphemes such as ck, th, and so forth. The set of complex graphemes specified by Houghton and Zorzi (see Appendix A) includes 10 onset graphemes, 41 vowel graphemes, and 19 coda graphemes. When letters combine to form one of these graphemes, the grapheme is activated instead of the letters (i.e., conjunctive coding). Note that the complex graphemes are basically the most frequent ones that occur in English (see Perry&Ziegler2004, for a full analysis), although they are by no means the entire set that can be found -->


入力表現は、形態素を形態素から graphosyllabic テンプレート(Caramazza&Miceli1990, Houghton&Zorzi2003) に配置することで構築される。このテンプレートには、発音の始まり (オンセット)、母音、および発音の終わり (コーダ) の構成要素が含まれる。発音の始まりスロットが 3 つ、母音スロットが 1 つ、終音スロットが 4 つある。各形態素は 1 つの入力スロットに割り当てられる。文字列の最初の形態素が子音の場合、それは最初の発音の始まりスロットに割り当てられ、続く子音形態素 は 2 番目、3 番目の発音の始まりスロットに割り当てられる。割り当てる形態素がない場合はスロットは空のままになる。母音の形態素は母音スロットに割り当てられる。母音の後の形態素は最初のコーダスロットに割り当てられ、その後の形態素（存在する場合）は順次コーダスロットを埋める。例えば、black は `b-l-*-a-ck-*-*-*` と符号化される。ここで、各アスタリスクは形態素によって活性化されていないスロットを表す。同様に、非単語である sloiched は s-l-*-oi-ch-e-d-* と符号化される。
<!-- The input representation is constructed by aligning graphemes to a graphosyllabic template (Caramazza&Miceli1990, Houghton&Zorzi2003) with onset, vowel, and coda constituents. There are three onset slots, one vowel slot, and four coda slots. Each grapheme is assigned to one input slot. If the first grapheme in a letter string is a consonant, it is assigned to the first onset slot, and the following consonant graphemes are assigned to the second and then to the third onset slots. Slots are left empty if there are no graphemes to be assigned. The vowel grapheme is assigned to the vowel slot. The grapheme following the vowel is assigned to the first coda slot, and subsequent graphemes (if any) fill the successive coda slots. Thus, for example, black would be coded as `b-l-*-a-ck-*-*-*`, where each asterisk represents a slot that is not activated by any grapheme. Similarly, a nonword like sloiched would be coded as `s-l-*-oi-ch-e-d-*`. -->

ネットワークの音声出力は、Zorzi+(1998b)で説明された表現構造と同一の構造を有しているが、3 つの子音、1 つの母音、3 つの尾音スロットを使用する代わりに、3 つの子音、1 つの母音、4 つの尾音スロットを使用している。したがって、ネットワークに訓練パターンが提示されると、出力（音韻）は子音–母音–尾音 の区別を尊重する形で分解される。4 つ目のコーダスロットを追加した理由は、モデルを訓練するために使用されたデータベースに 4 つのコダ音素を含む単語が存在したためである。したがって、/prɒmpts/ のようなコーダに 4 つの子音を含む単語はモデルで処理され、`p-r-*-ɒ-m-p-t-s`と符号化される。
<!-- The phonological output of the network has a representational structure identical to that described in Zorzi+(1998b), except that instead of using three onset, one vowel, and three coda slots, it uses three onset, one vowel, and four coda slots. Thus, when training patterns are presented to the network, the output (phonology) is broken down in a way that respects an onset–vowel–coda distinction. The addition of a fourth coda slot was motivated by the existence of words with four coda phonemes in the database used to train the model. Thus, a word like prompts (/prɒmpts/) with four consonants in the coda can be handled by the model and would be coded as `p-r-*-ɒ-m-p-t-s`. -->

TLA ネットワーク。下位語彙ネットワークは、入力ノードと出力ノードの数を除いて CDP の下位語彙経路と同一の単純な 2 層ネットワークである。入力ノードは、前述の形態素バッファ表現に従って単語の書記素を符号化する。したがって、形態素は入力層の   8 つのスロット（3  つのオンセットスロット  ＋1  つの母音スロット ＋4  つのコーダスロット）に符号化され、各スロットは  96 個の形態素ノード (26 個の単文字  +70  個の複合形態素) で構成される。単語の音韻はネットワークの出力層で符号化され、8 つの利用可能なスロット (3 つのオンセットスロット ＋1 つの母音スロット ＋4 つのコーダスロット) それぞれに 43 つの音素ノードが含まれる。これにより、入力ノードは 768 個、出力ノードは 344 個 (つまり 8×96 と 8×43) となる。スロット間で同一の符号化方式を複製することは、書記素 (または音韻論) ユニット全体が任意の位置で利用可能であることを意味する。ただし、この選択は単純化のために行われたものであり、実践的な影響はない。実際、決して活性化されないノード（例えば、発音位置におけるコーダ、コーダ位置における母音など）はネットワークにとって完全に無関係である：つまり、訓練中にいかなる入力も受けないノードは、いかなる出力も引き起こさない (Zorzi+1998b, p. 1136 の式 (2) )、そして表現の構築方法のため、これらの無関係なノードは決して活性化されない。したがって、ネットワークの性能は、発音ユニットを 3 つの主要な節 (発音初期、母音、終音）に分割したスロット固有の書記素 (または音韻) ユニット集合に基づく符号化方式を使用した場合でも同一になる。
<!-- TLA network. The sublexical network is a simple two-layer network, identical to the sublexical route of CDP apart from the number of input and output nodes. The input nodes encode the orthography of the word according to the grapheme buffer representation described earlier. Thus, graphemes are encoded over 8 slots in the input layer (3 onset slots  ＋1 vowel slot ＋4 coda slots), where each slot consists of 96 grapheme nodes (26 single letters +70 complex graphemes). The phonology of the word is encoded at the output layer of the network, which contains 43 phoneme nodes for each of the 8 available slots (3 onset slots +1 vowel +4 coda slots). This means that there are 768 input nodes and 344 output nodes (i.e., 8x96 and 8x43). Replicating an identical coding scheme across slots means that the entire set of orthographic (or phonological) units is potentially available at any position. However, this choice was made only for the sake of simplicity, and it has no practical consequences. Indeed, nodes that are never activated (like codas in onset positions, vowels in coda positions, etc.) are completely irrelevant for the network: That is, nodes that never get any input in training never cause any output (see Equation 2 of Zorzi+1998b, p. 1136), and because of the way the representation is constructed, these irrelevant nodes are never activated. Thus, performance of the network would be identical if we had used a coding scheme based on a slot-specific set of orthographic (or phonological) units divided into three main sections (onset, vowel, coda). -->

注意：Plaut+(1996)も、発音の初音 (オンセット)、母音、終音 (コーダ) の区別を使用していた。ただし、彼らの書記素と音韻の表象では、文字素子や音素素子の順序をスロットで符号化していなかった。彼らの解決策は、各セット (オンセット、母音、コーダ）内の素子を、書記素的に/音韻的に合法的な順序のみが発生するように配置することであった。子音群に 2 つの可能な順序がある場合（例：/ts/ vs. /st/）、それらを区別するために追加のノードが活性化された。CDP と三角モデルとのもう一つの違いは、CDP  では多文字書記素は書記素素子の活性化のみで符号化されるのに対し、Plaut＋では書記素と個々の文字の両方が活性化される点である。
<!-- Note that an onset–vowel–coda distinction was also used by Plaut+(1996). Their orthographic and phonological representations, however, did not use slots to encode the order of grapheme or phoneme units. Their solution was to arrange the units within each set (onset, vowel, or coda) with an order that allows only orthotactically/phonotactically legal sequences to occur. In the case of consonant clusters that have two possible orderings (e.g., /ts/ vs. /st/), an additional node was activated to disambiguate between them. Another difference between CDP  and the triangle model is that a multiletter grapheme is coded only by the activation of the grapheme unit with CDP , whereas in Plaut et al., both the grapheme and the individual letters were activated. -->

入力ノードと出力ノードをスロットに分割することは、組み込みのネットワーク構造（例：特定の接続パターン）を意味するものではないん。これは単に、ノードを活性化するための表現方式を反映しているに過ぎない。したがって、入力ノードと出力ノードは完全に接続されており、その間に隠れ素子は存在しない。したがって、任意の入力ノードは、任意の出力ノードを活性化させる可能性がある。出力ノードの活性化は、Zorzi+(1998b) で用いられた方法と同一の仕組みで、入力ノードの活性化に基づいて計算される。実際、同じパラメーターが使用されている(Appendix B)。使用される式とその動作の詳細は、Zorzi+(pp. 1136–1137) および Appendix C で詳細に説明されている。
<!-- The breakdown of input and output nodes into slots does not imply a built-in network structure (e.g., a specific pattern of connectivity); it only reflects the representational scheme used to activate the nodes. Accordingly, input and output nodes are fully connected, with no hidden units between them. Thus, any given input node has the potential to activate any given output. Activation of output nodes is calculated on the basis of the activation of input nodes in a manner identical to the one used by Zorzi+(1998b), and indeed the same parameters are used (see Appendix B). The equations used and how they work are described in full detail by Zorzi et al. (pp. 1136–1137) and in Appendix C. -->

注意すべき点は、接続主義モデルにおいて、特定の入力-出力関係の学習は、その関係が訓練データセットに存在するか否か、および入力と出力の符号化方法に厳密に依存する点である。例えば、我々のネットワークは、単語の構成要素（例えば、単語の頭子音）から他の構成要素（例えば、単語の末尾位置）への関係を一般化して学習できない。したがって、学習中に特定の位置で活性化されない場合、形態素は音韻に写像されない。この例としては、末尾位置にある子音 j が挙げられる。したがって、非単語である jinje はモデルによって正しく名前付けられない。しかし、英語の書記素上のコーダで j が n の後に現れないという事実から、この単語は書記素レベルでは二音節として扱われる可能性がある (つまり、jin-je 書記素上の音節境界に関する議論は Taft 1979 参照)。ただし、書記素上の音節の問題に関わらず、多音節単語を学習するモデルは jinje に困難を覚えないであろう。なぜなら、injure や banjo のような単語に –nj パターンが存在するためである ( 類似の議論については Plaut+1996 参照)。
<!-- It is worth noting that learning of a given input–output relationship, in any connectionist model, strictly depends on its existence in the training corpus and on how the inputs and outputs are coded. For example, our network cannot learn to generalize relationships from parts of words such as onset consonants to parts of words such as coda positions. Thus, a grapheme does not map to any phoneme if it is never activated in a specific position during learning. Though this is rather uncommon, one example is the consonant j in the coda position. Accordingly, a nonword like jinje cannot be correctly named by the model. However, the fact that the letter j never occurs after the letter n in English orthographic codas suggests that the word might be treated as disyllabic at an orthographic level (i.e., jin-je; see Taft, 1979, for a discussion of orthographic syllable boundaries). However, regardless of the issue of orthographic syllables, a model learning multisyllabic words would not have any difficulties with jinje because the –nj pattern does occur in words such as injure and banjo (see Plaut+1996, for a similar argument). -->
