In [1]:
import pandas as pd
import numpy as np

In [2]:
train = pd.read_csv('/home/tidal/ML_Data/Tweet_Sentiment_Extraction/train.csv')
test = pd.read_csv('/home/tidal/ML_Data/Tweet_Sentiment_Extraction/test.csv')
target=train['sentiment']

In [3]:
print('There are {} rows and {} cols in train set'.format(train.shape[0],train.shape[1]))
print('There are {} rows and {} cols in test set'.format(test.shape[0],test.shape[1]))

There are 27486 rows and 4 cols in train set
There are 3535 rows and 3 cols in test set


## <font size='4' color='blue'> Fast BERT-lstm model</font><a id='5'></a>

ここでは、BERT埋め込みを用いて、ターゲットキーフレーズを予測するための多入力モデルの構築を試みている。これはナイーブなアプローチであり、後ほど改良を加えていく予定です。

In [4]:
import transformers
from tokenizers import BertWordPieceTokenizer
import gc
import os

In [5]:
cols=['textID','text','sentiment','selected_text']
train_df=train[cols].copy()
del train
test_df=test.copy()
del test
gc.collect()

22

In [6]:
train

NameError: name 'train' is not defined

- Below function is from this [kernel](https://www.kaggle.com/xhlulu/jigsaw-tpu-distilbert-with-huggingface-and-keras) by @xhlulu,this is used to encode the sentences easily and quickly using distilbert tokenizer.
- 以下の関数は @xhlulu さんのカーネルのもので、 distilbert tokenizer を使って簡単かつ高速に文章をエンコードしています。

In [7]:
def fast_encode(texts, tokenizer, chunk_size=256, maxlen=128):
    tokenizer.enable_truncation(max_length=maxlen)
    tokenizer.enable_padding(max_length=maxlen)
    all_ids = []
    
    for i in tqdm(range(0, len(texts), chunk_size)):
        text_chunk = texts[i:i+chunk_size].tolist()
        encs = tokenizer.encode_batch(text_chunk)
        all_ids.extend([enc.ids for enc in encs])
    
    return np.array(all_ids)

- We load the Distilbert pretained tokenizer (uncased) and save it to directory.
- Reload and use BertWordPieceTokenizer.
- An implementation of a tokenizer consists of the following pipeline of processes, each applying different transformations to the textual information:
- Distilbert pretained tokenizerをロードし、ディレクトリに保存します。
- リロードして、BertWordPieceTokenizerを使用します。
- トークン化器の実装は、それぞれがテキスト情報に異なる変換を適用する以下のプロセスのパイプラインで構成されています。
![](https://miro.medium.com/max/1400/1*7uy9X3eE1rVmqV08yKrDgg.png)

In [8]:
tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-uncased')  ## change it to commit

# Save the loaded tokenizer locally
save_path = '/home/tidal/ML_Data/Tweet_Sentiment_Extraction/distilbert_base_uncased/'
if not os.path.exists(save_path):
    os.makedirs(save_path)
tokenizer.save_pretrained(save_path)

# Reload it with the huggingface tokenizers library
fast_tokenizer = BertWordPieceTokenizer('/home/tidal/ML_Data/Tweet_Sentiment_Extraction/distilbert_base_uncased/vocab.txt', lowercase=True)
fast_tokenizer

Tokenizer(vocabulary_size=30522, model=BertWordPiece, add_special_tokens=True, unk_token=[UNK], sep_token=[SEP], cls_token=[CLS], clean_text=True, handle_chinese_chars=True, strip_accents=True, lowercase=True, wordpieces_prefix=##)

- Now the comment text is prepared and encoded using this tokenizer easily.
- We here set the maxlen=128,(limit)
- これで、このトークナイザーを使って簡単にコメントテキストを作成し、エンコードすることができるようになりました。
- ここでは、maxlen=128, (limit)を設定しています。

In [10]:
from tqdm.notebook import tqdm
x_train = fast_encode(train_df.text.astype(str), fast_tokenizer, maxlen=128)
x_test = fast_encode(test_df.text.astype(str),fast_tokenizer,maxlen=128)

HBox(children=(FloatProgress(value=0.0, max=108.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=14.0), HTML(value='')))




In [11]:
x_train

array([[  101,  2985,  1996, ...,     0,     0,     0],
       [  101,  2821,   999, ...,     0,     0,     0],
       [  101,  2758,  2204, ...,     0,     0,     0],
       ...,
       [  101,  2652, 19219, ...,     0,     0,     0],
       [  101,  2156,  1057, ...,     0,     0,     0],
       [  101,  5292,  5292, ...,     0,     0,     0]])

- Now we load the pretrained bert ('uncased') transformer layer.
- This is used for creating the representations and training our corpus.
- ここで、事前学習されたbert ('uncased') トランスフォーマー層をロードします。
- これは表現の作成とコーパスの学習に使われます。

In [12]:
transformer_layer = transformers.TFDistilBertModel.from_pretrained('distilbert-base-uncased')

This code is lifted from [kernel](https://www.kaggle.com/gskdhiman/bert-baseline-starter-kernel#Training).
- In this section we create the representaion for the selected text from tweet text.
- The representation is created such that the positions of tokens which is selcted from text is represented with 1 and others with 0.
- for example,consider the tweet `" I have a cute dog"` and selected text `"cute dog"`
- This produces the ouput as ` [0,0,0,1,1]`
- ここでは、ツイートのテキストから選択されたテキストの表現を作成します。
- 表現は、テキストから選択されたトークンの位置が1、それ以外の位置が0となるように作成されます。
- 例えば、`"I have a cute dog"`というツイートと、`"cute dog"`という選択テキストを考えてみましょう。
- これにより、出力は ` [0,0,0,1,1]` となります

__t_textが一部おかしい(”##”が入っている)__

In [13]:
def create_targets(df):
    df['t_text'] = df['text'].apply(lambda x: tokenizer.tokenize(str(x)))
    df['t_selected_text'] = df['selected_text'].apply(lambda x: tokenizer.tokenize(str(x)))
    def func(row):
        x,y = row['t_text'],row['t_selected_text'][:]
        for offset in range(len(x)):
            d = dict(zip(x[offset:],y))
            #when k = v that means we found the offset
            check = [k==v for k,v in d.items()]
            if all(check)== True:
                break 
        return [0]*offset + [1]*len(y) + [0]* (len(x)-offset-len(y))
    df['targets'] = df.apply(func,axis=1)
    return df

train_df = create_targets(train_df)

print('MAX_SEQ_LENGTH_TEXT', max(train_df['t_text'].apply(len)))
print('MAX_TARGET_LENGTH',max(train_df['targets'].apply(len)))
MAX_TARGET_LEN=108

MAX_SEQ_LENGTH_TEXT 108
MAX_TARGET_LENGTH 108


In [14]:
train_df

Unnamed: 0,textID,text,sentiment,selected_text,t_text,t_selected_text,targets
0,a3d0a7d5ad,Spent the entire morning in a meeting w/ a ven...,neutral,my boss was not happy w/ them. Lots of fun.,"[spent, the, entire, morning, in, a, meeting, ...","[my, boss, was, not, happy, w, /, them, ., lot...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ..."
1,251b6a6766,Oh! Good idea about putting them on ice cream,positive,Good,"[oh, !, good, idea, about, putting, them, on, ...",[good],"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0]"
2,c9e8d1ef1c,says good (or should i say bad?) afternoon! h...,neutral,says good (or should i say bad?) afternoon!,"[says, good, (, or, should, i, say, bad, ?, ),...","[says, good, (, or, should, i, say, bad, ?, ),...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, ..."
3,f14f087215,i dont think you can vote anymore! i tried,negative,i dont think you can vote anymore!,"[i, don, ##t, think, you, can, vote, anymore, ...","[i, don, ##t, think, you, can, vote, anymore, !]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]"
4,bf7473b12d,haha better drunken tweeting you mean?,positive,better,"[ha, ##ha, better, drunken, t, ##wee, ##ting, ...",[better],"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0]"
...,...,...,...,...,...,...,...
27481,3dbae74fcd,"I want to go to VP, but no one is willing to c...",neutral,"I want to go to VP, but no one is willing to c...","[i, want, to, go, to, vp, ,, but, no, one, is,...","[i, want, to, go, to, vp, ,, but, no, one, is,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
27482,63147b35cb,"Wah, why are you sad?",neutral,"Wah, why are you sad?","[wah, ,, why, are, you, sad, ?]","[wah, ,, why, are, you, sad, ?]","[1, 1, 1, 1, 1, 1, 1]"
27483,bdb196a09f,playing sudoku while mommy makes me breakfast ...,neutral,playing sudoku while mommy makes me breakfast ...,"[playing, sud, ##oku, while, mommy, makes, me,...","[playing, sud, ##oku, while, mommy, makes, me,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, ..."
27484,18c2a1e98e,see u bye see u! i love the hot30,positive,i love,"[see, u, bye, see, u, !, i, love, the, hot, ##30]","[i, love]","[0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]"


- Now we need to make each output of the same length to feed it to the neural network.
- For that we find the maxlength of the target and pad all other target to this length.
- ここで、ニューラルネットワークに供給するために、各出力を同じ長さにする必要があります。
- そのためには、ターゲットの最大長を見つけ、他の全てのターゲットをこの長さにパッドする。

In [15]:
train_df['targets'] = train_df['targets'].apply(lambda x :x + [0] * (MAX_TARGET_LEN-len(x)))
targets=np.asarray(train_df['targets'].values.tolist())

In [16]:
print(train_df['targets'][27485])

[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


- We need to use the sentiment as a feature,for this encode it using LabelEncode.
- センチメントを特徴量として使用する必要がありますが、そのためにはLabelEncodeを使用してエンコードします。

__labelの出し方は再考の余地あり(sentimentは良し悪しを表すので何か表現できそう)__

In [18]:
from sklearn.preprocessing import LabelEncoder
lb=LabelEncoder()
sent_train=lb.fit_transform(train_df['sentiment'])
sent_test=lb.fit_transform(test_df['sentiment'])

In [19]:
sent_train

array([1, 2, 1, ..., 1, 2, 2])

- This is a multi-input model (comment+sentiment label).
- I have made a simple LSTM model
- concatenated both the inputs 
- マルチ入力モデル（コメント＋センティメントラベル）です。
- 簡単なLSTMモデルを作ってみました
- 両方の入力を連結した 

In [20]:
embedding_matrix=transformer_layer.weights[0].numpy()
embedding_matrix

array([[-0.01664949, -0.06661227, -0.01632868, ..., -0.01999032,
        -0.05139988, -0.0263568 ],
       [-0.01319846, -0.06733431, -0.01605646, ..., -0.0226614 ,
        -0.05537301, -0.02600443],
       [-0.01759106, -0.07094341, -0.01443494, ..., -0.02457913,
        -0.05956192, -0.0231829 ],
       ...,
       [-0.0231029 , -0.05878259, -0.01048967, ..., -0.01945743,
        -0.02615411, -0.02118432],
       [-0.0490171 , -0.05614787, -0.00465348, ..., -0.01065376,
        -0.01797333, -0.02187675],
       [-0.00646111, -0.0914881 , -0.00254872, ..., -0.01505679,
        -0.05040044,  0.04597744]], dtype=float32)

In [21]:
np.save('/home/tidal/ML_Data/Tweet_Sentiment_Extraction/features/x_train', x_train)
np.save('/home/tidal/ML_Data/Tweet_Sentiment_Extraction/features/x_test', x_test)
np.save('/home/tidal/ML_Data/Tweet_Sentiment_Extraction/features/sent_train', sent_train)
np.save('/home/tidal/ML_Data/Tweet_Sentiment_Extraction/features/sent_test', sent_test)
np.save('/home/tidal/ML_Data/Tweet_Sentiment_Extraction/targets/targets', targets)
np.save('/home/tidal/ML_Data/Tweet_Sentiment_Extraction/transformer_layer/embedding_matrix', embedding_matrix)