<a href="https://colab.research.google.com/github/Tyanakai/retweet_analysis/blob/main/retweet_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ０.はじめに
　こちらのipynbは、tweet内容を入力とし、そのretweet数を予測する深層学習モデルを訓練するものです。モデルとしてはBERTを使用します。予測する値は、単純なretweet数ではなく、tweetした主体のfollower数で除算したものとします。その辺りの考察はREADME.mdをご覧ください。<br>
　尚、本ipynbはGoogle Colaboratory上での実行を想定しています。<br>
　また、事前にtwitter_api.pyを実行して、tw_data.csvを作成し、Goolge Drive上のプロジョクトフォルダに保存し使用します。

# １.準備
　ライブラリのインストールやインポート、訓練の設定を行います。

In [None]:
!pip install -q transformers
!pip install -q ipadic
!pip install -q fugashi

[K     |████████████████████████████████| 4.4 MB 4.2 MB/s 
[K     |████████████████████████████████| 596 kB 46.4 MB/s 
[K     |████████████████████████████████| 6.6 MB 25.5 MB/s 
[K     |████████████████████████████████| 101 kB 10.6 MB/s 
[K     |████████████████████████████████| 13.4 MB 888 kB/s 
[?25h  Building wheel for ipadic (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 568 kB 4.4 MB/s 
[?25h

In [None]:
import os

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import tensorflow as tf
import transformers

In [None]:
BASE = "/content/drive/MyDrive/recruit/portfolio/retweet_analysis/"

class Config:
    file_path = os.path.join(BASE, "tw_data.csv")
    model_path = os.path.join(BASE, "model.h5")
    model = "cl-tohoku/bert-base-japanese"
    epochs = 2
    steps_per_epochs = None
    max_length = 64
    train_batch_size = 8
    valid_batch_size = 8
    test_batch_size = 8
    from_pt = False

# ２.関数
　訓練で使用する関数を定義します。

In [None]:
def prepare_data(file_path):
    df = pd.read_csv(file_path)
    df["target"] = df["retweet"] / (df["followers"] + 1)    # README.md参照
    df["target"] = df["target"].fillna(0)

    return df

In [None]:
def tokenize_texts(texts, tokenizer):
    """
    バッチ化された文字列をtoken化し、
    keyが"input_ids"と"attention_mask"の辞書として返す。
    """
    tokenized_dict = tokenizer.batch_encode_plus(
        texts,
        padding='max_length',
        truncation=True,
        max_length=Config.max_length,
        return_token_type_ids=False,
    )
    return dict(tokenized_dict)

In [None]:
def build_model():
    """
    使用するkerasモデルの全体像を定義します。
    """
    # encoder
    encoder = (
        transformers
        .TFAutoModel
        .from_pretrained(Config.model, from_pt=Config.from_pt)
        )

    # 入力
    input_ids = tf.keras.layers.Input(shape=(Config.max_length, ), 
                                           dtype=tf.int32, 
                                           name='input_ids')
    attention_mask = tf.keras.layers.Input(shape=(Config.max_length, ),
                                           dtype=tf.int32, 
                                           name='attention_mask')
    
    # ニューラルネットワーク全体構造
    x = encoder(input_ids=input_ids, 
                attention_mask=attention_mask, 
                output_hidden_states=True)
    # cls tokenを使用
    x = x[0][:, 0, :]
    # x = tf.keras.layers.Dropout(0.2)(x)
    output = tf.keras.layers.Dense(1)(x)

    # kerasモデル化
    model = tf.keras.models.Model(inputs=[input_ids, attention_mask],
                                  outputs=[output])

    # 最適化アルゴリズムと損失関数
    model.compile(optimizer=tf.keras.optimizers.RMSprop(0.001), loss="mse")
    model.summary()
    return model   

In [None]:
def get_dataset(x, y=None, dataset="test"):
    """
    データをtf.data.Datasetの形式に変換する。
    """
    if dataset=="train":
        tr_ds = tf.data.Dataset.from_tensor_slices((x, y))
        if Config.steps_per_epochs is not None:
            tr_ds = tr_ds.repeat()
        tr_ds = tr_ds.shuffle(128)
        tr_ds = tr_ds.batch(Config.train_batch_size)
        tr_ds = tr_ds.prefetch(tf.data.experimental.AUTOTUNE)
        return tr_ds

    elif dataset=="valid":
        val_ds = tf.data.Dataset.from_tensor_slices((x, y))
        val_ds = val_ds.batch(Config.valid_batch_size)
        val_ds = val_ds.prefetch(tf.data.experimental.AUTOTUNE)
        return val_ds
    
    elif dataset=="test":
        test_ds = tf.data.Dataset.from_tensor_slices(x)
        test_ds = test_ds.batch(Config.test_batch_size)
        test_ds = test_ds.prefetch(tf.data.experimental.AUTOTUNE)
        return test_ds

In [None]:
def train_fn(train_df, valid_df, model, tokenizer, filepath):
    """
    訓練関数
    """
    # tf.data.Dataset準備
    tr_text = tokenize_texts(texts=train_df["text"].tolist(), 
                             tokenizer=tokenizer)
    val_text = tokenize_texts(texts=valid_df["text"].tolist(),
                              tokenizer=tokenizer)

    tr_ds = get_dataset(x=tr_text, 
                        y=train_df["target"].values, 
                        dataset="train")
    val_ds = get_dataset(x=val_text, 
                         y=valid_df["target"].values, 
                         dataset="valid")

    # callbacks
    checkpoint = tf.keras.callbacks.ModelCheckpoint(
        filepath,  
        verbose=1, 
        save_best_only=True, 
        save_weights_only=True,
        )
    
    
    # 訓練実行
    history = model.fit(
        tr_ds, 
        epochs=Config.epochs, 
        verbose=1, 
        callbacks=[checkpoint],
        validation_data=val_ds, 
        steps_per_epoch=Config.steps_per_epochs,
        )
    
    return history


def predict_fn(test_df, model, tokenizer, filepath):
    """
    推論関数
    """
    model.load_weights(filepath)
    te_text = tokenize_texts(texts=test_df["text"].tolist(), 
                             tokenizer=tokenizer)
    te_ds = get_dataset(x=te_text, y=None, dataset="test")
    preds = model.predict(te_ds)
    return preds

# ３.実行
　手順を定義し、実行します。

In [None]:
def main():
    # model, tokenizer準備
    model = build_model()
    tokenizer = transformers.AutoTokenizer.from_pretrained(Config.model)

    # data準備
    df = prepare_data(Config.file_path)

    # dataの分割
    # 訓練デモンストレーションの為、簡易的にdataを分割します。本来は、cross validationで性能を評価すべき所です。
    train_df = df.iloc[:160]
    valid_df = df.iloc[160:180]
    test_df = df.iloc[180:]

    # 前処理
    scaler = StandardScaler()
    train_df["target"] = scaler.fit_transform(train_df.target.values.reshape(-1, 1)).flatten()
    valid_df["target"] = scaler.transform(valid_df.target.values.reshape(-1, 1)).flatten()
    test_df["target"] = scaler.transform(test_df.target.values.reshape(-1, 1)).flatten()

    history = train_fn(train_df, valid_df, model, tokenizer, Config.model_path)
    preds = predict_fn(test_df, model, tokenizer, Config.model_path)

    print(f"MSE : {mean_squared_error(test_df.target.values, preds)}")

In [None]:
main()

Downloading:   0%|          | 0.00/479 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/520M [00:00<?, ?B/s]

Some layers from the model checkpoint at cl-tohoku/bert-base-japanese were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at cl-tohoku/bert-base-japanese.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 64)]         0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 64)]         0           []                               
                                                                                                  
 tf_bert_model (TFBertModel)    TFBaseModelOutputWi  110617344   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 64,                                            

Downloading:   0%|          | 0.00/104 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/252k [00:00<?, ?B/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Epoch 1/2
Epoch 1: val_loss improved from inf to 0.71486, saving model to /content/drive/MyDrive/recruit/portfolio/retweet_analysis/model.h5
Epoch 2/2
Epoch 2: val_loss did not improve from 0.71486
MSE : 0.7137475082378183
