學號:110302005 姓名:江政達
選擇比賽:Natural Language Processing with Disaster Tweets
比賽簡介:分析推文描述的是否為一真實發生的災害
選擇此比賽的原因:現在網路上資訊過多，有時候只靠關鍵字很難找到所需的資訊，所以想透過這個比賽嘗試看看透過文字分析判斷狀況是否屬實


導入模組

In [2]:
import pandas as pd
import numpy as np
import os

導入資料

In [3]:
test_df = pd.read_csv("test.csv")
train_df = pd.read_csv("train.csv")

觀察

In [4]:
test_df.head()
test_df.info()
train_df.head()
train_df.info()
test_df.isnull().sum()
train_df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
dtypes: int64(1), object(3)
memory usage: 102.1+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

刪除遺失過多的資料

In [5]:
test_df.drop(["keyword","location"],axis = 1, inplace = True)
train_df.drop(["keyword","location"],axis = 1, inplace = True)

資料清理

In [6]:
import text_hammer as th
from tqdm._tqdm_notebook import tqdm_notebook 


def text_preprocessing(df,col_name):
    column = col_name
    df[column] = df[column].progress_apply(lambda x :str(x))
    df[column] = df[column].progress_apply(lambda x :th.remove_emails(x))
    df[column] = df[column].progress_apply(lambda x :th.remove_html_tags(x))
    df[column] = df[column].progress_apply(lambda x :th.remove_special_chars(x))
    df[column] = df[column].progress_apply(lambda x :th.remove_accented_chars(x))
    
    return(df)

train_cleaned_df = text_preprocessing(train_df,'text')
train_cleaned_df[train_cleaned_df.target == 0]




  0%|          | 0/7613 [00:00<?, ?it/s]

  0%|          | 0/7613 [00:00<?, ?it/s]

  0%|          | 0/7613 [00:00<?, ?it/s]

  0%|          | 0/7613 [00:00<?, ?it/s]

  0%|          | 0/7613 [00:00<?, ?it/s]

Unnamed: 0,id,text,target
15,23,Whats up man,0
16,24,I love fruits,0
17,25,Summer is lovely,0
18,26,My car is so fast,0
19,28,What a goooooooaaaaaal,0
...,...,...,...
7581,10833,engineshed Great atmosphere at the British Lio...,0
7582,10834,Cramer Igers 3 words that wrecked Disneys stoc...,0
7584,10837,These boxes are ready to explode Exploding Kit...,0
7587,10841,Sirens everywhere,0


標記資料並建模

In [9]:
from transformers import AutoTokenizer,TFBertModel
tokenizer = AutoTokenizer.from_pretrained('bert-large-uncased')
bert = TFBertModel.from_pretrained('bert-large-uncased')

Some layers from the model checkpoint at bert-large-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-large-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


找出推文的最多單字量

In [11]:
print("max len of tweets",max([len(x.split()) for x in train_df.text]))

max len of tweets 31


train文字轉換為bert格式

In [12]:
x_train = tokenizer(
    text = train_cleaned_df.text.tolist(),
    add_special_tokens = True,
    max_length = 35,
    truncation = True, #超過35單字刪除
    padding = True,  #不足35單字補0
    return_tensors = "tf",
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)

y_train = train_cleaned_df.target.values

導入模型模組

In [13]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import TruncatedNormal
from tensorflow.keras.losses import CategoricalCrossentropy,BinaryCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy,BinaryAccuracy
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.utils import plot_model

構建模型


In [14]:
max_len = 35
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense

input_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_ids")
input_mask = Input(shape=(max_len,), dtype=tf.int32, name="attention_mask")

embeddings = bert([input_ids,input_mask])[1]

out = tf.keras.layers.Dropout(0.1)(embeddings)
out = Dense(128, activation='relu')(out)
out = tf.keras.layers.Dropout(0.1)(out)
out = Dense(32,activation = 'relu')(out)

y = Dense(1,activation = 'sigmoid')(out)

model = tf.keras.Model(inputs=[input_ids, input_mask], outputs=y)
model.layers[2].trainable = True

In [15]:
model.summary()


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 35)]         0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 35)]         0           []                               
                                                                                                  
 tf_bert_model_1 (TFBertModel)  TFBaseModelOutputWi  335141888   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 35,                                            

編譯模型

In [16]:
optimizer = tf.keras.optimizers.legacy.Adam(
    learning_rate=6e-06, 
    epsilon=1e-08,
    decay=0.01,
    clipnorm=1.0)

loss = BinaryCrossentropy(from_logits = True)
metric = BinaryAccuracy('accuracy'),

model.compile(
    optimizer = optimizer,
    loss = loss, 
    metrics = metric)

訓練模型

In [17]:
train_history = model.fit(
    x ={'input_ids':x_train['input_ids'],'attention_mask':x_train['attention_mask']} ,
    y = y_train,
    validation_split = 0.2,
    epochs=1, 
    batch_size=24
)

  output, from_logits = _get_logits(




In [19]:
test_cleaned_df = text_preprocessing(test_df,'text')

  0%|          | 0/3263 [00:00<?, ?it/s]

  0%|          | 0/3263 [00:00<?, ?it/s]

  0%|          | 0/3263 [00:00<?, ?it/s]

  0%|          | 0/3263 [00:00<?, ?it/s]

  0%|          | 0/3263 [00:00<?, ?it/s]

test文字轉bert格式

In [23]:
x_test = tokenizer(
    text=test_cleaned_df.text.tolist(),
    add_special_tokens=True,
    max_length=35,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)

In [24]:
predicted = model.predict({'input_ids':x_test['input_ids'],'attention_mask':x_test['attention_mask']})



預測值>0.5改為1，其餘為0


In [28]:
y_predicted = np.where(predicted>0.5,1,0)

匯出結果


In [29]:
sample_df = pd.read_csv("sample_submission.csv")

In [30]:
sample_df['id'] = test_df.id
sample_df['target'] = y_predicted

In [32]:
sample_df.to_csv('1.csv',index = False)