В этом ноутбуке обучим модель DistilBERT распознавать спам в SMS

In [1]:
model_name = 'distilbert'
train_dataset_name = 'spam sms'

In [2]:
!pip install tensorflow-text
import tensorflow_text as text

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import tensorflow as tf
import tensorflow_hub as hub
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

Загружаем данные

In [4]:
from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv('/content/drive/MyDrive/data_for_colab/spam_sms.csv', encoding = "ISO-8859-1")
df

Mounted at /content/drive


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [5]:
df.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], inplace=True)
df.columns = ['IS_SPAM', 'DATA_COLUMN']
df['IS_SPAM'] = (df['IS_SPAM'] == 'spam').astype(int)

In [6]:
df

Unnamed: 0,IS_SPAM,DATA_COLUMN
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will Ì_ b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


In [7]:
df['IS_SPAM'].value_counts()

0    4825
1     747
Name: IS_SPAM, dtype: int64

In [8]:
df_positive = df[df['IS_SPAM']==1]

In [9]:
df_negative = df[df['IS_SPAM']==0]

Создаем тестовую и обучающую выборки

In [10]:
# Тестовая выборка
n_test = df_negative.shape[0] // 2
df_negative_test = df_negative.tail(n_test)
n_test = df_positive.shape[0] // 2
df_positive_test = df_positive.tail(n_test)

In [11]:
df_negative_test.shape

(2412, 2)

In [12]:
df_positive_test.shape

(373, 2)

In [13]:
df_positive_test

Unnamed: 0,IS_SPAM,DATA_COLUMN
2728,1,Urgent Please call 09066612661 from landline. ...
2729,1,Urgent! Please call 09066612661 from your land...
2741,1,I don't know u and u don't know me. Send CHAT ...
2766,1,Married local women looking for discreet actio...
2769,1,Burger King - Wanna play footy at a top stadiu...
...,...,...
5537,1,Want explicit SEX in 30 secs? Ring 02073162414...
5540,1,ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ...
5547,1,Had your contract mobile 11 Mnths? Latest Moto...
5566,1,REMINDER FROM O2: To get 2.50 pounds free call...


In [14]:
df_balanced_test = pd.concat([df_negative_test, df_positive_test])

In [15]:
df_balanced_test.sample(10)

Unnamed: 0,IS_SPAM,DATA_COLUMN
3262,0,So u gonna get deus ex?
4512,1,Money i have won wining number 946 wot do i do...
4437,0,Nothing will ever be easy. But don't be lookin...
5483,0,So li hai... Me bored now da lecturer repeatin...
5500,0,Love has one law; Make happy the person you lo...
3387,0,Same as kallis dismissial in 2nd test:-).
4539,0,"Urgh, coach hot, smells of chip fat! Thanks ag..."
4306,0,I guess it is useless calling u 4 something im...
4322,0,Aight well keep me informed
3422,1,Had your mobile 10 mths? Update to latest Oran...


In [16]:
df_balanced_test['IS_SPAM'].value_counts()

0    2412
1     373
Name: IS_SPAM, dtype: int64

In [17]:
# Обучающая выборка
n_train = df_negative.shape[0] // 2
df_negative_train = df_negative.head(n_train)
n_train = df_positive.shape[0] // 2
df_positive_train = df_positive.head(n_train)

In [18]:
df_balanced_train = pd.concat([df_negative_train, df_positive_train])

In [19]:
df_balanced_train['IS_SPAM'].value_counts()

0    2412
1     373
Name: IS_SPAM, dtype: int64

In [20]:
df_balanced_train.sample(10)

Unnamed: 0,IS_SPAM,DATA_COLUMN
1462,1,09066362231 URGENT! Your mobile No 07xxxxxxxxx...
1602,0,Ok pa. Nothing problem:-)
1485,0,(I should add that I don't really care and if ...
568,0,Love it! Daddy will make you scream with pleas...
1518,0,Shall i ask one thing if you dont mistake me.
662,0,Sorry me going home first... Daddy come fetch ...
1155,0,"Sorry man, accidentally left my phone on silen..."
74,0,U can call me now...
1201,0,I know she called me
1258,0,Honey boo I'm missing u.


In [21]:
X_train = df_balanced_train['DATA_COLUMN'].squeeze()
y_train = df_balanced_train['IS_SPAM'].squeeze()

In [22]:
X_test = df_balanced_test['DATA_COLUMN'].squeeze()
y_test = df_balanced_test['IS_SPAM'].squeeze()

Работаем с моделью

In [23]:
distilbert_preprocess = hub.KerasLayer('https://tfhub.dev/jeongukjae/distilbert_en_uncased_preprocess/2')




In [24]:
distilbert_encoder = hub.KerasLayer("https://tfhub.dev/jeongukjae/distilbert_en_uncased_L-6_H-768_A-12/1")

In [25]:
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = distilbert_preprocess(text_input)
outputs = distilbert_encoder(preprocessed_text)

Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


In [26]:
l = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
l = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l)

In [27]:
model = tf.keras.Model(inputs=[text_input], outputs = [l])

In [28]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text (InputLayer)              [(None,)]            0           []                               
                                                                                                  
 keras_layer (KerasLayer)       {'input_word_ids':   0           ['text[0][0]']                   
                                (None, 128),                                                      
                                 'input_mask': (Non                                               
                                e, 128)}                                                          
                                                                                                  
 keras_layer_1 (KerasLayer)     {'pooled_output': (  66362880    ['keras_layer[0][0]',        

In [29]:
METRICS = [
      tf.keras.metrics.BinaryAccuracy(name='accuracy'),
      tf.keras.metrics.Precision(name='precision'),
      tf.keras.metrics.Recall(name='recall')
]

model.compile(optimizer='adam',
 loss='binary_crossentropy',
 metrics=METRICS)

In [30]:
history = model.fit(X_train, y_train, epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


In [31]:
y_predicted = model.predict(X_test)
y_predicted




array([[0.18666743],
       [0.01383404],
       [0.00208221],
       ...,
       [0.97241205],
       [0.993364  ],
       [0.9814306 ]], dtype=float32)

In [32]:
y_predicted = np.where(y_predicted > 0.5, 1, 0)
y_predicted

array([[0],
       [0],
       [0],
       ...,
       [1],
       [1],
       [1]])

In [33]:
y_test

2795    0
2796    0
2797    0
2798    0
2799    0
       ..
5537    1
5540    1
5547    1
5566    1
5567    1
Name: IS_SPAM, Length: 2785, dtype: int64

In [34]:
accuracy_score(y_test, y_predicted)

0.9892280071813285

In [35]:
precision_score(y_test, y_predicted)


0.9803921568627451

In [36]:
recall_score(y_test, y_predicted)

0.938337801608579

In [37]:
f1_score(y_test, y_predicted)

0.958904109589041

In [38]:
df_results_on_test = pd.DataFrame(columns=['accuracy', 'precision', 'recall', 'f1_score'])

In [39]:
index = model_name + ' trained on ' + train_dataset_name + ' and tested on ' + train_dataset_name + ' dataset'
df_results_on_test.loc[index, 'accuracy'] = accuracy_score(y_test, y_predicted)
df_results_on_test.loc[index, 'precision'] = precision_score(y_test, y_predicted)
df_results_on_test.loc[index, 'recall'] = recall_score(y_test, y_predicted)
df_results_on_test.loc[index, 'f1_score'] =  f1_score(y_test, y_predicted)

In [40]:
df_results_on_test

Unnamed: 0,accuracy,precision,recall,f1_score
distilbert trained on spam sms and tested on spam sms dataset,0.989228,0.980392,0.938338,0.958904


Сохраним обученную модель

In [41]:
saved_model_path = '/content/drive/MyDrive/data_for_colab/distilbert_trained_on_spam_sms_19_january'

In [42]:
print(saved_model_path)

/content/drive/MyDrive/data_for_colab/distilbert_trained_on_spam_sms_19_january


In [43]:
model.save(saved_model_path, include_optimizer=True) 



Сохраним датафреймы с результатами на обучающей выборке и на тестовой выборке

Сначала результаты для обучающей выборки

In [44]:
name_for_train_csv = model_name + ' trained on ' + train_dataset_name + ' quality on train dataset'

In [45]:
def calculate_f1_score(precision, recall):
    F1 = 2 * (precision * recall) / (precision + recall)
    return F1

In [46]:
df_with_train_quality = pd.DataFrame(columns=['accuracy', 'precision', 'recall', 'f1_score'])

In [47]:
for el in ['accuracy', 'precision', 'recall']:
    df_with_train_quality.loc[name_for_train_csv, el] = history.history[el][-1]
df_with_train_quality.loc[name_for_train_csv, 'f1_score'] = calculate_f1_score(history.history['precision'][-1], history.history['recall'][-1])

In [48]:
df_with_train_quality

Unnamed: 0,accuracy,precision,recall,f1_score
distilbert trained on spam sms quality on train dataset,0.989228,0.975069,0.9437,0.959128


In [49]:
df_with_train_quality.to_csv('/content/drive/MyDrive/data_for_colab/dataframes/train_quality/' + name_for_train_csv + '.csv')

Теперь результаты для тестовой выборки

In [50]:
name_for_test_csv = model_name + ' trained on ' + train_dataset_name + ' and tested on ' + train_dataset_name + ' dataset'

In [51]:
df_results_on_test.to_csv('/content/drive/MyDrive/data_for_colab/dataframes/test_quality/' + name_for_test_csv + '.csv')