<a href="https://colab.research.google.com/github/Theieyrre/Natural-Language-Processing-with-Disaster-Tweets/blob/main/B%C4%B0L_470.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BİL 470 Projesi Natural Language Processing with Disaster Tweets
[Kaggle linki](https://www.kaggle.com/c/nlp-getting-started/overview)  
Kullanılan Modeller
* Logistic Regression (Unigram, Bigram ve ikisi birlikte)
* Naive Bayes (Unigram, Bigram ve ikisi birlikte)
* SVC OneVsRestClassifier (Unigram, Bigram ve ikisi birlikte)
* fasttext Bigram 
* BERT
* RoBERTa  





## Bag of Words, TFIDF kullanarak
Tüm modeller için gerekli kütüphaneleri import edildi

In [None]:
import sklearn
import pandas as pd
import re
import nltk
import numpy as np
import gc
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from google.colab import drive

### GDrive
Kaggle datasetini her seferinde Colab'a yüklememek için drive'a yükledim.  
GDrive'ı mount ettim ve verileri ordan berileri okudum


In [None]:
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
train = pd.read_csv('/content/gdrive/MyDrive/bil470/train.csv')
print(len(train))

7613


Verideki target dağılımını listeledim. Random guess için accuracy gözlemlenebilir

In [None]:
print(train["target"].value_counts())

0    4342
1    3271
Name: target, dtype: int64


In [None]:
train.head(5)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


### Preprocessing
NLTK kütüphanesi ile imle işaretleri ve bağlaçları temizledim


In [None]:
nltk.download('stopwords')
en_stopwords = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Tweet içerisindeki linkleri kaldır ve lowercase yap

In [None]:
tk = RegexpTokenizer(r'\w+')
train['text'] = train['text'].apply(lambda s: re.sub(r'http\S+', '', s))
train['words'] = train['text'].str.lower().apply(tk.tokenize)

In [None]:
train['words'].head(5)

0    [our, deeds, are, the, reason, of, this, earth...
1        [forest, fire, near, la, ronge, sask, canada]
2    [all, residents, asked, to, shelter, in, place...
3    [13, 000, people, receive, wildfires, evacuati...
4    [just, got, sent, this, photo, from, ruby, ala...
Name: words, dtype: object

Sayıları ve stopwordleri kaldır

In [None]:
train['words'] = train['words'].apply(lambda words: [word for word in words if word not in en_stopwords])
train['words'] = train['words'].apply(lambda words: [word for word in words if not word.isdigit()])

In [None]:
train['words'].head(5)

0    [deeds, reason, earthquake, may, allah, forgiv...
1        [forest, fire, near, la, ronge, sask, canada]
2    [residents, asked, shelter, place, notified, o...
3    [people, receive, wildfires, evacuation, order...
4    [got, sent, photo, ruby, alaska, smoke, wildfi...
Name: words, dtype: object

### Bag Of Words
List of lists yerine list of strings'e dönüştürdüm


In [None]:
corpus = train['words'].apply(lambda words: ' '.join(words)).tolist()
corpus[:5]

['deeds reason earthquake may allah forgive us',
 'forest fire near la ronge sask canada',
 'residents asked shelter place notified officers evacuation shelter place orders expected',
 'people receive wildfires evacuation orders california',
 'got sent photo ruby alaska smoke wildfires pours school']

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(corpus, train["target"], test_size=0.20, random_state=42)

In [None]:
corpus_np = np.array(corpus)

### Diğer modeller için gerekli yer
** Bu Kısım GridSearchCV, fasttext ve BERT için gerekli olan işlemler **  
Data bag of words oluşturulup vektör edilip modellere verildiğinden, sadece preprocess edilmiş datayı train ve test olarak ayırdım. Bundan üstünü çalıştırıp aradaki celleri atladım

Unigram ve Bigram için iki ayrı Vectorizer oluşturdum

In [None]:
unigram_vectorizer = CountVectorizer(ngram_range=(1,1), min_df=0., max_df=1.0)
bigram_vectorizer = CountVectorizer(ngram_range=(2,2), min_df=0., max_df=1.0)

In [None]:
uni_bow = unigram_vectorizer.fit_transform(corpus)
bi_bow = bigram_vectorizer.fit_transform(corpus)
del unigram_vectorizer
del bigram_vectorizer

### TF-IDF Vektörleri


Unigram ve Bigram kullanarak TF-IDF converter oluşturdum.  
Bunu kullanarak TF-IDF vektörlerini oluşturdum  
RAM sıkıntısından dolayı BoW'leri sildim

In [None]:
tfidfconverter = TfidfTransformer()
X_uni = tfidfconverter.fit_transform(uni_bow).toarray()
X_bi =  tfidfconverter.fit_transform(bi_bow).toarray()
del uni_bow
del bi_bow

KFold splitlerini oluşturdum ve bunlar üzerinde modelleri değerlendirdim

In [None]:
kf = KFold(n_splits=5)

### Logistic Regression
Base bir sonuç gözlemlemek için LR kullanarak binary classification için skorları gördüm.
Önce Unigramlar için sonra bigramlar için fit ve predict yaptım

In [None]:
lr = LogisticRegression()

Her Kfold için confusion matrix ve classification report gözlemlenebilir  
Önce unigramlar için 

In [None]:
for train_index, test_index in kf.split(X_uni):
  X_uni_train, X_uni_test = X_uni[train_index], X_uni[test_index]
  Y_uni_train, Y_uni_test = train.target[train_index], train.target[test_index] 
  uni_lr = lr.fit(X_uni_train, Y_uni_train)
  uni_lr_predicted_test = uni_lr.predict(X_uni_test)
  print(confusion_matrix(Y_uni_test, uni_lr_predicted_test))
  print(classification_report(Y_uni_test, uni_lr_predicted_test))

[[887  59]
 [286 291]]
              precision    recall  f1-score   support

           0       0.76      0.94      0.84       946
           1       0.83      0.50      0.63       577

    accuracy                           0.77      1523
   macro avg       0.79      0.72      0.73      1523
weighted avg       0.78      0.77      0.76      1523

[[826  68]
 [325 304]]
              precision    recall  f1-score   support

           0       0.72      0.92      0.81       894
           1       0.82      0.48      0.61       629

    accuracy                           0.74      1523
   macro avg       0.77      0.70      0.71      1523
weighted avg       0.76      0.74      0.73      1523

[[744  48]
 [441 290]]
              precision    recall  f1-score   support

           0       0.63      0.94      0.75       792
           1       0.86      0.40      0.54       731

    accuracy                           0.68      1523
   macro avg       0.74      0.67      0.65      1523
weigh

Bigramlar için

In [None]:
for train_index, test_index in kf.split(X_bi):
  X_bi_train, X_bi_test = X_bi[train_index], X_bi[test_index]
  Y_bi_train, Y_bi_test = train.target[train_index], train.target[test_index] 
  bi_lr = lr.fit(X_bi_train, Y_bi_train)
  bi_lr_predicted_test = bi_lr.predict(X_bi_test)
  print(confusion_matrix(Y_bi_test, bi_lr_predicted_test))
  print(classification_report(Y_bi_test, bi_lr_predicted_test))

[[946   0]
 [544  33]]
              precision    recall  f1-score   support

           0       0.63      1.00      0.78       946
           1       1.00      0.06      0.11       577

    accuracy                           0.64      1523
   macro avg       0.82      0.53      0.44      1523
weighted avg       0.77      0.64      0.52      1523

[[894   0]
 [588  41]]
              precision    recall  f1-score   support

           0       0.60      1.00      0.75       894
           1       1.00      0.07      0.12       629

    accuracy                           0.61      1523
   macro avg       0.80      0.53      0.44      1523
weighted avg       0.77      0.61      0.49      1523

[[791   1]
 [708  23]]
              precision    recall  f1-score   support

           0       0.53      1.00      0.69       792
           1       0.96      0.03      0.06       731

    accuracy                           0.53      1523
   macro avg       0.74      0.52      0.38      1523
weigh

En iyi parametreler için LR Pipeline ile GridSearchCV çalıştırdım

In [None]:
lr_pipe = Pipeline([('vectorizer', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('lr', LogisticRegression())])

Ngram değerleri, TF-IDF kullanma durumu ve LR çözümleyici için GridSearch çalıştırdım  
n_jobs = -1 uygun olacak kadar ayrı işlem çalıştırmak için. Süreci hızlandırıyor

In [None]:
parameters_lr = {'vectorizer__ngram_range': [(1, 1), (1, 2), (2, 2)],
              'tfidf__use_idf': (True, False),
              'lr__solver': ['lbfgs', 'liblinear']}
gs_lr = GridSearchCV(lr_pipe, parameters_lr, n_jobs=-1)
gs_lr = gs_lr.fit(X_train, Y_train)

En iyi modelin accuracy değerini ve parametrelerni gözlemlenebilir

In [None]:
print(gs_lr.best_score_)
print(gs_lr.best_params_)

0.8
{'lr__solver': 'liblinear', 'tfidf__use_idf': True, 'vectorizer__ngram_range': (1, 2)}


En iyi estimator ile confusion matrix ve skorları listeledim

In [None]:
y_lr_best_pred = gs_lr.best_estimator_.predict(X_test)
print(confusion_matrix(Y_test, y_lr_best_pred))
print(classification_report(Y_test, y_lr_best_pred))

[[779  95]
 [212 437]]
              precision    recall  f1-score   support

           0       0.79      0.89      0.84       874
           1       0.82      0.67      0.74       649

    accuracy                           0.80      1523
   macro avg       0.80      0.78      0.79      1523
weighted avg       0.80      0.80      0.79      1523



### Naive Bayes
NLP için temel modellerden biri olan Naive Bayes ile sonuçları gözlemlenebilir  
LR gibi önce unigram daha sonra bigramlar için

In [None]:
bayes = MultinomialNB()

In [None]:
for train_index, test_index in kf.split(X_uni):
  X_uni_train, X_uni_test = X_uni[train_index], X_uni[test_index]
  Y_uni_train, Y_uni_test = train.target[train_index], train.target[test_index] 
  uni_bayes = bayes.fit(X_uni_train, Y_uni_train)
  uni_bayes_predicted_test = uni_bayes.predict(X_uni_test)
  print(confusion_matrix(Y_uni_test, uni_bayes_predicted_test))
  print(classification_report(Y_uni_test, uni_bayes_predicted_test))

[[811 135]
 [227 350]]
              precision    recall  f1-score   support

           0       0.78      0.86      0.82       946
           1       0.72      0.61      0.66       577

    accuracy                           0.76      1523
   macro avg       0.75      0.73      0.74      1523
weighted avg       0.76      0.76      0.76      1523

[[756 138]
 [220 409]]
              precision    recall  f1-score   support

           0       0.77      0.85      0.81       894
           1       0.75      0.65      0.70       629

    accuracy                           0.76      1523
   macro avg       0.76      0.75      0.75      1523
weighted avg       0.76      0.76      0.76      1523

[[692 100]
 [309 422]]
              precision    recall  f1-score   support

           0       0.69      0.87      0.77       792
           1       0.81      0.58      0.67       731

    accuracy                           0.73      1523
   macro avg       0.75      0.73      0.72      1523
weigh

In [None]:
for train_index, test_index in kf.split(X_bi):
  X_bi_train, X_bi_test = X_bi[train_index], X_bi[test_index]
  Y_bi_train, Y_bi_test = train.target[train_index], train.target[test_index] 
  bi_bayes = bayes.fit(X_bi_train, Y_bi_train)
  bi_bayes_predicted_test = bi_bayes.predict(X_bi_test)
  print(confusion_matrix(Y_bi_test, bi_bayes_predicted_test))
  print(classification_report(Y_bi_test, bi_bayes_predicted_test))

[[926  20]
 [484  93]]
              precision    recall  f1-score   support

           0       0.66      0.98      0.79       946
           1       0.82      0.16      0.27       577

    accuracy                           0.67      1523
   macro avg       0.74      0.57      0.53      1523
weighted avg       0.72      0.67      0.59      1523

[[886   8]
 [529 100]]
              precision    recall  f1-score   support

           0       0.63      0.99      0.77       894
           1       0.93      0.16      0.27       629

    accuracy                           0.65      1523
   macro avg       0.78      0.58      0.52      1523
weighted avg       0.75      0.65      0.56      1523

[[784   8]
 [653  78]]
              precision    recall  f1-score   support

           0       0.55      0.99      0.70       792
           1       0.91      0.11      0.19       731

    accuracy                           0.57      1523
   macro avg       0.73      0.55      0.45      1523
weigh

En iyi parametreler için Bayes Pipeline ile GridSearchCV çalıştırdım

In [None]:
bayes_pipe = Pipeline([('vectorizer', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('bayes', MultinomialNB())])

In [None]:
parameters_bayes = {'vectorizer__ngram_range': [(1, 1), (1, 2), (2, 2)],
              'tfidf__use_idf': (True, False)}
gs_bayes = GridSearchCV(bayes_pipe, parameters_bayes, n_jobs=-1)
gs_bayes = gs_bayes.fit(X_train, Y_train)

In [None]:
print(gs_bayes.best_score_)
print(gs_bayes.best_params_)

0.8011494252873563
{'tfidf__use_idf': True, 'vectorizer__ngram_range': (1, 2)}


En iyi estimator ile confusion matrix ve skorları listeledim

In [None]:
y_bayes_best_pred = gs_bayes.best_estimator_.predict(X_test)
print(confusion_matrix(Y_test, y_bayes_best_pred))
print(classification_report(Y_test, y_bayes_best_pred))

[[787  87]
 [226 423]]
              precision    recall  f1-score   support

           0       0.78      0.90      0.83       874
           1       0.83      0.65      0.73       649

    accuracy                           0.79      1523
   macro avg       0.80      0.78      0.78      1523
weighted avg       0.80      0.79      0.79      1523



LogisticRegression ile Bayes çok yakın sonuç verdi. LR True Positif bulmada daha iyiyken Bayes True Negatif için daha iyi çalışıyor.

### SVC
LR ve Bayes'ten farklı olarak SVc ile kullanarak sonuçları gözlemledim.  
Binary classification olduğu için OneVsRest kullandım

In [None]:
onevrest = OneVsRestClassifier(SVC())

In [None]:
for train_index, test_index in kf.split(X_uni):
  X_uni_train, X_uni_test = X_uni[train_index], X_uni[test_index]
  Y_uni_train, Y_uni_test = train.target[train_index], train.target[test_index] 
  uni_svc = onevrest.fit(X_uni_train, Y_uni_train)
  uni_svc_predicted_test = uni_svc.predict(X_uni_test)
  print(confusion_matrix(Y_uni_test, uni_svc_predicted_test))
  print(classification_report(Y_uni_test, uni_svc_predicted_test))

[[916  30]
 [316 261]]
              precision    recall  f1-score   support

           0       0.74      0.97      0.84       946
           1       0.90      0.45      0.60       577

    accuracy                           0.77      1523
   macro avg       0.82      0.71      0.72      1523
weighted avg       0.80      0.77      0.75      1523

[[853  41]
 [351 278]]
              precision    recall  f1-score   support

           0       0.71      0.95      0.81       894
           1       0.87      0.44      0.59       629

    accuracy                           0.74      1523
   macro avg       0.79      0.70      0.70      1523
weighted avg       0.78      0.74      0.72      1523

[[757  35]
 [481 250]]
              precision    recall  f1-score   support

           0       0.61      0.96      0.75       792
           1       0.88      0.34      0.49       731

    accuracy                           0.66      1523
   macro avg       0.74      0.65      0.62      1523
weigh

Hepsi sırayla çalıştırılısa RAM yetersiz geliyor. Kullanım bitenleri silinecek

In [None]:
del uni_svc
del uni_svc_predicted_test

In [None]:
for train_index, test_index in kf.split(X_bi):
  X_bi_train, X_bi_test = X_bi[train_index], X_bi[test_index]
  Y_bi_train, Y_bi_test = train.target[train_index], train.target[test_index] 
  bi_svc = onevrest.fit(X_bi_train, Y_bi_train)
  bi_svc_predicted_test = bi_svc.predict(X_bi_test)
  print(confusion_matrix(Y_bi_test, bi_svc_predicted_test))
  print(classification_report(Y_bi_test, bi_svc_predicted_test))
  del bi_svc
  del bi_svc_predicted_test

[[946   0]
 [552  25]]
              precision    recall  f1-score   support

           0       0.63      1.00      0.77       946
           1       1.00      0.04      0.08       577

    accuracy                           0.64      1523
   macro avg       0.82      0.52      0.43      1523
weighted avg       0.77      0.64      0.51      1523

[[894   0]
 [600  29]]
              precision    recall  f1-score   support

           0       0.60      1.00      0.75       894
           1       1.00      0.05      0.09       629

    accuracy                           0.61      1523
   macro avg       0.80      0.52      0.42      1523
weighted avg       0.76      0.61      0.48      1523

[[791   1]
 [714  17]]
              precision    recall  f1-score   support

           0       0.53      1.00      0.69       792
           1       0.94      0.02      0.05       731

    accuracy                           0.53      1523
   macro avg       0.74      0.51      0.37      1523
weigh

En iyi parametreler için OneVsRest Pipeline ile GridSearchCV çalıştırdım  
SVC için fine-tune edilebilecek çok parametre olmasına rağmen Bayes parametreleri ile denedim. 

In [None]:
onevrest_pipe = Pipeline([('vectorizer', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('onevrest', OneVsRestClassifier(SVC()))])

In [None]:
parameters_onevrest = {'vectorizer__ngram_range': [(1, 1), (1, 2), (2, 2)],
              'tfidf__use_idf': (True, False)}
gs_onevrest = GridSearchCV(onevrest_pipe, parameters_onevrest, n_jobs=-1)
gs_onevrest = gs_onevrest.fit(X_train, Y_train)

In [None]:
print(gs_onevrest.best_score_)
print(gs_onevrest.best_params_)

0.8021346469622331
{'tfidf__use_idf': False, 'vectorizer__ngram_range': (1, 1)}


LR ve Bayes'ten farklı olarak TF-IDf olmadan Unigramlar için en iyi sonucu veriyor.  
En iyi estimator ile confusion matrix ve skorları listeledim

In [None]:
y_onevrest_best_pred = gs_onevrest.best_estimator_.predict(X_test)
print(confusion_matrix(Y_test, y_onevrest_best_pred))
print(classification_report(Y_test, y_onevrest_best_pred))

[[781  93]
 [208 441]]
              precision    recall  f1-score   support

           0       0.79      0.89      0.84       874
           1       0.83      0.68      0.75       649

    accuracy                           0.80      1523
   macro avg       0.81      0.79      0.79      1523
weighted avg       0.81      0.80      0.80      1523



### Sonuçlar
Tüm modeller birbirlerine yakın sonuçlar verdi. LR ve Bayes Unigram ve Bigram ile TF-IDF kullanarak en iyi sonuç verirken, SVC sadece unigramları kullanarak en iyi sonuç verdi.

## Word Embeddings kullanarak


### fasttext
Fasttext commandline uygulamasını Facebook research github sayfasından indirdim

In [None]:
!wget https://github.com/facebookresearch/fastText/archive/0.2.0.zip
!unzip 0.2.0.zip
%cd fastText-0.2.0
!make

Commandline versiyonunu kullandığım için train ve test için text dokümanı oluşturdum

In [None]:
index_file = 1
for train_index, test_index in kf.split(corpus):
  X_ft_train, X_ft_test = corpus_np[train_index], corpus_np[test_index]
  Y_ft_train, Y_ft_test = train.target[train_index], train.target[test_index] 
  train_dict = {'text': X_ft_train, 'target': Y_ft_train}
  test_dict = {'text': X_ft_test, 'target': Y_ft_test}
  df_ft_train = pd.DataFrame(train_dict)
  df_ft_test = pd.DataFrame(test_dict)
  f_train = open('train_' + str(index_file) + '.txt', 'w+')
  for index, row in df_ft_train.iterrows():
    f_train.write('__label__' + str(row.target) + ' ' +  row.text +"\n")
  f_test = open('test_' + str(index_file) + '.txt', 'w+')
  for index, row in df_ft_test.iterrows():
    f_test.write('__label__' + str(row.target) + ' ' +  row.text +"\n")
  f_train.close()
  f_test.close()
  index_file = index_file + 1

### Train & Test
Dimensionları 300 ve Bigramlar için 5 Kfolds için 5 farklı model eğitip evaluate ettim.

In [None]:
!./fasttext supervised -input train_1.txt -output fasttext_model_1  -dim 300 -wordNgrams 2 
!./fasttext supervised -input train_2.txt -output fasttext_model_2  -dim 300 -wordNgrams 2
!./fasttext supervised -input train_3.txt -output fasttext_model_3  -dim 300 -wordNgrams 2
!./fasttext supervised -input train_4.txt -output fasttext_model_4  -dim 300 -wordNgrams 2
!./fasttext supervised -input train_5.txt -output fasttext_model_5  -dim 300 -wordNgrams 2

Read 0M words
Number of words:  14172
Number of labels: 2
tcmalloc: large alloc 2417008640 bytes == 0x5602f8632000 @  0x7fdc3b75e887 0x5602f01920d3 0x5602f01aca5e 0x5602f01b2d92 0x5602f017c197 0x7fdc3a7fbbf7 0x5602f017c45a
Progress: 100.0% words/sec/thread:   94645 lr:  0.000000 loss:  0.471875 ETA:   0h 0m
Read 0M words
Number of words:  14051
Number of labels: 2
tcmalloc: large alloc 2416869376 bytes == 0x56538a39a000 @  0x7f35bda94887 0x565380f750d3 0x565380f8fa5e 0x565380f95d92 0x565380f5f197 0x7f35bcb31bf7 0x565380f5f45a
Progress: 100.0% words/sec/thread:   94227 lr:  0.000000 loss:  0.424461 ETA:   0h 0m
Read 0M words
Number of words:  14192
Number of labels: 2
tcmalloc: large alloc 2417033216 bytes == 0x55cc72502000 @  0x7f29dbfb7887 0x55cc68ac00d3 0x55cc68adaa5e 0x55cc68ae0d92 0x55cc68aaa197 0x7f29db054bf7 0x55cc68aaa45a
Progress: 100.0% words/sec/thread:   93875 lr:  0.000000 loss:  0.420099 ETA:   0h 0m
Read 0M words
Number of words:  14389
Number of labels: 2
tcmalloc: larg

Test komutu sonucu N sayısı, precision ve recall değerlerini yazdırır

In [None]:
!./fasttext test fasttext_model_1.bin test_1.txt
!./fasttext test fasttext_model_2.bin test_2.txt
!./fasttext test fasttext_model_3.bin test_3.txt
!./fasttext test fasttext_model_4.bin test_4.txt
!./fasttext test fasttext_model_5.bin test_5.txt

tcmalloc: large alloc 2417008640 bytes == 0x55bc426e8000 @  0x7f24d29e8887 0x55bc4119208f 0x55bc411a2934 0x55bc411a33c7 0x55bc411b3a0b 0x55bc4117a3f5 0x7f24d1a85bf7 0x55bc4117a45a
N	1523
P@1	0.783
R@1	0.783
tcmalloc: large alloc 2416869376 bytes == 0x561589fc2000 @  0x7fb741b46887 0x5615881ff08f 0x56158820f934 0x5615882103c7 0x561588220a0b 0x5615881e73f5 0x7fb740be3bf7 0x5615881e745a
N	1523
P@1	0.744
R@1	0.744
tcmalloc: large alloc 2417033216 bytes == 0x559db4dd6000 @  0x7f3765170887 0x559db315708f 0x559db3167934 0x559db31683c7 0x559db3178a0b 0x559db313f3f5 0x7f376420dbf7 0x559db313f45a
N	1523
P@1	0.703
R@1	0.703
tcmalloc: large alloc 2417270784 bytes == 0x55eecf1d4000 @  0x7f7ec4b03887 0x55eecced808f 0x55eeccee8934 0x55eeccee93c7 0x55eeccef9a0b 0x55eeccec03f5 0x7f7ec3ba0bf7 0x55eeccec045a
N	1522
P@1	0.736
R@1	0.736
tcmalloc: large alloc 2417188864 bytes == 0x563f0a5e0000 @  0x7fd81545f887 0x563f07e0408f 0x563f07e14934 0x563f07e153c7 0x563f07e25a0b 0x563f07dec3f5 0x7fd8144fcbf7 0x563f0

Labelları almak için predict çağırıp labelları kaydettim

In [None]:
!./fasttext predict fasttext_model_1.bin test_1.txt > labels_1.txt
!./fasttext predict fasttext_model_2.bin test_2.txt > labels_2.txt
!./fasttext predict fasttext_model_3.bin test_3.txt > labels_3.txt
!./fasttext predict fasttext_model_4.bin test_4.txt > labels_4.txt
!./fasttext predict fasttext_model_5.bin test_5.txt > labels_5.txt

tcmalloc: large alloc 2417008640 bytes == 0x55aac6ac4000 @  0x7efd19cab887 0x55aac4f2008f 0x55aac4f30934 0x55aac4f313c7 0x55aac4f3fa70 0x55aac4f0831e 0x7efd18d48bf7 0x55aac4f0845a
tcmalloc: large alloc 2416869376 bytes == 0x55bcd91f2000 @  0x7fe542d32887 0x55bcd803408f 0x55bcd8044934 0x55bcd80453c7 0x55bcd8053a70 0x55bcd801c31e 0x7fe541dcfbf7 0x55bcd801c45a
tcmalloc: large alloc 2417033216 bytes == 0x5647b31da000 @  0x7fd7bb163887 0x5647b10f108f 0x5647b1101934 0x5647b11023c7 0x5647b1110a70 0x5647b10d931e 0x7fd7ba200bf7 0x5647b10d945a
tcmalloc: large alloc 2417270784 bytes == 0x5556c5e9e000 @  0x7f1dddc7c887 0x5556c341e08f 0x5556c342e934 0x5556c342f3c7 0x5556c343da70 0x5556c340631e 0x7f1ddcd19bf7 0x5556c340645a
tcmalloc: large alloc 2417188864 bytes == 0x55838d6d6000 @  0x7f75115e5887 0x55838b95108f 0x55838b961934 0x55838b9623c7 0x55838b970a70 0x55838b93931e 0x7f7510682bf7 0x55838b93945a


Tüm labelları bir Dataframe de topla confusion matrix hesapladım

In [None]:
for i in range(1, 6):
  df_predict = pd.read_csv('labels_' + str(i) + '.txt', delimiter = "\t", header=None, names=['label'])
  df_predict['label'] = df_predict.label.str.replace('__label__', '')
  df_test = pd.read_csv('test_' + str(i) + '.txt', delimiter = "\t", header=None, names=['label'])
  df_test['label'] = df_test.label.apply(lambda x: re.findall("__label__[01]", x)[0])
  df_test['label'] = df_test.label.str.replace('__label__', '')
  print(confusion_matrix(df_test['label'], df_predict['label']))
  print(classification_report(df_test['label'], df_predict['label']))

[[884  62]
 [269 308]]
              precision    recall  f1-score   support

           0       0.77      0.93      0.84       946
           1       0.83      0.53      0.65       577

    accuracy                           0.78      1523
   macro avg       0.80      0.73      0.75      1523
weighted avg       0.79      0.78      0.77      1523

[[788 106]
 [284 345]]
              precision    recall  f1-score   support

           0       0.74      0.88      0.80       894
           1       0.76      0.55      0.64       629

    accuracy                           0.74      1523
   macro avg       0.75      0.71      0.72      1523
weighted avg       0.75      0.74      0.73      1523

[[703  89]
 [364 367]]
              precision    recall  f1-score   support

           0       0.66      0.89      0.76       792
           1       0.80      0.50      0.62       731

    accuracy                           0.70      1523
   macro avg       0.73      0.69      0.69      1523
weigh

Ortalama 0.75 accuracy ile diğer modellerden biraz daha kötü bir sonuç verdi

## BERT
Mücahid Hoca'yla çalışma dökümanımdan alınmıştır.\
Link: https://colab.research.google.com/drive/1_sY8ClRubVlNyX0yKo0cDFC4sww_eBAw?usp=sharing

In [None]:
%pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/fd/1a/41c644c963249fd7f3836d926afa1e3f1cc234a1c40d80c5f03ad8f6f1b2/transformers-4.8.2-py3-none-any.whl (2.5MB)
[K     |████████████████████████████████| 2.5MB 12.8MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 36.3MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 30.2MB/s 
[?25hCollecting huggingface-hub==0.0.12
  Downloading https://files.pythonhosted.org/packages/2f/ee/97e253668fda9b17e968b3f97b2f8e53aa0127e8807

In [None]:
import torch
import torch.nn as nn
from tqdm.notebook import tqdm
from collections import defaultdict
from torch.utils.data import DataLoader, Dataset, Subset
from transformers import BertForSequenceClassification
from transformers import BertModel
from transformers import AdamW, get_linear_schedule_with_warmup
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

İki farklı BERT ile sonuçları gözlemlenebilir.  
BERT Base wikipedi verileri üzerinde eğitildi.  
Roberta twitter verileri üzerind eğitildi.  
## BERT Base
BERT ve Optimizer için parametreleri tanımladım

In [None]:
# BERT Parameters
h_preprocess_mode = "bert-base-uncased"
h_max_len = 128
h_batch_size = 16
h_epoch = 5

# Adam Optimizer Parameters
h_learning_rate = 2e-6
h_eps = 1e-8

### Tokenizer
Kelimelerden token oluşturmak için pretrained tokenizer indirdim

In [None]:
tokenizer = BertTokenizer.from_pretrained(h_preprocess_mode)

### Device Control
Google colab CUDA altyapısı sunuyor. Yine de çıktı gözlemlek gerekiyor

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


### Create Special Dataset
BERT'e input olarak verebilmek için dictionarylerden oluşan özel dataset hazırladım  
Bu dataset üzerinde dataloaders kullanarak veriyi input olarak verdim

In [None]:
class BERTDataset(Dataset):
  def __init__(self, text, label, tokenizer, max_len):
    self.text = text
    self.label = label
    self.tokenizer = tokenizer
    self.max_len = max_len

  def __len__(self):
    return len(self.text)
  
  def __getitem__(self, item):
    text = str(self.text[item])
    encoding = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=self.max_len,
        return_token_type_ids=False,
        padding='max_length',
        return_attention_mask=True,
        truncation=True,
        return_tensors='pt'
    )

    return {
        'text': text,
        'input_ids': encoding['input_ids'].flatten(),
        'attention_mask': encoding['attention_mask'].flatten(),
        'labels': torch.tensor(self.label[item], dtype=torch.long)
    }

### Prepare data
BERTDataset üzerinde itarasyon için dataloaderlar oluşturdum
Train validation ve test için üç farklı dataloader oluşturup listeye append ettim  
val dataseti epochlar arası overfit için kullanılacak  



In [None]:
def create_data_loaders(tokenizer, max_len, batch_size):
  ds = BERTDataset(
      text=corpus_np,
      label=train.target.to_numpy(),
      tokenizer=tokenizer,
      max_len=max_len
  )
  train_idx, test_idx = train_test_split(list(range(len(ds))), test_size=0.20)
  datasets = {}
  train_val = Subset(ds, train_idx)
  train_idx, val_idx = train_test_split(list(range(len(train_val))), test_size=0.25)
  datasets['train'] = Subset(train_val, train_idx)
  datasets['val'] = Subset(train_val, val_idx)
  datasets['test'] = Subset(ds, test_idx)
  print(len(datasets['train']))
  print(len(datasets['test']))
  print(len(datasets['val']))

  return [DataLoader(x, batch_size=h_batch_size, num_workers=2) for _, x in datasets.items()]

Dataloaderların (train, val, test) için içlerindeki satır sayısı

In [None]:
dataloaders = create_data_loaders(tokenizer, h_max_len, h_batch_size)

4567
1523
1523


### Build Classifier
Binary classification için PyTorch modülü oluşturdum

In [None]:
class Classifier(nn.Module):
  def __init__(self, n_classes):
    super(Classifier, self).__init__()
    self.bert = BertModel.from_pretrained(h_preprocess_mode)
    self.drop = nn.Dropout(0.3)
    self.out = nn.Linear(self.bert.config.hidden_size, n_classes)
    self.softmax = nn.Softmax(dim=1)

  def forward(self, input_ids, attention_mask):
    _, pooled_output = self.bert(
        input_ids=input_ids,
        attention_mask=attention_mask,
        return_dict=False
    )
    output = self.drop(pooled_output)
    return self.out(output)

Bu Classifier'ın instancesını oluşturup CUDA'ya aktardım

In [None]:
model = Classifier(2)
model = model.to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Setting up optimizer
Backward propagation için optimizer tanımladım

In [None]:
optimizer = AdamW(model.parameters(),
                  lr=h_learning_rate,
                  correct_bias=False, 
                  eps=h_eps)

### Get scheduler
Backward propagation için scheduler tanımla ve optimizerı input olarak verdim.  

In [None]:
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=len(dataloaders[0])*h_epoch)

### Loss function
Cross Entropy olarak loss function tanımla ve CUDA'ya aktardım

In [None]:
loss_fn = nn.CrossEntropyLoss().to(device)

### Training
Her epoch için train fonksiyonu tanımladım  


In [None]:
def train_epoch(
    model,
    dataloader,
    loss_fn,
    optimizer,
    device,
    scheduler,
    n_examples
):
  model = model.train()
  losses = []
  correct_predictions = 0
  for d in tqdm(dataloader):
    input_ids = d['input_ids'].to(device)
    attention_mask = d['attention_mask'].to(device)
    labels = d['labels'].to(device)

    outputs = model(
        input_ids,
        attention_mask
    )

    _, preds = torch.max(outputs, dim=1)
    loss = loss_fn(outputs, labels)

    correct_predictions += torch.sum(preds == labels)
    losses.append(loss)

    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()
  
  return correct_predictions.double() / n_examples, torch.mean(torch.stack(losses))

### Validation
Her epoch için validation fonksiyonu tanımladım  
Trainden farklı olarak backward işlemi yapılmayacak  


In [None]:
def val_epoch(
    model,
    dataloader,
    loss_fn,
    optimizer,
    device,
    scheduler,
    n_examples
):
  losses = []
  correct_predictions = 0
  with torch.no_grad():
    for d in tqdm(dataloader):
      input_ids = d['input_ids'].to(device)
      attention_mask = d['attention_mask'].to(device)
      labels = d['labels'].to(device)

      outputs = model(
          input_ids,
          attention_mask
      )

      _, preds = torch.max(outputs, dim=1)
      loss = loss_fn(outputs, labels)

      correct_predictions += torch.sum(preds == labels)
      losses.append(loss)
      del input_ids
      del attention_mask
      del labels
      del preds
      del loss
 
  return correct_predictions.double() / n_examples, torch.mean(torch.stack(losses))

### Training Loop
CUDA hafıza yetersiz geldiği için modelleri train et sonra kaydettim  
Daha sonra validate için load_state_dict ile en iyi modeli bul onu kullandım

In [None]:
for epoch in tqdm(range(h_epoch)):
  train_acc, train_loss = train_epoch(
      model,
      dataloaders[0],
      loss_fn,
      optimizer,
      device,
      scheduler,
      len(dataloaders[0].dataset)
  )
  torch.save(model.state_dict(), "/content/gdrive/MyDrive/models/bert/model"+ str(epoch) + ".bin")

  tqdm.write(f'Train Loss: {train_loss}')
  tqdm.write(f'Train Acc: {train_acc}') 

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=286.0), HTML(value='')))


Train Loss: 0.42626693844795227
Train Acc: 0.8184804028903


HBox(children=(FloatProgress(value=0.0, max=286.0), HTML(value='')))


Train Loss: 0.3599497675895691
Train Acc: 0.8567987738121305


HBox(children=(FloatProgress(value=0.0, max=286.0), HTML(value='')))


Train Loss: 0.34179550409317017
Train Acc: 0.8705933873439895


HBox(children=(FloatProgress(value=0.0, max=286.0), HTML(value='')))


Train Loss: 0.3252500891685486
Train Acc: 0.877819137289249


HBox(children=(FloatProgress(value=0.0, max=286.0), HTML(value='')))


Train Loss: 0.3172646760940552
Train Acc: 0.878476023647909



Bu segmentin sıralı çalıştırılması CUDA memory hatası verecektir. Modelleri kaydetip tekrardan environmenta bağlanıp çalıştırmayı denedim fakat yetersiz oldu.  
CUDA hafıza hatası çözüm için
[CUDA Memory fix](https://discuss.pytorch.org/t/out-of-memory-error-during-evaluation-but-training-works-fine/12274/24)'e teşekkürler

In [None]:
for i in tqdm(range(h_epoch)):
  model = Classifier(2)
  model = model.to(device)
  model.load_state_dict(torch.load("/content/gdrive/MyDrive/models/bert/model"+ str(i) + ".bin"))
  val_acc, val_loss = val_epoch(
      model,
      dataloaders[2],
      loss_fn,
      optimizer,
      device,
      scheduler,
      len(dataloaders[2].dataset)
  )
  tqdm.write(f'Val Loss: {val_loss}')
  tqdm.write(f'Val Acc: {val_acc}')

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


HBox(children=(FloatProgress(value=0.0, max=96.0), HTML(value='')))


Val Loss: 0.4315555989742279
Val Acc: 0.8135259356533158


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


HBox(children=(FloatProgress(value=0.0, max=96.0), HTML(value='')))


Val Loss: 0.4606506824493408
Val Acc: 0.8102429415627052


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


HBox(children=(FloatProgress(value=0.0, max=96.0), HTML(value='')))


Val Loss: 0.4680980145931244
Val Acc: 0.8115561391989494


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


HBox(children=(FloatProgress(value=0.0, max=96.0), HTML(value='')))


Val Loss: 0.46683669090270996
Val Acc: 0.809586342744583


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


HBox(children=(FloatProgress(value=0.0, max=96.0), HTML(value='')))


Val Loss: 0.4716014862060547
Val Acc: 0.8056467498358503



### En iyi model
En iyi modeli tanımla ve get_texts fonksiyonunda çağırdım

In [None]:
best_model = Classifier(2)
best_model.load_state_dict(torch.load("/content/gdrive/MyDrive/models/bert/model0.bin"))
best_model = best_model.to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Get Label function
Verilen bir text için modeli kullanarak class'ı predict ettim

In [None]:
def get_label(text):
    global count, device
    encoded_review = tokenizer.encode_plus(
        text,
        max_length=h_max_len,
        add_special_tokens=True,
        return_token_type_ids=False,
        padding='max_length',
        return_attention_mask=True,
        truncation=True,
        return_tensors='pt',
    )
    input_ids = encoded_review['input_ids'].to(device)
    attention_mask = encoded_review['attention_mask'].to(device)
    output = best_model(input_ids, attention_mask)
    _, prediction = torch.max(output, dim=1)
    return prediction.item()

### Predict
Val dataset için kıyasladım

In [None]:
def get_texts(model, dataloader):
  texts = []
  predictions = []
  prediction_probs = []
  real_values = []

  with torch.no_grad():
      for d in dataloader:
        texts = d['text']
        input_ids = d['input_ids'].to(device)
        attention_mask = d['attention_mask'].to(device)
        labels = d['labels'].to(device)

        outputs = best_model(
          input_ids,
          attention_mask
        )

        _, preds = torch.max(outputs, dim=1)

        texts.extend(texts)
        predictions.extend(preds)
        prediction_probs.extend(outputs)
        real_values.extend(labels)

  predictions = torch.stack(predictions).to(device)
  prediction_probs = torch.stack(prediction_probs).to(device)
  real_values = torch.stack(real_values).to(device)

  return texts, predictions, prediction_probs, real_values


In [None]:
y_texts, y_preds, y_pred_probs, y_test = get_texts(model, dataloaders[1])

In [None]:
print(confusion_matrix(y_test.cpu(), y_preds.cpu()))
print(classification_report(y_test.cpu(), y_preds.cpu()))

[[790  78]
 [160 495]]
              precision    recall  f1-score   support

           0       0.83      0.91      0.87       868
           1       0.86      0.76      0.81       655

    accuracy                           0.84      1523
   macro avg       0.85      0.83      0.84      1523
weighted avg       0.85      0.84      0.84      1523



## Roberta
Twitter üzerinde train edilen modeli olduğu için tercih edildi.  
Tavsiye edilen parametreler listelenmemiş. Önceki çalışmadaki parametreler ile denedim

In [None]:
# Roberta Parameters
h_preprocess_mode = 'cardiffnlp/twitter-roberta-base-sentiment'
h_max_len = 128
h_batch_size = 16
h_epoch = 5

# Adam Optimizer Parameters
h_learning_rate = 1e-5
h_eps = 1e-8

In [None]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification, RobertaModel

### Tokenizer
Kelimelerden token oluşturmak için pretrained tokenizer indirdim

In [None]:
tokenizer = RobertaTokenizer.from_pretrained(h_preprocess_mode)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=150.0, style=ProgressStyle(description_…




### Device Control
Google colab CUDA altyapısı sunuyor. Yine de çıktı gözlemlemek gerekiyor

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


### Create Special Dataset
BERT'e input olarak verebilmek için dictionarylerden oluşan özel dataset hazırladım  
Train ve test datasetlerini RobertaDataset'e dönüştürdüm

In [None]:
class RobertaDataset(Dataset):
  def __init__(self, text, label, tokenizer, max_len):
    self.text = text
    self.label = label
    self.tokenizer = tokenizer
    self.max_len = max_len

  def __len__(self):
    return len(self.text)
  
  def __getitem__(self, item):
    text = str(self.text[item])
    encoding = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=self.max_len,
        return_token_type_ids=False,
        padding='max_length',
        return_attention_mask=True,
        truncation=True,
        return_tensors='pt'
    )

    return {
        'text': text,
        'input_ids': encoding['input_ids'].flatten(),
        'attention_mask': encoding['attention_mask'].flatten(),
        'labels': torch.tensor(self.label[item], dtype=torch.long)
    }

### Prepare data
RobertaDataset üzerinde itarasyon için dataloaderlar oluşturdum  
Train validation ve test için üç farklı dataloader oluşturup listeye append ettim


In [None]:
def create_data_loaders(tokenizer, max_len, batch_size):
  ds = RobertaDataset(
      text=corpus_np,
      label=train.target.to_numpy(),
      tokenizer=tokenizer,
      max_len=max_len
  )
  train_idx, test_idx = train_test_split(list(range(len(ds))), test_size=0.20)
  datasets = {}
  train_val = Subset(ds, train_idx)
  train_idx, val_idx = train_test_split(list(range(len(train_val))), test_size=0.25)
  datasets['train'] = Subset(train_val, train_idx)
  datasets['val'] = Subset(train_val, val_idx)
  datasets['test'] = Subset(ds, test_idx)
  print(len(datasets['train']))
  print(len(datasets['test']))
  print(len(datasets['val']))

  return [DataLoader(x, batch_size=h_batch_size, num_workers=2) for _, x in datasets.items()]

Dataloaderların (train, val, test) için içlerindeki satır sayısı

In [None]:
dataloaders = create_data_loaders(tokenizer, h_max_len, h_batch_size)

4567
1523
1523


### Build Classifier
Binary classification için PyTorch modülü oluşturdum

In [None]:
class Classifier(nn.Module):
  def __init__(self, n_classes):
    super(Classifier, self).__init__()
    self.roberta = RobertaModel.from_pretrained(h_preprocess_mode, return_dict=False)
    self.drop = nn.Dropout(0.3)
    self.out = nn.Linear(self.roberta.config.hidden_size, n_classes)
    self.softmax = nn.Softmax(dim=1)

  def forward(self, input_ids, attention_mask):
    _, pooled_output = self.roberta(
        input_ids=input_ids,
        attention_mask=attention_mask
    )
    output = self.drop(pooled_output)
    return self.out(output)

Bu Classifier'ın instancesını oluşturup CUDA'ya aktardım

In [None]:
model = Classifier(2)
model = model.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=747.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=498679497.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment were not used when initializing RobertaModel: ['classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictio

### Setting up optimizer
Backward propagation için optimizer tanımladım

In [None]:
optimizer = AdamW(model.parameters(),
                  lr=h_learning_rate,
                  correct_bias=False, 
                  eps=h_eps)

### Get scheduler
Backward propagation için scheduler tanımla ve optimizerı input olarak verdim. 

In [None]:
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=len(dataloaders[0])*h_epoch)

### Loss function
Cross Entropy olarak loss function tanımla ve CUDA'ya aktardım

In [None]:
loss_fn = nn.CrossEntropyLoss().to(device)

### Training
Her epoch için train fonksiyonu tanımladım  


In [None]:
def train_epoch(
    model,
    dataloader,
    loss_fn,
    optimizer,
    device,
    scheduler,
    n_examples
):
  model = model.train()
  losses = []
  correct_predictions = 0
  for d in tqdm(dataloader):
    input_ids = d['input_ids'].to(device)
    attention_mask = d['attention_mask'].to(device)
    labels = d['labels'].to(device)

    outputs = model(
        input_ids,
        attention_mask
    )

    _, preds = torch.max(outputs, dim=1)
    loss = loss_fn(outputs, labels)

    correct_predictions += torch.sum(preds == labels)
    losses.append(loss)

    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()
  
  return correct_predictions.double() / n_examples, torch.mean(torch.stack(losses))

### Validation
Her epoch için validation fonksiyonu tanımladım  
Trainden farklı olarak backward işlemi yapılmayacak  



In [None]:
def val_epoch(
    model,
    dataloader,
    loss_fn,
    optimizer,
    device,
    scheduler,
    n_examples
):
  losses = []
  correct_predictions = 0
  with torch.no_grad():
    for d in tqdm(dataloader):
      input_ids = d['input_ids'].to(device)
      attention_mask = d['attention_mask'].to(device)
      labels = d['labels'].to(device)

      outputs = model(
          input_ids,
          attention_mask
      )

      _, preds = torch.max(outputs, dim=1)
      loss = loss_fn(outputs, labels)

      correct_predictions += torch.sum(preds == labels)
      losses.append(loss)
      del input_ids
      del attention_mask
      del labels
      del preds
      del loss
 
  return correct_predictions.double() / n_examples, torch.mean(torch.stack(losses))

### Training Loop
CUDA hafıza yetersiz geldiği için modelleri train et sonra kaydettim  
Daha sonra validate için load_state_dict ile en iyi modeli bul onu kullandım

In [None]:
for epoch in tqdm(range(h_epoch)):
  train_acc, train_loss = train_epoch(
      model,
      dataloaders[0],
      loss_fn,
      optimizer,
      device,
      scheduler,
      len(dataloaders[0].dataset)
  )
  torch.save(model.state_dict(), "/content/gdrive/MyDrive/models/roberta/model"+ str(epoch) + ".bin")

  tqdm.write(f'Train Loss: {train_loss}')
  tqdm.write(f'Train Acc: {train_acc}') 

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=286.0), HTML(value='')))


Train Loss: 0.4784523546695709
Train Acc: 0.7906722137070287


HBox(children=(FloatProgress(value=0.0, max=286.0), HTML(value='')))


Train Loss: 0.339642733335495
Train Acc: 0.8659951828333698


HBox(children=(FloatProgress(value=0.0, max=286.0), HTML(value='')))


Train Loss: 0.24923035502433777
Train Acc: 0.9049704401138603


HBox(children=(FloatProgress(value=0.0, max=286.0), HTML(value='')))


Train Loss: 0.2176932394504547
Train Acc: 0.9277425005474054


HBox(children=(FloatProgress(value=0.0, max=286.0), HTML(value='')))


Train Loss: 0.19093620777130127
Train Acc: 0.9397854171228378



Bu segmentin sıralı çalıştırılması CUDA memory hatası verecektir. Modelleri kaydetip tekrardan environmenta bağlanıp çalıştırmayı denedim ama işe yaramadı.  
BERT için yaptığım çözüm ile olabildi 

In [None]:
for i in tqdm(range(h_epoch)):
  model = Classifier(2)
  model = model.to(device)
  model.load_state_dict(torch.load("/content/gdrive/MyDrive/models/roberta/model"+ str(i) + ".bin"))
  val_acc, val_loss = val_epoch(
      model,
      dataloaders[2],
      loss_fn,
      optimizer,
      device,
      scheduler,
      len(dataloaders[2].dataset)
  )
  tqdm.write(f'Val Loss: {val_loss}')
  tqdm.write(f'Val Acc: {val_acc}')

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment were not used when initializing RobertaModel: ['classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictio

HBox(children=(FloatProgress(value=0.0, max=96.0), HTML(value='')))


Val Loss: 0.42052945494651794
Val Acc: 0.8194353250164149


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment were not used when initializing RobertaModel: ['classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictio

HBox(children=(FloatProgress(value=0.0, max=96.0), HTML(value='')))


Val Loss: 0.399945467710495
Val Acc: 0.8384766907419566


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment were not used when initializing RobertaModel: ['classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictio

HBox(children=(FloatProgress(value=0.0, max=96.0), HTML(value='')))


Val Loss: 0.4880884289741516
Val Acc: 0.840446487196323


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment were not used when initializing RobertaModel: ['classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictio

HBox(children=(FloatProgress(value=0.0, max=96.0), HTML(value='')))


Val Loss: 0.5814012289047241
Val Acc: 0.8266579120157583


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment were not used when initializing RobertaModel: ['classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictio

HBox(children=(FloatProgress(value=0.0, max=96.0), HTML(value='')))


Val Loss: 0.632189154624939
Val Acc: 0.830597504924491



### En iyi model
En iyi modeli tanımladım ve get_texts fonksiyonunda çağırdım

In [None]:
best_model = Classifier(2)
best_model.load_state_dict(torch.load("/content/gdrive/MyDrive/models/roberta/model2.bin"))
best_model = best_model.to(device)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment were not used when initializing RobertaModel: ['classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictio

### Get Label function
Verilen bir text için modeli kullanarak class'ı predict ettim

In [None]:
def get_label(text):
    global count, device
    encoded_review = tokenizer.encode_plus(
        text,
        max_length=h_max_len,
        add_special_tokens=True,
        return_token_type_ids=False,
        padding='max_length',
        return_attention_mask=True,
        truncation=True,
        return_tensors='pt',
    )
    input_ids = encoded_review['input_ids'].to(device)
    attention_mask = encoded_review['attention_mask'].to(device)
    output = best_model(input_ids, attention_mask)
    _, prediction = torch.max(output, dim=1)
    return prediction.item()

### Predict
Val dataset için kıyasladım


In [None]:
def get_texts(model, dataloader):
  model = model.eval()
  texts = []
  predictions = []
  prediction_probs = []
  real_values = []

  with torch.no_grad():
      for d in dataloader:
        texts = d['text']
        input_ids = d['input_ids'].to(device)
        attention_mask = d['attention_mask'].to(device)
        labels = d['labels'].to(device)

        outputs = best_model(
          input_ids,
          attention_mask
        )

        _, preds = torch.max(outputs, dim=1)

        texts.extend(texts)
        predictions.extend(preds)
        prediction_probs.extend(outputs)
        real_values.extend(labels)

  predictions = torch.stack(predictions).to(device)
  prediction_probs = torch.stack(prediction_probs).to(device)
  real_values = torch.stack(real_values).to(device)

  return texts, predictions, prediction_probs, real_values

In [None]:
y_texts, y_preds, y_pred_probs, y_test = get_texts(model, dataloaders[1])

In [None]:
print(confusion_matrix(y_test.cpu(), y_preds.cpu()))
print(classification_report(y_test.cpu(), y_preds.cpu()))

[[773 118]
 [153 479]]
              precision    recall  f1-score   support

           0       0.83      0.87      0.85       891
           1       0.80      0.76      0.78       632

    accuracy                           0.82      1523
   macro avg       0.82      0.81      0.82      1523
weighted avg       0.82      0.82      0.82      1523

