# ❇️LSTM #

##  ` 👉Opis zadania` ### 


## 🇵🇱 Zadanie 2.2.8** (opcjonalnie)

Nowy zbiór z `KLEJ` - [PolEmo2.0-IN](https://clarin-pl.eu/dspace/handle/11321/710). To zestaw recenzji online z dziedziny medycyny i hoteli. Zadaniem jest przewidzenie sentymentu recenzji.

* input: `../input/klej/klej_polemo2.0-in/train.tsv`
* models: `../models/word2vec/klej_polemo2.0-in_train`

##  ` 👉Import bibliotek` ### 

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf

import random
import os

from gensim.utils import simple_preprocess
import tensorflow.keras.preprocessing.text as kpt 

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional, Flatten
from keras.layers import Dropout

from itertools import product
import mlflow
import time
import datetime
import warnings

import team_helper as th
from team_helper import recall, precision, f1, get_y
from team_helper import random_polemo2_opinion as random_opinion

  return torch._C._cuda_getDeviceCount() > 0


##  ` 👉Usunięcie niepotrzebnych ostrzeżeń + ustawienie mlflow` ### 

In [2]:
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=RuntimeWarning)

In [2]:
mlflow.set_tracking_uri("file:///home/jovyan/nlp2/shared-mlruns/team-three/mariusz")

##  ` 👉Próba zapewnienia powtarzalności eksperymentu` ### 

In [4]:
seed_value= 0
os.environ['PYTHONHASHSEED']=str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)
tf.compat.v1.set_random_seed(seed_value)

##  ` 👉Eksploracja zbioru` ### 

In [5]:
df = pd.read_csv('../input/klej/klej_polemo2.0-in/train.tsv', sep='\t')

In [6]:
df

Unnamed: 0,sentence,target
0,Super lekarz i człowiek przez duże C . Bardzo ...,__label__meta_plus_m
1,Bardzo olewcze podejscie do pacjenta . Przypro...,__label__meta_minus_m
2,Lekarz zalecił mi kurację alternatywną do doty...,__label__meta_amb
3,Konsumenci oczywiście kierują się ceną . Te l...,__label__meta_zero
4,Pani Doktor Iwona jest profesjonalistką w każd...,__label__meta_plus_m
...,...,...
5739,"Centralne skrzydło jest jednokondygnacyjne , z...",__label__meta_zero
5740,Ogólnie w hotelu panuje balagan informacyjny -...,__label__meta_minus_m
5741,Przybyli śmy z rodziną na krotki wypoczynek . ...,__label__meta_amb
5742,Opinię może wyrazić dzisiaj każdy i jest to je...,__label__meta_zero


In [7]:
df['target'].value_counts(normalize=True)

__label__meta_minus_m    0.380223
__label__meta_plus_m     0.270369
__label__meta_amb        0.181929
__label__meta_zero       0.167479
Name: target, dtype: float64

##  ` 👉Wyodrębnienie zmiennych i preprocessing` ### 

In [27]:
def polish_tokenizer_sm(doc):
    return [token.lemma_ for token in nlp_sm(doc) if not token.is_punct and token.lemma_ != "-PRON-"]

In [8]:
X_oryg = df['sentence'].map(th.simple_tokens)
Xprep = df['sentence'].map(th.polish_tokenizer_sm).apply(lambda x: ' '.join(x)).map(th.preprocessing).map(simple_preprocess)

y = get_y(df, 'target')

In [9]:
Xprep

0       [super, lekarz, człowiek, przez, duży, bardzo,...
1       [bardzo, olewcze, podejscie, do, pacjent, przy...
2       [lekarz, zalecić, ja, kuracja, alternatywny, d...
3       [konsument, oczywiście, kierować, się, cena, t...
4       [pani, doktor, iwon, jest, każdym, cal, iść, d...
                              ...                        
5739    [centralny, skrzydło, jest, zlokalizowane, są,...
5740    [ogólnie, hotel, panować, balagan, informacyjn...
5741    [przybyć, śmy, rodzina, na, krotki, wypoczynek...
5742    [opinia, móc, wyrazić, dzisiaj, każdy, jest, t...
5743    [apartamenty, wielki, budynek, basen, spokojny...
Name: sentence, Length: 5744, dtype: object

##  ` 👉Wybór parametrów do eksperymentu` ### 

In [10]:
EXP_NAME = 'LSTM_polemo2_0_IN_X_prep_spacy_sm_tokenizer'
SELECTED_X = Xprep

# jak duży ma być słownik
num_words_list = [500, 300] 

# do ilu słów mamy uciąć daną recenzję
maxlens_list = [60, 50, 40, 10]

# ile ma być dwukierunkowych warstw w LSTM
nobs_list = [0, 1 , 2] 

# jak długi wektor ma reprezentować dane słowo
embedding_vector_lengths_list = [400, 500] 

# liczba jednostek
units_list = [32, 64]

# liczba epok
epochs_list = [10]

##  ` 👉Przydatne funkcje` ### 

In [11]:
def make_sequence(inputX, num_words, maxlen):
    tokenizer = kpt.Tokenizer(num_words)
    tokenizer.fit_on_texts(inputX) 
    sequences = tokenizer.texts_to_sequences(inputX) 
    X = sequence.pad_sequences(sequences, maxlen, padding='post', truncating='post')
    return X

In [12]:
def get_multi_LSTM(nob, unit, num_words, embedding_vector_length, maxlen):
    if nob == 0:
        model = Sequential([
            Embedding(num_words, embedding_vector_length, input_length=maxlen),
            Dropout(0.9),
            LSTM(unit),
            Dense(y.shape[1], activation='softmax')
        ])
    if nob == 1:
        model = Sequential([
            Embedding(num_words, embedding_vector_length, input_length=maxlen),
            Dropout(0.9),
            Bidirectional(LSTM(unit, dropout=0.5)),
            #Bidirectional(LSTM(unit)),
            #Dropout(0.5),
            Dense(y.shape[1], activation='softmax')
        ])
    if nob == 2:
        model = Sequential([
            Embedding(num_words, embedding_vector_length, input_length=maxlen),
            Dropout(0.9),
            Bidirectional(LSTM(unit, dropout=0.5, return_sequences=True)),
            Bidirectional(LSTM(unit, dropout=0.5)),
            #Bidirectional(LSTM(unit, return_sequences=True)),
            #Bidirectional(LSTM(unit)),
            #Dropout(0.5),
            Dense(y.shape[1], activation='softmax')
        ])
    return model

##  ` 👉Eksperyment` ### 

In [None]:
params_cnt = len(num_words_list)*\
len(maxlens_list)*\
len(nobs_list)*\
len(embedding_vector_lengths_list)*\
len(units_list)*len(epochs_list)

exp_start = datetime.datetime.now()

for n,(num_words, maxlen, nob, embedding_vector_length, unit, epoch) in \
enumerate(product(num_words_list, maxlens_list, nobs_list, embedding_vector_lengths_list, units_list, epochs_list),1):
    
    X = make_sequence(SELECTED_X, num_words, maxlen)

    model_str = f'Embedding (max_top_words={num_words}, max_opinion_length={maxlen}, ' \
    + f'embedding_vector_lengths={embedding_vector_length}, units={unit}, epochs={epoch}, nobs={nob})'

    print(f'{n}/{params_cnt}')
    print(model_str)

    with mlflow.start_run(experiment_id=th._eid(EXP_NAME), run_name=model_str):

        model = get_multi_LSTM(nob, unit, num_words, embedding_vector_length, maxlen)

        start = time.time()
        model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy', precision, recall, f1])
        history = model.fit( 
            X, y,
            epochs=epoch, batch_size=256,
            validation_split=0.1,
            #callbacks=[PlotLossesKeras()],
            verbose=0)
        end = time.time()
        
        print('train_f1_curr = ', round(history.history['f1'][-1],3))
        print('train_f1_max = ', round(max(history.history['f1']),3))
        print('val_f1_curr = ', round(history.history['val_f1'][-1],3))
        print('val_f1_max = ',round(max(history.history['val_f1']),3))
        
        print('time: ',round((end-start),2))
        print('-'*10)

        # log params
        mlflow.log_param('model', model_str)
        mlflow.log_param('max_opinion_length', maxlen)
        mlflow.log_param('max_top_words', num_words)
        mlflow.log_param('epoch', epoch)
        mlflow.log_param('unit', unit)
        mlflow.log_param('nob', nob)
        
        # log metrics
        mlflow.log_metric('train_accuracy_curr', history.history['accuracy'][-1])
        mlflow.log_metric('train_accuracy_max', max(history.history['accuracy']))
        mlflow.log_metric('val_accuracy_curr', history.history['val_accuracy'][-1])
        mlflow.log_metric('val_accuracy_max', max(history.history['val_accuracy']))
        
        mlflow.log_metric('train_precision_curr', history.history['accuracy'][-1])
        mlflow.log_metric('train_precision_max', max(history.history['precision']))
        mlflow.log_metric('val_precision_curr', history.history['val_precision'][-1])
        mlflow.log_metric('val_precision_max', max(history.history['val_precision']))

        mlflow.log_metric('train_recall_curr', history.history['recall'][-1])
        mlflow.log_metric('train_recall_max', max(history.history['recall']))
        mlflow.log_metric('val_recall_curr', history.history['val_recall'][-1])
        mlflow.log_metric('val_recall_max', max(history.history['val_recall']))
        
        mlflow.log_metric('train_f1_curr', history.history['f1'][-1])
        mlflow.log_metric('train_f1_max', max(history.history['f1']))
        mlflow.log_metric('val_f1_curr', history.history['val_f1'][-1])
        mlflow.log_metric('val_f1_max', max(history.history['val_f1']))
        
        mlflow.log_metric('loss', history.history['loss'][-1])
        mlflow.log_metric('calc_time', (end-start))
        

        # log artifacts
        fig = plt.figure()
        plt.plot(history.history['f1'])
        plt.plot(history.history['val_f1'])
        plt.title('model f1')
        plt.ylabel('f1')
        plt.xlabel('epoch')
        plt.legend(['train', 'valid'], loc='best')
        mlflow.log_figure(fig, "f1-plot.png")
        
        fig2 = plt.figure()
        plt.plot(history.history['loss'])
        plt.plot(history.history['val_loss'])
        plt.title('model loss')
        plt.ylabel('loss')
        plt.xlabel('epoch')
        plt.legend(['train', 'valid'], loc='best')
        mlflow.log_figure(fig2, "loss-plot.png")
        
        
exp_end = datetime.datetime.now()
print('='*10)
print(f'Start of experiment: {exp_start}')
print(f'End of experiment: {exp_end}')
th.calculate_delta((exp_end-exp_start))

dfruns = mlflow.search_runs(experiment_ids=[th._eid(EXP_NAME)])

1/96
Embedding (max_top_words=500, max_opinion_length=60, embedding_vector_lengths=400, units=32, epochs=10, nobs=0)
train_f1_curr =  0.736
train_f1_max =  0.736
val_f1_curr =  0.688
val_f1_max =  0.695
time:  174.85
----------
2/96
Embedding (max_top_words=500, max_opinion_length=60, embedding_vector_lengths=400, units=64, epochs=10, nobs=0)
train_f1_curr =  0.732
train_f1_max =  0.732
val_f1_curr =  0.687
val_f1_max =  0.699
time:  226.87
----------
3/96
Embedding (max_top_words=500, max_opinion_length=60, embedding_vector_lengths=500, units=32, epochs=10, nobs=0)
train_f1_curr =  0.724
train_f1_max =  0.724
val_f1_curr =  0.679
val_f1_max =  0.679
time:  198.37
----------
4/96
Embedding (max_top_words=500, max_opinion_length=60, embedding_vector_lengths=500, units=64, epochs=10, nobs=0)
train_f1_curr =  0.734
train_f1_max =  0.737
val_f1_curr =  0.695
val_f1_max =  0.695
time:  260.3
----------
5/96
Embedding (max_top_words=500, max_opinion_length=60, embedding_vector_lengths=400, u

37/96
Embedding (max_top_words=500, max_opinion_length=10, embedding_vector_lengths=400, units=32, epochs=10, nobs=0)
train_f1_curr =  0.614
train_f1_max =  0.618
val_f1_curr =  0.522
val_f1_max =  0.524
time:  45.99
----------
38/96
Embedding (max_top_words=500, max_opinion_length=10, embedding_vector_lengths=400, units=64, epochs=10, nobs=0)
train_f1_curr =  0.629
train_f1_max =  0.63
val_f1_curr =  0.521
val_f1_max =  0.521
time:  55.4
----------
39/96
Embedding (max_top_words=500, max_opinion_length=10, embedding_vector_lengths=500, units=32, epochs=10, nobs=0)
train_f1_curr =  0.633
train_f1_max =  0.633
val_f1_curr =  0.513
val_f1_max =  0.529
time:  50.18
----------
40/96
Embedding (max_top_words=500, max_opinion_length=10, embedding_vector_lengths=500, units=64, epochs=10, nobs=0)
train_f1_curr =  0.634
train_f1_max =  0.634
val_f1_curr =  0.52
val_f1_max =  0.532
time:  58.53
----------
41/96
Embedding (max_top_words=500, max_opinion_length=10, embedding_vector_lengths=400, un

73/96
Embedding (max_top_words=300, max_opinion_length=40, embedding_vector_lengths=400, units=32, epochs=10, nobs=0)
train_f1_curr =  0.7
train_f1_max =  0.703
val_f1_curr =  0.71
val_f1_max =  0.71
time:  134.07
----------
74/96
Embedding (max_top_words=300, max_opinion_length=40, embedding_vector_lengths=400, units=64, epochs=10, nobs=0)
train_f1_curr =  0.702
train_f1_max =  0.703
val_f1_curr =  0.678
val_f1_max =  0.695
time:  174.11
----------
75/96
Embedding (max_top_words=300, max_opinion_length=40, embedding_vector_lengths=500, units=32, epochs=10, nobs=0)
train_f1_curr =  0.711
train_f1_max =  0.711
val_f1_curr =  0.665
val_f1_max =  0.682
time:  152.42
----------
76/96
Embedding (max_top_words=300, max_opinion_length=40, embedding_vector_lengths=500, units=64, epochs=10, nobs=0)
train_f1_curr =  0.726
train_f1_max =  0.726
val_f1_curr =  0.726
val_f1_max =  0.726
time:  195.76
----------
77/96
Embedding (max_top_words=300, max_opinion_length=40, embedding_vector_lengths=400,

In [13]:
dfruns.to_csv('LSTM_X_prep_spacy_sm_token.csv')

In [None]:
!mlflow ui --backend-store-uri /home/jovyan/nlp2/shared-mlruns/team-three/mariusz --default-artifact-root /home/jovyan/nlp2/shared-mlruns/team-three/mariusz --port 5001

[2021-10-10 21:15:06 +0000] [100] [INFO] Starting gunicorn 20.1.0
[2021-10-10 21:15:06 +0000] [100] [INFO] Listening at: http://127.0.0.1:5001 (100)
[2021-10-10 21:15:06 +0000] [100] [INFO] Using worker: sync
[2021-10-10 21:15:06 +0000] [102] [INFO] Booting worker with pid: 102


In [42]:
!mlflow ui

[2021-10-05 17:25:19 +0000] [677] [INFO] Starting gunicorn 20.1.0
[2021-10-05 17:25:19 +0000] [677] [INFO] Listening at: http://127.0.0.1:5000 (677)
[2021-10-05 17:25:19 +0000] [677] [INFO] Using worker: sync
[2021-10-05 17:25:19 +0000] [679] [INFO] Booting worker with pid: 679
^C
[2021-10-05 17:26:37 +0000] [677] [INFO] Handling signal: int
[2021-10-05 17:26:37 +0000] [679] [INFO] Worker exiting (pid: 679)
