<a href="https://colab.research.google.com/github/Lukas-Swc/neural-network-course/blob/main/03_keras/03_overfitting_underfitting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Główne problemy uczenia maszynowego: przeuczenie (overfitting) oraz niedouczenie (underfitting)

>Celem tego notebook'a jest pokazanie przykładów zbyt dobrego dopasowanie modelu do danych uczących (przeuczenie) oraz zbyt słabego dopasowania modelu do danych uczących (niedouczenie).
>
>Wykorzystamy zbiór z bilioteki Keras składający się z 50000 recenzji filmów oznaczonych sentymentem: pozytywny/negatywny. Recenzje są wstępnie przetworzone, a każda recenzja jest zakodowana jako sekwencja indeksów słów. Słowa są indeksowane według ogólnej częstotliwości w zbiorze danych. Na przykład liczba 5 oznacza piąte najczęściej pojawiające się słowo w danych. Liczba 0 nie oznacza określonego słowa.

### Spis treści
1. [Import bibliotek](#a1)
2. [Załadowanie i przygotowanie danych](#a2)
3. [Budowa modelu bazowego](#a3)
4. [Budowa 'mniejszego' modelu](#a4)    
5. [Budowa 'większego' modelu](#a5)
6. [Porównanie wydajności modeli](#a6)
7. [Metody regularyzacji](#a7)

### <a name='a1'></a> 1. Import bibliotek

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

import tensorflow as tf

sns.set()
tf.__version__

'2.18.0'

### <a name='a2'></a> 2. Załadowanie i przygotowanie danych

In [2]:
NUM_WORDS = 10000   # 10000 najczęściej pojawiających się słów
INDEX_FROM = 3

(train_data, train_labels), (test_data, test_labels) = tf.keras.datasets.imdb.load_data(num_words=NUM_WORDS, index_from=INDEX_FROM)

In [3]:
print(f'train_data shape: {train_data.shape}')
print(f'test_data shape: {test_data.shape}')

train_data shape: (25000,)
test_data shape: (25000,)


In [4]:
print(train_data[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


In [5]:
word_to_idx = tf.keras.datasets.imdb.get_word_index()
word_to_idx = {k:(v + INDEX_FROM) for k, v in word_to_idx.items()}
word_to_idx['<PAD>'] = 0
word_to_idx['<START>'] = 1
word_to_idx['<UNK>'] = 2
word_to_idx['<UNUSED>'] = 3
idx_to_word = {v: k for k, v in word_to_idx.items()}
list(idx_to_word.items())[:10]
print(' '.join(idx_to_word[idx] for idx in train_data[0]))

<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for wha

In [6]:
train_labels[:10]

array([1, 0, 0, 1, 0, 0, 1, 0, 1, 0])

In [7]:
def multi_hot_sequences(sequences, dimension):
    results = np.zeros((len(sequences), dimension))
    for i, word_indices in enumerate(sequences):
        results[i, word_indices] = 1.0
    return results

train_data = multi_hot_sequences(train_data, dimension=NUM_WORDS)
test_data = multi_hot_sequences(test_data, dimension=NUM_WORDS)
train_data.shape

(25000, 10000)

In [8]:
test_data.shape

(25000, 10000)

### <a name='a3'></a> 3. Budowa modelu bazowego

In [9]:
baseline_model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape=(NUM_WORDS,)),
    tf.keras.layers.Dense(units=16, activation='relu'),
    tf.keras.layers.Dense(units=16, activation='relu'),
    tf.keras.layers.Dense(units=1, activation='sigmoid')
])

baseline_model.compile(optimizer='adam',
                       loss='binary_crossentropy',
                       metrics=['accuracy', 'binary_crossentropy'])

baseline_model.summary()

In [10]:
train_labels = train_labels.reshape(-1, 1)
test_labels = test_labels.reshape(-1, 1)

In [11]:
print(train_labels.shape)
print(test_labels.shape)

(25000, 1)
(25000, 1)


In [12]:
baseline_history = baseline_model.fit(train_data, train_labels, epochs=20, batch_size=512, validation_data=(test_data, test_labels))

Epoch 1/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 116ms/step - accuracy: 0.7326 - binary_crossentropy: 0.5690 - loss: 0.5690 - val_accuracy: 0.8785 - val_binary_crossentropy: 0.3285 - val_loss: 0.3285
Epoch 2/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 41ms/step - accuracy: 0.9093 - binary_crossentropy: 0.2599 - loss: 0.2599 - val_accuracy: 0.8886 - val_binary_crossentropy: 0.2833 - val_loss: 0.2833
Epoch 3/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 59ms/step - accuracy: 0.9400 - binary_crossentropy: 0.1797 - loss: 0.1797 - val_accuracy: 0.8850 - val_binary_crossentropy: 0.2897 - val_loss: 0.2897
Epoch 4/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 53ms/step - accuracy: 0.9557 - binary_crossentropy: 0.1392 - loss: 0.1392 - val_accuracy: 0.8798 - val_binary_crossentropy: 0.3122 - val_loss: 0.3122
Epoch 5/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 71ms/step - accuracy: 0.

### <a name='a4'></a> 3. Budowa 'mniejszego' modelu

In [13]:
smaller_model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape=(NUM_WORDS,)),
    tf.keras.layers.Dense(units=4, activation='relu'),
    tf.keras.layers.Dense(units=4, activation='relu'),
    tf.keras.layers.Dense(units=1, activation='sigmoid')
])

smaller_model.compile(optimizer='adam',
                       loss='binary_crossentropy',
                       metrics=['accuracy', 'binary_crossentropy'])

smaller_model.summary()

In [14]:
smaller_history = smaller_model.fit(train_data, train_labels, epochs=20, batch_size=512, validation_data=(test_data, test_labels))

Epoch 1/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 96ms/step - accuracy: 0.6036 - binary_crossentropy: 0.6592 - loss: 0.6592 - val_accuracy: 0.7888 - val_binary_crossentropy: 0.5371 - val_loss: 0.5371
Epoch 2/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 38ms/step - accuracy: 0.8430 - binary_crossentropy: 0.4800 - loss: 0.4800 - val_accuracy: 0.8745 - val_binary_crossentropy: 0.4043 - val_loss: 0.4043
Epoch 3/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 38ms/step - accuracy: 0.9095 - binary_crossentropy: 0.3367 - loss: 0.3367 - val_accuracy: 0.8851 - val_binary_crossentropy: 0.3285 - val_loss: 0.3285
Epoch 4/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 81ms/step - accuracy: 0.9290 - binary_crossentropy: 0.2526 - loss: 0.2526 - val_accuracy: 0.8866 - val_binary_crossentropy: 0.2983 - val_loss: 0.2983
Epoch 5/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 55ms/step - accuracy: 0.9

### <a name='a5'></a> 4. Budowa 'większego' modelu

In [16]:
bigger_model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape=(NUM_WORDS,)),
    tf.keras.layers.Dense(units=512, activation='relu'),
    tf.keras.layers.Dense(units=512, activation='relu'),
    tf.keras.layers.Dense(units=1, activation='sigmoid')
])
bigger_model.compile(optimizer='adam',
                       loss='binary_crossentropy',
                       metrics=['accuracy', 'binary_crossentropy'])

bigger_model.summary()

In [17]:
bigger_history = bigger_model.fit(train_data, train_labels, epochs=20, batch_size=512, validation_data=(test_data, test_labels))

Epoch 1/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 430ms/step - accuracy: 0.8019 - binary_crossentropy: 0.4353 - loss: 0.4353 - val_accuracy: 0.8794 - val_binary_crossentropy: 0.2977 - val_loss: 0.2977
Epoch 2/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m45s[0m 506ms/step - accuracy: 0.9491 - binary_crossentropy: 0.1488 - loss: 0.1488 - val_accuracy: 0.8734 - val_binary_crossentropy: 0.3329 - val_loss: 0.3329
Epoch 3/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 515ms/step - accuracy: 0.9825 - binary_crossentropy: 0.0610 - loss: 0.0610 - val_accuracy: 0.8692 - val_binary_crossentropy: 0.4208 - val_loss: 0.4208
Epoch 4/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 441ms/step - accuracy: 0.9983 - binary_crossentropy: 0.0117 - loss: 0.0117 - val_accuracy: 0.8688 - val_binary_crossentropy: 0.5614 - val_loss: 0.5614
Epoch 5/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 388ms/step - acc

In [18]:
hist = pd.DataFrame(baseline_history.history)
hist['epoch'] = baseline_history.epoch
hist.head()

Unnamed: 0,accuracy,binary_crossentropy,loss,val_accuracy,val_binary_crossentropy,val_loss,epoch
0,0.81624,0.464861,0.464861,0.87852,0.328464,0.328464,0
1,0.91364,0.24469,0.24469,0.8886,0.283332,0.283332,1
2,0.93628,0.181422,0.181422,0.88504,0.289663,0.289663,2
3,0.95144,0.146008,0.146008,0.8798,0.312159,0.312159,3
4,0.9612,0.119475,0.119475,0.87344,0.342014,0.342014,4


### <a name='a6'></a> 5. Porównanie wydajności modeli

In [19]:
import plotly.graph_objects as go

fig = go.Figure()
for name, history in zip(['smaller', 'baseline', 'bigger'], [smaller_history, baseline_history, bigger_history]):
    hist = pd.DataFrame(history.history)
    hist['epoch'] = history.epoch
    fig.add_trace(go.Scatter(x=hist['epoch'], y=hist['binary_crossentropy'], name=name + '_binary_crossentropy', mode='lines+markers'))
    fig.add_trace(go.Scatter(x=hist['epoch'], y=hist['val_binary_crossentropy'], name=name + '_val_binary_crossentropy', mode='lines+markers'))
    fig.update_layout(xaxis_title='Epoki', yaxis_title='binary_crossentropy')
fig.show()

### <a name='a7'></a> 6. Metody regularyzacji

In [22]:
l2_model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape=(NUM_WORDS,)),
    tf.keras.layers.Dense(16, kernel_regularizer=tf.keras.regularizers.l2(0.001), activation='relu'),
    tf.keras.layers.Dense(16, kernel_regularizer=tf.keras.regularizers.l2(0.01), activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

l2_model.compile(optimizer='adam',
                 loss='binary_crossentropy',
                 metrics=['accuracy', 'binary_crossentropy'])

l2_model.summary()

In [23]:
l2_model_history = l2_model.fit(train_data, train_labels, epochs=20, batch_size=512, validation_data=(test_data, test_labels))

Epoch 1/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 134ms/step - accuracy: 0.6965 - binary_crossentropy: 0.6066 - loss: 0.7897 - val_accuracy: 0.8714 - val_binary_crossentropy: 0.3717 - val_loss: 0.5342
Epoch 2/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 46ms/step - accuracy: 0.8983 - binary_crossentropy: 0.3019 - loss: 0.4629 - val_accuracy: 0.8866 - val_binary_crossentropy: 0.2964 - val_loss: 0.4498
Epoch 3/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 67ms/step - accuracy: 0.9253 - binary_crossentropy: 0.2232 - loss: 0.3736 - val_accuracy: 0.8858 - val_binary_crossentropy: 0.2850 - val_loss: 0.4253
Epoch 4/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 54ms/step - accuracy: 0.9364 - binary_crossentropy: 0.1951 - loss: 0.3322 - val_accuracy: 0.8850 - val_binary_crossentropy: 0.2839 - val_loss: 0.4124
Epoch 5/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 58ms/step - accuracy: 0.

In [24]:
fig = go.Figure()
for name, history in zip(['baseline', 'l2'], [baseline_history, l2_model_history]):
    hist = pd.DataFrame(history.history)
    hist['epoch'] = history.epoch
    fig.add_trace(go.Scatter(x=hist['epoch'], y=hist['binary_crossentropy'], name=name + '_binary_crossentropy', mode='lines+markers'))
    fig.add_trace(go.Scatter(x=hist['epoch'], y=hist['val_binary_crossentropy'], name=name + '_val_binary_crossentropy', mode='lines+markers'))
    fig.update_layout(xaxis_title='Epoki', yaxis_title='binary_crossentropy')
fig.show()

In [25]:
dropout_model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape=(NUM_WORDS,)),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

dropout_model.compile(optimizer='adam',
                 loss='binary_crossentropy',
                 metrics=['accuracy', 'binary_crossentropy'])

dropout_model.summary()

In [26]:
dropout_model_history = dropout_model.fit(train_data, train_labels, epochs=20, batch_size=512, validation_data=(test_data, test_labels))

Epoch 1/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 109ms/step - accuracy: 0.5638 - binary_crossentropy: 0.6757 - loss: 0.6757 - val_accuracy: 0.7706 - val_binary_crossentropy: 0.5759 - val_loss: 0.5759
Epoch 2/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 56ms/step - accuracy: 0.7855 - binary_crossentropy: 0.5554 - loss: 0.5554 - val_accuracy: 0.8720 - val_binary_crossentropy: 0.4354 - val_loss: 0.4354
Epoch 3/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 56ms/step - accuracy: 0.8502 - binary_crossentropy: 0.4412 - loss: 0.4412 - val_accuracy: 0.8846 - val_binary_crossentropy: 0.3519 - val_loss: 0.3519
Epoch 4/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 56ms/step - accuracy: 0.8873 - binary_crossentropy: 0.3629 - loss: 0.3629 - val_accuracy: 0.8859 - val_binary_crossentropy: 0.3136 - val_loss: 0.3136
Epoch 5/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 54ms/step - accuracy: 0.

In [27]:
fig = go.Figure()
for name, history in zip(['baseline', 'l2'], [baseline_history, dropout_model_history]):
    hist = pd.DataFrame(history.history)
    hist['epoch'] = history.epoch
    fig.add_trace(go.Scatter(x=hist['epoch'], y=hist['binary_crossentropy'], name=name + '_binary_crossentropy', mode='lines+markers'))
    fig.add_trace(go.Scatter(x=hist['epoch'], y=hist['val_binary_crossentropy'], name=name + '_val_binary_crossentropy', mode='lines+markers'))
    fig.update_layout(xaxis_title='Epoki', yaxis_title='binary_crossentropy')
fig.show()

In [28]:
l2_dropout_model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape=(NUM_WORDS,)),
    tf.keras.layers.Dense(16, kernel_regularizer=tf.keras.regularizers.l2(0.001), activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(16, kernel_regularizer=tf.keras.regularizers.l2(0.01), activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

l2_dropout_model.compile(optimizer='adam',
                 loss='binary_crossentropy',
                 metrics=['accuracy', 'binary_crossentropy'])

l2_dropout_model.summary()

In [29]:
l2_dropout_model_history = l2_dropout_model.fit(train_data, train_labels, epochs=20, batch_size=512, validation_data=(test_data, test_labels))

Epoch 1/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 145ms/step - accuracy: 0.5765 - binary_crossentropy: 0.6683 - loss: 0.8341 - val_accuracy: 0.8503 - val_binary_crossentropy: 0.5166 - val_loss: 0.6491
Epoch 2/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 50ms/step - accuracy: 0.7543 - binary_crossentropy: 0.5244 - loss: 0.6513 - val_accuracy: 0.8768 - val_binary_crossentropy: 0.3820 - val_loss: 0.4964
Epoch 3/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 69ms/step - accuracy: 0.8291 - binary_crossentropy: 0.4177 - loss: 0.5297 - val_accuracy: 0.8850 - val_binary_crossentropy: 0.3232 - val_loss: 0.4288
Epoch 4/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 41ms/step - accuracy: 0.8672 - binary_crossentropy: 0.3566 - loss: 0.4606 - val_accuracy: 0.8877 - val_binary_crossentropy: 0.2955 - val_loss: 0.3951
Epoch 5/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 71ms/step - accuracy: 0.

In [30]:
fig = go.Figure()
for name, history in zip(['baseline', 'l2'], [baseline_history, l2_dropout_model_history]):
    hist = pd.DataFrame(history.history)
    hist['epoch'] = history.epoch
    fig.add_trace(go.Scatter(x=hist['epoch'], y=hist['binary_crossentropy'], name=name + '_binary_crossentropy', mode='lines+markers'))
    fig.add_trace(go.Scatter(x=hist['epoch'], y=hist['val_binary_crossentropy'], name=name + '_val_binary_crossentropy', mode='lines+markers'))
    fig.update_layout(xaxis_title='Epoki', yaxis_title='binary_crossentropy')
fig.show()