# Bag of words approach (RNN)
### Unigram:
- Ett neuralt nätverk med många ingångar (10k, 20k osv), varje ingång anger förekomsten av ett ord (1) eller ej (0). Endast förekomsten av ett visst ord ger 1. Antalet gånger ett ord är med, tas det ingen hänsyn till.
### N-gram:
- Varje ingång i nätverket motsvaras av enstaka ord eller ord-följder med N ord. Detta sättet innebär att att viss hänsyn tas till ordföljd. Ju större N, ju mer hänsyn till ordföljd tas.
- Bi-gram innebär att man använder två på varandra följanden ord.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import keras
from keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
import daniel_functions as dfunc

### Read in data and small EDA:

In [2]:
# Load the datasets
X_test = pd.read_csv("data/Train_Test_splits/X_test_50proc_orig.csv")
X_train = pd.read_csv("data/Train_Test_splits/X_train_50proc_orig.csv")
y_test = pd.read_csv("data/Train_Test_splits/y_test_50proc.csv")
y_train = pd.read_csv("data/Train_Test_splits/y_train_50proc.csv")

y_test['sentiment'] = y_test['sentiment'].apply(lambda x: 1 if x == 'LABEL_1' else 0)
y_train['sentiment'] = y_train['sentiment'].apply(lambda x: 1 if x == 'LABEL_1' else 0)

In [3]:
print(len(X_train), len(X_test), len(y_train), len(y_test))
print(type(X_train), type(X_test), type(y_train), type(y_test))
print(f"X_train: {X_train.loc[0]},\n y_train: {y_train.loc[0]}")
print(f"X_train.head(1):\n {X_train.head(1)}")

25000 25000 25000 25000
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'>
X_train: review    One of the other reviewers has mentioned that ...
Name: 0, dtype: object,
 y_train: sentiment    1
Name: 0, dtype: int64
X_train.head(1):
                                               review
0  One of the other reviewers has mentioned that ...


## Unigram bag of words approach:
- skapar datasetet. "multi-hot" innebär unigram.

In [4]:
text_vectorization = layers.TextVectorization(max_tokens=20000, output_mode="multi_hot")
text_vectorization.adapt(X_train['review'])
X_train_unigram = text_vectorization(X_train['review'])
X_test_unigram = text_vectorization(X_test['review'])
y_train = y_train['sentiment'].to_numpy()
y_test = y_test['sentiment'].to_numpy()

Skapar en modell:

In [5]:
model = dfunc.get_model()
model.summary()

In [6]:
callbacks = [keras.callbacks.ModelCheckpoint("bagofwords_unigram.keras", save_best_only=True)]
model.fit(x=X_train_unigram, y=y_train, validation_split=0.2, epochs=10, callbacks=callbacks, verbose=True)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.7660 - loss: 0.4947 - val_accuracy: 0.8778 - val_loss: 0.2991
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.8932 - loss: 0.2809 - val_accuracy: 0.8834 - val_loss: 0.2905
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.9132 - loss: 0.2376 - val_accuracy: 0.8830 - val_loss: 0.3144
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.9237 - loss: 0.2198 - val_accuracy: 0.8826 - val_loss: 0.3263
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.9304 - loss: 0.2138 - val_accuracy: 0.8792 - val_loss: 0.3316
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.9350 - loss: 0.2028 - val_accuracy: 0.8816 - val_loss: 0.3446
Epoch 7/10
[1m625/625[0m 

<keras.src.callbacks.history.History at 0x11b2cfba120>

In [7]:
model = keras.models.load_model("bagofwords_unigram.keras")
# Ensure X_test_unigram is properly formatted and does not contain None values
X_test_unigram = tf.convert_to_tensor(X_test_unigram)
print(f"Test acc: {model.evaluate(X_test_unigram, y_test)[1]:.3f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.8940 - loss: 0.2749
Test acc: 0.893


## Bi-gram:

In [12]:
text_vectorization = layers.TextVectorization(ngrams=2, max_tokens=20000, output_mode="tf_idf")
text_vectorization.adapt(X_train['review'])
X_train_bigram = text_vectorization(X_train['review'])
X_test_bigram = text_vectorization(X_test['review'])
'''
if y_train.dtype != 'nd-array':
    y_train = y_train['sentiment'].to_numpy()
else:
    y_train = y_train
if y_test.dtype != 'nd-array':
    y_test = y_test['sentiment'].to_numpy()
else:
    y_test = y_test
'''

"\nif y_train.dtype != 'nd-array':\n    y_train = y_train['sentiment'].to_numpy()\nelse:\n    y_train = y_train\nif y_test.dtype != 'nd-array':\n    y_test = y_test['sentiment'].to_numpy()\nelse:\n    y_test = y_test\n"

In [13]:
model_bigram = dfunc.get_model()
model_bigram.summary()

In [14]:
callbacks = [keras.callbacks.ModelCheckpoint("bagofwords_bigram.keras", save_best_only=True)]
model_bigram.fit(x=X_train_bigram, y=y_train, validation_split=0.2, epochs=10, callbacks=callbacks, verbose=True)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.7240 - loss: 0.5861 - val_accuracy: 0.8778 - val_loss: 0.3348
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.8776 - loss: 0.3094 - val_accuracy: 0.8860 - val_loss: 0.3194
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8984 - loss: 0.2680 - val_accuracy: 0.8870 - val_loss: 0.3209
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8979 - loss: 0.2625 - val_accuracy: 0.8828 - val_loss: 0.3357
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9054 - loss: 0.2377 - val_accuracy: 0.8792 - val_loss: 0.3452
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9119 - loss: 0.2300 - val_accuracy: 0.8812 - val_loss: 0.3535
Epoch 7/10
[1m625/625[0m 

<keras.src.callbacks.history.History at 0x11b2cf327e0>

In [15]:
model = keras.models.load_model("bagofwords_bigram.keras")
X_test_bigram = tf.convert_to_tensor(X_test_bigram)
print(f"Test acc: {model.evaluate(X_test_bigram, y_test)[1]:.3f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.8924 - loss: 0.3000
Test acc: 0.892
