<h1> Modelling and evaluation </h1>
<h2> 1. Import and download </h2>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score as ACC
from sklearn.ensemble import RandomForestClassifier

from tensorflow import keras
from keras import layers
from keras.layers import RNN, Dense, Dropout, BatchNormalization
from keras import Sequential, layers, Input, callbacks

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Importing all the datasets
train_A = pd.read_csv('data/train_A.csv')
train_B = pd.read_csv('data/train_B.csv')
train_C = pd.read_csv('data/train_C.csv')

val_A = pd.read_csv('data/val_A.csv')
val_B = pd.read_csv('data/val_B.csv')
val_C = pd.read_csv('data/val_C.csv')

test_A = pd.read_csv('data/test_A.csv')
test_B = pd.read_csv('data/test_B.csv')
test_C = pd.read_csv('data/test_C.csv')

<h2> 2. Data preprocessing </h2>

In [3]:
datasets = [train_A, val_A, test_A, 
            train_B, val_B, test_B, 
            train_C, val_C, test_C]


In [4]:
imp_char = ["FRODO", "SAM", "GANDALF", "PIPPIN", "MERRY", "GOLLUM", "GIMLI", "THEODEN", "FARAMIR", "ARAGORN"]

# Creating a common label for the characters not of interest
def common_label_removal(data):
    mask = data["char"].isin(imp_char)
    data.loc[~ mask, "char"] = "Rest"
    mask2 = data['char'] == 'Rest'
    data = data[~mask2]
    return data

def x_y_split(data):
    y_data = data['char']
    x_data = data.drop(columns=['char', 'dialog'])
    return x_data, y_data

def char_2_num(y_data):
    encoder = LabelEncoder()
    y_data = y_data.values.reshape(-1, 1)
    encoded_data = encoder.fit_transform(y_data)
    names = list(encoder.inverse_transform(np.unique(encoded_data)))
    print(names)
    print(np.unique(encoded_data))
    return encoded_data, names

def preprocessing(data):
    data = common_label_removal(data)
    x_data, y_data = x_y_split(data)
    y_data = char_2_num(y_data)
    return x_data, y_data

for i in range(len(datasets)):
    datasets[i] = preprocessing(datasets[i])

['ARAGORN', 'FARAMIR', 'FRODO', 'GANDALF', 'GIMLI', 'GOLLUM', 'MERRY', 'PIPPIN', 'SAM', 'THEODEN']
[0 1 2 3 4 5 6 7 8 9]
['ARAGORN', 'FARAMIR', 'FRODO', 'GANDALF', 'GIMLI', 'GOLLUM', 'MERRY', 'PIPPIN', 'SAM', 'THEODEN']
[0 1 2 3 4 5 6 7 8 9]
['ARAGORN', 'FARAMIR', 'FRODO', 'GANDALF', 'GIMLI', 'GOLLUM', 'MERRY', 'PIPPIN', 'SAM', 'THEODEN']
[0 1 2 3 4 5 6 7 8 9]
['ARAGORN', 'FARAMIR', 'FRODO', 'GANDALF', 'GIMLI', 'GOLLUM', 'MERRY', 'PIPPIN', 'SAM', 'THEODEN']
[0 1 2 3 4 5 6 7 8 9]
['ARAGORN', 'FARAMIR', 'FRODO', 'GANDALF', 'GIMLI', 'GOLLUM', 'MERRY', 'PIPPIN', 'SAM', 'THEODEN']
[0 1 2 3 4 5 6 7 8 9]
['ARAGORN', 'FARAMIR', 'FRODO', 'GANDALF', 'GIMLI', 'GOLLUM', 'MERRY', 'PIPPIN', 'SAM', 'THEODEN']
[0 1 2 3 4 5 6 7 8 9]
['ARAGORN', 'FARAMIR', 'FRODO', 'GANDALF', 'GIMLI', 'GOLLUM', 'MERRY', 'PIPPIN', 'SAM', 'THEODEN']
[0 1 2 3 4 5 6 7 8 9]
['ARAGORN', 'FARAMIR', 'FRODO', 'GANDALF', 'GIMLI', 'GOLLUM', 'MERRY', 'PIPPIN', 'SAM', 'THEODEN']
[0 1 2 3 4 5 6 7 8 9]
['ARAGORN', 'FARAMIR', 'FRODO', 

In [5]:
A_tra_X =datasets[0][0]
A_tra_y =datasets[0][1][0]
A_val_X =datasets[1][0]
A_val_y=datasets[1][1][0]
A_tar_X=datasets[2][0]
A_tar_y=datasets[2][1][0]

B_tra_X =datasets[3][0]
B_tra_y =datasets[3][1][0]
B_val_X =datasets[4][0]
B_val_y=datasets[4][1][0]
B_tar_X=datasets[5][0]
B_tar_y=datasets[5][1][0]

C_tra_X =datasets[6][0]
C_tra_y =datasets[6][1][0]
C_val_X =datasets[7][0]
C_val_y=datasets[7][1][0]
C_tar_X=datasets[8][0]
C_tar_y=datasets[8][1][0]

<h2> 2. Benchmarks </h2>
<h3> 2.1 Naive Benchmark, Monte Carlo Method </h3>
<p> Using 1000 simulations with random guesses on target labels. </p>

In [6]:
def naive_benchmark_MonC(y):
    accuracy_list = []
    for i in range(0,1000,1):
        naive_rand_pred = np.random.randint(0,12,size=(len(y)))
        accuracy_sel = ACC(naive_rand_pred, y)
        accuracy_list.append(accuracy_sel)
    return np.mean(accuracy_list)

In [7]:
naive_benchmark_MonC(A_tar_y)

0.08397863247863248

<h3> 2.2 Naive Benchmark, Majority Class Method </h3>
<p> Using Frodo, which equals label 2, as guess </p>

In [8]:
def naive_benchmark_MajC(y):
    pred_MCNB =np.repeat(2,len(y))
    return ACC(pred_MCNB, y)

In [9]:
naive_benchmark_MajC(A_tar_y)

0.1752136752136752

<h2> 3. Modelling  </h2>
<h3> 3.1 ANN on dataset A</h3>
<p> Dataset A contains various numerical retrieved from the characters. </p>
<p> The feedforward neural network has a relative simple architecture.

In [10]:
scaler = StandardScaler()
A1 = scaler.fit_transform(A_tra_X)
A2 = scaler.transform(A_val_X)
A3 = scaler.transform(A_tar_X)

Y1 = np.eye(10)[A_tra_y]
Y2 = np.eye(10)[A_val_y]
Y3 = np.eye(10)[A_tar_y]

In [11]:
# ann_model = keras.Sequential([
#     layers.Dense(8, activation='relu',input_dim=20),
#     layers.BatchNormalization(),
#     layers.Dropout(rate=0.3),
#     # layers.Dense(16, activation='selu'),
#     # layers.BatchNormalization(),
#     # layers.Dropout(0.3),
#     layers.Dense(10, activation='softmax'),
#     layers.Dense(10)
# ])

# optimizer = keras.optimizers.Adam(learning_rate=0.01)
# ann_model.compile(optimizer=optimizer,
#               loss = 'categorical_crossentropy',
#               metrics=['accuracy']
#               )

# early_stopping = callbacks.EarlyStopping(
#     min_delta=0.001, # minimium amount of change to count as an improvement
#     patience=35, # how many epochs to wait before stopping
#     restore_best_weights=True,
# )
# ann_model.fit(A1, Y1, 
#           validation_data= (A2, Y2),
#           epochs=200, batch_size=10, 
#           callbacks=early_stopping,
#           verbose=0
#           )

# print('Accuracy train: ',ann_model.evaluate(A1, Y1))
# print('Accuracy validation: ',ann_model.evaluate(A2, Y2))
# print('Accuracy test: ',ann_model.evaluate(A3, Y3))

In [12]:
from xgboost import XGBClassifier 
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe

In [15]:
p_g = {
    'objective':['multi:softprob'],
    'alpha': hp.uniform('alpha',0,1),
    'gamma': hp.uniform('gamma',0,9),
    'reg_lambda':hp.quniform('reg_lamda',0,3,1),
    'max_depth':hp.quniform('max_depth',6,12,1),
    'learning_rate': hp.uniform('learning_rate',0.001,0.05),
    'n_estimators': hp.quniform('n_estimators', 5,500,1),
    'min_child_weight': hp.quniform('min_child_weight',0,5,1),
    'colsample_bytree' : hp.uniform('colsample_bytree', 0.5,1),
    'seed':42
    }

In [20]:
from sklearn.model_selection import cross_val_score

In [22]:
def bayopt_xgb(p_g):
    internal_model = XGBClassifier(
                     objective='multi:softprob',
                     alpha=p_g['alpha'],
                     gamma=p_g['gamma'],
                     reg_lambda= p_g['reg_lambda'],
                    #  colsample_bytree= p_q['colsample_bytree'],
                     max_depth = int(p_g['max_depth']),
                     n_estimator = (p_g['n_estimators']),
                     learning_rate=p_g['learning_rate'],
                    #  min_child_weight=p_g['min_child_weight'],
                     seed =p_g['seed'],
                     )
    # evaluation = [(A2, A_val_y)]

    internal_model.fit(A1, A_tra_y,
                     eval_set = [(A2, A_val_y)],
                     eval_metric = 'mlogloss',
                     early_stopping_rounds=25,verbose=False)
    
    # pred_valid = internal_model.predict(A2)
    # score = ACC(pred_valid, A_tra_y)

    score =np.mean(cross_val_score(internal_model, A1, A_tra_y, scoring='accuracy', cv=5))
    print('Score:', score)
    return {'loss':-score, 'status':STATUS_OK}

def tune():
    trials = Trials()
    best_tune = fmin(fn=bayopt_xgb, 
                    space=p_g,
                    algo= tpe.suggest,
                    max_evals=50,
                    trials=trials)
    return best_tune


ntune = tune()
ntune['n_estimators'] =  int(ntune['n_estimators'])
ntune['max_depth'] =  int(ntune['max_depth'])
xmodel = XGBClassifier(**ntune)

Score:                                                
0.2083387127397                                       
Score:                                                                        
0.15518891209417124                                                           
Score:                                                                        
0.16041389785456617                                                           
Score:                                                                        
0.19093221947977976                                                           
Score:                                                                        
0.19003987089424718                                                           
Score:                                                                        
0.21621036643250427                                                           
Score:                                                                            
0.190898044427567

In [23]:
xmodel.fit(A1, A_tra_y)

In [24]:
print('Accuracy train: ',ACC(xmodel.predict(A1),A_tra_y))
print('Accuracy validation: ',ACC(xmodel.predict(A2),A_val_y))
print('Accuracy test: ',ACC(xmodel.predict(A3),A_tar_y))

Accuracy train:  0.4115082824760244
Accuracy validation:  0.4115082824760244
Accuracy test:  0.2264957264957265


In [26]:
# A1a

In [25]:
# internal_model = XGBClassifier(
#                             objective='multi:softmax',
#                                 #  alpha=p_q['alpha'],
#                                 #  gamma=p_q['gamma'],
#                                 #  reg_lambda= p_q['reg_lambda'],
#                                 #  colsample_bytree= p_q['colsample_bytree'],
#                             # max_depth = int(p_g['max_depth']),
#                             max_depth = int(3),

#                             n_estimator = (p_g['n_estimators']),
#                             learning_rate=p_g['learning_rate'],
#                             #  min_child_weight=p_g['min_child_weight'],
#                             seed =p_g['seed'],
#                             )
# evaluation = [(A2, A_val_y)]

# internal_model.fit(A1, A_tra_y,
#                 eval_set = evaluation,
#                 eval_metric = 'mlogloss',
#                 early_stopping_rounds=25,verbose=False)
    
# pred_valid = internal_model.predict(A2)
# score = ACC(A2, A_tra_y)
#     # return pred_valid

# print('Score:', score)
# {'loss':-score, 'status':STATUS_OK}

# def tune():
#     trials = Trials()
#     best_tune = fmin(fn=internal_model, 
#                     space=p_g,
#                     algo= tpe.suggest,
#                     max_evals=200,
#                     trials=trials)
#     return best_tune


# ntune = tune()
# ntune['n_estimators'] =  int(ntune['n_estimators'])
# ntune['max_depth'] =  int(ntune['max_depth'])
# # xmodel = XGBClassifier(**ntune)

In [None]:

# def cvscore():
#     ntune = tune()
#     ntune['n_estimators'] =  int(ntune['n_estimators'])
#     ntune['max_depth'] =  int(ntune['max_depth'])
#     xmodel = XGBClassifier(**ntune, random_state=42)
#     cvs = cross_val_score(xmodel, A1, Y1, cv=25,
#                          random_state=42)
#     cvs.predict
#     return cvs.mean()

<h3> 3.2 RNN on dataset B </h3>
<p> Dataset B contains embeddings(?). This, I need to read myself up on.</p>

In [26]:
from numpy import array
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten,Embedding,Dense

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense , Flatten ,Embedding,Input
from keras.models import Model



In [27]:
B1 = pd.read_csv('data/train_df.csv')
B2= pd.read_csv('data/val_df.csv')
B3 = pd.read_csv('data/test_df.csv')

In [28]:
B1 = common_label_removal(B1).reset_index(drop=True)
B2 = common_label_removal(B2).reset_index(drop=True)
B3 = common_label_removal(B3).reset_index(drop=True)

In [29]:
def quote_list(X):
    quote_list = []
    for quote in range(len(X)):
        splitted_quote =  X['dialog'][quote].split()
        sequence_list = []
        for split in range(len(splitted_quote)):
            splitted_word = splitted_quote[split]

            word_list = str()
            i=0
            while i < (len(splitted_word)):
                # print(splitted_word[i])
                if splitted_word[i].isalpha() == True:
                    word_list += splitted_word[i]
                i+=1
            sequence_list.append(word_list)
        quote_list.append(sequence_list)
    return quote_list

In [30]:
def maxlen(X):
    uni = []
    for i in range(len(quote_list)):
        for j in range(len(quote_list[i])):
            if quote_list[i][j] not in uni:
                uni.append(quote_list[i][j])
    return len(uni)

In [31]:
B1 = quote_list(B1)
B2 = quote_list(B2)
B3 = quote_list(B3)

In [32]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(B1)
B1_seq = tokenizer.texts_to_sequences(B1)
B2_seq = tokenizer.texts_to_sequences(B2)
B3_seq = tokenizer.texts_to_sequences(B3)
maxlen = max([len(seq) for seq in B1_seq])


B1_padseq = pad_sequences(B1_seq, maxlen=maxlen,padding='post')
B2_padseq = pad_sequences(B2_seq, maxlen=maxlen,padding='post')
B3_padseq = pad_sequences(B3_seq, maxlen=maxlen,padding='post')

B1y = np.eye(10)[B_tra_y]
B2y = np.eye(10)[B_val_y]
B3y = np.eye(10)[B_tar_y]

In [33]:
emb_model = Sequential([
    layers.Embedding(input_dim=2500, output_dim=15, input_length=maxlen),
    # layers.Flatten(),
    layers.LSTM(8,activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0,3),
    layers.Dense(32, activation='selu'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(64, activation='gelu'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(10, activation='softmax')
])

optimizer = keras.optimizers.Adam(learning_rate=0.005)
emb_model.compile(optimizer=optimizer, 
            loss='categorical_crossentropy', 
            metrics=['accuracy'])

early_stopping = callbacks.EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=35, # how many epochs to wait before stopping
    restore_best_weights=True,
)

emb_model.fit(B1_padseq,B1y, epochs=100, batch_size=30, 
        validation_data=(B2_padseq, B2y),
        callbacks=early_stopping,)

Epoch 1/100
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 22ms/step - accuracy: 0.1209 - loss: 2.3952 - val_accuracy: 0.1299 - val_loss: 2.2504
Epoch 2/100
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.1347 - loss: 2.3081 - val_accuracy: 0.0984 - val_loss: 2.2546
Epoch 3/100
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.1126 - loss: 2.3298 - val_accuracy: 0.1299 - val_loss: 2.2445
Epoch 4/100
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.1236 - loss: 2.2828 - val_accuracy: 0.1299 - val_loss: 2.2578
Epoch 5/100
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.1420 - loss: 2.2659 - val_accuracy: 0.1299 - val_loss: 2.2507
Epoch 6/100
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.1166 - loss: 2.2792 - val_accuracy: 0.1299 - val_loss: 2.2382
Epoch 7/100
[1m39/39[0m [

<keras.src.callbacks.history.History at 0x1d80a1a8490>

In [34]:
emb_model.summary()

In [35]:
# Train accuracy
emb_model.evaluate(B1_padseq, B1y)

[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.1567 - loss: 2.2568


[2.2551958560943604, 0.14646905660629272]

In [36]:
# Validation accuracy
emb_model.evaluate(B2_padseq, B2y)

[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.1307 - loss: 2.2342 


[2.2347300052642822, 0.13385826349258423]

In [37]:
# Test accuracy
emb_model.evaluate(B3_padseq, B3y)

[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.1461 - loss: 2.2180 


[2.2260401248931885, 0.1794871836900711]

<p> sources </p>
<ul>
<li>https://keras.io/api/models/model/</li>
<li>https://towardsdatascience.com/machine-learning-word-embedding-sentiment-classification-using-keras-b83c28087456</li>
<li>https://www.kaggle.com/code/rajmehra03/a-detailed-explanation-of-keras-embedding-layer</li>
<li>https://medium.com/@iqra.bismi/understanding-keras-embedding-for-natural-language-processing-9f65a281b1a7</li>

</ul>

<h3> 3.3 RFC on dataset C </h3>
<p>  Dataset C contains a counter on how many times a specific word have been mentioned in a quote. </p>

In [38]:
param_grid = {
    'n_estimators': [30,35,45,55,65,75,85,95],
    'max_depth': [6,9,12,15,18,21,24,27,30],
}

acc_list = []
for n in range(len(param_grid['n_estimators'])):
    nE = param_grid['n_estimators'][n]
    for d in range(len(param_grid['max_depth'])):
        mD = param_grid['max_depth'][d]
        
        model = RandomForestClassifier(n_estimators=nE, max_depth=mD, random_state=42) 
        model.fit(C_tra_X,C_tra_y)
        X1 = model.predict(C_tra_X)
        x2 = model.predict(C_val_X)
        acc_list.append(ACC(x2, C_val_y))


In [39]:
a = pd.Series(acc_list)
np.where(a==max(a))

(array([59], dtype=int64),)

In [40]:
#ne 85
#md 24
rfc_model = RandomForestClassifier(n_estimators=55, max_depth=15,random_state=42)
rfc_model.fit(C_tra_X,C_tra_y)
predCtrain= rfc_model.predict(C_tra_X)
predCval= rfc_model.predict(C_val_X)
predCtest= rfc_model.predict(C_tar_X)

In [41]:
# Train accuracy 
ACC(predCtrain, C_tra_y)

0.5483870967741935

In [42]:
# Train accuracy 
ACC(predCval, C_val_y)

0.2952755905511811

In [43]:
# Train accuracy 
ACC(predCtest, C_tar_y)

0.3247863247863248

<h2> 4. Ensemble model </h2>
<p> The RFC contains absolutely best results therefore, they will have prioritized votes if there are ties. </p>

In [44]:
# ann_model
# emb_model
# rfc_model

In [48]:
P1 = xmodel.predict(A3)
# P1 = pp.argmax(axis=1)

pp = emb_model.predict(B3_padseq)
P2 = pp.argmax(axis=1)

P3 = rfc_model.predict(C_tar_X)

[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step 


In [49]:
final_preds = []
for i in range(len(P1)):
    preds =  [P1[i],P2[i],P3[i]]
    if preds[0]==preds[1]:
        ans = preds[0]
    elif preds[0]==preds[2]:
        ans= preds[0]
    elif preds[1]==preds[2]:
        ans=preds[1]
    else:
        ans = preds[2]
    final_preds.append(ans)

In [50]:
ACC(final_preds, A_tar_y)

0.2692307692307692

<h1> 5. Conclusion: </h1>
<p> We have used three different datasets trained on three different models. The best individual model is the random forest classifier, which is trained on dummy coded BoW. </p>
<br>
<p> Furthermore, all the models have been put together in an ensemble model, where the majority class wins. The accuracy of the ensemble model is equal to the accuracy retrieved from the rfc model. This might indicate that there are no documents where the two other models agrees upon another label than the rfc model. In other words; the other models are do not give any type of additional explanatory power other what than the rfc model gives.</p>
<br>
<p> The upside of the modelling phase is that we have been able to create a model that is better than random guessing by 300% and a model that better than guessing Frodo all the time by approximately 100%. </p>
<br>
<h1> Biological hazard have left the building at 01:55.  </h1>