# Bi-directional LSTM: Machine Learning

## Description of text features

This notebook describes the pre-computed text features provided for assignment 2. **You do not need to recompute the features yourself for this assignment** -- this information is just for your reference. However, feel free to experiment with different text features if you are interested. If you do want to try generating your own text features, some things to keep in mind:
- There are many different decisions you can make throughout the feature design process, from the text preprocessing to the size of the output vectors. There's no guarantee that the defaults we chose will produce the best possible text features for this classification task, so feel free to experiment with different settings.
- These features must be trained on a training corpus. Generally, the training corpus should not include validation samples, but for the purposes of this assignment we have used the entire non-test set (training+validation) as the training corpus, to allow you to experiment with different validation sets. If you recompute the text features as part of your own model, you should exclude validation samples and retrain on training samples only. Note that if you do N-fold cross-validation, this means generating N sets of features for N different training-validation splits.
- This code may take a long time to run and require a good bit of memory, which is why we are not requiring you to recompute these features yourself. doc2vec in particular is very slow unless you can implement some speed-ups in C.

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD, PCA

from sklearn import linear_model, naive_bayes

from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Seed variable used in random_state
SEED_NO = 0

# read text
# for DEMONSTRATION PURPOSES, the entire training set will be used to train the models and also as a test set
X_train_original = pd.read_csv(r"./COMP30027_2021_Project2_datasets/recipe_train.csv", index_col = False, delimiter = ',', header=0)
# use recipe name as an example
train_corpus_name = X_train_original['name']
test_name = X_train_original['name']

In [None]:
print('X_train_original Dimensions:', X_train_original.shape)
print("train_corpus_name Dimensions:", train_corpus_name.shape)
print("y_train Dimensions:", X_train_original.duration_label.shape)
print("Attributes Names: {}".format(X_train_original.columns.values))
X_train_original.head()

In [61]:
# Using pre-existing doc2Vec50 files
X_train_name = pd.read_csv(r"./COMP30027_2021_Project2_datasets/recipe_text_features_doc2vec50/train_name_doc2vec50.csv", 
            index_col = False, 
            delimiter = ',', 
            header=None)

X_train_steps = pd.read_csv(r"./COMP30027_2021_Project2_datasets/recipe_text_features_doc2vec50/train_steps_doc2vec50.csv", 
            index_col = False, 
            delimiter = ',', 
            header=None)
            
X_train_ingr = pd.read_csv(r"./COMP30027_2021_Project2_datasets/recipe_text_features_doc2vec50/train_ingr_doc2vec50.csv", 
            index_col = False, 
            delimiter = ',', 
            header=None)

In [62]:
X_test_name = pd.read_csv(r"./COMP30027_2021_Project2_datasets/recipe_text_features_doc2vec50/test_name_doc2vec50.csv", 
            index_col = False, 
            delimiter = ',', 
            header=None)

X_test_steps = pd.read_csv(r"./COMP30027_2021_Project2_datasets/recipe_text_features_doc2vec50/test_steps_doc2vec50.csv", 
            index_col = False, 
            delimiter = ',', 
            header=None)
            
X_test_ingr = pd.read_csv(r"./COMP30027_2021_Project2_datasets/recipe_text_features_doc2vec50/test_ingr_doc2vec50.csv", 
            index_col = False, 
            delimiter = ',', 
            header=None)

In [None]:
print("Set of Duration Labels: " , set(X_train_original.duration_label))

In [None]:
#Estbalish a Feature & Label dataframe for graphing purposes 
vectorized_df = pd.concat([X_train_name, X_train_steps, X_train_ingr, X_train_original.duration_label], axis=1)
vectorized_df.duration_label = vectorized_df.duration_label.astype(str) #For graphing putpose, so lengend will show it as discrete value
vectorized_df.head()

In [None]:
tokenized_reduced = vectorized_df #creates a copy used for dimensionality reduction 

#To showcase that there are no distinct clusters through after PCA transformation on number vectors
sc = StandardScaler()
pca = PCA(n_components=3, random_state=SEED_NO)
components = pca.fit_transform(sc.fit_transform(tokenized_reduced))
labels = {
    str(i): f"PCA {i+1} ({var:.1f}%)" for i, var in enumerate(pca.explained_variance_ratio_ * 100)
} 

#Representing Single Value Decomposition in Low Dimension Settings
fig = px.scatter_matrix(components, 
                        labels=labels, 
                        dimensions=range(pca.components_.shape[0]), 
                        color=tokenized_reduced.duration_label,
                        title="Total Explained Ratio (R-Sq): {:.2f}%".format(pca.explained_variance_ratio_.sum()*100),
                        width=800, height=500
                       ).update_traces(diagonal_visible=False, marker=dict(size=3))
fig.show()

In [None]:
# Apply PCA to see reduction aids in better total explanability
tokenized_pca = PCA().fit(sc.fit_transform(tokenized_reduced))

# Looking at the R-Squared Slopes
fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(tokenized_pca.n_components_+1),
                         y=tokenized_pca.explained_variance_ratio_, 
                         mode='lines',
                         name='R-squared'))
fig.add_trace(go.Scatter(x=np.arange(tokenized_pca.n_components_+1),
                         y=np.cumsum(tokenized_pca.explained_variance_ratio_), 
                         mode='lines',
                         name='Accumulative R-Squared'))
fig.update_layout(title='R-Squared vs. No. of PCA Components',
                  xaxis_title='Principal Component No.',
                  yaxis_title='R-Squared Value')
fig.show()

## Standard models

In [None]:
y_train = X_train_original.duration_label

# Modelling Linear Classifier to be used within benchmarks
print("--- \nMulti-class Logistic Regression ")
LR_clf = linear_model.LogisticRegression(random_state=SEED_NO,
                                         C=0.9, 
                                         max_iter = 50000,
                                         multi_class='multinomial')

# Compute Cross_validation score usign 5-fold and average the accuracy for each CSV files (parallel processing)
LR_accuracy = (cross_val_score(LR_clf, X_train_name, y_train, cv=5).mean() + \
        cross_val_score(LR_clf, X_train_steps, y_train, cv=5).mean() + \
        cross_val_score(LR_clf, X_train_ingr, y_train, cv=5).mean()) / 3
print("Multinomial Logistic Regression Accuracy:", LR_accuracy)

print("--- \nMultinomial Naive Bayes")
NB_clf = naive_bayes.GaussianNB()
NB_accuracy = (cross_val_score(NB_clf, X_train_name, y_train, cv=5).mean() + \
        cross_val_score(NB_clf, X_train_steps, y_train, cv=5).mean() + \
        cross_val_score(NB_clf, X_train_ingr, y_train, cv=5).mean()) / 3
print("Multinomial NB Accuracy", NB_accuracy)

In [None]:
import tensorflow as tf 
import tensorflow.keras.utils as utils
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, LSTM, Bidirectional

#Stacking all Vectorized feature and reshape them into  3-D Array
X_train_3dim = np.hstack((X_train_ingr, X_train_name, X_train_steps))
X_train_3dim = np.reshape(X_train_3dim, (X_train_3dim.shape[0], 1, X_train_3dim.shape[1]))
print("Transformed X_train Dimensions:          ", X_train_3dim.shape)

# One-hot Encoding for Y_train
y_train_le = LabelEncoder().fit_transform(y_train)
y_train_onehot = utils.to_categorical(y_train_le)
print("One-hot Encoding Dimensions for y_train: ", y_train_onehot.shape)

In [None]:
def Bidirectional_LSTM_clf(X, y, epochs_size, batch_size):
    model = Sequential()
    model.add(Bidirectional(LSTM(X.shape[2], return_sequences=True, dropout=0.4, input_shape=(1, X.shape[2]))))
    model.add(Bidirectional(LSTM(X.shape[2], return_sequences=False, dropout=0.4, input_shape=(1, X.shape[2]))))
    model.add(Dense(X.shape[2], activation='tanh'))
    model.add(Dense(3, activation='softmax')) #Softmax output for 3 corresponding categorical variables
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model_history = model.fit(X, y, 
                            epochs=epochs_size, 
                            batch_size=batch_size, 
                            validation_split=0.2,
                            verbose=1)
    return model, model_history 

bi_lstm_model, bi_lstm_model_history = Bidirectional_LSTM_clf(X_train_3dim, y_train_onehot, 10, 128)

In [None]:
# lstm_model.summary()
# print("LSTM Model Training Acccuracy:                        {:.2f}".format(lstm_model_history.history['accuracy'][-1]))
# print("LSTM Model Cross-Validation Acccuracy:                {:.2f}".format(lstm_model_history.history['val_accuracy'][-1]))
# print("-----------------------------------------------------------------\n")
bi_lstm_model.summary()
print("Bidirectional LSTM Model Training Acccuracy:          {:.2f}".format(bi_lstm_model_history.history['accuracy'][-1]))
print("Bidirectional LSTM Model Cross-Validation Acccuracy:  {:.2f}".format(bi_lstm_model_history.history['val_accuracy'][-1]))
# bi_lstm_model_history.history

In [None]:
# Plot line for epoch and loss function
acc_df = pd.DataFrame(
            zip(np.arange(60), 
                # lstm_model_history.history['accuracy'],  
                bi_lstm_model_history.history['accuracy'],
                # lstm_model_history.history['val_accuracy'],  
                bi_lstm_model_history.history['val_accuracy']
                ),
            columns=["Epoch Iteration", 
                    # "LSTM Model (Training)", 
                    "Bidirectional LSTM Model (Training)",
                    # "LSTM Model (Cross Validation)",
                    "Bidirectional LSTM Model (Cross Validation)",
                    ],
            )

loss_df = pd.DataFrame(
            zip(np.arange(60), 
                # lstm_model_history.history['loss'],  
                bi_lstm_model_history.history['loss'],
                # lstm_model_history.history['val_loss'],  
                bi_lstm_model_history.history['val_loss']
                ),
            columns=["Epoch Iteration", 
                    # "LSTM Model (Training)", 
                    "Bidirectional LSTM Model (Training)",
                    # "LSTM Model (Cross Validation)",
                    "Bidirectional LSTM Model (Cross Validation)",
                    ],
            )

In [None]:
fig = make_subplots(rows=1, cols=2,
                    subplot_titles=["Accuracy per Epoch Iteration", "Loss per Epoch Iteration"])

fig.add_trace(
    go.Scatter(x=acc_df['Epoch Iteration'], y=acc_df["Bidirectional LSTM Model (Training)"], 
                name="Bidirectional LSTM Model (Training)", legendgroup='group2', line_color='purple'),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=acc_df['Epoch Iteration'], y=acc_df["Bidirectional LSTM Model (Cross Validation)"], 
                name="Bidirectional LSTM Model (Cross Validation)", legendgroup='group4', line_color='red'),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=loss_df['Epoch Iteration'], y=loss_df["Bidirectional LSTM Model (Training)"], 
                name="Bidirectional LSTM Model (Training)", legendgroup='group2', line_color='purple', showlegend = False),
    row=1, col=2
)

fig.add_trace(
    go.Scatter(x=loss_df['Epoch Iteration'], y=loss_df["Bidirectional LSTM Model (Cross Validation)"], 
                name="Bidirectional LSTM Model (Cross Validation)", legendgroup='group4', line_color='red', showlegend = False), 
    row=1, col=2
)

fig.update_layout(height=600, width=1000, title_text="Accuracy & Loss per Epoch Iterations")
fig.show()

## Prediction based on Remaining Discrete Data 

No. of Ingredients, No. of Steps


In [None]:
X_train_discrete =  X_train_original[['n_steps', 'n_ingredients']]
X_train_discrete.head()

In [None]:
scatter_df = X_train_discrete.join(y_train)
scatter_df.duration_label = scatter_df.duration_label.astype(str)
px.scatter(scatter_df, x="n_steps", y="n_ingredients", color="duration_label", title="No. of Cooking Steps vs. No. of Ingredients")

## Evaluation

In [None]:
def majority_voting_scoring(clf, X_train_name, X_train_steps, X_train_ingr, y_train):
    y_pred_name = cross_val_predict(clf, X_train_name, y_train, cv=5)
    y_pred_steps = cross_val_predict(clf, X_train_steps, y_train, cv=5)
    y_pred_ingr = cross_val_predict(clf, X_train_ingr, y_train, cv=5)

    y_pred_pool = pd.DataFrame({"name": y_pred_name, "steps": y_pred_steps, "ingr": y_pred_ingr})
    # # Return mode of label (the majority) in row-wise comparison
    return y_pred_pool.mode(axis=1)[0]
    

In [None]:
# predict relevant data
y_pred_NB = majority_voting_scoring(NB_clf, X_train_name, X_train_steps, X_train_ingr, y_train)
y_pred_LR = majority_voting_scoring(LR_clf, X_train_name, X_train_steps, X_train_ingr, y_train)

In [None]:
# convert one-hot label output back to one of {1, 2, 3}
y_pred_BLSTM = np.array(list(map(lambda x: float(x+1), # Increment output to match category as np.argmax return position of one-hot array
                        np.argmax(bi_lstm_model.predict(X_train_3dim), axis=-1))))

### Classification Report for Naive Bayes & Logistic Regression

In [None]:
target_names =['1.0', '2.0', '3.0']

print("----------- Naive Bayes Classifier Report -----------")
# print("Accuracy Score: ", accuracy_score(y_train.astype(str), y_pred_NB.astype(str)))
print(classification_report(y_train.astype(str), y_pred_NB.astype(str), target_names=target_names, digits=4))
print("------- Logistic Regression Classifier Report -------")
# print("Accuracy Score: ", accuracy_score(y_train.astype(str), y_pred_LR.astype(str)))
print(classification_report(y_train.astype(str), y_pred_LR.astype(str), target_names=target_names, digits=4))

In [None]:
X_train = pd.concat([X_train_name, X_train_steps, X_train_ingr], axis=1)

#Split the data to test for LSTM accuracy 
X_train_lstm, X_test_lstm, y_train_lstm, y_test_lstm = train_test_split(X_train, y_train, test_size=0.3, random_state=SEED_NO)

# #Stacking all Vectorized feature and reshape them into  3-D Array
X_train_3dim = np.reshape(np.array(X_train_lstm), newshape=(X_train_lstm.shape[0], 1, X_train_lstm.shape[1]))
X_test_3dim = np.reshape(np.array(X_test_lstm), newshape=(X_test_lstm.shape[0], 1, X_test_lstm.shape[1]))
print("Transformed X_train Dimensions:          ", X_train_3dim.shape)

# One-hot Encoding for Y_train
y_train_le = LabelEncoder().fit_transform(y_train_lstm)
y_train_onehot = utils.to_categorical(y_train_le)
print("One-hot Encoding Dimensions for y_train: ", y_train_onehot.shape)

bi_lstm_model, bi_lstm_model_history = Bidirectional_LSTM_clf(X_train_3dim, y_train_onehot, 10, 128)

In [None]:
print("------- Bidirectional LSTM Classifier Report -------")
y_pred_BLSTM = np.array(list(map(lambda x: float(x+1), # Increment output to match category as np.argmax return position of one-hot array
                        np.argmax(bi_lstm_model.predict(X_test_3dim), axis=-1))))
print(classification_report(y_test_lstm.astype(str), y_pred_BLSTM.astype(str), target_names=target_names, digits=4))

## Statistics for accuracy score under different feature extraction setting 

In [None]:
# Poulating Accuracy in a histogram 
vec50_acc = [0.6245, 0.6358, 0.6965]
vec100_acc = [0.5743, 0.6205, 0.7146]
chi2_300f_acc = [0.6584, 0.7985, 0.7783]
chi2_100f_acc = [0.6584, 0.7952, 0.7909]

# Assmebling new DataFrame for 
hist_df = pd.DataFrame({'vec_50': np.dot(vec50_acc,100), 
                        'vec_100': np.dot(vec100_acc, 100), 
                        'chi2_300': np.dot(chi2_300f_acc, 100), 
                        'chi2_100': np.dot(chi2_100f_acc, 100)
                        })
hist_df = hist_df.T
hist_df.columns = ['Naive Bayes', 'Logistic Regression', 'BLSTM']
hist_df.head()
# print(hist_df.iloc[0])

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(y=hist_df.iloc[0], x=hist_df.columns, name="vec_50"))
fig.add_trace(go.Bar(y=hist_df.iloc[1], x=hist_df.columns, name="vec_100"))
fig.add_trace(go.Bar(y=hist_df.iloc[2], x=hist_df.columns, name="chi2_300"))
fig.add_trace(go.Bar(y=hist_df.iloc[3], x=hist_df.columns, name="chi2_100"))

fig.update_layout(
    xaxis_title="Classification Model",
    yaxis_title="Accuracy (%)",
    legend_title="Feature Extraction Type",
    title="Classification Accuracy Summary"
)


fig.update_yaxes(range=[0, 100])
fig.show()

## Exporting label prediction to CSV files 

In [None]:
def export_pred_to_csv(y_pred, fname):
    pd.DataFrame(zip(np.arange(1, len(y_pred)+1), y_pred), columns=["id", "duration_label"]).to_csv("{}".format(fname), header=True, index=False)

In [63]:
# Modelling Linear Classifier to be used within benchmarks
NB_clf = naive_bayes.GaussianNB()

# Fit the model and name 
NB_clf.fit(X_train_name, y_train)
y_pred_name = NB_clf.predict(X_test_name)

# Fit the model and predict steps
NB_clf.fit(X_train_steps, y_train)
y_pred_steps = NB_clf.predict(X_test_steps)

# Fit the model and predict ingredietns 
NB_clf.fit(X_train_ingr, y_train)
y_pred_ingr = NB_clf.predict(X_test_ingr)

# Package the prediction 
y_pred_NB_pool = pd.DataFrame({"name": y_pred_name, "steps": y_pred_steps, "ingr": y_pred_ingr})
y_pred_NB_pool = y_pred_NB_pool.mode(axis=1)[0]

In [64]:
# Modelling Linear Classifier to be used within benchmarks
LR_clf = linear_model.LogisticRegression(random_state=SEED_NO,
                                         C=0.9, 
                                         max_iter = 50000,
                                         multi_class='multinomial')
# Fit the model and name 
LR_clf.fit(X_train_name, y_train)
y_pred_name = LR_clf.predict(X_test_name)

# Fit the model and predict steps
LR_clf.fit(X_train_steps, y_train)
y_pred_steps = LR_clf.predict(X_test_steps)

# Fit the model and predict ingredietns 
LR_clf.fit(X_train_ingr, y_train)
y_pred_ingr = LR_clf.predict(X_test_ingr)

# Package the prediction 
y_pred_LR_pool = pd.DataFrame({"name": y_pred_name, "steps": y_pred_steps, "ingr": y_pred_ingr})
y_pred_LR_pool = y_pred_LR_pool.mode(axis=1)[0]

In [65]:
#Stacking all Vectorized feature and reshape them into  3-D Array
X_train_3dim = np.hstack((X_train_ingr, X_train_name, X_train_steps))
X_train_3dim = np.reshape(X_train_3dim, (X_train_3dim.shape[0], 1, X_train_3dim.shape[1]))
print("Transformed X_train Dimensions:          ", X_train_3dim.shape)

# One-hot Encoding for Y_train
y_train_le = LabelEncoder().fit_transform(y_train)
y_train_onehot = utils.to_categorical(y_train_le)
print("One-hot Encoding Dimensions for y_train: ", y_train_onehot.shape)

Transformed X_train Dimensions:           (40000, 1, 150)
One-hot Encoding Dimensions for y_train:  (40000, 3)


In [66]:
print("------- Bidirectional LSTM Classifier Report -------")
y_pred_BLSTM = np.array(list(map(lambda x: float(x+1), # Increment output to match category as np.argmax return position of one-hot array
                        np.argmax(bi_lstm_model.predict(X_test_3dim), axis=-1))))
print(classification_report(y_test_lstm.astype(str), y_pred_BLSTM.astype(str), target_names=target_names, digits=4))

------- Bidirectional LSTM Classifier Report -------
              precision    recall  f1-score   support

         1.0     0.6563    0.8081    0.7243      5283
         2.0     0.7642    0.6520    0.7037      6110
         3.0     0.6206    0.2883    0.3937       607

    accuracy                         0.7023     12000
   macro avg     0.6804    0.5828    0.6072     12000
weighted avg     0.7094    0.7023    0.6971     12000



In [67]:
export_pred_to_csv(y_pred_BLSTM, "BLSTM_y_pred_vec50.csv")
export_pred_to_csv(y_pred_NB_pool, "NB_y_pred_vec50.csv")
export_pred_to_csv(y_pred_LR_pool, "LR_y_pred_vec50.csv")