

# Demo Code for NLU Coursework Bi-LSTM Model Inference

## Before we begin:
- Upload:
  - pre-trained model: 'Bi-lstm_model.h5'
  - Tokenizer: 'tokenizer.joblib'
  - File name of the test data

 , Directly to this Colab session before proceeding.
- These files are the result of training the Bi-LSTM model on the NLI task, and the trained Tokenizer used for pre-processing.



In [None]:
# Installation of required packages (if not already available in the Google Colab environment)
# Run the following lines if you encounter import errors, or if instructed by the notebook
!pip install joblib
!pip install pandas  # Usually, pandas is pre-installed in Google Colab
!pip install keras
!pip install tensorflow


1. Import necessary libraries


In [1]:
import pandas as pd
import joblib
import os
import re
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences

2. Load the pre-trained model, vectorizer, and test dataset for prediction

In [2]:
# Load the pre-trained model and tokenizer
try:
    lstm_model = load_model('Bi-lstm_model.h5')
    tokenizer = joblib.load('tokenizer.joblib')
except FileNotFoundError as e:
    print(e)
    print("Please make sure the model and tokenizer files are uploaded to this session.")
    raise  # Exit if files not found.


In [3]:
# Load new input data for prediction
try:
    new_data = pd.read_csv('demo_input.csv') # This is just a placeholder replace with actual file name of the test data
except FileNotFoundError:
    print("Data file not found. Please upload the test dataset to this session.")
    raise

3. Pre-process test dataset

In [4]:
# Function to preprocess text data
def preprocess_text(text):
    # Use a lambda function to replace non-alphanumeric characters and convert to lowercase
    text = text.apply(lambda t: re.sub('[^\w\s]', '', t).lower())
    # Tokenize and pad sequences
    sequences = tokenizer.texts_to_sequences(text)
    padded_sequences = pad_sequences(sequences, maxlen=100)  # Ensure the same max length as during training
    return padded_sequences

In [5]:
new_data.fillna('', inplace=True)
new_data['text'] = new_data['premise'].str.cat(new_data['hypothesis'], sep=' ')
X_new = preprocess_text(new_data['text'])

4. Make predictions using the loaded model


In [6]:
predictions = lstm_model.predict(X_new)
predictions = (predictions > 0.5).astype(int)




5. Save the predictions to a CSV file in the specified format



In [7]:
predictions_df = pd.DataFrame(predictions, columns=['prediction'])
predictions_df.to_csv('Group_3_B.csv', header=True, index=False)
print("Predictions have been saved successfully to 'Group_3_B.csv'") # Save directly in the session, click on the left-hand side file icon in Google Colab to see the file and downlaod, select refresh by right clicking the files field if file did not appear immedietly


Predictions have been saved successfully to 'Group_3_B.csv'


In [8]:
# Output Verification
print(f"Checking if the output file exists: {'Found' if os.path.isfile('Group_3_B.csv') else 'Not Found'}")
if os.path.isfile('Group_3_B.csv'):
    print("\nFirst 5 lines of the prediction file:")
    with open('Group_3_B.csv', 'r') as file:
        for _ in range(5):
            print(file.readline().strip())
else:
    print("Output file not found. Please check the file path and write permissions.")


Checking if the output file exists: Found

First 5 lines of the prediction file:
prediction
1
1
1
1
