# Note
If you need to run the code there are two options:
- Download all the libraries required in your local machine before running this file.

- Or you can access the google colab link: https://colab.research.google.com/drive/1rApB67sZ2R5lxZu8m1PuMbOh9LwHlw_I?usp=sharing
    In the google colab, go to the folder Icon in the left side, and upload the excel dataset before running the code.


#Pre-processing the Data

First, we uploaded the Excel file that contained the IELTS dataset and retrieved all important columns using pandas and numpy libraries.The IELTS writing scored essays dataset is an extensive collection of over 1200 essays, each accompanied a variety of essential columns.  

This step is important because the IELTS dataset had two tasks component or types. This categorizes the essays into their respective IELTS writing tasks, distinguishing between "Task 1" and "Task 2". In this project, we focused on the second task "Task 2". Hence, one of the things we did in the pre-processing phase was to filter out column for task 1.

In [None]:
#Import tensorflow and pandas
import pandas as pd
import tensorflow as tf


In [None]:
!pip install tensorflow==2.8.0
!pip install transformers==4.18.0




In [None]:
#Read the excel file
file_path = '/content/ielts_writing_dataset.csv'
data_csv = pd.read_csv(file_path)
data_csv.head() #This is just to make sure we read the file correctly...


Unnamed: 0,Task_Type,Question,Essay,Examiner_Commen,Task_Response,Coherence_Cohesion,Lexical_Resource,Range_Accuracy,Overall
0,1,The bar chart below describes some changes abo...,"Between 1995 and 2010, a study was conducted r...",,,,,,5.5
1,2,Rich countries often give money to poorer coun...,Poverty represents a worldwide crisis. It is t...,,,,,,6.5
2,1,The bar chart below describes some changes abo...,The left chart shows the population change hap...,,,,,,5.0
3,2,Rich countries often give money to poorer coun...,Human beings are facing many challenges nowada...,,,,,,5.5
4,1,The graph below shows the number of overseas v...,Information about the thousands of visits from...,,,,,,7.0


Visualizing the data..

In [None]:
print(data_csv.columns)


Index(['Task_Type', 'Question', 'Essay', 'Examiner_Commen', 'Task_Response',
       'Coherence_Cohesion', 'Lexical_Resource', 'Range_Accuracy', 'Overall'],
      dtype='object')


In [None]:
print(data_csv.shape)


(1435, 9)


In [None]:
print(data_csv.isnull().sum()) #to know how many empty cells


Task_Type                0
Question                 0
Essay                    0
Examiner_Commen       1373
Task_Response         1435
Coherence_Cohesion    1435
Lexical_Resource      1435
Range_Accuracy        1435
Overall                  0
dtype: int64


In [None]:
print(data_csv.dtypes) #data type


Task_Type               int64
Question               object
Essay                  object
Examiner_Commen        object
Task_Response         float64
Coherence_Cohesion    float64
Lexical_Resource      float64
Range_Accuracy        float64
Overall               float64
dtype: object


Since the task type is integer and the overall score is in float, we dont have to convert their data types

In [None]:
#data_csv['Task_Type'] This represent all the rows in the colm Task_Type, so we want to focus on task 2 only thats why we will filter them this way:


data_task_2 = data_csv[data_csv['Task_Type'] == 2]
#to make sure we did it right lets print data_task_2:
data_task_2.head()

Unnamed: 0,Task_Type,Question,Essay,Examiner_Commen,Task_Response,Coherence_Cohesion,Lexical_Resource,Range_Accuracy,Overall
1,2,Rich countries often give money to poorer coun...,Poverty represents a worldwide crisis. It is t...,,,,,,6.5
3,2,Rich countries often give money to poorer coun...,Human beings are facing many challenges nowada...,,,,,,5.5
5,2,Some countries achieve international sports by...,Whether countries should only invest facilitie...,,,,,,6.5
7,2,Some countries achieve international sports by...,"Sports is an essential part to most of us , so...",,,,,,5.5
9,2,Some countries achieve international sports by...,International sports events require the most w...,,,,,,9.0


In [None]:
data_task_2.shape #In the Kaggle website, it says that there are 793 examples that are task 2 samples, so checking the shape will allow us to check if we filter them correctly:

(793, 9)

In [None]:
#Checking the data and making sure the encoding went correctly:
data_task_2.head()

Unnamed: 0,Task_Type,Question,Essay,Examiner_Commen,Task_Response,Coherence_Cohesion,Lexical_Resource,Range_Accuracy,Overall
1,2,Rich countries often give money to poorer coun...,Poverty represents a worldwide crisis. It is t...,,,,,,6.5
3,2,Rich countries often give money to poorer coun...,Human beings are facing many challenges nowada...,,,,,,5.5
5,2,Some countries achieve international sports by...,Whether countries should only invest facilitie...,,,,,,6.5
7,2,Some countries achieve international sports by...,"Sports is an essential part to most of us , so...",,,,,,5.5
9,2,Some countries achieve international sports by...,International sports events require the most w...,,,,,,9.0


In [None]:
#------------
#The columns that we need are question, essay and the overall score, we will treat the question and essay columns as features, and the Overall as a label (output).
labels = data_task_2['Overall'].values

features = data_task_2[['Question', 'Essay']].values #adding .values will convert to numpy (built-in pandas library), features[0] will give the first row for both question and essay columns
# and features[0][0] will give the first column of the first row, features[0][1] is the second column from the first row..


#Training set, cross-validation set and testing set

In this section we will split the data we have into three sets. Currently we have 793 row, each has a question, and essay response and the score that person got. We will split this into 60% 20% 20%. We will utlize scikit-learn to do this.

The reason why we split the data into three sets, is because its very common problem to have overfitting when training a model, having a cross validation will insure that we avoid this problem. For the cross validation set, we dont update the weights (we dont train the model), we just validate the accuracy and performance of the model. Finally, we would use the testing set, which will be the unbiased evaluation set, we wont use it until we done with the training.


In [None]:
from sklearn.model_selection import train_test_split

# Combine text features, using SEP to indicate a question-answer relation
data_task_2['combined_features'] = data_task_2['Question'] + " [SEP] " + data_task_2['Essay']

# Split the combined features and labels into training, validation, and test sets
X_train_val, X_test, y_train_val, y_test = train_test_split(
    data_task_2['combined_features'], data_task_2['Overall'], test_size=0.2, random_state=42
)

X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=0.25, random_state=42  # 0.25 x 0.8 = 0.2
)



print(X_val.iloc[0], y_val.iloc[0])  #example


Universities should accept equal numbers of male and female students in every subject.To what extent do you agree or disagree?Give reasons for your answer and include any relevant examples from your own knowledge or experience. [SEP] Some people believe that educational facilities must accept the same numbers of men as women in every subject. There are both pros and cons in this argument, which will be discussed in this essay.
A tremendous number of people think that equality is crucial for education. Nowadays, more and more activists fight for women's rights. They want that every female can study everything that she prefers, notwithstanding this is socially acceptable or not. Moreover, many feminists suggest that patriarchal societies do not allow women to study equality with men. For example, when a little boy says to his younger sister that she is not intelligent enough for the men's play, a little girl grows up and thinks that she cannot study men's subjects and work men's jobs. Th

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_task_2['combined_features'] = data_task_2['Question'] + " [SEP] " + data_task_2['Essay']


#Skewnees ( We didnt work on this part yet, but its part of the research, so I decided to include it here.)

Skewness is when the training samples we have are not balanced (for example, most scores are between 4-9 and very few scores between 0-3) and because of that, the model might not predict the scores 0-4 correctly. It turned out that this is the case with the dataset we using from Kaggle, most scores were between 5-9.

#Fine tune

Now the next steps are as follows:
 - Install the transformer (Already installed in google colab)
 - Call the pre-trained model (Source of model: https://huggingface.co/google-bert/bert-base-cased)
 - Call the tokenizer
 - Convert the tokenized inputs and labels into TensorFlow datasets / datasets object (Still need to do)

In [None]:
#Calling the pre-trained model:

from transformers import TFBertModel, BertTokenizer

# Load the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

# Load the pre-trained BERT model
bert_model = TFBertModel.from_pretrained('bert-base-cased')


Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/502M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [None]:
#Tokenize:

# Convert the series to lists.
X_train_list = X_train.tolist()
X_val_list = X_val.tolist()
X_test_list = X_test.tolist()

# Tokenize the training set
train_encodings = tokenizer(X_train_list,
                            add_special_tokens=True,  # Enable [SEP] that we used in the combined_text
                            truncation=True,  # If num of tokens is above than 512 tokens, truncation will activate.
                            padding=True,  # Add padding to the sequence
                            return_tensors='tf')  # Return format

# Validation set
val_encodings = tokenizer(X_val_list,
                           add_special_tokens=True,
                           truncation=True,
                           padding=True,
                           return_tensors='tf')

# The test set
test_encodings = tokenizer(X_test_list,
                            add_special_tokens=True,
                            truncation=True,
                            padding=True,
                            return_tensors='tf')

# Convert the label Series to numpy arrays
y_train_np = y_train.to_numpy()
y_val_np = y_val.to_numpy()
y_test_np = y_test.to_numpy()


In [None]:
# Convert Tokenized Inputs and Labels into TensorFlow Datasets

train_dataset = tf.data.Dataset.from_tensor_slices((
    {
        'input_ids': train_encodings['input_ids'],
        'token_type_ids': train_encodings['token_type_ids'],
        'attention_mask': train_encodings['attention_mask']
    },
    y_train_np
))

val_dataset = tf.data.Dataset.from_tensor_slices((
    {
        'input_ids': val_encodings['input_ids'],
        'token_type_ids': val_encodings['token_type_ids'],
        'attention_mask': val_encodings['attention_mask']
    },
    y_val_np
))

test_dataset = tf.data.Dataset.from_tensor_slices((
    {
        'input_ids': test_encodings['input_ids'],
        'token_type_ids': test_encodings['token_type_ids'],
        'attention_mask': test_encodings['attention_mask']
    },
    y_test_np
))

# You can batch and prefetch the dataset for better performance during training
batch_size = 8  # Depending on your GPU memory..
train_dataset = train_dataset.shuffle(len(X_train)).batch(batch_size).prefetch(tf.data.AUTOTUNE)
val_dataset = val_dataset.batch(batch_size).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(batch_size).prefetch(tf.data.AUTOTUNE)


In [None]:
from transformers import TFBertModel, BertTokenizer
import tensorflow as tf

# Load the pre-trained BERT model
bert_model = TFBertModel.from_pretrained('bert-base-cased')

# Ensure the BERT model's pre-trained layers are not trainable to reuse features without modification, this is important for fine tuning
bert_model.trainable = False

# Define model inputs
input_ids = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name='input_ids')  # tokens
token_type_ids = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name='token_type_ids')
attention_mask = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name='attention_mask')

# Use the BERT model
outputs = bert_model(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)

# Take the [CLS] token's output for regression tasks
pooled_output = outputs.pooler_output

# Add custom layers on top for our specific task
# It should be dense layer because we using regression:
score = tf.keras.layers.Dense(1, activation=None)(pooled_output)

# Construct the final model
model = tf.keras.Model(inputs=[input_ids, token_type_ids, attention_mask], outputs=score)

# Model summary to see all layers
model.summary()


Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, None)]       0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, None)]       0           []                               
                                                                                                  
 token_type_ids (InputLayer)    [(None, None)]       0           []                               
                                                                                                  
 tf_bert_model_1 (TFBertModel)  TFBaseModelOutputWi  108310272   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]',     

#Training
We used 3 epochs for the training, we did try more than that but apparently after the third epoch, the improvement is very minimal.

For every epoch, the cross-validation set is also evaluated after the training set.

In [None]:
# Compile the model with  Adam optimizer, loss function MSE, and metrics MAE
model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss='mean_squared_error',
              metrics=['mean_absolute_error'])

# Training the model with epochs and batch size as required

epochs = 3

history = model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=epochs
)


Epoch 1/3
Epoch 2/3
Epoch 3/3


In [None]:
# Evaluate the model on the test dataset
test_loss, test_mae = model.evaluate(test_dataset)

print(f"Test Loss: {test_loss:.4f}")
print(f"Test Mean Absolute Error (MAE): {test_mae:.4f}")


Test Loss: 1.2144
Test Mean Absolute Error (MAE): 0.8849


The loss in the test dataset is 1.2144, and in the training set is 1.2981. Which is good indicator that the model is doing fine with no overfitting.

In [None]:
# Extract a batch from the test dataset
for batch in test_dataset.take(1):   # You can adjust this for more batches
    inputs, labels = batch

# Predict scores using the model
predicted_scores = model.predict(inputs)

# Print out a few predictions vs actual scores
for i in range(len(predicted_scores)):
    print(f"Essay {i+1}:")
    print(f"Predicted Score: {predicted_scores[i][0]:.2f}")
    print(f"Actual Score: {labels[i].numpy()}\n")


Essay 1:
Predicted Score: 7.29
Actual Score: 5.0

Essay 2:
Predicted Score: 6.14
Actual Score: 7.5

Essay 3:
Predicted Score: 6.89
Actual Score: 8.0

Essay 4:
Predicted Score: 6.84
Actual Score: 6.5

Essay 5:
Predicted Score: 7.22
Actual Score: 9.0

Essay 6:
Predicted Score: 6.75
Actual Score: 6.5

Essay 7:
Predicted Score: 6.59
Actual Score: 5.0

Essay 8:
Predicted Score: 6.84
Actual Score: 6.0



# References:

- The model we used:
 https://huggingface.co/google-bert/bert-base-cased

- Dataset from Keggle:
https://www.kaggle.com/datasets/mazlumi/ielts-writing-scored-essays-dataset/data

- I relied on this page to setup the training API:
https://pypi.org/project/easy-tensorflow/1.1.1/