<a href="https://colab.research.google.com/github/Arron-The-Analyst/MSc-Dissertation-Coding-Repository/blob/master/MSc_Disseration_Codebase.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MSc Dissertation Technical Experiment



Step 1: Importing the relevant packages.


In [75]:
# Firstly, let's import the relevant packages that we will be using.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.utils.data as data_utils
import torch.optim as optim
import gc  #If you are running this on a GPU, it is likely you will need a garabage collector
from tqdm import tqdm

In [77]:
# Then, we need to import the transformer models from the hugging face index (Source: https://huggingface.co/transformers/index.html)
%%capture
!pip install transformers

In [55]:
# Let's now import the ALBERT sequence classifier and ALBERT tockensizer, again from the hugging face index (Source: https://huggingface.co/transformers/index.html)
%%capture
from transformers import AlbertForSequenceClassification, AlbertTokenizer

# It is ideal to run this over a GPU, but as most computers do not have a GPU, let us enable this to run on a CPU.
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")


Step 2: Importing the Data.

In [56]:
# Firstly, select the CSV file you wish to run the model over.
from google.colab import files
uploaded = files.upload()

Saving news.csv to news (3).csv


In [78]:
# Next, we can use pandas to import our real news sources on the Coronavirus.
text_data = pd.read_csv("news.csv",header=1) # Note: make sure the csv file is the same as the one you have just uploaded

In [79]:
# Let's set out our column names and double check that all our data is present
text_data.columns = ['text','target_names','target']
text_data.head(10)

Unnamed: 0,text,target_names,target
0,One patient and five members of staff at a car...,Real,1
1,Four councils in north Wales are to go into lo...,Real,1
2,Visits to a prison have been stopped after an ...,Real,1
3,Police officers in England and Wales are to be...,Real,1
4,Students at Aberdeen University have been warn...,Real,1
5,I feel as if this pandemic has truly left me w...,Real,1
6,"Steve Thomas, leader of the council's Labour g...",Real,1
7,The World Health Organisation has announced it...,Real,1
8,More than 10 million people have downloaded th...,Real,1
9,A third wave of coronavirus is “entirely possi...,Real,1


Step 3: Set up the tockeniser ALBERT model.


In [80]:
# Now, we need to tokensize our model which we shall do on a standard ALBERT model. A custom tockenization could have potential of future study.
token_maker = AlbertTokenizer.from_pretrained('albert-base-v1')

In [81]:
# Next, let's map this to our own dataset using a straightforward lambda function.
df_tokens = list(map(lambda t: ['[CLS]'] + token_maker.tokenize(t)[:510] + ['[SEP]'], text_data['text']))

In [82]:
# Transformer models have a maximun input of 512 tokens they can process, so we need to limit the size of our system.
max_input = 512

In [83]:
# Now, we can easily find the index of each tocken using the mapping function
index_tokens = list(map(token_maker.convert_tokens_to_ids, df_tokens))

In [84]:
# And perform a matrix embedding using the numpy array function.
index_matrix = np.array([xi+[0]*(max_input-len(xi)) for xi in index_tokens])

In [85]:
# Keeping it simple we just want to know whether a system believes a news item is Real(1) or Fake(0). 
# Therefore using a binary classifier is the easiest way to achieve this.
target = text_data['target'].values

In [86]:
# Let's now create a dictionary which maps our indexs and tokens together.

# Create an dictionary of all of words and indicies
words = []
indices = []

# Use the extend function to map the tokens to the words.
for l in df_tokens:
  words.extend(l)

# Use the extend function to map the indexs to the tockens.
for i in index_tokens:
  indices.extend(i)

# Use the zip function and create our two dicitonaries.
word_to_ix = dict(zip(words, indices))
ix_to_word = dict(zip(indices, words))

In [87]:
# As ALBERT makes use of word embedding we will also need to mask the variables together.
mask = [[float(i>0) for i in ii] for ii in index_matrix]

Step 4: Format the data into a tensor so that it can be processed by the ALBERT model



In [88]:
# Let's start by setting a small batch size of 5 results. (You can change this value)
Batch_Size = 5

In [89]:
# Then we can create a method to load our data into a form which can be processed by Pytorch.
def tensor_format(text_data, mask, labels, batch_size):
    
    X = torch.from_numpy(text_data)
    X = X.long()
    
    mask = torch.tensor(mask)
    
    y = torch.from_numpy(labels)
    y = y.long()
    
    t_p = data_utils.TensorDataset(X, mask, y)
    loader = data_utils.DataLoader(t_p, batch_size=batch_size, shuffle=False)
    
    return loader

In [90]:
# Finally, we can split our data using a classic test-train function on. We have set the default to a 80:20 split.

X_train, X_test, y_train, y_test = train_test_split(index_matrix, target, 
                                                    test_size=0.2, random_state=42)  

train_masks, test_masks, _, _ = train_test_split(mask, index_matrix, 
                                                       test_size=0.2, random_state=42)

training_data = tensor_format(X_train, train_masks, y_train,Batch_Size)
testing_data = tensor_format(X_test, test_masks, y_test, Batch_Size)

In [91]:
# Let us just run an iteration of our training data to make sure this is working as intended.
next(iter(training_data))

[tensor([[    2,   698,  2041,  ...,     0,     0,     0],
         [    2, 13538, 14792,  ...,     0,     0,     0],
         [    2,    19,    21,  ...,     0,     0,     0],
         [    2,    83,     8,  ...,     0,     0,     0],
         [    2,    21,   653,  ...,     0,     0,     0]]),
 tensor([[1., 1., 1.,  ..., 0., 0., 0.],
         [1., 1., 1.,  ..., 0., 0., 0.],
         [1., 1., 1.,  ..., 0., 0., 0.],
         [1., 1., 1.,  ..., 0., 0., 0.],
         [1., 1., 1.,  ..., 0., 0., 0.]]),
 tensor([1, 1, 0, 1, 1])]

Step 5: Load and Run the Model.


In [92]:
# Firstly, let us define our model we shall be running.
model = AlbertForSequenceClassification.from_pretrained('albert-base-v1') # Source (https://github.com/google-research/albert)

# Then, just quickly check it has loaded in.
model

AlbertForSequenceClassification(
  (albert): AlbertModel(
    (embeddings): AlbertEmbeddings(
      (word_embeddings): Embedding(30000, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): AlbertTransformer(
      (embedding_hidden_mapping_in): Linear(in_features=128, out_features=768, bias=True)
      (albert_layer_groups): ModuleList(
        (0): AlbertLayerGroup(
          (albert_layers): ModuleList(
            (0): AlbertLayer(
              (full_layer_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (attention): AlbertAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=76

In [93]:
# Let us build a simple method to check the accuracy of our obtained results using the tqdm import.

def accuracy(model, dataloader, processor):
    tqdm()
    model.eval()
    correct, num_samples = 0,0
    
    with torch.no_grad():
        
        for i, batch in enumerate(tqdm(dataloader)):
            token_ids, masks, labels = tuple(t.to(processor) for t in batch)
            _, yhat = model(input_ids=token_ids, attention_mask=masks, labels=labels)
            predict = (torch.sigmoid(yhat[:,1]) > 0.5).long()
            num_samples += labels.size(0)
            correct += (predict==labels.long()).sum()
            
 # Save space by deleting uneccessary labels           
            del token_ids, masks, labels 
        gc.collect() 
        
 # Return accuracy measure.       
        return correct.float()/num_samples*100

Step 6: Output the Results.

In [95]:
# To start with let us reload our processor.
processor = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Set the number of times you wish to run the model
times_run = 4

# We can also use the simple Adam model optimiser to make sure our system is running effectively.
loss_function = nn.BCEWithLogitsLoss()
lost = []
model.to(processor)
opt = optim.Adam(model.parameters(), lr=3e-6)

In [96]:
# For every time run, set parameters
for times in range(times_run):
    model.train()
    loss_rate = 0.0
    iterate = 0
    
    # For every batch in the training set, run the optimization.
    for i, batch in enumerate(training_data):
        iterate += 1
        token_ids, masks, labels = tuple(t.to(device) for t in batch)
        opt.zero_grad()
        loss, yhat = model(input_ids=token_ids, attention_mask=masks, labels=labels)
        loss.backward()
        opt.step()
        loss_rate  += float(loss.item())
         
        # Get rid of obselte information.
        del token_ids, masks, labels
    
        # Display results if less than 20%
        if not i%20:
            print(f'Run Number: {times+1:03d}/{times_run:03d} | '
                  f'Batch {i+1:03d}/{len(training_data):03d} | '
                  f'Average loss in last iteration: {(loss_rate/iterate):.4f}')
            
            # Reset values
            loss_rate  = 0.0
            iterate = 0

        # Optional, save RAM and perform another Garabage Collection.
        # gc.collect()

        # Append lost values to an array.
        lost.append(float(loss.item()))
    
    # Print out current training accuracy after each iteration.
    with torch.set_grad_enabled(False):
          print(f'\n Current training accuracy: '
              f'{accuracy(model, training_data, processor):.2f}%')

Run Number: 001/004 | Batch 001/003 | Average loss in last iteration: 0.8032


0it [00:00, ?it/s]
100%|██████████| 3/3 [00:25<00:00,  8.51s/it]



 Current training accuracy: 21.43%
Run Number: 002/004 | Batch 001/003 | Average loss in last iteration: 0.7600


0it [00:00, ?it/s]
100%|██████████| 3/3 [00:25<00:00,  8.47s/it]



 Current training accuracy: 28.57%
Run Number: 003/004 | Batch 001/003 | Average loss in last iteration: 0.7203


0it [00:00, ?it/s]
100%|██████████| 3/3 [00:25<00:00,  8.44s/it]



 Current training accuracy: 64.29%
Run Number: 004/004 | Batch 001/003 | Average loss in last iteration: 0.6727


0it [00:00, ?it/s]
100%|██████████| 3/3 [00:25<00:00,  8.38s/it]


 Current training accuracy: 78.57%





In [97]:
with torch.set_grad_enabled(False):
  print(f'\n Final test accuracy:'
  f'{accuracy(model, training_data, processor):.2f}%')

0it [00:00, ?it/s]
100%|██████████| 3/3 [00:25<00:00,  8.38s/it]


 Final test accuracy:78.57%





Step 7: Try to interpret test accuracy as a qualatative measure.


In [109]:
# Get the accuracy value and convert from a tensor to a numpy.
accuracy_value = accuracy(model, training_data, processor).numpy()

0it [00:00, ?it/s]
100%|██████████| 3/3 [00:25<00:00,  8.41s/it]


In [125]:
# Provide a contextual explination of the obtained results.
if accuracy_value > 0 and accuracy_value <= 0.24:
  print("We obtained a training accuracy score of %d" %accuracy_value + "%") 
  print("This indicates that: The model is struggling to distinguish between Real and Fake News Sources on the Coronavirus.")

if accuracy_value >= 0.25 and accuracy_value <=0.49:
   print("We obtained a training accuracy score of %d" %accuracy_value + "%")
   print("This indicates that: The Model is underperforming at finding the difference between Real and Fake Sources on the Coronavirus.")

if accuracy_value  >= 0.5 and accuracy_value <=0.75:
  print("We obtained a training accuracy score of %d" %accuracy_value + "%")
  print("This indicates that: The Model is performing okay at identifying the difference between Real and Fake Sources on the Coronavirus.")

else: 
    print("We obtained a training accuracy score of %d" %accuracy_value + "%")  
    print("This indicates that: The Model is confidently and accurately defining Real and Fake News Sources on the Coronvirus.")

We obtained a training accuracy score of 78%
This indicates that: The Model is confidently and accurately defining Real and Fake News Sources on the Coronvirus.
