<a href="https://colab.research.google.com/github/Axiom-G/Axiom-G/blob/main/Pilot_Study_notebook_v10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pilot study: Measuring Psychological Variables in Text: a tutorial

Authors: Goddard, A. & Gillespie, A.

This notebook is for replicating the results of a pilot study for measuring misunderstandings in online dialogue.

For both understanding and misunderstanding:

1. A dictionary based classifier 

2. A ML classifier (using Google's BERT transformer model - see [Devlin et al., 2019](https://arxiv.org/abs/1810.04805v1)) 

All four classifiers are binary (e.g. misunderstanding/not misunderstanding) and are validated using general accuracy statistics.

*NOTE: PLEASE RUN FINAL CELL TO INSTALL ALL PACKAGES FOR THIS TUTORIAL* 

# 1. Clean and Load Data

In this section we load the data using Google Drive and perform a few very basic cleaning tasks. As the manually coded data has already been cleaned – and we want a raw text input – not much needs to be done to the data. 

In [2]:
# for setting directory
import os 
from pathlib import Path
# for mathematical / vectoral operations
import numpy as np
# for loading google drive
from google.colab import drive
# for loading dataframes and spreadsheets (.csv)
import pandas as pd

In [3]:
# load google drive
drive.mount('/content/drive/')
# pick path to data
os.chdir("/content/drive/My Drive/Colab Notebooks/Datasets/_PhDdata")
# load manually coded dataset
adf = pd.read_csv('Final_TurnCodedDataset_July2022_v2.csv')

Mounted at /content/drive/


In [4]:
# Select relevant columns for analysis - "M_C" indicates misunderstandings and "U_C" understandings
adf = adf[["turn_id", "group", "turn", "author", "text", "M_C", "U_C"]]

In [5]:
# delete any rows with no text
adf = adf[adf["text"] != ""]
# delte all values for misunderstanding not equal to zero or one
adf = adf[adf["M_C"].isin([0,1])]
# delte all values for understanding not equal to zero or one
adf = adf[adf["U_C"].isin([0,1])]

In [6]:
# inspect dataframe
adf.head()

Unnamed: 0,turn_id,group,turn,author,text,M_C,U_C
0,162942,Twitter_g_34679,turn_1,153653,@AskTarget messaged you again as the replaceme...,0.0,0.0
1,162941,Twitter_g_34679,turn_2,AskTarget,@153653 Thank you for taking the time to reach...,0.0,1.0
2,8427,Wiki8427,turn_1,Jayy008,"Another thing, please read [[Wikipedia:GOODCHA...",0.0,0.0
3,8428,Wiki8427,turn_2,Jayy008,"That was my fault with the Ballads, I called i...",1.0,0.0
4,8429,Wiki8427,turn_3,Jayy008,"Also, if you have a problem with my edits, rep...",0.0,0.0


# 2. Dictionary classifier

This dictionary classifier is built using SpaCy's rule-based pattern matcher: https://spacy.io/api/matcher/ 

First, we load in the dictionary and relevant packages. We also specify a Python dict where keys = target word and values = synonyms for words. These will then be used to create augmented dictionary items, generating a new sentence for each synonym:


> `"I didn't mean"` becomes: `["I didn't say", "I didn't intent", "I didn't try", etc.] `

These are then converted to (very basic) SpaCy patterns:

> `"I get it."` becomes: `{{"LOWER":"I"}, {"LOWER": "get"}, {"LOWER":"it"}, {"IS_PUNCT":True}}`

The matcher then counts, for each turn in the dialogue, all occurences of the dictionary's items (now SpaCy patterns). 

In [34]:
# SpaCy's rule based matcher
from spacy.matcher import Matcher
# SpaCy's medium language model 
import en_core_web_md
# for counting values in a list/array
from collections import Counter 
# for displaying remaining time in a loading bar
from tqdm import tqdm 
# for accuracy statistics
from sklearn.metrics import classification_report
# load nlp() from SpaCy
nlp = en_core_web_md.load()

In [21]:
# define synonyms for augmenting existing list of dictionary items
target_dict = {"mean":["say","intend","try"],
               "saying":["meaning", "intending", "speaking", "talking", "trying", "saying"],
               "really":["actually"],
               "understand":["get","comprehend","grasp", "realize", "see", "imagine"],
               "expect":["want"],
               "knew":["had known", "had foreseen"],
               "feel":["think","believe","intend", "want","mean"],
               "realize":["get","comprehend","grasp","understand","have any idea", "know", "have knowledge of"],
               "meant":["intended", "believed", "thought", "said"],
               "said":["meant", "believed", "thought", "wrote", "written", "spoke"],
               "misunderstood":["miscomprehend", "misconstured"],
               "completely:": ["absolutely", "actually", "also", "apparently", "basically", "clearly",
               "definitely", "especially", "essentially", "even", "extremely",
               "generally", "hardly", "so", "indeed", "mostly", "particularly", "presently",
               "primarily", "principally", "probably", "prolly", "relatively", "somewhat", 
               "soo*", "totally", "truly", "ultimately", "specifically", "totally",
               "vastly", "undeniably", "undoubtedly", "definitively", "evidently",
               "fundamentally", "infallibly", "irrefutably", "necessarily", "obviously",
               "positively", "100%", "complete", "total", "absolute", "full"],
               "ok":["alright", "fair", "my mistake", "my bad", "m"],
               "know":["have knowledge of", "get","comprehend","grasp","understand","have any idea", "realize"],
               "issue":["problem", "warning"],
               "thought":["believed", "intended", "said", "meant", "perceived"],
               "serious":["genuine","honest","truthful","real"],
               "problem":["issue", "warning"],
               "actually":["really"],
               "think":["feel", "intend"],
               "offend": ["upset", "hurt"],
               "don't":["do not"],
               "you've": ["you have"],
               "doesn't":["does not"],
               "you're": ["you are"],
               "I'm":["I am"],
               "hadn't":["had not"],
               "I'd":["I had"],
               "hell":["fuck"],
               "that's":["that is"],
               "clarification":["explanation", "explanation", "answer"],
               "clarifying": ["explaining", "showing", "demonstrating"],
               "clarifies": ["clears", "explains", "clarified", "cleared", "explained"],
               "thanks": ["thank you"],
               "problem": ["issue", "issues", "grievance", "bad experience", "bad time", "bad treatment", "poor experience", "poor time", "poor treatment"],
               "difficult":["hard", "painful","unfair","unjust","awful", "terrible", "dreadful","catastrophic","scary","upsetting"],
               "point":["argument", "position","perspective", "response"]}

In [22]:
# load dictionaries:
dicts = pd.read_csv("Dictionaries - IPA (v21).csv")

## 2.1 Prepare data & dictionary items for analysis

In [23]:
# make dictionaries into lists and delete any missing values
# Understanding
und_dict = dicts["Understanding"].to_list()
und_dict = [x for x in und_dict if str(x) != "nan"]
# Misunderstanding
mis_dict = dicts["Misunderstanding"].to_list()
mis_dict = [x for x in mis_dict if str(x) != "nan"]
print("there are {} items in the understanding dictionary.".format(len(und_dict)))
print("there are {} items in the misunderstanding dictionary.".format(len(mis_dict)))

there are 69 items in the understanding dictionary.
there are 135 items in the misunderstanding dictionary.


In [24]:
# create function to map on synonyms
def dictionary_augmenter(sentence, target_dict):
  # create output list
  new_sentences = []
  # iterate over dictionary keys and values
  for k,v in target_dict.items():
    # if the word (key) is in the target sentence
    if k in sentence:
      # for every synonym in the values
      for v_ in v:
        # create a new sentence
        new_sentences.append(sentence.replace(k, v_))
  # return output      
  return new_sentences

In [25]:
# augment the understanding dictionary by looping the new function over every sentence
und_aug = [dictionary_augmenter(s, target_dict) for s in und_dict] 
# flatten list of lists [[x,y,z],[i,j,k]] --> [x,y,z,i,j,k]
und_aug = [y for x in und_aug for y in x]
print("{} new items were generated for the understanding dictionary.".format(len(und_aug)))

279 new items were generated for the understanding dictionary.


In [26]:
# join the new dictionary with the old one
und_fDict = und_aug + und_dict
# keep only unique values
und_fDict = list(set(und_fDict))
print("Final understanding dictionary: {} items.".format(len(und_fDict)))

Final understanding dictionary: 343 items.


In [27]:
# repeat for misunderstandings
mis_aug = [dictionary_augmenter(s, target_dict) for s in mis_dict]
mis_aug = [y for x in mis_aug for y in x]
print("{} new items were generated for the misunderstanding dictionary.".format(len(mis_aug)))
mis_fDict = mis_aug + mis_dict
mis_fDict = list(set(mis_fDict))
print("Final misunderstanding dictionary: {} items.".format(len(mis_fDict)))

544 new items were generated for the misunderstanding dictionary.
Final misunderstanding dictionary: 614 items.


In [28]:
# create SpaCy patterns for each list of items

def createPattern(sentence, # expects string
                  nlp): # expects SpaCy NLP 
  # output list
  out_pat = []
  # create spacy document from sentence
  doc = nlp(sentence)
  # for every token in the document
  for t in doc:
    # let x be the lowercase token
    x = t.text.lower()
    # if the token is punctuation:
    if t.dep_ == "punct":
      # add punctuation pattern to output list
      out_pat.append({"IS_PUNCT":True})
    else:
      # if not punctuation, return the pattern for a lowercase word
      out_pat.append({"LOWER":x})
  return out_pat

In [29]:
# output list for misunderstanding patterns
misPatterns = []
# for each phrase in the dictionary
for x in mis_fDict: 
  # create pattern from phrase and add to output list
  misPatterns.append(createPattern(x, nlp))

# repeat for understanding patterns
undPatterns = []
for x in und_fDict:
  undPatterns.append(createPattern(x,nlp))

## 2.2 Run dictionary

In [30]:
# function for generating pattern matches

def create_matches(df, # expects dataframe
                   patterns, # expect list of patterns
                   patterns_tag, # expects string - variable name
                   nlp, # expects SpaCy nlp()
                   text_col = "text"):
  # create SpaCy documents for each text in text column
  df["doc"] = [nlp(d) for d in df[text_col]]
  # initiate the pattern matcher
  matcher = Matcher(nlp.vocab)
  # add patterns to the matcher
  matcher.add(patterns_tag, patterns)
  # output list
  output = []
  # for document in document column
  for doc in tqdm(df.doc):
    # create matches
    matches = matcher(doc)
    # count matches
    counts = Counter(element[0] for element in matches)
    # update output list with frequency counts for document
    value = None
    for entry in counts:
        value = counts[entry]
    output.append(value)
  # create output column of matches
  df[patterns_tag] = output
  # fill no matches (nan) with zero
  df[patterns_tag] = df[patterns_tag].fillna(0)
  # return dataframe
  return df

# function for making a frequency vector binary 
def binaryMaker(list_):
  out = []
  for x in list_:
    if x > 0:
      out.append(1)
    else:
      out.append(0)
  return out

In [31]:
# Create misunderstanding results
adf = create_matches(adf, misPatterns, "M_D", nlp)
adf["M_D"] = binaryMaker(adf["M_D"].tolist())
print("there are {x} misunderstanding dictionary hits for {n} turns in the dataset".format(x = sum(adf["M_D"]), n = len(adf["M_D"])))

100%|██████████| 4032/4032 [00:05<00:00, 698.17it/s] 

there are 44 misunderstanding dictionary hits for 4032 turns in the dataset





In [32]:
# Create understanding results
adf = create_matches(adf, undPatterns, "U_D", nlp)
adf["U_D"] = binaryMaker(adf["U_D"].tolist())
print("there are {x} understanding dictionary hits for {n} turns in the dataset".format(x = sum(adf["U_D"]), n = len(adf["U_D"])))

100%|██████████| 4032/4032 [00:01<00:00, 2347.46it/s]

there are 242 understanding dictionary hits for 4032 turns in the dataset





## 2.3 Validate model

We use the following statistics for assessing the accuracy of the classifiers:

*   Accuracy: the ratio of correct predictions 
*   Precision: the ratio of true positives to true positives and false positives
*   Recall: the ratio of true positives to true positives and false negatives
*   F1 score: harmonic mean of precision and recall


In [35]:
# Misunderstanding classification report
print(classification_report(adf["M_C"].tolist(), adf["M_D"].tolist()))

              precision    recall  f1-score   support

         0.0       0.93      0.99      0.96      3723
         1.0       0.50      0.07      0.12       309

    accuracy                           0.92      4032
   macro avg       0.71      0.53      0.54      4032
weighted avg       0.90      0.92      0.90      4032



In [36]:
# Understanding classification report
print(classification_report(adf["U_C"].tolist(), adf["U_D"].tolist()))

              precision    recall  f1-score   support

         0.0       0.86      0.98      0.92      3322
         1.0       0.72      0.25      0.37       710

    accuracy                           0.85      4032
   macro avg       0.79      0.61      0.64      4032
weighted avg       0.83      0.85      0.82      4032



# 3. Machine learning classifier

We use the package ktrain: https://github.com/amaiya/ktrain for using Google's Bidirectional Encoder Representations from Transformer (BERT) model, originally introduced by Devlin and colleagues (2019): https://github.com/google-research/bert/blob/master/README.md 

BERT is a deep learning transformer model trained on large quantities of text data. Unlike classic machine learning, deep learning requires little to no feature engineering. Instead, it uses auto-encoders to generate features from the text. BERT's transformers provide features that (in the simplest of terms) have been trained in a large dataset. These can be transferred through the transformer model and applied to a new natural language processing task (in this case, binary classification). 

In [40]:
# Load ktrain: 
# See also: https://colab.research.google.com/drive/1ixOZTKLz4aAa-MtC6dy_sAvc9HujQmHN#scrollTo=Y8hIFvooF4vc
# and: https://medium.com/towards-data-science/bert-text-classification-in-3-lines-of-code-using-keras-264db7e7a358
# and: https://towardsdatascience.com/ktrain-a-lightweight-wrapper-for-keras-to-help-train-neural-networks-82851ba889c 
import ktrain
from ktrain import text

# for splitting dataset into train and test
from sklearn.model_selection import train_test_split

## 3.1 Misunderstanding classifier
### 3.1.1 Prepare data

The first step is to split our data into:


*   A training set (here 70% of the coded data)
*   A validation set (here 30% of the coded data).

The model only learns using the training set and we test its accuracy using the validation set. This ensures that validation is only done using new data  not  previously "seen" by the model. This prevents the model from overfitting and learning only the precise training documents coded manually as a target variable.

In [41]:
# split train dataset into train, validation and test sets
# X = Texts
X = adf['text'].copy()
# Y1 = Understanding codes
y_u = adf['U_C'].astype(int).copy()
#Y2 = Misunderstanding codes
y_m = adf['M_C'].astype(int).copy()

# MISUNDERSTANDING test and train vectors - 30% test size and stratified for coding distribution
# X_train = list of texts in training; X_test = list of texts for validation
# y_train = list of codes for training; y_test = list of codes for validation
X_trainM, X_testM, y_trainM, y_testM = train_test_split(X, y_m, test_size=0.30, random_state=10, stratify=y_m)

# Create train dataframe, ensuring codes are integers for the ktrain.text function
MisTraindf = pd.DataFrame({"text": X_trainM, "Misunderstanding": [int(x) for x in y_trainM]})
# Create validation dataframe, ensuring codes are integers for the ktrain.text function
MisValdf = pd.DataFrame({"text": X_testM, "Misunderstanding": [int(x) for x in y_testM]})

# Create objects ready for ktrain learner object (train/test sets + preprocessing object)
(x_train_m,  y_train_m), (x_test_m, y_test_m), preproc = text.texts_from_df(train_df = MisTraindf, # training df
                                                                            text_column = "text",
                                                                            label_columns = ["Misunderstanding"], 
                                                                            val_df = MisValdf, # validation df
                                                                            preprocess_mode='bert', # preprocessing mode for feature embeddings - Google's BERT
                                                                            maxlen=350, # this is the max number of words for a document
                                                                            max_features=35000) # size of network


['not_Misunderstanding', 'Misunderstanding']
      not_Misunderstanding  Misunderstanding
1256                   1.0               0.0
3281                   1.0               0.0
2646                   1.0               0.0
3054                   1.0               0.0
276                    1.0               0.0
['not_Misunderstanding', 'Misunderstanding']
      not_Misunderstanding  Misunderstanding
481                    1.0               0.0
3726                   1.0               0.0
3715                   1.0               0.0
2768                   1.0               0.0
1424                   1.0               0.0
preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


### 3.1.2 Train model

In [42]:
# Load the bert model and create a "learner" object

# load a BERT text classifier and specify the training data - "preproc" is the relevant bert preprocessing 
model = text.text_classifier('bert', train_data=(x_train_m, y_train_m), preproc=preproc)

# Load the ktrain "learner" object that primes the algorithm - Batch size should be set according to the GPU capabilities: https://github.com/google-research/bert/blob/master/README.md 
learner = ktrain.get_learner(model, train_data=(x_train_m, y_train_m), batch_size=6)

Is Multi-Label? False
maxlen is 350
done.


In [43]:
# run the model - in BERT Paper (Devlin et al., 2019), they recommend 5e-5, 3e-5, or 2e-5. However, this can also be found using learner.lr_find() or learner.lr_plot()
# We run on four iterations
learner.fit_onecycle(2e-5, 4)



begin training using onecycle policy with max lr of 2e-05...
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f83a62eef50>

### 3.1.3 Validate model

In [44]:
# Validate the model
learner.validate(val_data=(x_test_m, y_test_m))

              precision    recall  f1-score   support

           0       0.94      0.98      0.96      1117
           1       0.54      0.30      0.39        93

    accuracy                           0.93      1210
   macro avg       0.74      0.64      0.67      1210
weighted avg       0.91      0.93      0.92      1210



array([[1093,   24],
       [  65,   28]])

## 3.2 Understandings classifier

In [45]:
# Data preperation

X_trainU, X_testU, y_trainU, y_testU = train_test_split(X, y_u, test_size=0.30, random_state=10, stratify=y_m)


UndTraindf = pd.DataFrame({"text": X_trainU, "Understanding": [int(x) for x in y_trainU]})
UndValdf = pd.DataFrame({"text": X_testU, "Understanding": [int(x) for x in y_testU]})

(x_train_u,  y_train_u), (x_test_u, y_test_u), preproc = text.texts_from_df(train_df = UndTraindf,
                                                                            text_column = "text",
                                                                            label_columns = ["Understanding"],
                                                                            val_df = UndValdf,
                                                                            preprocess_mode='bert',
                                                                            maxlen=350, 
                                                                            max_features=35000)

['not_Understanding', 'Understanding']
      not_Understanding  Understanding
1256                1.0            0.0
3281                1.0            0.0
2646                1.0            0.0
3054                1.0            0.0
276                 1.0            0.0
['not_Understanding', 'Understanding']
      not_Understanding  Understanding
481                 1.0            0.0
3726                1.0            0.0
3715                1.0            0.0
2768                1.0            0.0
1424                1.0            0.0
preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


In [46]:
# load model
model_und = text.text_classifier('bert', train_data=(x_train_u, y_train_u), preproc=preproc)
learner_und = ktrain.get_learner(model_und, train_data=(x_train_u, y_train_u), batch_size=6)

Is Multi-Label? False
maxlen is 350
done.


In [47]:
# run model
learner_und.fit_onecycle(2e-5, 4)



begin training using onecycle policy with max lr of 2e-05...
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f832073ba50>

In [48]:
# Validate the model
learner_und.validate(val_data=(x_test_u, y_test_u))

              precision    recall  f1-score   support

           0       0.95      0.95      0.95       991
           1       0.78      0.79      0.79       219

    accuracy                           0.92      1210
   macro avg       0.87      0.87      0.87      1210
weighted avg       0.92      0.92      0.92      1210



array([[941,  50],
       [ 45, 174]])

## 3.3 Testing understanding model

The machine learning models can be used to predict a classification for any new document. Here we use our best performing understandings model as an example.

In [49]:
# create a predictor tool
predictor = ktrain.get_predictor(learner_und.model, preproc)

In [50]:
# check categories
predictor.get_classes()

['not_Understanding', 'Understanding']

In [51]:
predictor.predict("I don't get what you're saying")

'not_Understanding'

In [52]:
predictor.predict("Thank you so much")

'Understanding'

In [53]:
predictor.predict("Ah, I get it now")

'Understanding'

In [54]:
predictor.predict("I don't get it dude")

'not_Understanding'

# RUN ME FIRST

In [1]:
!python -m spacy download en_core_web_md
!pip install ktrain

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-md==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.4.0/en_core_web_md-3.4.0-py3-none-any.whl (42.8 MB)
[K     |████████████████████████████████| 42.8 MB 1.4 MB/s 
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ktrain
  Downloading ktrain-0.31.7.tar.gz (25.3 MB)
[K     |████████████████████████████████| 25.3 MB 1.5 MB/s 
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 37.4 MB/s 
Collecting cchardet
  Downloading cchardet-2.1.7-cp37-cp37m-manylinux2010_x86_64.whl (263 kB)
[K     |████