<a href="https://colab.research.google.com/github/Amsterdam-Internships/Readability-Lexical-Simplification/blob/master/Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **All Analyses**
This notebook contains the components for the quantitative analysis of the data and model (output)s

1.   Subword-tokenization inspection
2.   Generation Properties
3.   Selection Properties



## 0. Installations

In [None]:
!pip install transformers
!pip install sentencepiece
!pip install torch
!pip install spacy

import transformers
import torch
import pandas as pd
import matplotlib.pyplot as plt
import statistics
import nltk
import spacy

import matplotlib.pyplot as plt

from transformers import AutoTokenizer, AutoModelForMaskedLM
from collections import Counter, defaultdict

from nltk.corpus import stopwords
nltk.download('stopwords')

!python -m spacy download en_core_web_sm 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 5.3 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [None]:
nlp=spacy.load('en_core_web_sm')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 1. Subword Tokenization

### Functions

In [None]:
def get_tokens(file_path, bench=False):
  all_annos =[]
  complex_words = []

  if bench:
    with open(file_path, 'r',encoding="utf-8") as infile:
      data = infile.readlines()

  else: 
    with open(file_path, 'r',encoding="ISO-8859-1") as infile:
      data = infile.readlines()

  print("dataset of size:", len(data)) 
  for row in data:
    row = row.strip()
    info = row.split("\t")

    complex_word = info[1]
    complex_words.append(complex_word)
    annotations = info[3:]
    
    if bench:
      clean_annotations = [anno[2:] for anno in annotations]
    
    else: 
      clean_annotations = annotations

    for a in clean_annotations:
      all_annos.append(a)
  
  return all_annos, complex_words

In [None]:
def relative_subwords (abs_subwords):
  relative_dict = dict()
  
  total = sum(abs_subwords.values())
  
  for len, freq in abs_subwords.items():
    relative_dict[len] = (freq/total)*100
  
  return relative_dict


In [None]:
def count_subwordtokenization(tokenizer, words):

  tokenize_sizes = defaultdict(int)
  for word in words:
    tokenized_word = tokenizer.tokenize(word)
    nr_of_subwords = len(tokenized_word)
    if nr_of_subwords in tokenize_sizes.keys():
      tokenize_sizes[nr_of_subwords]+=1
    else:
      tokenize_sizes[nr_of_subwords]=1
  
  relative = relative_subwords (tokenize_sizes)

  return dict(tokenize_sizes), relative

### Running analysis

In [None]:
# Choose Tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking")
# tokenizer = AutoTokenizer.from_pretrained("GroNLP/bert-base-dutch-cased")

In [None]:
# Choose file for analysis

data_files = ["/content/BenchLS.txt", "/content/lex.mturk.txt","/content/NNSeval.txt"]
# data_files = "/content/dutch_sents_for_annotation.txt"]
# data_files = ["/content/dutch_train_sents.txt"]

with open(f"tokenization_analysis.txt", "w") as outfile:

  for file in data_files:
    dataset = file.replace("/content/","")
    print(dataset)

    if "Bench" in file or "NNSeval" in file or "dutch" in file:
      annotations, complex_words = get_tokens(file, bench=True)

    else: 
      annotations, complex_words = get_tokens(file, bench=False)

    abs_cword, rel_cword = count_subwordtokenization(tokenizer, complex_words)
    abs_annos, rel_annos = count_subwordtokenization(tokenizer, annotations)

    print("percentage of subword tokenized complex words:",100-rel_cword[1])
    print("percentage of subword tokenized annotation: ",100-rel_annos[1])

    outfile.write(f"""{dataset}
    Complex Words: 
    \t Absolute \t {abs_cword}
    \t Relative \t {rel_cword}
    \t percentage subword-tokenized \t {100-rel_cword[1]}
    Annotations:
    \t Absolute \t {abs_annos} 
    \t Relative \t {rel_annos}
    \t percentage subword-tokenized \t {100-rel_annos[1]}
    """)

BenchLS.txt
dataset of size: 929
percentage of subword tokenized complex words: 6.996770721205607
percentage of subword tokenized annotation:  12.795793163891318
lex.mturk.txt
dataset of size: 501
percentage of subword tokenized complex words: 11.177644710578832
percentage of subword tokenized annotation:  11.33010931848662
NNSeval.txt
dataset of size: 239
percentage of subword tokenized complex words: 7.94979079497908
percentage of subword tokenized annotation:  28.92238972640982


## 2. Generation Comparison

### Loading and opening files

In [None]:
# Opening the frequency file, and storing it in a dictionary
freq_dict = dict()

with open ("/content/frequency_merge_wiki_child.txt", "r") as infile:
  data = infile.readlines()

for line in data:
  line = line.strip()
  info = line.split(" ")
  word = info [0]
  freq = int(info [1])
  freq_dict[word]=freq


In [None]:
def open_lexmturk():
  with open ("/content/lex.mturk.txt","r",encoding="ISO-8859-1") as infile:
    lexmturk = infile.readlines()[1:]

  annotations = []
  sentences = []
  cwords = []
  for line in lexmturk:
    info = line.strip().split("\t")
    sentence = info[0]
    cword = info[1]
    line_annotations = [] 
    for an in  info[2:]:
      if an.count(" ")<1:
        line_annotations.append(an)
    annotations.append(set(line_annotations))

  return annotations, sentences, cwords

### Functions

In [None]:
def get_characteristics(row):
  row = row.strip().split("\t")
  sentence = row[0]
  complex_word = row[1]
  options = set(row[2:])
  
  return(sentence, complex_word, options)

In [None]:
def get_comparisons(base_model, other_model, annotations):
  prediction_dict = defaultdict(list)

  for base_line, model_line, annotations_line in zip(base_model, other_model, annotations):
    if len(base_line)>3:
      sent, complex_word, base_options = get_characteristics(base_line)
    else: continue
    if len(model_line)>3:
      sent, complex_word, model_options = get_characteristics(model_line)
    else: continue
    
    prediction_dict["all_model_preds"].append(model_options)
    prediction_dict["all_base_preds"].append(base_options)

    prediction_dict["ABM"].append(base_options.intersection(model_options, annotations_line))
    prediction_dict["AM"].append(annotations_line.intersection(model_options).difference(base_options))
    prediction_dict["AB"].append(annotations_line.intersection(base_options).difference(model_options))
    prediction_dict["BM"].append(base_options.intersection(model_options).difference(annotations_line))
    prediction_dict["A"].append(annotations_line.difference(base_options, model_options))
    prediction_dict["B"].append(base_options.difference(model_options, annotations_line))
    prediction_dict["M"].append(model_options.difference(base_options, annotations_line))

    prediction_dict["all_preds"] = base_options.union(model_options, annotations_line)
    
  return prediction_dict

In [None]:
def get_frequencies(words):
  # word_list = words
  word_list = [item for wordset in words for item in wordset]
  total_freq = 0
  frequencies = []
  for word in word_list:
    if word in freq_dict.keys():
      # print(word)
      freq = freq_dict[word]
      total_freq += freq
      frequencies.append(freq)
  
  return(frequencies)

In [None]:
def get_pos(sents, cwords):
  complex_poss = [] 
  for sent, cword in zip(sents, cwords):
    print(sent)
    print(cword)
    doc = nlp(sent)
    pos_sequence = []
    tok_sequence = []
    for token in doc:
      pos_sequence.append(token.pos_)
      tok_sequence.append(token.text)
    
    try:
      c_index = tok_sequence.index(cword)
      c_pos = pos_sequence[c_index]
    
    except:
      c_pos = "NONE"
    complex_poss.append(c_pos)
    
  return complex_poss

### Analysis

In [None]:
from os import listdir
from os.path import isfile, join
mypath = "/content/ouputs"
print(models_to_analyze)
models_to_analyze = [f for f in listdir(mypath) if isfile(join(mypath, f))]

['lex.mturk_FT_lr5e-07_50000sents_outputs.txt', 'lex.mturk_FT_lr5e-07_1000sents_outputs.txt', '.ipynb_checkpoints', 'lex.mturk_FT_lr5e-06_50000sents_outputs.txt', 'lex.mturk_FT_lr0.0005_1000sents_outputs.txt', 'lex.mturk_FT_lr0.0005_50000sents_outputs.txt', 'lex.mturk_FT_lr5e-06_1000sents_outputs.txt', 'lex.mturk_FT_lr0.0005_10000sents_outputs.txt', 'lex.mturk_FT_lr5e-05_1000sents_outputs.txt', 'lex.mturk_FT_lr5e-06_10000sents_outputs.txt', 'lex.mturk_FT_lr5e-07_10000sents_outputs.txt', 'lex.mturk_FT_lr5e-05_50000sents_outputs.txt', 'lex.mturk_FT_lr5e-05_10000sents_outputs.txt']


In [None]:
# Open Base Model File:
wwm_path = "/content/LMTWWMoutputs.txt"
with open (wwm_path, "r") as infile:
  base_model = infile.readlines()  

# Open Annotation File
annotations, sentences, cwords = open_lexmturk()

# Load Stopwords
stop_words = set(stopwords.words('english'))

# List of complex word POSs
pos_list = get_pos(sentences, cwords)

# List of model outputs to analyze
# models_to_analyze = ["/content/lex.mturk_FT_lr5e-06_10000sents_outputs.txt",
                    #  "/content/lex.mturk_FT_lr5e-07_1000sents_outputs.txt"] 

In [None]:
with open("generation_analysis.csv","w") as outfile:

  # Write Header
  outfile.write("model_name\tsetting\t occurence\t avg_tok_len\t median\t sigma\t stop_word_count\t stop_word_unique\n")

  # Analysis for each of the model outputs
  for model_to_analyze in models_to_analyze:

    model_name = model_to_analyze.replace("/content/","")
    model_path = "/content/ouputs/"+model_to_analyze

    with open (model_path, "r") as infile:
      model = infile.readlines()
      comparisons = get_comparisons(base_model, model, annotations)

      for name, setting in comparisons.items():
        print("\nSetting:",name)

        frequencies = get_frequencies(setting)
        occurence = len(frequencies)
        print("This setting occured", occurence, "times")

        # If this setting has at least one overlapping prediction:
        if occurence> 1:
        
          flattened =  [item for wordset in setting for item in wordset]
          lengths = [len(i) for i in flattened]
          
          # Amount of generated stopwords
          stop_word_count = 0  
          for stop_word in stop_words:
            stop_word_count += flattened.count(stop_word)
          stop_word_unique = len(stop_words.intersection(set(flattened)))

          # Frequency statistics
          avg_tok_len = sum(lengths) / len(lengths)
          print(avg_tok_len)
          median = statistics.median(frequencies)
          sigma = statistics.stdev(frequencies)
          
        else:
          stop_word_count = 0
          stop_word_unique = 0
          avg_tok_len = 0
          median = 0
          sigma = 0
        
        outfile.write(model_name+"\t"+
                      name+"\t"+
                      str(occurence)+"\t"+
                      str(avg_tok_len)+"\t"+
                      str(median)+"\t"+
                      str(sigma)+"\t"+
                      str(stop_word_count)+"\t"+
                      str(stop_word_unique)+"\n")



Setting: all_model_preds
This setting occured 4998 times
7.5754301720688275

Setting: all_base_preds
This setting occured 5000 times
7.485

Setting: ABM
This setting occured 1190 times
6.81344537815126

Setting: AM
This setting occured 218 times
6.477064220183486

Setting: AB
This setting occured 289 times
6.73356401384083

Setting: BM
This setting occured 2063 times
7.965584100824042

Setting: A
This setting occured 3374 times
6.878685762426285

Setting: B
This setting occured 1458 times
7.502057613168724

Setting: M
This setting occured 1527 times
7.798952193844139

Setting: all_preds
This setting occured 112 times
1.0

Setting: all_model_preds
This setting occured 4980 times
7.327510040160643

Setting: all_base_preds
This setting occured 5000 times
7.485

Setting: ABM
This setting occured 1028 times
6.761673151750973

Setting: AM
This setting occured 199 times
6.57286432160804

Setting: AB
This setting occured 451 times
6.880266075388026

Setting: BM
This setting occured 1564 times

### Dutch: Analysis of non-word predictions

In [None]:
with open("/content/drive/MyDrive/Thesis/code/Dutch/Error Analysis/outputs.txt","r") as infile:
  data = infile.readlines()

In [None]:
print(len(data))
only_chars = [] 
also_chars = []

for row in data:
  info = row.strip().split("\t")
  sentence = info [0]
  cword = info [1]
  predictions = info[2:]
  one_char_preds = len([i for i in predictions if len(i)==1])

  if one_char_preds == 0:
    continue
  elif one_char_preds in range(1,10):
    also_chars.append(row)
    print("less than ten")
    print(sentence, cword, predictions)
  else: 
    only_chars.append(row)
    print("there are ten")
    print(sentence, cword, predictions)

1024
less than ten
De gemeente Amsterdam stimuleert mobiliteit en heeft daarbij ook aandacht voor de mogelijkheden van promotie en demotie. stimuleert ['biedt', 'steunt', 'bevorderd', 'gestimuleerd', 'c', 'l', 'l', 'l', 'l', 'l']
less than ten
prestatie met bijbehorend stimulerend gedrag. stimulerend ['gezond', 'positief', 'prettig', 'aantrekkelijk', 'creatief', 'o', 'l', 'i', 'i', 'i']
less than ten
Geef aandacht aan diversiteit, b.v. door verschillende feestdagen te vieren diversiteit ['cultuur', 'schoonheid', 'innovatie', 'à', 'j', 'j', 'j', 'j', 'j', 'j']
there are ten
Daar kun je het met elkaar over hebben. daar ['m', 'i', 'p', 'k', 'à', 'è', 'è', 'è', 'è', 'è']
less than ten
Ook vervult de monitor een verantwoordingsfunctie. monitor ['bestuurder', 'voorzitter', 'burgemeester', 'directeur', 'werknemer', 'ambtenaar', 'secretaris', 'c', 'c', 'c']
less than ten
van ‘High Impact Crime’ (HIC) die in 2011 in de regio Amsterdam-Amstelland is opgezet. impact ['effect', 't', 'à', 'i', 'y',

In [None]:
with open ("only_chars.txt","w") as outfile:
  outfile.writelines(only_chars)

with open ("also_chars.txt","w") as outfile:
  outfile.writelines(also_chars)

## 3. Selection Analysis