<a href="https://colab.research.google.com/github/Teaganstmp/Langlearning/blob/main/Copy_of_LIN_393_Class_1_Word_Order_Experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Universal Dependencies to Get Word Order Information

In notebooks, you can intermingle code with text.

In [None]:
#Code and text from https://colab.research.google.com/drive/1d7LO_0665DYw6DrVJXXautJAJzHHqYOm#scrollTo=4WwZYkNr1bPN
#This cell loads the Universal Dependecies Treekbank corpus. It'll download all the packages, but we'll only use the GUM
#english package. We'll also install the conllu package, that was developed to parse data in the conLLu format, a
#format common of linguistic annotated files. We'll also have a list variable, but now named ud_treebank.

#Install conllu package, download the UD Treebanks corpus and unpack it.
!pip install conllu
!wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3105/ud-treebanks-v2.5.tgz
!tar zxf ud-treebanks-v2.5.tgz

#The imports needed to open and parse the conllu file. At the end we'll have a list of dicts.
from io import open
import conllu
import glob
from collections import defaultdict
import numpy as np
import pandas as pd


Collecting conllu
  Downloading conllu-5.0.1-py3-none-any.whl.metadata (21 kB)
Downloading conllu-5.0.1-py3-none-any.whl (16 kB)
Installing collected packages: conllu
Successfully installed conllu-5.0.1
--2024-08-27 16:43:41--  https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3105/ud-treebanks-v2.5.tgz
Resolving lindat.mff.cuni.cz (lindat.mff.cuni.cz)... 195.113.20.140
Connecting to lindat.mff.cuni.cz (lindat.mff.cuni.cz)|195.113.20.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 355216681 (339M) [application/x-gzip]
Saving to: ‘ud-treebanks-v2.5.tgz’


2024-08-27 16:44:01 (18.3 MB/s) - ‘ud-treebanks-v2.5.tgz’ saved [355216681/355216681]



In [None]:
file_path = "ud-treebanks-v2.5/UD_English-GUM/en_gum-ud-train.conllu"

with open(file_path, 'r', encoding='utf-8') as f:
  data = f.read()

# Parse the file using the conllu library
sentences = conllu.parse(data)

In [None]:
s = sentences[10]

In [None]:
for token in s:
  print (token, token["id"], token["upos"], token["deprel"], token["head"])

print(token.keys())

Thus 1 ADV advmod 16
, 2 PUNCT punct 1
the 3 DET det 4
time 4 NOUN nsubj 16
it 5 PRON nsubj 6
takes 6 VERB acl:relcl 4
and 7 CCONJ cc 9
the 8 DET det 9
ways 9 NOUN conj 4
of 10 SCONJ mark 12
visually 11 ADV advmod 12
exploring 12 VERB acl 9
an 13 DET det 14
artwork 14 NOUN obj 12
can 15 AUX aux 16
inform 16 VERB root 0
about 17 ADP case 19
its 18 PRON nmod:poss 19
relevance 19 NOUN obl 16
, 20 PUNCT punct 21
interestingness 21 NOUN conj 19
, 22 PUNCT punct 27
and 23 CCONJ cc 27
even 24 ADV advmod 27
its 25 PRON nmod:poss 27
aesthetic 26 ADJ amod 27
appeal 27 NOUN conj 19
. 28 PUNCT punct 16
dict_keys(['id', 'form', 'lemma', 'upos', 'xpos', 'feats', 'head', 'deprel', 'deps', 'misc'])


In [None]:
for token in s:
  if token["deprel"] == "nsubj" and (token["id"] < token["head"]):
    print ("subject comes first!", token, token["id"], token["upos"], token["deprel"], token["head"])
  if token["deprel"] == "obj" and (token["id"] < token["head"]):
    print ("object comes first!", token, token["id"], token["upos"], token["deprel"], token["head"])


subject comes first! time 4 NOUN nsubj 16
subject comes first! it 5 PRON nsubj 6


In [None]:
arg_first = defaultdict(list)
for token in s:
  if token["deprel"] in ["nsubj", "obj", "case"]:
    arg_first[token["deprel"]] += [token["id"] < token["head"]]
means = {i: np.round(np.mean(arg_first[i]), 2) for i in arg_first}


In [None]:
means

{'nsubj': 1.0, 'obj': 0.0, 'case': 1.0}

In [None]:
def get_counts(fn):
  lang = fn.split("/")[1][3:]
  with open(fn, 'r', encoding='utf-8') as f:
    data = f.read()
  # Parse the file using the conllu library
  sentences = conllu.parse(data)
  arg_first = defaultdict(list)
  for sentence in sentences:
    for token in sentence:
      if token["deprel"] in ["nsubj", "obj", "case"]:
        arg_first[token["deprel"]] += [token["id"] < token["head"]]
  means = {i: np.round(np.mean(arg_first[i]), 2) for i in arg_first}
  print(lang, means)
  return (lang, means)

In [None]:
ud_files = glob.glob("ud-treebanks-v2.5/*/*-test.conllu")
results = [get_counts(i) for i in ud_files]

Faroese-OFT {'nsubj': 0.87, 'case': 0.99, 'obj': 0.03}
Japanese-Modern {'obj': 1.0, 'case': 0.0, 'nsubj': 1.0}
Scottish_Gaelic-ARCOSG {'nsubj': 0.03, 'obj': 0.34, 'case': 1.0}
Finnish-TDT {'nsubj': 0.89, 'obj': 0.29, 'case': 0.11}
Ukrainian-IU {'nsubj': 0.76, 'obj': 0.21, 'case': 1.0}
Swedish-PUD {'nsubj': 0.83, 'case': 1.0, 'obj': 0.03}
Italian-PoSTWITA {'case': 1.0, 'nsubj': 0.81, 'obj': 0.17}
Latvian-LVTB {'nsubj': 0.78, 'obj': 0.35, 'case': 0.99}
Ancient_Greek-PROIEL {'nsubj': 0.66, 'obj': 0.39, 'case': 0.99}
Amharic-ATT {'obj': 0.88, 'case': 0.47, 'nsubj': 0.55}
Belarusian-HSE {'nsubj': 0.81, 'obj': 0.05, 'case': 1.0}
Maltese-MUDT {'nsubj': 0.75, 'case': 0.99, 'obj': 0.06}
Russian-Taiga {'case': 0.99, 'nsubj': 0.8, 'obj': 0.27}
Old_Russian-RNC {'case': 0.99, 'obj': 0.61, 'nsubj': 0.74}
Turkish-PUD {'nsubj': 1.0, 'case': 0.01, 'obj': 1.0}
English-GUM {'case': 0.96, 'nsubj': 0.96, 'obj': 0.02}
Indonesian-GSD {'case': 0.97, 'nsubj': 0.98, 'obj': 0.05}
Czech-PUD {'case': 1.0, 'nsubj':

In [None]:
pd.DataFrame([i[1] for i in results], index=[i[0] for i in results]).to_csv("results.csv")

# Getting the Probability of a Sentence Using Minicons

Minicons: a library built on HuggingFace for easily asking linguistic questions using models. It takes as input the name of a model and then will give you probabilities from that model.

In [None]:
!pip install torch transformers minicons

from minicons import scorer
import torch

# Load the GPT-2 model with minicons
gpt2_scorer = scorer.IncrementalLMScorer('gpt2')

# Define the sentence you want to calculate the probability for
sentence1 = "Colorless green ideas sleep furiously."
sentence2 = "Furiously sleep ideas green colorless."

print(gpt2_scorer.sequence_score(sentence1, reduction = lambda x: -x.sum(0).item()))
print(gpt2_scorer.sequence_score(sentence2, reduction = lambda x: -x.sum(0).item()))

def get_sequence_score(s, gpt2_scorer=gpt2_scorer):
  return (gpt2_scorer.sequence_score(s, reduction = lambda x: -x.sum(0).item()))


Collecting minicons
  Downloading minicons-0.2.47-py3-none-any.whl.metadata (10 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cu

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

[52.596832275390625]
[67.12906646728516]


In [None]:
get_sequence_score("Incandescent aardvark brains activate extravagantly.")

[61.90435791015625]

In [None]:
get_sequence_score("The key to the cabinets are on the table.")

[42.31085205078125]

In [None]:
texts = [i.metadata["text"] for i in sentences]

In [None]:
import nltk
import random
nltk.download('punkt')

def swap_words(text, p=.05):
  words = nltk.word_tokenize(text)
  for i in range(len(words) - 1):
    if random.random() > p:
      word = words[i]
      nextword = words[i + 1]
      words[i], words[i + 1] = nextword, word
  return " ".join(words)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
text = texts[100]
for i in range(100):
  print(text, round(get_sequence_score(text)[0]))
  text = swap_words(text)


For this reason, institutions may also be slow to investigate accusations of fraud, and they may try to keep their discoveries in-house to protect their reputations. 106
this reason , institutions may also For be to investigate accusations of fraud , and they may try to keep their discoveries in-house to protect their reputations . slow 168
reason , institutions may also For be to this accusations of fraud investigate and they may try to keep their discoveries in-house to protect their reputations . slow , 180
, institutions may also For be to this accusations of fraud investigate and they may try to keep their discoveries in-house to protect their reputations . slow , reason 183
institutions may also For be to this accusations of fraud investigate , they may try to keep their discoveries in-house to protect their reputations . slow , reason and 179
may also For be to this accusations of fraud investigate institutions they may try to keep their discoveries in-house to protect their rep