<a href="https://colab.research.google.com/github/Jagoda222/LoLa---group-8/blob/main/LoLa_tools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tools for the Logic & Language course

by Lasha.Abzianidze@gmail.com

# SpaCy processing

In [None]:
import spacy
print(f"spaCy version={spacy.__version__}")

In [None]:
# downloading spaCy's medium/large model if needed (small is downloaded by default)
# !python -m spacy download en_core_web_lg

In [None]:
NLP = spacy.load("en_core_web_sm")

In [None]:
sent = "This is a sample sentence to be parsed"
doc = NLP(sent)

In [None]:
spacy.displacy.render(doc, style='dep', jupyter=True, options={'fine_grained':True, 'compact':False})

In [None]:
# if many sentences needs to be parsed, use pipe
nlp_sm = spacy.load("en_core_web_sm")
docs_sm = list(NLP.pipe(1000 * [sent]))
print(len(docs_sm))

# CoreNLP parsing

CoreNLP will be used through [Stanza CoreNLP interface](https://github.com/stanfordnlp/stanza/blob/main/demo/Stanza_CoreNLP_Interface.ipynb). CoreNLP provides both constituency and dependency trees. For English, it is possible to directly get dependency trees with a dependency parser or indirectly obtain them by converting the constituency trees into dependecy trees.

In [None]:
!pip install stanza

In [None]:
import stanza
import os
# Download the Stanford CoreNLP package with Stanza's installation command
# This'll take several minutes, depending on the network speed
corenlp_dir = './corenlp'
stanza.install_corenlp(dir=corenlp_dir)
# Set the CORENLP_HOME environment variable to point to the installation location
os.environ["CORENLP_HOME"] = corenlp_dir
# Import client module
from stanza.server import CoreNLPClient
# src: https://github.com/stanfordnlp/stanza/blob/main/demo/Stanza_CoreNLP_Interface.ipynb
from nltk.tree import Tree

## Dependency parsing

In [None]:
sents = ["This is a sample sentence to be parsed", "A brown fox is jumping over the lazy dog"]

In [None]:
# Getting dependency trees from a dependency parser
# https://stanfordnlp.github.io/CoreNLP/depparse.html
with CoreNLPClient(annotators='tokenize,pos,depparse',
                   memory='4G', endpoint='http://localhost:9021', be_quiet=True,
                   output_format='json') as client:
    core_dep_parses = [ client.annotate(s)['sentences'][0] for s in sents ]

## Constituency parsing

In [None]:
# Getting dependency trees from a constituency parser
# takes 3-4min
# https://stanfordnlp.github.io/CoreNLP/parse.html
with CoreNLPClient(annotators='tokenize,pos,parse',
                   memory='4G', endpoint='http://localhost:9030', be_quiet=True,
                   output_format='json') as client:
    core_con_parses = [ client.annotate(s)['sentences'][0] for s in sents ]

In [None]:
# Drawing CoreNLP constituency trees with NLTK's Tree object
Tree.fromstring(core_con_parses[0]['parse']).pretty_print()

# AMuSE word senses

We use an API to predict word senses with the help of the multilingual word sense disambiguation system. For more details visit [here](http://nlp.uniroma1.it/amuse-wsd/about).  

In [None]:
import requests

In [None]:
headers = {'accept': 'application/json', 'Content-Type': 'application/json'}
url = 'http://nlp.uniroma1.it/amuse-wsd/api/model'

In [None]:
# disambiguation ENglish sentences
input = [
    {'text': "This table is too long for this room", "lang": "EN" },
    {'text': "The bank is wet", "lang": "EN" }
]

In [None]:
res = requests.post(url, json=input, headers=headers).json()

In [None]:
res

# Prover9

The NLTK-native tableau prover for FOL cannot handle the equality predicate properly. That's why we will use Prover9, a proper theorem prover from FOL. Fortunately, it is nicely integrated in NLTK classes.        
We need to download Prover9 as it is not by default available in recent NLTK anymore.

In [None]:
import nltk
print(nltk.__version__)

In [None]:
%%bash
prover9_file_name="p9m4-v05.tar.gz"
[[ ${prover9_file_name} =~ (.+)\.tar\.gz ]]
prover9_folder_name=${BASH_REMATCH[1]}
if [[ ! -d ${prover9_folder_name} ]]; then
  curl -sL "https://www.cs.unm.edu/~mccune/prover9/gui/$prover9_file_name" -o ${prover9_file_name}
  tar -xzf ${prover9_file_name}
  rm -rf 'prover9'
  mv ${prover9_folder_name} 'prover9'
  rm ${prover9_file_name}
fi

In [None]:
prover9 = nltk.Prover9()
prover9.config_prover9("/content/prover9/bin")

In [None]:
str2exp = nltk.sem.Expression.fromstring

In [None]:
premises = ["all x.(man(x) -> walks(x))", "not walks(Alex)"]
conclusion = "some y. not man(y)"
prover9.prove(str2exp(conclusion), [ str2exp(p) for p in premises ])

In [None]:
conclusion = "exists x. (L(x) & exists y. (E(y) & y = x)) -> exists x.(L(x) & E(x))"
prover9.prove(str2exp(conclusion), [])

# CFG parsing and generation

In [None]:
import nltk
print(nltk.__version__)

In [None]:
# to graphically display parse trees (when ascii display is not enough)
!pip install svgling
import svgling

## Parsing with CFG

In [None]:
# parser = nltk.ChartParser(groucho_grammar)
groucho_grammar = nltk.CFG.fromstring("""
    S -> NP VP
    PP -> P NP
    NP -> Det N | Det N PP | 'I'
    VP -> V NP | VP PP
    Det -> 'an' | 'my'
    N -> 'elephant' | 'pajamas'
    V -> 'shot'
    P -> 'in'
    """)

In [None]:
sent = 'I shot an elephant in my pajamas'.split()

In [None]:
parser = nltk.ChartParser(groucho_grammar)

In [None]:
for tree in parser.parse(sent):
    print(tree)

In [None]:
# Alternatively you can print trees in a more beautiful way
for tree in parser.parse(sent):
    tree.pretty_print()

In [None]:
# graphical display of the tree
tree

## Generating with CFG

We can also generate sentences from a CFG. All generated sentences are automatically parsable by the grammar that generated them. Sometimes grammars can generate infinitely many sentences. For this, one needs to limit the output of the generation, otherwise the code won't terminate.  
Read here more about the generation: [howto](https://www.nltk.org/howto/generate.html), [source](https://www.nltk.org/_modules/nltk/parse/generate.html).

In [None]:
from nltk.parse.generate import generate

In [None]:
# Let's limit generation with number of trees (10). It also supports max depth constraint.
# Note that generated sentecnes are a list of words
for sent in generate(groucho_grammar, n=20, depth=4):
    print(sent)

## Parsing with feature-based CFG

This section largely follows [Chapter 9](https://www.nltk.org/book/ch09.html), but also digs into several issues that are left unxplained in the NLTK book.  
[This page](https://www.nltk.org/howto/featgram.html) also provides examples of usage of the feature-based grammars, but instructions might seem relatively terse.  
Note that Section 3 in Chpater 9 goes too deep into a feature-based grammar and syntactic theory. Feel free to read it if you find it interesting but expect to find some parts unclear or shallow. Also it is useful to know what are the limits of the implementation of feature-based grammars in NLTK.

In [None]:
# nltk.download('book_grammars') # download predefined grammars
from nltk.grammar import FeatureGrammar
from nltk.parse.featurechart import FeatureChart, FeatureChartParser
from nltk.featstruct import Feature
from typing import List

In [None]:
# The grammar is taken from grammars/book_grammars/feat0.fcfg
GR1 = """
% start S
# ###################
# Grammar Productions
# ###################
# S expansion productions
S -> NP[NUM=?n] VP[NUM=?n]
# NP expansion productions
NP[NUM=?n] -> N[NUM=?n]
NP[NUM=?n] -> PropN[NUM=?n]
NP[NUM=?n] -> Det[NUM=?n] N[NUM=?n]
NP[NUM=pl] -> N[NUM=pl]
# VP expansion productions
VP[TENSE=?t, NUM=?n] -> IV[TENSE=?t, NUM=?n]
VP[TENSE=?t, NUM=?n] -> TV[TENSE=?t, NUM=?n] NP
# ###################
# Lexical Productions
# ###################
Det[NUM=sg] -> 'this' | 'every'
Det[NUM=pl] -> 'these' | 'all'
Det -> 'the' | 'some' | 'several'
PropN[NUM=sg]-> 'Kim' | 'Jody'
N[NUM=sg] -> 'dog' | 'girl' | 'car' | 'child'
N[NUM=pl] -> 'dogs' | 'girls' | 'cars' | 'children'
IV[TENSE=pres,  NUM=sg] -> 'disappears' | 'walks'
TV[TENSE=pres, NUM=sg] -> 'sees' | 'likes'
IV[TENSE=pres,  NUM=pl] -> 'disappear' | 'walk'
TV[TENSE=pres, NUM=pl] -> 'see' | 'like'
IV[TENSE=past] -> 'disappeared' | 'walked'
TV[TENSE=past] -> 'saw' | 'liked'
"""

In [None]:
# creating a grammar object from the string
gr1 = FeatureGrammar.fromstring(GR1)
print(f"The type of gr1 is {type(gr1)}")

In NLTK, featured structures are a set of attribute-value pairs, but they also have a non-terminal symbol, e.g., `S`, `NP`, `VP`, etc. What are these symbols in feature structures? Each feature structure has a special feature called `type` and its values are these symbols. Let's get the start symbol of the grammar.

In [None]:
print(f"The start symbol of gr1 in a raw format is {gr1.start()}")
print(f"The start symbol of gr1 is a clean format is {gr1.start()[Feature('type')]}")

Now let's create a parser based on the grammar object. The parser we create is a chart parser (you have seen a `chart` parser in the previous section too). You can read more about chart parsing in [J&M (Ch13-13.2)](https://web.stanford.edu/~jurafsky/slp3/13.pdf) or [here](https://en.wikipedia.org/wiki/CYK_algorithm), but it is not necessary as such at this stage.

In [None]:
# creating a chart parser object based on the grammar
parser1 = FeatureChartParser(gr1)
# this is same as
# FeatureChartParser(gr1, trace=0, chart_class=FeatureChart)
# When trace is > 0, the parsing procedure prints the workings of the parsing algorithm
# this is useful when the parser produces unexpected results

Let's parse some grammatical and ungrammatical sentences.

In [None]:
for tree in parser1.parse("this dog likes children".split()):
    print(tree) # plain print, unfortunately featured grammar trees are not supporting pretty_print yet in NLTK

In [None]:
# *this dogs: det-noun number disagreemnet
for tree in parser1.parse("this dogs likes children".split()):
    print(tree) # plain print, unfortunately featured grammar trees are not supporting pretty_print yet in NLTK

# *dogs likes: subject-verb number disagreement
for tree in parser1.parse("this dogs likes children".split()):
    print(tree) # plain print, unfortunately featured grammar trees are not supporting pretty_print yet in NLTK

The above code prints nothing as no parses are found for the ungrammatical sentences, i.e., the iterator the parser returns is empty, hence the body of the for-loop is not carried out.

Now, this is a significant improvement over the previous bare, feature-less, grammar. But the grammar cannot distinguish semantically nonsensical sentences from sensible ones (e.g., if the grammar had appropriate rules and descriptions for certain words, it would parse the famous sentence [Colorless green ideas sleep furiously](https://en.wikipedia.org/wiki/Colorless_green_ideas_sleep_furiously) ). It would be too much to ask such fine-grained distinction from the syntax-based grammar.

## Generating with feature-based CFG

Unfortunately when using the featured CFG, the generation ignores constraints from the features and generates more sentences than it should.  
Here is my NLTK [issue](https://github.com/nltk/nltk/issues/2628) about this (I don't think it is fixed).  
The issue also mentions a workaround, which is to generate sentences from FCFG and then keep only those sentences that are parsed by the same FCFG.

# Fine-tuning a BERT-like model on NLI

## Loading data and models

In [None]:
import torch
from os import path as op
import os
import numpy as np
from collections import Counter

In [None]:
# Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`: Please run `pip install transformers[torch]` or `pip install accelerate -U`
!pip install accelerate -U

In [None]:
# transformers complained about newset version 0.0.13 so installing the older version
# ! pip install huggingface-hub==0.0.12

In [None]:
! pip install datasets #transformers

In [None]:
import datasets
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

In [None]:
# META Variables
# it is good to have certain directories for saving model checkpoints (e.g., on google drive)
MODEL_DIR = 'model_checkpoints'
MODEL_CHECKPOINT = "distilbert-base-uncased"
BATCH_SIZE = 16

In [None]:
snli_data = load_dataset("snli")
print(Counter(snli_data['train']['label']))

# SNLI data needs to be cleaned as it contains -1s as a label
for k in snli_data:
    snli_data[k] = snli_data[k].filter( lambda prob: prob['label'] >= 0 )

In [None]:
metric = load_metric('glue', "mnli")

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT, use_fast=True)

In [None]:
# https://huggingface.co/transformers/preprocessing.html
def preprocess_function(d):
    return tokenizer(d['premise'], d['hypothesis'], truncation=True)

In [None]:
# tokenize the data
encoded_snli_data = snli_data.map(preprocess_function, batched=True, load_from_cache_file=True)

In [None]:
# load a model and prepare it for 3-way classification
model = AutoModelForSequenceClassification.from_pretrained(MODEL_CHECKPOINT, num_labels=3)

## Fine-tuning

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
args = TrainingArguments(
    MODEL_DIR, # to save models
    # evaluation_strategy = "epoch", # 1 epoch for training takes too long for colab
    evaluation_strategy = "steps",
    eval_steps = 500, # evaluate and save after training on every next 500x16 examples
    save_steps=500, # saves model after every 500 steps. save_steps should be divisible on eval_steps
    learning_rate=2e-5,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=1, # going throught the training data only once
    weight_decay=0.01,
    load_best_model_at_end=True, # after fine-tuning trainer.model will keep the best model
    metric_for_best_model="accuracy",
)

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_snli_data["train"],
    eval_dataset=encoded_snli_data["validation"],
    # You could use "test" here but it will be cheating then
    # to select the model checkpoint which gets highest score on test
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()
# it takes ~32min to fine-tune one epoch on the training set (550K problems) on V100
# it takes ~45min to fine-tune one epoch on the training set (550K problems) on T4

In [None]:
# if colab timeouts after one evaluation (i.e., training on 5000x16),
# you will still have a model in $MODEL_DIR/checkpoint-5000
# you can load that model and continue fine-tuning on the remaining problems
# note that the first 5000x16 problems will be skipped
trainer.train(op.jopin(MODEL_DIR, 'checkpoint-5000'))

## Evaluation (no fine-tuning)

In [None]:
# evaluation of a particular model

# if you want to load a model from a checkpoint for evaluation
# ft_model = AutoModelForSequenceClassification.from_pretrained(op.join(MODEL_DIR, 'checkpoint-5000'))

trainer_eval = Trainer(
    trainer.model, # model that you want to evaluate, In this case this is the best model based on the fine-tuning
    args,
    train_dataset=encoded_snli_data["train"],
    eval_dataset=encoded_snli_data["validation"], # you want to evaluate on test
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer_eval.evaluate()

# Decision trees on SNLI

The code examples below show you how to use `snli_jsonl2dict` function to read data from SNLI files. The read data separates NLI problem info from sentence annotations because many sentences occur in many NLI problems and there is no need to reprocess the same sentences every time it is encountered in an NLI problem. This separation saves space and runtime, and it does make difference when you think of creating feature representations of 550K NLI problems.  

You are also provided with `sen2features`, `problem2features`, and `probs2df` functions to show you how feature selection on a sentence level and a problem level can be done in a modular way. Note that the provided features are very simplistic ones. Try to replace them with more effective or reasonable ones. The final function demostrates how to visually verify/view feature representation of the problems (the latter is useful to verify whether your code is really doing what you think it should be doing).

In [None]:
from tqdm import tqdm
import pandas as pd

In [None]:
# assigntools package is a course specific collection of useful tools
!rm -fr assigntools # helps to rerun this cell witthout errors, if recloning needed
! git clone https://github.com/kovvalsky/assigntools.git

In [None]:
from assigntools.LoLa.read_nli import snli_jsonl2dict, sen2anno_from_nli_problems
from assigntools.LoLa.sen_analysis import spacy_process_sen2tok, display_doc_dep

## Reading data

In [None]:
from nltk.tree import Tree

In [None]:
# Get SNLI data on fly
!wget https://nlp.stanford.edu/projects/snli/snli_1.0.zip
!unzip snli_1.0.zip
# !rm -r __MACOSX/ snli_1.0/*_test*

In [None]:
# takes ~1min to read and pre-process data
# By default it reads the problems that have a gold label.
# SNLI is dict {part: {problem_id: problem_info}}
# S2A is dict {sentence: sentence annotation dict}
SNLI, S2A = snli_jsonl2dict('snli_1.0')

In [None]:
# access a problem with its ID in the train part
some_prob = SNLI['train']['4804607632.jpg#0r1e']
display(some_prob) # you can use print but the data will be squeezed in a single line

In [None]:
# The analysis/annotation of the hypothesis sentence
# It includes tokenization, tree structures and pos tags.
# Additionally, for each sentence you can find out in which
# parts, problems, and role (premise or hypothesis) it occurs.
# Check the key "pids" (problemIDs) for this info.
print(f"Sentence: {some_prob['h']}")
S2A[some_prob['h']]

In [None]:
# It is a good idea to keep the problem annotations and sentence annotations separately
# because many sentences occur in many NLI problems and you don't want to extract features
# for the same sentence for each problem it occurs in.
# For example the following sentence occurs many times in NLI problems
len(S2A["A man is sleeping."]['pids'])

### Displaying syntax trees

In [None]:
# we can read tree representations as NLTK Tree objects
t = Tree.fromstring(S2A[some_prob['h']]['tree'])
print(t)
# better printing
t.pretty_print()

In [None]:
# you need to have svgline installed to display tree objects
! pip install svgling

In [None]:
# display tree
t

## Processing with spaCy [optional]

For more reasoning-relevant features, one can use [spaCy](https://spacy.io/) to get dependency parse trees for sentences. In addition to the dependency parsing, spaCy pipeline also does part-of-speech tagging (with general and fine-grained POS tags), named entity recognition, and lemmatization (details [here](https://spacy.io/usage/processing-pipelines)). For a quick intro to spaCy, have a look at the following section in the [spaCy tutorial](https://course.spacy.io/en/): sections 1 & 5 in [chapter 1](https://course.spacy.io/en/chapter1), and 4 & 8 in [chapter 2](https://course.spacy.io/en/chapter2).   
Use attributes of spaCy's [Token objects](https://spacy.io/api/token).  
After annotation, tokens come with two pos tags: fine-grained corresponds to [Penn Treebank pos tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) while coarse-grained to [Universal pos tags](https://universaldependencies.org/u/pos/). The dependency parse trees follow the [stanford style](https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf).  

In [None]:
# downloading spaCy's large model
# !python -m spacy download en_core_web_lg
import spacy

In [None]:
NLP = spacy.load("en_core_web_sm")

In [None]:
# SNLI train part contains 640K different sentences
# First, processing all these sentences with spaCy and then using the analyses
# for feature extraction is not feasible as the colab will run out of the memory
# There are two options, either reduce the number of sentecnes by using subpart
# of the training part, or process the sentences with spaCy in batches while
# at the same time converting NLI problems into a set of feature-values
# The former is simpler and this is how you can create new SNLI and S2A variables
print(f"Train contains {len(SNLI['train'])} problems")
print(f"The number of different sentences in SNLI: {len(S2A)}")
# Let's decide that we take first 10K problems from TRAIN
# (the label distribution should reflect the original distribution from the training data)
SNLI['sub_train'] = { pid: SNLI['train'][pid] for pid in sorted(SNLI['train'])[:10000] }
sub_S2A = sen2anno_from_nli_problems({**SNLI['sub_train'], **SNLI['dev']}, S2A)
print(f"The number of different sentences in subTRAIN and DEV: {len(sub_S2A)}")

In [None]:
# process all sentecnes in DEV and subTRAIN with spaCy
# Note thet the following function takes spaCy pipeline and sen->tokens dict
# With the tokenization input, the pipeline is forced to use the same tokenization
sen2Doc = spacy_process_sen2tok(NLP, { sen: anno['tok'] for sen, anno in sub_S2A.items() })

In [None]:
display_doc_dep(sen2Doc["A man is sleeping."])

## Create features [demo]

This section shows one way how you could organize your code in a modular and hierarchical way: separate sentence-level features from problem/pair-level features where the latter uses the former. The conversion of the entire training data into feature-values is wrapped in a separate function so that it can work for train, dev and test parts in the similar way.  

In [None]:
# You can modify the function
def sen2features(sen, anno):
    '''
    Takes a sentence and its annotation and returns a dictionary
    of feature:value that characterizes the sentence
    '''
    feats = {}
    # number of tokens
    feats['tok_num'] = len(anno['tok'])
    # number of negation words
    neg_words = "n't no not"
    feats['neg_num'] = len([ t for t in anno['tok'] if t.lower() in neg_words ])
    # number of nouns
    feats['noun_num'] = len([ t for t in anno['pos'] if t == "NNS" or t == "NN" ])

    return {**feats, **anno}

In [None]:
# Add features to the sentence annotations
s2af = { s: sen2features(s, a) for s, a in tqdm(sub_S2A.items()) }

In [None]:
# an example of a sentence with feature-added annotations
s2af['The women is not on her phone.']

In [None]:
# You can modify the function
def problem2features(sen1, anno1, sen2, anno2, sen_feats=set(['tok_num', 'neg_num', 'noun_num'])):
    '''
    Takes two sentences and their anotations (_features) and returns a dictionary of
    feature:value that characterizes the sentence pair, i.e. feature is about both sentences
    '''
    features = {}
    # define the sentence-based features that will be part of the problem features
    sen_feats = set('tok_num neg_num noun_num'.split())
    sen1_feats = { f"{k}1": v for k, v in anno1.items() if k in sen_feats }
    sen2_feats = { f"{k}2": v for k, v in anno2.items() if k in sen_feats }
    # not very smart idea: putting single sentence-based features as pair features
    features = {**sen1_feats, **sen2_feats} # merge two dicts

    # pair-related features
    # if only one of the sentences has a negation
    neg_set = set([anno1['neg_num'], anno2['neg_num']])
    features['neg_diff'] = 1 if (0 in neg_set and len(neg_set) > 1) else 0

    # If premise contains more tokens than hypothesis has
    features['tok1>2'] = int(anno1['tok_num'] > anno2['tok_num'])

    # If premise has more nouns than hypothesis has
    features['noun1>2'] = int(anno1['noun_num'] > anno2['noun_num'])

    return features

In [None]:
# You can modify the function, but it might not be necessary as it is pretty general
def problems2df(data_dict, sen2af):
    '''
    Read a dictionary of NLI problems {pid->prob} and
    a dictionary of sentence annotations {sent->anno_feats}
    and represent each problem as a set of feature-values in DatFrame.
    DataFrame offers an easy way of viewing and manipulating data.
    Separate DataFrames are created for labels, features, and sentence pairs
    https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
    '''
    dict_of_feats = { pid: problem2features(prob['p'], sen2af[prob['p']], prob['h'], sen2af[prob['h']])
                      for pid, prob in tqdm(data_dict.items()) }
    # Use  {key : list or dict} to create DataFrame
    gold_labels = { pid:[prob['g']] for pid, prob in data_dict.items() }
    # Don't use label annotations as features as this will be cheating :)
    # Create DataFrame for sentence pairs for visualization
    pair_df = { pid:[f"{prob['p']} ??? {prob['h']}"] for pid, prob in tqdm(data_dict.items()) }
    # make each problem charactersistics as a row
    feat_df = pd.DataFrame(dict_of_feats).transpose()
    lab_df = pd.DataFrame(gold_labels).transpose()
    pair_df = pd.DataFrame(pair_df).transpose()
    # match the order in label, feature and pair farmes
    lab_df.reindex(feat_df.index)
    pair_df.reindex(feat_df.index)
    return feat_df, lab_df, pair_df

In [None]:
feat_df, lab_df, pair_df = problems2df(SNLI['sub_train'], s2af)

In [None]:
# Let's put all three dataframes together for visualization
# Press the magic wand icon after the frame appears
pd.concat([lab_df, feat_df, pair_df], axis=1)

## Training

In [None]:
# Just an example with decision trees
from sklearn.tree import DecisionTreeClassifier as DTC

# preparing data and converting it to feature-values
s2af = { s: sen2features(s, a) for s, a in tqdm(sub_S2A.items()) }
feat_df, lab_df, pair_df = problems2df(SNLI['sub_train'], s2af)

# initializing a DT classifier and training it
DT = DTC(criterion="gini", max_depth=10, random_state=0)
default_DT = DT.fit(feat_df, lab_df)

MODEL = {'cheater': default_DT}

## Evaluation

In [None]:
def evaluate(model, dataset, sen2anno):
    """
    model - a classifier to predict NLI classes
    dataset and sem2anno are the similar to the output of snli_jsonl2dict
    dataset - a dict of nli problems: keys are problem ids and values problem descriptions
    sen2anno - a dict of sentence annotations from SNLI: keys are sentences and values its tree, pos tag and tokenozation.
    The function converts problems in dataset into set of feature-values (sen2anno can be used reprocess each sentence once)
    and predicts the inference classes of the problems.
    It can use spacy model "NLP" on-fly to get features based on its analyses.
    Returns a list of predictions and a list of gold values
    """

    # a sample code wich is adapted to the previous code about decision trees
    s2af = { s: sen2features(s, a) for s, a in tqdm(sen2anno.items()) }
    feat_df, lab_df, pair_df = problems2df(dataset, s2af)

    pred_list = model.predict(feat_df)
    return pred_list.tolist(), lab_df.values.squeeze().tolist()

In [None]:
# TEST
from nltk.metrics.scores import accuracy as Accuracy
from nltk.metrics import ConfusionMatrix

S2A_dev = sen2anno_from_nli_problems(SNLI['dev'], S2A)
# The code should also work for 'test' part

for name in MODEL:
    pred, gold = evaluate(MODEL[name], SNLI['dev'], S2A_dev)
    print(f"{name:=^80}")
    print(ConfusionMatrix(gold, pred))
    print(f"Accuracy = {Accuracy(gold, pred)}")
    print(f"{'':=^80}")

# Text2FOL with LogicLLaMa

The code follows the content of the [demo notebook](https://github.com/gblackout/LogicLLaMA/blob/main/demo.ipynb).

<font color="red">Run the cells on GPU, e.g., the cells were tested on T4.</font>

In [None]:
# the repo has this in requirements (might be relevant only for replicating results)
! pip install transformers@git+https://github.com/huggingface/transformers.git@3ec7a47664ebe40c40f4b722f6bb1cd30c3821ec

In [None]:
# you might need to specify exact versions of the modules from requirements.txt
! pip install peft Levenshtein SentencePiece bitsandbytes

In [None]:
# !rm -fr /content/drive/MyDrive/Llama/LogicLLaMA
! git clone https://github.com/gblackout/LogicLLaMA.git

In [None]:
import os
os.chdir("LogicLLaMA")

In [None]:
import torch
from functools import partial
import transformers
print(f"transformers version={transformers.__version__}")
from transformers import GenerationConfig, LlamaForCausalLM, LlamaTokenizer
from peft import PeftModel, prepare_model_for_int8_training
from utils import TranslationDataPreparer, ContinuousCorrectionDataPreparer, make_parent_dirs
from fol_parser import parse_text_FOL_to_tree
from generate import llama_generate

In [None]:
# download data: MALLS and FOLIO
! sh data_download.sh

In [None]:
HF_ACCESS_TOKEN="hf_{SOME_MESS_OF_ALPHANUMERICS}"

In [None]:
LLAMA2_MODEL = 'meta-llama/Llama-2-7b-hf'

In [None]:
prompt_template_path='data/prompt_templates'
load_in_8bit = True
max_output_len = 128

In [None]:
tokenizer = LlamaTokenizer.from_pretrained(LLAMA2_MODEL, use_auth_token=HF_ACCESS_TOKEN)
tokenizer.add_special_tokens({
    "eos_token": "</s>",
    "bos_token": "<s>",
    "unk_token": '<unk>',
    "pad_token": '<unk>',
})
tokenizer.padding_side = "left"  # Allow batched inference

In [None]:
generation_config = GenerationConfig(
    temperature=0.1,
    top_p=0.75,
    top_k=40,
    num_beams=1
)

llama_model = LlamaForCausalLM.from_pretrained(
    LLAMA2_MODEL,
    load_in_8bit=load_in_8bit,
    torch_dtype=torch.float16,
    device_map='auto',
    use_auth_token=HF_ACCESS_TOKEN
)
llama_model = prepare_model_for_int8_training(llama_model)

In [None]:
!cd .. && git clone https://huggingface.co/yuan-yang/LogicLLaMA-7b-direct-translate-delta-v0.1

In [None]:
peft_path='../LogicLLaMA-7b-direct-translate-delta-v0.1'

In [None]:
model = PeftModel.from_pretrained(
    llama_model,
    peft_path,
    torch_dtype=torch.float16
)

In [None]:
data_preparer = TranslationDataPreparer(
    prompt_template_path,
    tokenizer,
    False,
    256 # just a filler number
)

prepare_input = partial(
    data_preparer.prepare_input,
    **{"nl_key": "NL"},
    add_eos_token=False,
    eval_mode=True,
    return_tensors='pt'
)

simple_generate = partial(
    llama_generate,
    llama_model=model,
    data_preparer=data_preparer,
    max_new_tokens=max_output_len,
    generation_config=generation_config,
    prepare_input=prepare_input,
    return_tensors=False
)

In [None]:
data_point = {'NL': 'The one who created this repo is either a human or an alien'}

In [None]:
full_resp_str, resp_parts = simple_generate(input_str=data_point)

In [None]:
resp_parts