# Task-1.1:  Sentence Negation

### Importing Necessary Libraries

NLTK library is used for the purpose of tokenization and the Hugging Face's datasets library provides us with the dataset.

In [66]:
import csv
import pandas as pd
import string
from datasets import load_dataset
import nltk
from nltk import word_tokenize, pos_tag

### Data

The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. We load this library available through the datasets library below.

In [67]:
dataset = load_dataset('glue', 'mnli')

Reusing dataset glue (/Users/rishideychowdhury/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/5 [00:00<?, ?it/s]

The first $5000$ sentences are chosen for the purpose of this negation task and will be used further in the second task for training the model.

In [68]:
data = dataset['train']['premise'][:5000]

### Data Processing

For any NLP workflow, tokenization is the most basic step towards handling text data and below the word_tokenizer is used for this task provided by the NLTK library. 

In [69]:
data_tokenized = [word_tokenize(sent) for sent in data]

Since, our purpose is the negate the English sentences using 'not' negation. We need to know few rules about how to do so.
- Insert the negation marker ‘not’ after the first auxiliary verb in the original true target sentence to generate a new distractor sentence. 
- If the sentence already contains the negation marker ‘not’, we instead remove it. 

The following is the reason why this sort of data processing is of particular interest:
- ‘Not’ Negation aims to detect whether a sentence embedding model is misled by the negation of a sentence relation caused by adding the word ‘not’.

So, the 'not' negation task boils down to finding the first auxiliary verb of a sentence. Since, in English the 'not' word appears just after the auxiliary verb i.e. the form a 'not' negated simple English sentence is:
$$S\ +\ AV\ +\ not\ +\ MV\ +\ R$$
where, $S$ is subject, $AV$ is auxiliary verb, $MV$ is main verb and $R$ is rest of the sentence.

Similarly, most of the English sentences with 'not' appears in the above form and we can simply remove the 'not' word to non-negate the sentece.

In English, there are only a finitely many auxiliary verb as mentioned in the cell below, which are allowed to take 'not' word after it and we leverage this property for our purpose.

In [70]:
# Auxiliary Verbs
aux_verbs = [
    'is', 'am', 'are', 'was', 'were', 'do', 'does', 'did', 'have', 'has', 'had', 'may', 'can', 'will', 'shall', 
    'might', 'could', 'would', 'should', 'must', 'ought', 'need', 'dare', 'used', "'m", "'re", "'ve", "'ll", "'d", "'s" 
]

Some practical details kept in mind while implementing the sentence negation function:
- One thing to observe is there are contracted words like 'm, 're, 've, 'll, 'd and n't. Special care is taken to deal with these words as 's is also a possesive pronoun and 'not' should not be added after it. Hence, we use the part of speech tagger to identify these tokens and check if they are verb or not.
- The original sentence structure is preserved to the maximum extent as possible i.e. no extra spaces or punctuation marks or words are added, except for the 'not' word insertion or deletion.
- **If a sentence is joined by using a coordinating or subordinating conjunction then both the sentences are negated.** The conjunctions are identifies using the POS tags.

In [71]:
# stores the 'not' negated sentences and the 'not' word removed sentences
data_negated = list()
for i in range(len(data)):
    neg_sent_tokens = list() # Store the tokens of the new negated sentence
    neg_flag = 1 # flag to remove only the first not word or add the first not word
    sent_pos_tagged = pos_tag(data_tokenized[i]) # POS tagging sentence to identify conjunctions, possesive pronouns, etc
    
    for j in range(len(data_tokenized[i])):
        if neg_flag == 1 and (data_tokenized[i][j].lower() in ['not', "n't"]) and j > 0:
            if data_tokenized[i][j-1].lower() in aux_verbs:
                # If current token is among variations of 'not' coming after a auxiliary verb then skip it 
                neg_flag = 0
                continue
            neg_sent_tokens.append(data_tokenized[i][j])
        elif neg_flag == 1 and data_tokenized[i][j].lower() in aux_verbs and ('V' in sent_pos_tagged[j][1] or 'MD' in sent_pos_tagged[j][1]) and  j < len(data_tokenized[i])-1:
            neg_sent_tokens.append(data_tokenized[i][j])
            # If current token is a auxiliary verb and the next token is not a variation of 'not' then add 'not'
            if data_tokenized[i][j+1].lower() not in ['not', "n't"]: 
                neg_sent_tokens.append('not')
                neg_flag = 0
        else:
            # Otherwise simply add the token as it is
            neg_sent_tokens.append(data_tokenized[i][j])
        if data_tokenized[i][j] in string.punctuation or sent_pos_tagged[j][1] in ['IN', 'CC'] or data_tokenized[i][j] in ['--', '..', '...', '....']:
            # If a punctuation or conjuction appears in the sentence then most of the times there are two
            #  or more parts so we will consider the part after it as an independent sentence and will allow it's 
            # negation
            neg_flag = 1
    
    # Since, 's, etc are tokenized simply adding them separated by space changes the sentence, we deal with that below
    # Even the punctuation marks are separated out by space if we simply use join. Hence, we process the spaces aptly
    negated_sentence = neg_sent_tokens[0]
    for token in neg_sent_tokens[1:]:
        negated_sentence += token if token[0] in string.punctuation or token == "n't" else " " + token
    data_negated.append(negated_sentence)

The following table shows the original sentences.

In [72]:
pd.DataFrame(data)

Unnamed: 0,0
0,Conceptually cream skimming has two basic dime...
1,you know during the season and i guess at at y...
2,One of our number will carry out your instruct...
3,How do you know? All this is their information...
4,yeah i tell you what though if you go price so...
...,...
4995,right because clothes are some are really expe...
4996,I'll admit that it wasn't he who bought strych...
4997,Particularly noteworthy are the ornately Frenc...
4998,Do you mean tall or short?


The following are the negated sentences.

In [73]:
pd.DataFrame(data_negated)

Unnamed: 0,0
0,Conceptually cream skimming has not two basic ...
1,you know during the season and i guess at at y...
2,One of our number will not carry out your inst...
3,How do not you know? All this is not their inf...
4,yeah i tell you what though if you go price so...
...,...
4995,right because clothes are not some are really ...
4996,I'll not admit that it was he who bought stryc...
4997,Particularly noteworthy are not the ornately F...
4998,Do not you mean tall or short?


Writing the results in a csv file.

In [74]:
filename = "rishi_dey_chowdhury_negation.csv"
fields = ['index', 'real_sentence', 'fake_sentence'] 
with open(filename, 'w') as csvfile: 
    csvwriter = csv.writer(csvfile) 
    csvwriter.writerow(fields)
    for i in range(len(data)):
        csvwriter.writerow([i, data[i], data_negated[i]])