# Task-1.2: Sentence with Antonyms

### Importing Necessary Libraries

NLTK library is used for the purpose of tokenization and Wordnet for antonym generation. The Hugging Face's datasets library provides us with the dataset.

In [5]:
import csv
import pandas as pd
import string
from datasets import load_dataset
import nltk
from nltk.corpus import wordnet as wn
from nltk import word_tokenize, pos_tag

### Data

The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. We load this library available through the datasets library below.

In [6]:
dataset = load_dataset('glue', 'mnli')

Reusing dataset glue (/Users/rishideychowdhury/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/5 [00:00<?, ?it/s]

The first $5000$ sentences are chosen for the purpose of this negation task and will be used further in the second task for training the model.

In [7]:
data = dataset['train']['premise'][:5000]

### Data Processing

For any NLP workflow, tokenization is the most basic step towards handling text data and below the word_tokenizer is used for this task provided by the NLTK library. 

In [8]:
data_tokenized = [word_tokenize(sent) for sent in data]

Since, our purpose is the generate a sentence with all adjectives being replaced with their antonyms, if available. We will start by tagging the parts of speech of the tokens for each sentence using the NLTK's `pos_tagger`.

In [9]:
data_pos_tagged = [pos_tag(tokenized_sent) for tokenized_sent in data_tokenized]

The following helper function helps us find the antonymns for a word using the WordNet from NLTK's library.

In [10]:
def find_antonyms(word):
    antonyms = [] # To store the antonyms of the input word
    for syn in wn.synsets(word): # Looking for the synonym set in WordNet
        for i in syn.lemmas(): # Looking for the lemmas associated with each lemma in the synonym set
             if i.antonyms(): # Finding the antonyms based on this lemma
                antonyms.append(i.antonyms()[0].name()) 
    antos = sorted(set(antonyms)) # Sorts the antonyms retrieved in lexicographical order
    return list(antos)

The original sentences are traversed token-wise and then adjectives are replaced with the lexicographically smallest antonym retrieved from WordNet and the new sentence is generated keeping in mind the proper spacing as in the original sentence.

In [11]:
# stores the sentences with antonyms of adjectives replaced in place of orginial adjectives
data_antonymed = list()
for tokens in data_pos_tagged:
    anto_sent_tokens = list() # stores the fake sentence tokens for the selected sentence
    
    for token in tokens:
        if 'J' in token[1]: # Identifying the adjectives from the POS tags
            ants = find_antonyms(token[0]) # Finding the antonyms corresponding to this adjective using above func
            if ants: # If antonym exists then add the first antonym as it is lexicographically the smallest
                if token[0].endswith('ing'): 
                    # Handling common situation of token ending with 'ing'
                    anto_sent_tokens.append(ants[0] if ants[0].endswith('ing') else ants[0][:-1] + 'ing')
                elif token[0].endswith('ed'):
                    # Handling common situation of token ending with 'ed'
                    anto_sent_tokens.append(ants[0] if ants[0].endswith('ed') else ants[0][:-1] + 'ed')
                else:
                    anto_sent_tokens.append(ants[0])
            else: # If no antonym exists then simply add the original word
                anto_sent_tokens.append(token[0])
        else: # If not adjective then simply add it as it is
            anto_sent_tokens.append(token[0])
            
    # Since, 's, etc are tokenized simply adding them separated by space changes the sentence, we deal with that below
    # Even the punctuation marks are separated out by space if we simply use join. Hence, we process the spaces aptly.
    antonym_sentence = anto_sent_tokens[0]
    for token in anto_sent_tokens[1:]:
        antonym_sentence += token if token[0] in string.punctuation or token == "n't" else " " + token
    data_antonymed.append(antonym_sentence)

The following are the original sentences.

In [12]:
pd.DataFrame(data)

Unnamed: 0,0
0,Conceptually cream skimming has two basic dime...
1,you know during the season and i guess at at y...
2,One of our number will carry out your instruct...
3,How do you know? All this is their information...
4,yeah i tell you what though if you go price so...
...,...
4995,right because clothes are some are really expe...
4996,I'll admit that it wasn't he who bought strych...
4997,Particularly noteworthy are the ornately Frenc...
4998,Do you mean tall or short?


The following are the sentences in which the adjective is replaced by it's lexicographically smallest antonym

In [13]:
pd.DataFrame(data_antonymed)

Unnamed: 0,0
0,Conceptually cream skimming has two incidental...
1,you know during the season and i guess at at y...
2,One of our number will carry out your instruct...
3,How do you know? All this is their information...
4,yeah i tell you what though if you go price so...
...,...
4995,right because clothes are some are really chea...
4996,I'll admit that it wasn't he who bought strych...
4997,Particularly noteworthy are the ornately Frenc...
4998,Do you mean short or long?


Writing the results in a csv file.

In [14]:
filename = "rishi_dey_chowdhury_antonyms.csv"
fields = ['index', 'real_sentence', 'fake_sentence'] 
with open(filename, 'w') as csvfile: 
    csvwriter = csv.writer(csvfile) 
    csvwriter.writerow(fields)
    for i in range(len(data)):
        csvwriter.writerow([i, data[i], data_antonymed[i]])