### Connect Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Put your path below

In [2]:
!cd '/content/drive/MyDrive/AIISC-Internship/text-based-object-discovery'

In [3]:
PATH = '/content/drive/MyDrive/AIISC-Internship/text-based-object-discovery'

### Install Required Packages

`Stanza`, Stanford NLP Package benefits from `GPU` so enable it under `View Resources > Change runtime type`

In [4]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-31930931-6e0a-de5d-fb1a-76ff7f7bde6c)


In [5]:
!pip install stanza # for stanford pos tagger
!pip install ftfy regex tqdm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting stanza
  Downloading stanza-1.4.2-py3-none-any.whl (691 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m691.3/691.3 KB[0m [31m37.8 MB/s[0m eta [36m0:00:00[0m
Collecting emoji
  Downloading emoji-2.2.0.tar.gz (240 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.9/240.9 KB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-2.2.0-py3-none-any.whl size=234926 sha256=21928d512a2d69a83152877f2cb774765cb6aa4a08d89cdda00fb74bc8c32983
  Stored in directory: /root/.cache/pip/wheels/86/62/9e/a6b27a681abcde69970dbc0326ff51955f3beac72f15696984
Successfully built emoji
Installing collected packages: emoji, stanza
Successfully installed emoji-2.2.0 stan

### Load Necessary Libraries

We will load the necessary libraries required for generating DAAM outputs for input prompts.

In [6]:
import os
import json
from tqdm import tqdm

from matplotlib import pyplot as plt
import numpy as np

from nltk.corpus import stopwords

from pycocotools.coco import COCO

Download the stopwords for removing stopwords

In [7]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [8]:
import stanza
stanza.download('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.1/models/default.zip:   0%|          | 0…

INFO:stanza:Finished downloading models and saved to /root/stanza_resources.


### Load Data

Below, we load the `MS-COCO` annotations to extract the captions to continue further with extracting the objects from each caption.

In [11]:
!wget -c http://images.cocodataset.org/annotations/annotations_trainval2017.zip
!unzip annotations_trainval2017.zip
!rm annotations_trainval2017.zip
!ls annotations

--2023-01-06 09:08:43--  http://images.cocodataset.org/annotations/annotations_trainval2017.zip
Resolving images.cocodataset.org (images.cocodataset.org)... 52.216.57.33, 54.231.227.105, 54.231.201.89, ...
Connecting to images.cocodataset.org (images.cocodataset.org)|52.216.57.33|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 252907541 (241M) [application/zip]
Saving to: ‘annotations_trainval2017.zip’


2023-01-06 09:09:05 (11.4 MB/s) - ‘annotations_trainval2017.zip’ saved [252907541/252907541]

Archive:  annotations_trainval2017.zip
  inflating: annotations/instances_train2017.json  
  inflating: annotations/instances_val2017.json  
  inflating: annotations/captions_train2017.json  
  inflating: annotations/captions_val2017.json  
  inflating: annotations/person_keypoints_train2017.json  
  inflating: annotations/person_keypoints_val2017.json  
captions_train2017.json   instances_val2017.json
captions_val2017.json	  person_keypoints_train2017.json
instances_t

Now, we load the json file for the train and validation set captions.

In [12]:
with open('annotations/captions_train2017.json') as json_file:
  caption_data_train_file = json.load(json_file)
with open('annotations/captions_val2017.json') as json_file:
  caption_data_val_file = json.load(json_file)

In [13]:
caption_data_train = caption_data_train_file['annotations']
caption_data_val = caption_data_val_file['annotations']

In [14]:
prompts_train = [ann['caption'] for ann in caption_data_train]
prompts_val = [ann['caption'] for ann in caption_data_val]

In [15]:
def show_captions():
  print('***train captions***\n', '\n'.join(prompts_train[:5]))
  print()
  print('Number of train captions:', len(prompts_train))
  print()
  print()
  print('***validation captions:***\n', '\n'.join(prompts_val[:5]))
  print()
  print('Number of train captions:', len(prompts_val))

show_captions()

***train captions***
 A bicycle replica with a clock as the front wheel.
A room with blue walls and a white sink and door.
A car that seems to be parked illegally behind a legally parked car
A large passenger airplane flying through the air.
There is a GOL plane taking off in a partly cloudy sky.

Number of train captions: 591753


***validation captions:***
 A black Honda motorcycle parked in front of a garage.
A Honda motorcycle parked in a grass driveway
An office cubicle with four different types of computers.
A small closed toilet in a cramped space.
Two women waiting at a bench next to a street.

Number of train captions: 25014


### Caption Processing

Cleaning the prompts. I adopt few ways to clean the prompt:
- Lower Case Conversion
- Tokenization
- Remove stop words
- Remove non-alphabets
- Keep only nouns
- Lemmatization (to store the object name)

In [16]:
# loads the text processing pipeline
nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma', tokenize_no_ssplit=True, tokenize_pretokenized=False, verbose=True, pos_batch_size=10000)

# treebank-specific POS (XPOS) tags to keep, other POS tagged tokens will not be retained
keep_pos_tags = ['NN', 'NNS', 'NNP', 'NNPS']

# Stopwords
stpwords = set(stopwords.words('english'))

# extract parts of speech
def extract_pos(doc):
  parsed_text = list()
  for sent in doc.sentences:
    parsed_sent = list()
    for wrd in sent.words:
      #extract text and pos
      parsed_sent.append((wrd.text, wrd.xpos))
    parsed_text.append(parsed_sent)
  return parsed_text

# extract lemma
def extract_lemma(doc):
  parsed_text = list()
  for sent in doc.sentences:
    parsed_sent = list()
    for wrd in sent.words:
      # extract text and lemma
      parsed_sent.append((wrd.text, wrd.lemma))
    parsed_text.append(parsed_sent)
  return parsed_text

def clean_prompt(sentences):
  # convert the sentences to lower case
  sentences_lc = [sentence.lower() for sentence in sentences]

  # stanza accepts only a single string instead of list of strings. So, we have set the tokenize_no_ssplit=True and have to join each sentence with double newline
  sentence_string = "\n\n".join(sentences_lc)

  # tokenizes, lemmatizes and pos tags the prompt
  processed_prompt = nlp(sentence_string)
  
  # extracts pos tags from the processed_prompt
  pos_tagged_prompt = extract_pos(processed_prompt)

  # lemmatized text
  lemmatized_prompt = extract_lemma(processed_prompt)

  # keep only the noun words, removes stopwords
  fin_prompt = [[word for word, pos_tag in sent if ((pos_tag in keep_pos_tags) and (word not in stpwords))] for sent in pos_tagged_prompt]
  obj_prompt = [[word_lemma[1] for word_pos, word_lemma in zip(sent_pos, sent_lemma) if ((word_pos[1] in keep_pos_tags) and ((word_lemma[0] not in stpwords) or (word_lemma[1] not in stpwords)))] for sent_pos, sent_lemma in zip(pos_tagged_prompt, lemmatized_prompt)]
  return fin_prompt, obj_prompt

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

INFO:stanza:Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |
| lemma     | combined |

INFO:stanza:Use device: gpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Done loading processors!


An example is shown below for the application of `clean_prompt`.

In [17]:
clean_prompt(["The fishes are playing in the mountains."])

([['fishes', 'mountains']], [['fish', 'mountain']])

Below, we start processing each prompt and store the objects detected in the captions from train and validation split.

In [18]:
NUM_PROMPTS_INFO_DISPLAY = 500 # After processing how many prompts, some necessary information must be displayed

In [19]:
import shutil # Removes directory if already present! CAREFUL!!!!!!!!!!!!!!!!!!
if os.path.exists(os.path.join(PATH, 'Caption-Processing1')):
  shutil.rmtree(os.path.join(PATH, 'Caption-Processing1'))
os.mkdir(os.path.join(PATH, 'Caption-Processing1'))

In [None]:
print('Starting...')
print('Captions to be processed:', len(prompts_train))
print('Cleaning Prompts... Storing Objects per prompt...')
processed_train = clean_prompt(prompts_train) # start processing the train captions

In [None]:
total_objects = set() # Stores the total number of distinct objects detected
num_objects_detected = list() # Stores number of objects detected after processing some number of prompts iteratively

# Processing each prompt and updating annotation file for train set
cleaned_prompts, object_prompts = processed_train
for idx, prompt in tqdm(enumerate(zip(cleaned_prompts, object_prompts))):
  cleaned, objects = prompt # Process prompt
  # update files and object list
  caption_data_train_file['annotations'][idx]['cleaned'] = cleaned
  caption_data_train_file['annotations'][idx]['objects'] = objects
  total_objects.update(set(objects))

  if (idx+1) % NUM_PROMPTS_INFO_DISPLAY == 0: # Display Info
    num_objects_detected.append(len(total_objects))

# Display info once the for loop ends
if (idx+1) % NUM_PROMPTS_INFO_DISPLAY != 0: 
  num_objects_detected.append(len(total_objects))

# Save the processed captions data
with open(os.path.join(PATH, 'Caption-Processing1/train-captions-processed.json'), 'w') as outfile: # Save Results in json
  outfile.write(json.dumps({'captions': caption_data_train_file['annotations'], 'num_objects': num_objects_detected}, indent=4))

# Save the objects detected info
with open(os.path.join(PATH, 'Caption-Processing1/train-objects.json'), 'w') as outfile: # Saving Total objects in json
  outfile.write(json.dumps({'objects': list(total_objects), 'num_objects': num_objects_detected}, indent=4))

print('Saved and Finished Processing...')

In [None]:
print(total_objects)

In [None]:
print('Starting...')
print('Captions to be processed:', len(prompts_val))
print('Cleaning Prompts... Storing Objects per prompt...')
processed_val = clean_prompt(prompts_val) # start processing the validation captions

In [None]:
total_objects = set() # Stores the total number of distinct objects detected
num_objects_detected = list() # Stores number of objects detected after processing some number of prompts iteratively

# Processing each prompt and updating annotation file for validation set
cleaned_prompts, object_prompts = processed_val
for idx, prompt in tqdm(enumerate(zip(cleaned_prompts, object_prompts))):
  cleaned, objects = prompt # Process prompt
  # update files and object list
  caption_data_val_file['annotations'][idx]['cleaned'] = cleaned
  caption_data_val_file['annotations'][idx]['objects'] = objects
  total_objects.update(set(objects))

  if (idx+1) % NUM_PROMPTS_INFO_DISPLAY == 0: # Display Info
    num_objects_detected.append(len(total_objects))

# Display info once the for loop ends
if (idx+1) % NUM_PROMPTS_INFO_DISPLAY != 0: 
  num_objects_detected.append(len(total_objects))

# Save the processed captions data
with open(os.path.join(PATH, 'Caption-Processing1/val-captions-processed.json'), 'w') as outfile: # Save Results in json
  outfile.write(json.dumps({'captions': caption_data_val_file['annotations'], 'num_objects': num_objects_detected}, indent=4))

# Save the objects detected info
with open(os.path.join(PATH, 'Caption-Processing1/val-objects.json'), 'w') as outfile: # Saving Total objects in json
  outfile.write(json.dumps({'objects': list(total_objects), 'num_objects': num_objects_detected}, indent=4))

print('Saved and Finished Processing...')

In [None]:
print(total_objects)

Now, we look at how each additional prompt helped in increasing the number of unique objects in the `total_objects`.

In [None]:
# Load the objects set for train set
with open(os.path.join(PATH, 'Caption-Processing1/train-objects.json')) as json_file:
  train_objects_file = json.load(json_file)

# Load the objects set for val set
with open(os.path.join(PATH, 'Caption-Processing1/val-objects.json')) as json_file:
  val_objects_file = json.load(json_file)

Plots are more visually appealing and revealing let's plot the results.

In [None]:
plt.figure(figsize=(16,12))
plt.plot(train_objects_file['num_objects'])
plt.xlabel(f'Number of Prompts processed (1 unit = {NUM_PROMPTS_INFO_DISPLAY} prompts)')
plt.ylabel('Number of unique objects extracted')
plt.title('Objects Extracted from COCO captions (train split)')
plt.show()

In [None]:
plt.figure(figsize=(16,12))
plt.plot(val_objects_file['num_objects'])
plt.xlabel(f'Number of Prompts processed (1 unit = {NUM_PROMPTS_INFO_DISPLAY} prompts)')
plt.ylabel('Number of unique objects extracted')
plt.title('Objects Extracted from COCO captions (validation split)')
plt.show()