### Connect Google Drive

Open the [COLAB NOTEBOOK HERE](https://colab.research.google.com/drive/1lLfWRUrcfo_yKM0d9QZdhN5r8dlHNuZ2?usp=sharing).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Put your path below

In [None]:
!cd '/content/drive/MyDrive/AIISC-Internship/text-based-object-discovery'

In [None]:
PATH = '/content/drive/MyDrive/AIISC-Internship/text-based-object-discovery'

### Local PC or Mac

In [6]:
PATH = '/Users/rishideychowdhury/Desktop/Text-Based-Object-Discovery'

### Install Required Packages

`Stanza`, Stanford NLP Package benefits from `GPU` so enable it under `View Resources > Change runtime type`

In [None]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-4e4bd966-8027-b9b5-2880-23cb26555ad4)


In [None]:
!pip install stanza # for stanford pos tagger
!pip install ftfy regex tqdm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting stanza
  Downloading stanza-1.4.2-py3-none-any.whl (691 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m691.3/691.3 KB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
Collecting emoji
  Downloading emoji-2.2.0.tar.gz (240 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.9/240.9 KB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-2.2.0-py3-none-any.whl size=234926 sha256=490ec6569a39a1f6ffd2470b17be6af43097309b03ba1ad434c2c40759aa8bb4
  Stored in directory: /root/.cache/pip/wheels/86/62/9e/a6b27a681abcde69970dbc0326ff51955f3beac72f15696984
Successfully built emoji
Installing collected packages: emoji, stanza
Successfully installed emoji-2.2.0 stan

### Load Necessary Libraries

We will load the necessary libraries required for extracting objects from prompts.

In [3]:
import os
import json
from tqdm import tqdm

from matplotlib import pyplot as plt

import numpy as np
import pandas as pd

from nltk.corpus import stopwords

Download the stopwords for removing stopwords

In [4]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rishideychowdhury/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/rishideychowdhury/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/rishideychowdhury/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/rishideychowdhury/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/rishideychowdhury/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [5]:
import stanza
stanza.download('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

2023-01-06 16:12:15 INFO: Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.1/models/default.zip:   0%|          | 0…

2023-01-06 16:16:28 INFO: Finished downloading models and saved to /Users/rishideychowdhury/stanza_resources.


### Load Data

Below, we load the `Google Conceptual Caption` annotations to extract the captions to continue further with extracting the objects from each caption.

In [9]:
train_file = pd.read_csv(os.path.join(PATH, 'Data/Goggle-Conceptual-Caption/Train_GCC-training.tsv'), sep='\t', names=['captions', 'url'])
val_file = pd.read_csv(os.path.join(PATH, 'Data/Goggle-Conceptual-Caption/Validation_GCC-1.1.0-Validation.tsv'), sep='\t', names=['captions', 'url'])

In [10]:
train_file

Unnamed: 0,captions,url
0,a very typical bus station,http://lh6.ggpht.com/-IvRtNLNcG8o/TpFyrudaT6I/...
1,sierra looked stunning in this top and this sk...,http://78.media.tumblr.com/3b133294bdc7c7784b7...
2,young confused girl standing in front of a war...,https://media.gettyimages.com/photos/young-con...
3,interior design of modern living room with fir...,https://thumb1.shutterstock.com/display_pic_wi...
4,cybernetic scene isolated on white background .,https://thumb1.shutterstock.com/display_pic_wi...
...,...,...
3318328,the teams line up for a photo after kick - off,https://i0.wp.com/i.dailymail.co.uk/i/pix/2015...
3318329,stickers given to delegates at the convention .,http://cdn.radioiowa.com/wp-content/uploads/20...
3318330,this is my very favourite design that i recent...,https://i.pinimg.com/736x/96/f0/77/96f07728efe...
3318331,man driving a car through the mountains,https://www.quickenloans.com/blog/wp-content/u...


In [11]:
val_file

Unnamed: 0,captions,url
0,author : a life in photography -- in pictures,https://i.pinimg.com/736x/66/01/6c/66016c3ba27...
1,an angler fishes river on a snowy day .,http://www.standard.net/image/2015/02/04/800x_...
2,photograph of the sign being repaired by brave...,http://indianapolis-photos.funcityfinder.com/f...
3,the player staring intently at a computer scre...,http://www.abc.net.au/news/image/9066492-3x2-7...
4,globes : the green 3d person carrying in hands...,https://www.featurepics.com/StockImage/2009031...
...,...,...
15835,a bougainvillea with pink flowers on a white b...,https://media.istockphoto.com/photos/bougainvi...
15836,ingredient hanging over river during festival,http://l7.alamy.com/zooms/4e49c7b4c0274166bb07...
15837,the general circulation of the atmosphere,http://slideplayer.com/5036014/16/images/22/Th...
15838,young teenager and her black horse in a traini...,https://www.featurepics.com/StockImage/2008082...


Now, we load the captions for the train and validation set captions in lists.

In [12]:
prompts_train = list(train_file['captions'])
prompts_val = list(val_file['captions'])

In [13]:
def show_captions():
  print('***train captions***\n', '\n'.join(prompts_train[:5]))
  print()
  print('Number of train captions:', len(prompts_train))
  print()
  print()
  print('***validation captions:***\n', '\n'.join(prompts_val[:5]))
  print()
  print('Number of train captions:', len(prompts_val))

show_captions()

***train captions***
 a very typical bus station
sierra looked stunning in this top and this skirt while performing with person at their former university
young confused girl standing in front of a wardrobe
interior design of modern living room with fireplace in a new house
cybernetic scene isolated on white background .

Number of train captions: 3318333


***validation captions:***
 author : a life in photography -- in pictures
an angler fishes river on a snowy day .
photograph of the sign being repaired by brave person
the player staring intently at a computer screen .
globes : the green 3d person carrying in hands globe

Number of train captions: 15840


### Caption Processing

Cleaning the prompts. I adopt few ways to clean the prompt:
- Lower Case Conversion
- Tokenization (Already pre-tokenized is provided by `Google Conceptual Caption`)
- Remove stop words
- Remove non-alphabets
- Keep only nouns
- Lemmatization (to store the object name)

In [14]:
# loads the text processing pipeline
nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma', tokenize_no_ssplit=True, tokenize_pretokenized=True, verbose=True, pos_batch_size=10000)

# treebank-specific POS (XPOS) tags to keep, other POS tagged tokens will not be retained
keep_pos_tags = ['NN', 'NNS', 'NNP', 'NNPS']

# Stopwords
stpwords = set(stopwords.words('english'))

# extract parts of speech
def extract_pos(doc):
  parsed_text = list()
  for sent in doc.sentences:
    parsed_sent = list()
    for wrd in sent.words:
      #extract text and pos
      parsed_sent.append((wrd.text, wrd.xpos))
    parsed_text.append(parsed_sent)
  return parsed_text

# extract lemma
def extract_lemma(doc):
  parsed_text = list()
  for sent in doc.sentences:
    parsed_sent = list()
    for wrd in sent.words:
      # extract text and lemma
      parsed_sent.append((wrd.text, wrd.lemma))
    parsed_text.append(parsed_sent)
  return parsed_text

def clean_prompt(sentences):
  # convert the sentences to lower case
  sentences_lc = [sentence.lower() for sentence in sentences]

  # stanza accepts only a single string instead of list of strings. So, we have set the tokenize_no_ssplit=True and have to join each sentence with double newline
  sentence_string = "\n\n".join(sentences_lc)

  # tokenizes, lemmatizes and pos tags the prompt
  processed_prompt = nlp(sentence_string)
  
  # extracts pos tags from the processed_prompt
  pos_tagged_prompt = extract_pos(processed_prompt)

  # lemmatized text
  lemmatized_prompt = extract_lemma(processed_prompt)

  # keep only the noun words, removes stopwords
  fin_prompt = [[word for word, pos_tag in sent if ((pos_tag in keep_pos_tags) and (word not in stpwords))] for sent in pos_tagged_prompt]
  obj_prompt = [[word_lemma[1] for word_pos, word_lemma in zip(sent_pos, sent_lemma) if ((word_pos[1] in keep_pos_tags) and ((word_lemma[0] not in stpwords) or (word_lemma[1] not in stpwords)))] for sent_pos, sent_lemma in zip(pos_tagged_prompt, lemmatized_prompt)]
  return fin_prompt, obj_prompt

2023-01-06 16:19:43 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

2023-01-06 16:19:44 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |
| lemma     | combined |

2023-01-06 16:19:44 INFO: Use device: cpu
2023-01-06 16:19:44 INFO: Loading: tokenize
2023-01-06 16:19:44 INFO: Loading: pos
2023-01-06 16:19:44 INFO: Loading: lemma
2023-01-06 16:19:44 INFO: Done loading processors!


An example is shown below for the application of `clean_prompt`.

In [15]:
clean_prompt(["The fishes are playing in the mountains."])

([['fishes', 'mountains.']], [['fish', 'mountains.']])

Below, we start processing each prompt and store the objects detected in the captions from train and validation split.

In [16]:
NUM_PROMPTS_INFO_DISPLAY = 500 # After processing how many prompts, some necessary information must be displayed

In [None]:
# import shutil # Removes directory if already present! CAREFUL!!!!!!!!!!!!!!!!!!
# if os.path.exists(os.path.join(PATH, 'Caption-Processing1')):
#   shutil.rmtree(os.path.join(PATH, 'Caption-Processing1'))
# os.mkdir(os.path.join(PATH, 'Caption-Processing1'))

In [None]:
print('Starting...')
print('Captions to be processed:', len(prompts_train))
print('Cleaning Prompts... Storing Objects per prompt...')
processed_train = clean_prompt(prompts_train) # start processing the train captions

Starting...
Captions to be processed: 3318333
Cleaning Prompts... Storing Objects per prompt...


In [None]:
total_objects = set() # Stores the total number of distinct objects detected
num_objects_detected = list() # Stores number of objects detected after processing some number of prompts iteratively
caption_data_train_file = {'annotations':[{'caption':caption} for caption in prompts_train]} # For storing results

# Processing each prompt and updating annotation file for train set
cleaned_prompts, object_prompts = processed_train
for idx, prompt in tqdm(enumerate(zip(cleaned_prompts, object_prompts))):
  cleaned, objects = prompt # Process prompt
  # update files and object list
  caption_data_train_file['annotations'][idx]['cleaned'] = cleaned
  caption_data_train_file['annotations'][idx]['objects'] = objects
  total_objects.update(set(objects))

  if (idx+1) % NUM_PROMPTS_INFO_DISPLAY == 0: # Display Info
    num_objects_detected.append(len(total_objects))

# Display info once the for loop ends
if (idx+1) % NUM_PROMPTS_INFO_DISPLAY != 0: 
  num_objects_detected.append(len(total_objects))

# Save the processed captions data
with open(os.path.join(PATH, 'Data-Captions/GCC/train-captions-processed.json'), 'w') as outfile: # Save Results in json
  outfile.write(json.dumps({'captions': caption_data_train_file['annotations'], 'num_objects': num_objects_detected}, indent=4))

# Save the objects detected info
with open(os.path.join(PATH, 'Data-Captions/GCC/train-objects.json'), 'w') as outfile: # Saving Total objects in json
  outfile.write(json.dumps({'objects': list(total_objects), 'num_objects': num_objects_detected}, indent=4))

print('Saved and Finished Processing...')

In [None]:
print(total_objects)

In [None]:
print('Starting...')
print('Captions to be processed:', len(prompts_val))
print('Cleaning Prompts... Storing Objects per prompt...')
processed_val = clean_prompt(prompts_val) # start processing the validation captions

In [None]:
total_objects = set() # Stores the total number of distinct objects detected
num_objects_detected = list() # Stores number of objects detected after processing some number of prompts iteratively
caption_data_val_file = {'annotations':[{'caption':caption} for caption in prompts_val]} # For storing results

# Processing each prompt and updating annotation file for validation set
cleaned_prompts, object_prompts = processed_val
for idx, prompt in tqdm(enumerate(zip(cleaned_prompts, object_prompts))):
  cleaned, objects = prompt # Process prompt
  # update files and object list
  caption_data_val_file['annotations'][idx]['cleaned'] = cleaned
  caption_data_val_file['annotations'][idx]['objects'] = objects
  total_objects.update(set(objects))

  if (idx+1) % NUM_PROMPTS_INFO_DISPLAY == 0: # Display Info
    num_objects_detected.append(len(total_objects))

# Display info once the for loop ends
if (idx+1) % NUM_PROMPTS_INFO_DISPLAY != 0: 
  num_objects_detected.append(len(total_objects))

# Save the processed captions data
with open(os.path.join(PATH, 'Data-Captions/GCC/val-captions-processed.json'), 'w') as outfile: # Save Results in json
  outfile.write(json.dumps({'captions': caption_data_val_file['annotations'], 'num_objects': num_objects_detected}, indent=4))

# Save the objects detected info
with open(os.path.join(PATH, 'Data-Captions/GCC/val-objects.json'), 'w') as outfile: # Saving Total objects in json
  outfile.write(json.dumps({'objects': list(total_objects), 'num_objects': num_objects_detected}, indent=4))

print('Saved and Finished Processing...')

In [None]:
print(total_objects)

Now, we look at how each additional prompt helped in increasing the number of unique objects in the `total_objects`.

In [None]:
# Load the objects set for train set
with open(os.path.join(PATH, 'Data-Captions/GCC/train-objects.json')) as json_file:
  train_objects_file = json.load(json_file)

# Load the objects set for val set
with open(os.path.join(PATH, 'Data-Captions/GCC/val-objects.json')) as json_file:
  val_objects_file = json.load(json_file)

Plots are more visually appealing and revealing let's plot the results.

In [None]:
plt.figure(figsize=(16,12))
plt.plot(train_objects_file['num_objects'])
plt.xlabel(f'Number of Prompts processed (1 unit = {NUM_PROMPTS_INFO_DISPLAY} prompts)')
plt.ylabel('Number of unique objects extracted')
plt.title('Objects Extracted from Google Conceptual Captions (train split)')
plt.show()

In [None]:
plt.figure(figsize=(16,12))
plt.plot(val_objects_file['num_objects'])
plt.xlabel(f'Number of Prompts processed (1 unit = {NUM_PROMPTS_INFO_DISPLAY} prompts)')
plt.ylabel('Number of unique objects extracted')
plt.title('Objects Extracted from Google Conceptual Captions (validation split)')
plt.show()