### Install Required Packages

`Stanza`, Stanford NLP Package benefits from `GPU` so enable it under `View Resources > Change runtime type`

In [2]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-ff8ef99e-b482-1d18-03ac-44412b25c0c7)


In [3]:
%%capture
!pip install stanza # for stanford pos tagger
!pip install ftfy regex tqdm
!pip install datasets

### Load Necessary Libraries

We will load the necessary libraries required for generating DAAM outputs for input prompts.

In [4]:
# General
import os
import gc
import json
import time
from tqdm import tqdm

# Plotting
from matplotlib import pyplot as plt

# Data-Handling
import numpy as np
import pandas as pd
from datasets import load_dataset
from pycocotools.coco import COCO

# Model Handling
import torch

# Caption-Processing
from nltk.corpus import stopwords

Download the stopwords for removing stopwords

In [5]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [6]:
# POS-Tagging
import stanza
stanza.download('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.1/models/default.zip:   0%|          | 0…

INFO:stanza:Finished downloading models and saved to /root/stanza_resources.


### Load Data

Below, we load the `LAION-2B` dataset with the URL and captions in streaming mode to prevent downloading over `350 GB` of data. There is only `train` split available.

In [7]:
dataset = load_dataset('laion/laion2B-en', split='train', streaming=True)

Downloading readme:   0%|          | 0.00/30.0 [00:00<?, ?B/s]



For faster processing, I group the data in batches. Choosing `batch_size=10000`.

In [8]:
BATCH_SIZE = 10000 # SAVE_AFTER = BATCH_SIZE i.e. after processing these many prompts we will save the results.

In [9]:
# For processing data in batches
def group_batch(batch):
  return {k: [v] for k, v in batch.items()}
data = dataset.map(group_batch, batch_size=BATCH_SIZE, batched=True, remove_columns=['SAMPLE_ID', 'URL', 'HEIGHT', 'WIDTH', 'LICENSE', 'NSFW', 'similarity'])

We will look at the captions in the Caption Processing part together with the cleaned captions. For comparison, I use first `BATCH_SIZE` captions.

In [10]:
t0 = time.time()
examples = next(iter(data))['TEXT']
t1 = time.time()
print(f'Fetched {BATCH_SIZE} captions in {t1-t0} secs')

Fetched 10000 captions in 2.7841286659240723 secs


### Caption Processing

Cleaning the prompts. I adopt few ways to clean the prompt:
- Lower Case Conversion
- Tokenization
- ~Remove stop words~
- ~Remove non-alphabets~
- ~Keep only nouns~ POS Tag each word
- ~Lemmatization (to store the object name)~ !VERY IMPORTANT
- ~Discard any lemma/word with non-alphabet characters. (As `LAION` has lots of noise)~

#### Stanza Pipeline

- Fully Neural Pipeline

In [20]:
# loads the text processing pipeline
nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos', tokenize_no_ssplit=True, verbose=True, pos_batch_size=6500)

# extract parts of speech
def extract_pos(doc):
  parsed_text = list()
  for sent in doc.sentences:
    parsed_sent = list()
    for wrd in sent.words:
      #extract text and pos
      parsed_sent.append((wrd.text, wrd.xpos))
    parsed_text.append(parsed_sent)
  return parsed_text

def clean_prompt(sentences):
  # convert the sentences to lower case
  sentences_lc = [sentence.lower() for sentence in sentences]

  # stanza accepts only a single string instead of list of strings. So, we have set the tokenize_no_ssplit=True and have to join each sentence with double newline
  sentence_string = "\n\n".join(sentences_lc)

  # tokenizes and pos tags the prompt
  with torch.no_grad():
    processed_prompt = nlp(sentence_string)
  
  # extracts pos tags from the processed_prompt
  pos_tagged_prompt = extract_pos(processed_prompt)

  del processed_prompt
  
  return pos_tagged_prompt

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

INFO:stanza:Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |

INFO:stanza:Use device: gpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Done loading processors!


An example of how it works

In [21]:
clean_prompt(['The fishes are playing in the mountains.'])

[[('the', 'DT'),
  ('fishes', 'NNS'),
  ('are', 'VBP'),
  ('playing', 'VBG'),
  ('in', 'IN'),
  ('the', 'DT'),
  ('mountains', 'NNS'),
  ('.', '.')]]

Run the above pipeline.

In [22]:
# Takes quite a bit of time with large batch_size.
def process(file_name):
  t1 = time.time()
  pos_tagged = clean_prompt(examples)
  t2 = time.time()

  pd.DataFrame({
      'original prompts': examples,
      'pos tagged prompt': pos_tagged,
  }).to_csv(f'{file_name}.csv')

  print(f'Time taken to pos tag 1 batch containing {BATCH_SIZE} prompts:{t2-t1} secs')
  return pos_tagged

stanza_pos_1 = process('stanza-full-neural')

Time taken to pos tag 1 batch containing 10000 prompts:25.224125862121582 secs


#### NLTK Pipeline

In [23]:
def clean_prompt(sentences):
  # convert the sentences to lower case
  sentences_lc = [sentence.lower() for sentence in sentences]

  # tokenizes and pos tags the prompt
  processed_prompt = list()
  for sentence_string in sentences_lc:
    processed_prompt.append(nltk.pos_tag(nltk.word_tokenize(sentence_string)))
  
  return processed_prompt

In [24]:
nltk_pos = process('nltk')

Time taken to pos tag 1 batch containing 10000 prompts:9.226912260055542 secs


NLTK takes significantly less time in POS Tagging compared to Stanza.

Let's compare the quality of results.
- How many unique nouns (all kinds of - not taking into account plural, etc) detected?

In [26]:
noun_tags = ['NN', 'NNS', 'NNP', 'NNPS']

stanza_nouns = set([word for sent in stanza_pos_1 for word, pos in sent if pos in noun_tags])
nltk_nouns = set([word for sent in nltk_pos for word, pos in sent if pos in noun_tags])

print(f'Number of nouns detected by Stanza: {len(stanza_nouns)}')
print(f'Number of nouns detected by NLTK: {len(nltk_nouns)}')

Number of nouns detected by Stanza: 18108
Number of nouns detected by NLTK: 16628


Significantly more nouns detected by Stanza compared to NLTK.