<a href="https://colab.research.google.com/github/SamuelBFG/DL-studies/blob/master/1_get_MSCOCO_validation_image_ids_with_legible_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Cloud APIs for Computer Vision: Up and Running in 15 Minutes**

This code is part of Chapter 8- Cloud APIs for Computer Vision: Up and Running in 15 Minutes.

### **Get MSCOCO validation image ids with legible text**

We will develop a dataset of images from the MSCOCO dataset that contain at least a single instance of legible text and are in the validation split.

In order to do this, we first need to download cocotext.v2.json from https://bgshih.github.io/cocotext/.

In [1]:
!wget -nc -q -O tmp.zip https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip && unzip -n tmp.zip && rm tmp.zip

Archive:  tmp.zip
  inflating: cocotext.v2.json        


Let's verify that the file has been downloaded and that it exists.

In [2]:
import os
os.path.isfile('./cocotext.v2.json')

True

We also need to download the coco_text.py file from the COCO-Text repository from http://vision.cornell.edu/se3/coco-text/

In [3]:
!wget -nc https://raw.githubusercontent.com/bgshih/coco-text/master/coco_text.py

--2020-11-28 14:22:48--  https://raw.githubusercontent.com/bgshih/coco-text/master/coco_text.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10259 (10K) [text/plain]
Saving to: ‘coco_text.py’


2020-11-28 14:22:48 (78.8 MB/s) - ‘coco_text.py’ saved [10259/10259]



In [4]:
import coco_text

In [5]:
# Load the COCO text json file
ct = coco_text.COCO_Text('./cocotext.v2.json')

loading annotations into memory...
0:00:01.706332
creating index...
index created!


In [6]:
# Find the total number of images in validation set
print(len(ct.val))

10000


In [7]:
!git clone https://github.com/waleedka/coco
!pip install -U setuptools
!pip install -U wheel
!make install -C coco/PythonAPI

Cloning into 'coco'...
remote: Enumerating objects: 904, done.[K
remote: Total 904 (delta 0), reused 0 (delta 0), pack-reused 904[K
Receiving objects: 100% (904/904), 10.39 MiB | 30.39 MiB/s, done.
Resolving deltas: 100% (539/539), done.
Requirement already up-to-date: setuptools in /usr/local/lib/python3.6/dist-packages (50.3.2)
Requirement already up-to-date: wheel in /usr/local/lib/python3.6/dist-packages (0.35.1)
make: Entering directory '/content/coco/PythonAPI'
# install pycocotools to the Python site-packages
python setup.py build_ext install
Compiling pycocotools/_mask.pyx because it changed.
[1/1] Cythonizing pycocotools/_mask.pyx
  tree = Parsing.p_module(s, pxd, full_module_name)
running build_ext
building 'pycocotools._mask' extension
creating build
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/pycocotools
creating build/common
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-securi

In [8]:
ls /content/coco/PythonAPI

Makefile  pycocoDemo.ipynb  pycocoEvalDemo.ipynb  [0m[01;34mpycocotools[0m/  setup.py


In [9]:
!pwd

/content


In [10]:
!curl http://images.cocodataset.org/zips/train2014.zip --output ./train2014.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12.5G  100 12.5G    0     0  41.3M      0  0:05:11  0:05:11 --:--:-- 47.1M


In [11]:
from zipfile import ZipFile
file_name = "train2014.zip"

with ZipFile(file_name,'r') as zip:
  zip.extractall()
  print('Done')

Done


In [12]:
ls /content/train2014/

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
COCO_train2014_000000256648.jpg  COCO_train2014_000000547308.jpg
COCO_train2014_000000256651.jpg  COCO_train2014_000000547315.jpg
COCO_train2014_000000256655.jpg  COCO_train2014_000000547318.jpg
COCO_train2014_000000256659.jpg  COCO_train2014_000000547348.jpg
COCO_train2014_000000256662.jpg  COCO_train2014_000000547351.jpg
COCO_train2014_000000256664.jpg  COCO_train2014_000000547352.jpg
COCO_train2014_000000256683.jpg  COCO_train2014_000000547363.jpg
COCO_train2014_000000256690.jpg  COCO_train2014_000000547367.jpg
COCO_train2014_000000256706.jpg  COCO_train2014_000000547369.jpg
COCO_train2014_000000256707.jpg  COCO_train2014_000000547378.jpg
COCO_train2014_000000256720.jpg  COCO_train2014_000000547387.jpg
COCO_train2014_000000256721.jpg  COCO_train2014_000000547388.jpg
COCO_train2014_000000256731.jpg  COCO_train2014_000000547391.jpg
COCO_train2014_000000256734.jpg  COCO_train2014_000000547411.jpg
COCO_train2014_0000002567

In [13]:
path = '/content/train2014/' # Please update with local absolute path to train2014
os.path.exists(path)

True

Get all images containing at least one instance of legible text

In [14]:
image_ids = ct.getImgIds(imgIds=ct.val, catIds=[('legibility', 'legible')])

Find total number of validation images which have legible text

In [15]:
print(len(image_ids))

3261


In the data we downloaded, make sure all the image IDs exist.

In [16]:
def filename_from_image_id(image_ids):
    return "COCO_train2014_000000" + str(image_ids) + ".jpg"


final_image_ids = []

for each in image_ids:
    filename = filename_from_image_id(each)
    if os.path.exists(path + filename):
        final_image_ids.append(each)

print(len(final_image_ids))

2752


Make a folder where all the temporary data files can be stored

In [17]:
!mkdir data-may-2020
!mkdir data-may-2020/legible-images

Save a list of the image ids of the validation images

In [18]:
with open('./data-may-2020/val-image-ids-final.csv', 'w') as f:
    f.write("\n".join(str(image_id) for image_id in final_image_ids))

Move these images to a separate folder for future use.

In [19]:
from shutil import copy2

for each in final_image_ids:
    filename = filename_from_image_id(each)
    if os.path.exists(path + filename):
        copy2(path + filename, './data-may-2020/legible-images/')

In [20]:
!cd /content
# Zip the directory so that we can download it
!zip -r data-may-2020.zip /content/data-may-2020

  adding: content/data-may-2020/ (stored 0%)
  adding: content/data-may-2020/legible-images/ (stored 0%)
  adding: content/data-may-2020/legible-images/COCO_train2014_000000307352.jpg (deflated 0%)
  adding: content/data-may-2020/legible-images/COCO_train2014_000000404292.jpg (deflated 0%)
  adding: content/data-may-2020/legible-images/COCO_train2014_000000358620.jpg (deflated 10%)
  adding: content/data-may-2020/legible-images/COCO_train2014_000000366984.jpg (deflated 1%)
  adding: content/data-may-2020/legible-images/COCO_train2014_000000214478.jpg (deflated 0%)
  adding: content/data-may-2020/legible-images/COCO_train2014_000000498425.jpg (deflated 0%)
  adding: content/data-may-2020/legible-images/COCO_train2014_000000435575.jpg (deflated 0%)
  adding: content/data-may-2020/legible-images/COCO_train2014_000000510290.jpg (deflated 0%)
  adding: content/data-may-2020/legible-images/COCO_train2014_000000357994.jpg (deflated 0%)
  adding: content/data-may-2020/legible-images/COCO_train

Now please donwload the ***data-may-2020.zip*** file to store it locally for further use.