# Cloud APIs for Computer Vision: Up and Running in 15 Minutes

This code is part of [Chapter 8- Cloud APIs for Computer Vision: Up and Running in 15 Minutes ](https://learning.oreilly.com/library/view/practical-deep-learning/9781492034858/ch08.html).

## Get MSCOCO validation image ids with legible text

We will develop a dataset of images from the MSCOCO dataset that contain at least a single instance of legible text and are in the validation split.

In order to do this, we first need to download `cocotext.v2.json` from https://bgshih.github.io/cocotext/.

In [1]:
!wget -nc -q -O tmp.zip https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip && unzip -n tmp.zip && rm tmp.zip

Archive:  tmp.zip


Let's verify that the file has been downloaded and that it exists.

In [2]:
import os

os.path.isfile("./cocotext.v2.json")

True

We also need to download the `coco_text.py` file from the COCO-Text repository from http://vision.cornell.edu/se3/coco-text/

In [3]:
!wget -nc https://raw.githubusercontent.com/bgshih/coco-text/master/coco_text.py

File ‘coco_text.py’ already there; not retrieving.



In [4]:
import coco_text

In [5]:
# Load the COCO text json file
ct = coco_text.COCO_Text("./cocotext.v2.json")

loading annotations into memory...
0:00:01.462544
creating index...
index created!


In [6]:
# Find the total number of images in validation set
print(len(ct.val))

10000


Add the paths to the `train2014` directory downloaded from the [MSCOCO website](http://cocodataset.org/#download).

In [7]:
path = <PATH_TO_IMAGES> # Please update with local absolute path to train2014
os.path.exists(path) 

True

Get all images containing at least one instance of legible text

In [8]:
image_ids = ct.getImgIds(imgIds=ct.val, catIds=[("legibility", "legible")])

Find total number of validation images which have legible text

In [9]:
print(len(image_ids))

3261


In the data we downloaded, make sure all the image IDs exist.

In [10]:
def filename_from_image_id(image_id):
    return "COCO_train2014_000000" + str(image_id) + ".jpg"


final_image_ids = []

for each in image_ids:
    filename = filename_from_image_id(each)
    if os.path.exists(path + filename):
        final_image_ids.append(each)

print(len(final_image_ids))

2752


Make a folder where all the temporary data files can be stored

In [11]:
!mkdir data-may-2020
!mkdir data-may-2020/legible-images

mkdir: cannot create directory ‘data-may-2020’: File exists
mkdir: cannot create directory ‘data-may-2020/legible-images’: File exists


Save a list of the image ids of the validation images

In [12]:
with open("./data-may-2020/val-image-ids-final.csv", "w") as f:
    f.write("\n".join(str(image_id) for image_id in final_image_ids))

Move these images to a separate folder for future use.

In [13]:
from shutil import copy2

for each in final_image_ids:
    filename = filename_from_image_id(each)
    if os.path.exists(path + filename):
        copy2(path + filename, "./data-may-2020/legible-images/")