## initial-exploration

This notebook is a quick initial EDA of the Google Images V4 dataset. Here I'm mainly downloading and previewing the metadata files, poking around in them a bit, getting some understanding of how to work with them, and saving the files to local disk.

In [3]:
import pandas as pd

train_boxed = pd.read_csv(
    "https://storage.googleapis.com/openimages/2018_04/train/train-annotations-bbox.csv"
)

### Boxed images, train set, bounding boxes

In [4]:
train_boxed.head()

Unnamed: 0,ImageID,Source,LabelName,Confidence,XMin,XMax,YMin,YMax,IsOccluded,IsTruncated,IsGroupOf,IsDepiction,IsInside
0,000002b66c9c498e,xclick,/m/01g317,1,0.0125,0.195312,0.148438,0.5875,0,1,0,0,0
1,000002b66c9c498e,xclick,/m/01g317,1,0.025,0.276563,0.714063,0.948438,0,1,0,0,0
2,000002b66c9c498e,xclick,/m/01g317,1,0.151562,0.310937,0.198437,0.590625,1,0,0,0,0
3,000002b66c9c498e,xclick,/m/01g317,1,0.25625,0.429688,0.651563,0.925,1,0,0,0,0
4,000002b66c9c498e,xclick,/m/01g317,1,0.257812,0.346875,0.235938,0.385938,1,0,0,0,0


`xclick` is manually drawn boxes, `activemil` is boxes drawn using an augmented image labeling technique which have been human-verified to have >= 0.7 IoU accuracy.

In [6]:
train_boxed['Source'].value_counts()

xclick       13050532
activemil     1559697
Name: Source, dtype: int64

In [8]:
train_boxed['LabelName'].value_counts().head()

/m/09j2d     1438128
/m/04yx4     1418594
/m/07j7r     1051344
/m/0dzct     1037710
/m/01g317    1034721
Name: LabelName, dtype: int64

For the elements below 1 is present, 0 is not present, and -1 is unknown.

In [10]:
train_boxed['Confidence'].value_counts()

1    14610229
Name: Confidence, dtype: int64

In [11]:
train_boxed['IsOccluded'].value_counts()

 1    9651132
 0    4937226
-1      21871
Name: IsOccluded, dtype: int64

In [12]:
train_boxed['IsTruncated'].value_counts()

 0    10922480
 1     3665878
-1       21871
Name: IsTruncated, dtype: int64

In [13]:
train_boxed['IsGroupOf'].value_counts()

 0    13713801
 1      874557
-1       21871
Name: IsGroupOf, dtype: int64

In [14]:
train_boxed['IsDepiction'].value_counts()

 0    13791999
 1      796359
-1       21871
Name: IsDepiction, dtype: int64

In [15]:
train_boxed['IsInside'].value_counts()

 0    14552771
 1       35587
-1       21871
Name: IsInside, dtype: int64

In [16]:
train_annotations = pd.read_csv(
    "https://storage.googleapis.com/openimages/2018_04/train/train-annotations-human-imagelabels-boxable.csv"
)

### Boxed images, train set, labels

In [19]:
train_annotations.head()

Unnamed: 0,ImageID,Source,LabelName,Confidence
0,000002b66c9c498e,verification,/m/014j1m,0
1,000002b66c9c498e,verification,/m/014sv8,1
2,000002b66c9c498e,verification,/m/01599,0
3,000002b66c9c498e,verification,/m/015p6,0
4,000002b66c9c498e,verification,/m/015x4r,0


`verification` images are certified by in-house labelers, `crowdsource-verification` images are crowdsourced via the Crowdsource app.

In [18]:
train_annotations['Source'].value_counts()

verification                8659710
crowdsource-verification     337085
Name: Source, dtype: int64

In [21]:
train_annotations['LabelName'].value_counts().head()

/m/01g317    839436
/m/09j2d     675650
/m/04yx4     472414
/m/05s2s     436288
/m/07j7r     423757
Name: LabelName, dtype: int64

Confidences are 0 for confirmed negative and 1 for confirmed positive.

In [23]:
train_annotations['Confidence'].value_counts()

1    6622219
0    2374576
Name: Confidence, dtype: int64

...which does mean that certain images have tons of positive or negative labels populated in the record, interestingly enough.

In [26]:
train_annotations['ImageID'].value_counts().head()

000020780ccee28d    140
000002b97e5471a0    116
000004f4400f6ec5     98
0000333f08ced1cd     97
0000071d71a0a6f6     93
Name: ImageID, dtype: int64

### Memory usage

Still pretty small footprints.

In [27]:
train_boxed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14610229 entries, 0 to 14610228
Data columns (total 13 columns):
ImageID        object
Source         object
LabelName      object
Confidence     int64
XMin           float64
XMax           float64
YMin           float64
YMax           float64
IsOccluded     int64
IsTruncated    int64
IsGroupOf      int64
IsDepiction    int64
IsInside       int64
dtypes: float64(4), int64(6), object(3)
memory usage: 1.4+ GB


In [28]:
train_annotations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8996795 entries, 0 to 8996794
Data columns (total 4 columns):
ImageID       object
Source        object
LabelName     object
Confidence    int64
dtypes: int64(1), object(3)
memory usage: 274.6+ MB


### Image IDs

In [29]:
image_ids = pd.read_csv(
    "https://storage.googleapis.com/openimages/2018_04/train/train-images-boxable-with-rotation.csv"
)

In [31]:
image_ids.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1743042 entries, 0 to 1743041
Data columns (total 12 columns):
ImageID               object
Subset                object
OriginalURL           object
OriginalLandingURL    object
License               object
AuthorProfileURL      object
Author                object
Title                 object
OriginalSize          int64
OriginalMD5           object
Thumbnail300KURL      object
Rotation              float64
dtypes: float64(1), int64(1), object(10)
memory usage: 159.6+ MB


In [30]:
image_ids.head()

Unnamed: 0,ImageID,Subset,OriginalURL,OriginalLandingURL,License,AuthorProfileURL,Author,Title,OriginalSize,OriginalMD5,Thumbnail300KURL,Rotation
0,4fa8054781a4c382,train,https://farm3.staticflickr.com/5310/5898076654...,https://www.flickr.com/photos/michael-beat/589...,https://creativecommons.org/licenses/by/2.0/,https://www.flickr.com/people/michael-beat/,Michael Beat,...die FNF-Kerze,4405052,KFukvivpCM5QXl5SqKe41g==,https://c1.staticflickr.com/6/5310/5898076654_...,0.0
1,b37f763ae67d0888,train,https://c1.staticflickr.com/1/67/197493648_628...,https://www.flickr.com/photos/drstarbuck/19749...,https://creativecommons.org/licenses/by/2.0/,https://www.flickr.com/people/drstarbuck/,Karen,Three boys on a hill,494555,9IzEn38GRNsVpATuv7gzEA==,https://c3.staticflickr.com/1/67/197493648_628...,0.0
2,7e8584b0f487cb9e,train,https://c7.staticflickr.com/8/7056/7143870979_...,https://www.flickr.com/photos/circasassy/71438...,https://creativecommons.org/licenses/by/2.0/,https://www.flickr.com/people/circasassy/,CircaSassy,A Christmas carol and The cricket on the heart...,2371584,3hQwu0iSzY1VIoXiwp0/Mg==,https://c7.staticflickr.com/8/7056/7143870979_...,0.0
3,86638230febe21c4,train,https://farm5.staticflickr.com/5128/5301868579...,https://www.flickr.com/photos/ajcreencia/53018...,https://creativecommons.org/licenses/by/2.0/,https://www.flickr.com/people/ajcreencia/,Alex,Abbey and Kenny,949267,onB+rCZnGQg5PRX7xOs18Q==,https://c4.staticflickr.com/6/5128/5301868579_...,
4,249086e72671397d,train,https://c6.staticflickr.com/4/3930/15342460029...,https://www.flickr.com/photos/codnewsroom/1534...,https://creativecommons.org/licenses/by/2.0/,https://www.flickr.com/people/codnewsroom/,COD Newsroom,Suburban Law Enforcement Academy 20th Annivers...,6541758,MjpaAVbMAWbCusSaxI1D7w==,https://c1.staticflickr.com/4/3930/15342460029...,0.0


### Class labels

In [20]:
class_names = pd.read_csv(
    "https://storage.googleapis.com/openimages/2018_04/class-descriptions-boxable.csv", 
    header=None
)
class_names.columns = ['LabelID', 'LabelName']

In [25]:
class_names.head()

Unnamed: 0,LabelID,LabelName
0,/m/011k07,Tortoise
1,/m/011q46kg,Container
2,/m/012074,Magpie
3,/m/0120dh,Sea turtle
4,/m/01226z,Football


### Write to disc

This is optional, but now that we've examined the metadata it's helpful to write it to disk. We can then pass the filepaths to the `openimager`  downloader, which would otherwise have to go out and re-download these files itself, which is hella slow.

In [75]:
mkdir data/metadata/

In [76]:
train_boxed.to_csv("../data/metadata/train-annotations-bbox.csv")

In [78]:
train_annotations.to_csv("../data/metadata/train-annotations-human-imagelabels-boxable.csv")

In [79]:
image_ids.to_csv("../data/metadata/train-images-ids.csv")

In [27]:
class_names.to_csv("../data/metadata/image-class-names.csv")