<a href="https://colab.research.google.com/github/MarinaOrzechowski/CNN-on-Quick-Draw-dataset/blob/master/load_%26_pickle_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#<center>**Load the data**</center>

The data was taken from Quick, Draw! dataset: https://github.com/googlecreativelab/quickdraw-dataset .

The Quick Draw Dataset is a collection of 50 million drawings across 345 categories, contributed by players of the game Quick, Draw!. The drawings were captured as timestamped vectors, tagged with metadata including what the player was asked to draw and in which country the player was located. You can browse the recognized drawings on [quickdraw.withgoogle.com/data](https://).

For our project we will use a preprocessed dataset in .npy format. Each file includes >100000 images of a specific category. The total dataset size is 37 Gb, which is very difficult to work with.Therefore, we reduced the number of images for each category to 10000 (reduce_dataset.py) and the total dataset size to 2.5 Gb. However, Google Colab RAM of 25.5 Gb wasn't enough to process all 345 categories, so we only considered 100. 

Next step is to load the data and pickle it.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

import numpy as np
import os
import pickle

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [2]:
files = os.listdir("/content/gdrive/My Drive/CTPproject/data")
x = []
x_load = []
y = []
y_load = []
print(files)

['ambulance.npy', 'alarm clock.npy', 'airplane.npy', 'bandage.npy', 'banana.npy', 'backpack.npy', 'axe.npy', 'asparagus.npy', 'arm.npy', 'apple.npy', 'ant.npy', 'angel.npy', 'beard.npy', 'bear.npy', 'beach.npy', 'bathtub.npy', 'bat.npy', 'basketball.npy', 'basket.npy', 'baseball.npy', 'baseball bat.npy', 'barn.npy', 'blueberry.npy', 'blackberry.npy', 'birthday cake.npy', 'bird.npy', 'binoculars.npy', 'bicycle.npy', 'bench.npy', 'belt.npy', 'bee.npy', 'bed.npy', 'bowtie.npy', 'boomerang.npy', 'book.npy', 'bucket.npy', 'broom.npy', 'broccoli.npy', 'bridge.npy', 'camera.npy', 'camel.npy', 'calendar.npy', 'calculator.npy', 'cake.npy', 'cactus.npy', 'butterfly.npy', 'bush.npy', 'bus.npy', 'bracelet.npy', 'bulldozer.npy', 'ceiling fan.npy', 'cat.npy', 'castle.npy', 'carrot.npy', 'car.npy', 'canoe.npy', 'cannon.npy', 'candle.npy', 'campfire.npy', 'camouflage.npy', 'compass.npy', 'coffee cup.npy', 'cloud.npy', 'clock.npy', 'clarinet.npy', 'circle.npy', 'church.npy', 'chandelier.npy', 'chair.np

Lets see how some of the images look like:

In [3]:
# test print. how the data looks like?
file = "/content/gdrive/My Drive/CTPproject/data//" + files[1]
x_test = np.load(file) #returns an array of lists(images) with 784 elements(coordinates) in each (flattened 28x28)
test_print = [x_test[i].reshape(28,28) for i in range(0, 10)]
chars = 'x%#*+=-:. '  
scale = (len(chars)-1)/255.
print()
for image in test_print:
    for row in image:
        print(' '.join(chars[int(value*scale)] for value in row))
    print()   


x x x x x x x x x x x x x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x x x x * + * * # x x x x
x x x x * .   . * x x x x x x x x % :           = x x x
x x x * . - * - . # x x x x x x x : : * x x x = . x x x
x x # . - x x x - . x x x # * + = * x . : x x %   * x x
x x + . x x x x +   - - .     . .   . + . : x x . + x x
x x #   # x x * . .   : = + % x x % : . * . = %   * x x
x x x . + x * . . . # x x x x x x x x - . =   +   # x x
x x x = . * . . . - . : : % x x x x x x : . : .   % x x
x x x x . . : . % : . . = x x x x x x x x . - * * x x x
x x x x + - . = x : :   * x x x x x x x x + . # x x x x
x x x x x + . x x . + = . # x x x x x x x x : : x x x x
x x x x x + . x x * % x : . % x x x x x x x *   % x x x
x x x x x + . x x x x x x . - x x x x x x x x . = x x x
x x x x x *   x x x x x x *   * x x x % : # x = . x x x
x x x x x *   % x x x x x x - . # x x x - . * #   % x x
x x x x x #   # x x x x x x +   . = + + * -   =   # x x
x x x x x %   * x x x x x x % . . : .       .  

In [0]:
def load_data():
  '''function which reads each data file in the testdata directory and extracts lists of pixels'''
  x_load.clear()
  y_load.clear()
  count = 0
  for num, file in enumerate(files[:100]):
      file = "/content/gdrive/My Drive/CTPproject/data/" + file
      x = np.load(file)
      x = x / 255.
      x = x[:10000, :] # consider only first 10000 images for each category
      x_load.append(x)
      y = [count for _ in range(10000)]
      count += 1
      y = np.array(y)
      y = y.reshape(y.shape[0], 1)
      y_load.append(y)
      print(num)

  return x_load, y_load

In [5]:
images, labels = load_data()
images = np.array(images)
labels = np.array(labels)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99


In [6]:
print(images.shape)
print(labels.shape)

(100, 10000, 784)
(100, 10000, 1)


Flatten data:

In [0]:
images=images.reshape(images.shape[0]*images.shape[1],images.shape[2])
labels=labels.reshape(labels.shape[0]*labels.shape[1],labels.shape[2])

In [8]:
print(images.shape)
print(labels.shape)

(1000000, 784)
(1000000, 1)


Pickle data:

In [0]:

with open("/content/gdrive/My Drive/CTPproject/images", "wb") as f:
    pickle.dump(images, f, protocol=4)
with open("/content/gdrive/My Drive/CTPproject/labels", "wb") as f:
    pickle.dump(labels, f, protocol=4)