## Creating the X and y arrays for cnn.ipynb or similar code ##

After running my versions of either fetch_sdss.ipynb or fetch_sdss.py, you will need to create the X and y arrays of the subsequent data.  You will need to specify the folder that contains the .npy files created in the fetch_sdss code (by default, I will assume that it is in the folder named "result", as was specified in fetch_sdss).  Once this is specified, the files will be iterated over, the redshift information will be extracted from the .npy file names and put into the 'y' array, and the actual numpy files (image information) will be copied into the 'X' array.  

The resulting X and y arrays will be located in a different directory named "arrays" that is **NOT** inside of "result".
#### Importing needed libraries ####

In [1]:
import os
import numpy as np
import sys
from sklearn.cross_validation import train_test_split

#### Importing future to make up for any python2/3 discrepancies ####
This *shouldn't* be an issue.

In [2]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

### *SPECIFY THE IMAGE AND ARRAY PATH HERE* ###

In [3]:
path = "../data/results/"
savedir = "../data/split/"
traindir = "../data/train/"

### Get the list of all of the files you need ###

In [4]:
npfiles = os.listdir(path)
print(len(npfiles))

45722


### Initializing the numpy arrays ###

In [5]:
X = np.zeros((len(npfiles), 5, 48, 48))
y = np.zeros((len(npfiles), 1))

### Filling the arrays ###
First all of the galaxy images are downloaded to the notebook, loaded into the numpy array made above, and the redshift information is extracted from the file name.  These two HUGE arrays are then split into 200 (or so) roughly equal arrays.  These sub-arrays are then run through the train_test_split from sckit.cross_validation.  This creates 4x the original sub arrays (800 in this case).  Each of these new sub-arrays is then saved via numpy.save so that it can be stitched back together when you want to do deep learning on it over in cnn.ipynb.

In [6]:
s = 0
for file in npfiles:
    image = np.load(path + file)
    X[s,:,:,:] = image[:,:,:]
    num = file[:10]
    y[s] = float(num)
    s+=1     
    

### Specify the number of split arrays here ###
The larger the corpus, the bigger the nubmer should be (currently 200).

In [7]:
#now we split the array
X_split =np.array_split(X,200)
y_split = np.array_split(y,200)

# Freeing up memory
%xdel X
%xdel y
%xdel image
%xdel file

In [8]:
print(len(X_split))
print(X_split[0].shape)

200
(229, 5, 48, 48)


In [9]:
t = 0
while(t < len(X_split)):
    #Here we make the original arrays that we will concatenate on to later
    X_train, X_test, y_train, y_test = train_test_split(X_split[t], y_split[t], test_size=0.2, random_state=24)
        
    np.save(savedir+ "X_train" +str(t), X_train)
    np.save(savedir+ "X_test"  +str(t), X_test)
    np.save(savedir+ "y_train" +str(t), y_train)
    np.save(savedir+ "y_test"  +str(t), y_test)
        
    print(t)
        
    # Freeing up memory
    %xdel X_train
    %xdel X_test
    %xdel y_train
    %xdel y_test
    t+=1

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199


In [10]:
%xdel X_split
%xdel y_split

In [11]:
trfiles = os.listdir(savedir)
print(len(trfiles))
print(trfiles)

800
['y_test171.npy', 'y_test11.npy', 'X_train186.npy', 'X_train29.npy', 'y_test1.npy', 'X_train111.npy', 'y_test104.npy', 'X_test55.npy', 'y_test146.npy', 'X_train132.npy', 'y_train172.npy', 'X_test0.npy', 'y_test101.npy', 'X_test151.npy', 'X_train83.npy', 'y_test37.npy', 'y_test108.npy', 'y_train65.npy', 'y_train156.npy', 'y_test83.npy', 'X_test122.npy', 'X_train75.npy', 'X_train14.npy', 'y_train120.npy', 'y_test13.npy', 'y_test159.npy', 'y_train99.npy', 'X_test128.npy', 'y_test24.npy', 'X_test77.npy', 'y_test138.npy', 'X_test171.npy', 'y_test124.npy', 'X_train33.npy', 'y_train83.npy', 'X_test114.npy', 'y_test190.npy', 'X_test45.npy', 'X_train99.npy', 'y_test144.npy', 'X_train159.npy', 'X_train45.npy', 'X_train3.npy', 'y_train150.npy', 'X_test145.npy', 'X_train27.npy', 'y_train113.npy', 'X_test192.npy', 'X_test179.npy', 'X_test170.npy', 'y_test6.npy', 'y_test184.npy', 'X_train1.npy', 'y_train57.npy', 'y_test53.npy', 'X_test87.npy', 'y_train169.npy', 'X_test5.npy', 'X_train69.npy', 'X

In [13]:
X_ex = np.load(savedir+"X_test.npz")
namelist = X_ex.zip.namelist()
X_ex.zip.extract(namelist[0])
X_memmap = np.load(namelist[0], mmap_mode='r+')
assert np.all(X==X_memmap[:])

In [15]:
X_memmap.shape

(45722, 5, 48, 48)

In [17]:
y_test = np.load(savedir+"y_array.npy")
y_test.shape

(45722, 1)

In [None]:
print(X_test.shape)
print(y_test.shape)

print(X_train.shape)
print(y_train.shape)