# Script for generating the training and testing data sets for desirable galaxy types

The script follows after **02. Reading and Processing SDSS data.ipynb (in short; 02)** and uses the data files generated in that notebook. The current script can be run locally on the computer after obtaining the data sets remotely from **02** *lesta*. The aim of this script is to generate the training and testing data sets that is used to produce deep learning models for the three category classification of galaxy types.

1. Defining the input parameters
2. Preliminary preperation of the training and testing sets
3. Generation of the training and testing data sets

**Data**: 11th Nov, 2019 <br>
**Author**: Soumya Shreeram <br>
**Guidance from**: Anand Raichoor <br>
**Script adapted from:** S. Ben Nejma


In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 20})
import numpy as np
from numpy.lib.format import open_memmap
import os, sys
import subprocess
import random

## 1. Defining the input parameters

In [3]:
# setting the right path for the directory with the data
current_dir = os.getcwd()
data_dir = os.path.join(current_dir, "Data_files")

# ratio with which the data is separated for training and testing
ratio = 0.7

In [4]:
def writeOutputToFile(input_name, shape_arr, in_dtype):
    """
    Write to a .npy file as a memory-mapped array
    @param input_name :: array name
    @param shape_arr :: shape of the array to be memory-mapped
    
    @return output_arr :: the memory-mapped array
    """
    filename =  'Data_files/'+input_name+'.npy'
    w1 = open_memmap(filename, dtype=in_dtype, mode='w+', shape=shape_arr)
    return w1

## 2. Preliminary preperation of the training and testing sets

Following the functions in the succeeding code cell, which are responsible for generation of training and testing data sets, a brief explanation of the sample processing is given.

In [5]:
def createCategoryList(Y, ratio):
    """
    Function creates a list of indexes for every category (galaxy type) in the sample
    """
    # lists of the different labels, and aranging indicies
    categories = np.unique(Y).astype(int)
    indexes = np.arange(len(Y))

    # minimum no. of samples to choose for each catergory/type of galaxy
    min_samples = np.array([int(ratio*len(Y[Y == i])) for i in categories])

    #  list of indicies for every target type, and shuffling them at random
    category_indexes = [indexes[Y == i] for i in categories]
    for i in categories:
        random.shuffle(category_indexes[i])
    return category_indexes, categories, min_samples

def testTrainIndexes(category_indexes, categories, min_samples, train, test):
    """
    Function produces the indicies used for generating training and testing samples
    """
    if train:
        indexes_interm = [category_indexes[i][:min_samples[i]]
                                for i in categories]
    elif test:
        indexes_interm = [category_indexes[i][min_samples[i]:]
                                for i in categories]
    indexes = np.array([idx for categories in indexes_interm
                              for idx in categories])
    random.shuffle(indexes)
    return indexes

def generateTrainTestFiles(len_train, X):
    """
    Function to generate empty memory-mapped files for training and testing data sets
    """
    X_train = writeOutputToFile('X_train', (len_train, X.shape[1]), 'float32')
    Y_train = writeOutputToFile('Y_train', (len_train,), 'float32')
    
    X_test =  writeOutputToFile('X_test', (X.shape[0]-len_train, X.shape[1]), 'float32')
    Y_test = writeOutputToFile('Y_test', (X.shape[0]-len_train,), 'float32')
    return X_train, Y_train, X_test, Y_test

The X, Y data sets that were generated by running the notebook 02 on *lesta* cluster are loaded.
* X $\equiv Flux\ {\rm values}\ \forall\ {\rm waveslengths}$
* Y $\equiv Target-type$ i.e. Quasars, Galaxies, Other $\forall\ {\rm fibres}$ 

In [6]:
# loading the (X, Y) == (flux, target-types)data sets
X = np.load('Data_files/X_corrupted.npy', mmap_mode='r')
Y = np.load('Data_files/Y_corrupted.npy', mmap_mode='r')

The following code block creates a list of all categories of target-types, and selects the percentage of every type as defined by the quantity $ratio$ (see start of the code). The indicies of these cataegories are processed rather than the catergory list itself; this is to reduce the computational cost of processing high-dimensional samples. Finally, the list of indicies for training and testing (${\rm indexes\_train,\ indexes\_test}$) are generated. 

Empty memory-mapped arrays are generated to successively save the X, Y values for training and testing purposes.

In [7]:
# creates lists of indexes
category_indexes, categories, min_samples = createCategoryList(Y, ratio)

# Generates the indicies for training and testing
indexes_train = testTrainIndexes(category_indexes, categories, min_samples, train=True, test=False)
indexes_test = testTrainIndexes(category_indexes, categories, min_samples, train=False, test=True)

# Creates empty memory-mapped array for the (X, Y) training and testing samples
len_train = np.sum(min_samples)
X_train, Y_train, X_test, Y_test = generateTrainTestFiles(len_train, X)

## 3. Generation of the training and testing data sets
The following two lines saves the values of the fluxes (X) and target types (Y) for training and testing samples. 

In [8]:
X_train[:], Y_train[:] = X[indexes_train], Y[indexes_train]
X_test[:], Y_test[:] = X[indexes_test], Y[indexes_test]

In [9]:
print("Shape of [X, Y] training data sets: ", [np.shape(X_train), np.shape(Y_train)])
print("Shape of [X, Y] testing data sets: ", [np.shape(X_test), np.shape(Y_test)])

Shape of [X, Y] training data sets:  [(553301, 4317), (553301,)]
Shape of [X, Y] testing data sets:  [(237130, 4317), (237130,)]
