Cite:
D. A. Goldstein, et al. 2015 "Automated Transient Identification in the Dark Energy Survey" AJ (accepted).

# Background

* We are aiming here to classify two different types of astronomy images: true data, and artificially injected


First things first, let's get the pyspark kernel. Open up a Cori terminal and type "module load spark"

Let's grab the data'

In [1]:
! rm -rf autoscan_features.2.csv &&  wget http://portal.nersc.gov/project/dessn/autoscan/autoscan_features.2.csv

--2016-08-23 06:46:24--  http://portal.nersc.gov/project/dessn/autoscan/autoscan_features.2.csv
Resolving portal.nersc.gov... 128.55.6.160
Connecting to portal.nersc.gov|128.55.6.160|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 448893905 (428M) [text/plain]
Saving to: “autoscan_features.2.csv”


2016-08-23 06:46:28 (106 MB/s) - “autoscan_features.2.csv” saved [448893905/448893905]



In [None]:
from skimage.io import imread, imshow

from matplotlib import pyplot as plt

path_to_sample_image = "/project/projectdirs/dasrepo/data_day/astron-images/srch11802308.gif"

%matplotlib inline

#### Here is a sample astronomy image:

In [None]:
#im = imread(path_to_sample_image)

#get an image of the other day

#plt.imshow(im,cmap='gray')

Instead of running directly on the images, we will run on 40 physics computed features. If we compute pretty discriminating features, this will make it easier for the ML algo to discriminate

It would interesting to see if a machine learning algorithm could discriminate solely based on the pixels of the image. If you are interested, I can show later applying deep learning to do classification on the raw images

We have a csv file. Here is what it looks like. Each line represents a single event. Each event consists of 40 numbers which are these physically motivated features from the image. The first row of the file is the header with the name of each feature


In [4]:
! head -12 './autoscan_features.2.csv' | grep "^#" 

# autoscan training data
# use the id column to cross-match rows with thumbnails
# object_type gives the class of the row
# object_type = 0: artifact
# object_type = 1: non-artifact
# remaining 38 columns defined in section 3 and table 2 of companion paper 


In [5]:
! sed -i.bak '/^#/d' ./autoscan_features.2.csv

Ok, we will use spark, here, so let's load the modules of interest and delete the comments at the beginning.

In [6]:
from pyspark.sql import SparkSession

SparkSession is like the workhorse variable here

In [7]:
spark = SparkSession.builder.getOrCreate()

Now we will read the csv file to a data frame

In [8]:
df = spark.read.csv('./autoscan_features.2.csv', header=True)

CPU times: user 5 ms, sys: 2 ms, total: 7 ms
Wall time: 8.12 s


In [9]:
#ID will not be useful and band is non-numerical
df=df.drop('ID')
df=df.drop('BAND')

Now let's look at a sample record from the dataset. As we can see, underneath the dataframe is an RDD of rows.

In [10]:
df.take(1)

CPU times: user 5 ms, sys: 4 ms, total: 9 ms
Wall time: 1.01 s


[Row(OBJECT_TYPE=u'0', AMP=u'0.8083234429359436', A_IMAGE=u'1.5080000162124634', A_REF=u'2.65006947517395', B_IMAGE=u'0.949999988079071', B_REF=u'1.8995014429092407', CCDID=u'10', COLMEDS=u'0.11207699775695801', DIFFSUMRN=u'25.857545852661133', ELLIPTICITY=u'0.37002652883529663', FLAGS=u'0', FLUX_RATIO=u'0.2590300440788269', GAUSS=u'226.4202880859375', GFLUX=u'1.0089635848999023', L1=u'103.80699920654297', LACOSMIC=u'1.736109972000122', MAG=u'23.031299591064453', MAGDIFF=u'-0.4524995982646942', MAGLIM=u'0', MAG_FROM_LIMIT=u'1.6222000122070312', MAG_REF=u'22.578800201416016', MAG_REF_ERR=u'0.11959999799728394', MASKFRAC=u'0.0', MIN_DISTANCE_TO_EDGE_IN_NEW=u'559.7000122070312', N2SIG3=u'0', N2SIG3SHIFT=u'-7', N2SIG5=u'0', N2SIG5SHIFT=u'-8', N3SIG3=u'0', N3SIG3SHIFT=u'-8', N3SIG5=u'0', N3SIG5SHIFT=u'-9', NN_DIST_RENORM=u'0.6749339699745178', NUMNEGRN=u'22', SCALE=u'2.0241222381591797', SNR=u'7.722346305847168', SPREADERR_MODEL=u'0.004628799855709076', SPREAD_MODEL=u'-0.0037175000179558992

In [11]:
len(df.columns)

38

And the schema. As we can see here, there is one label, one ID and 38 other features

In [12]:
df.printSchema()

#describe a couple of the physics features

root
 |-- OBJECT_TYPE: string (nullable = true)
 |-- AMP: string (nullable = true)
 |-- A_IMAGE: string (nullable = true)
 |-- A_REF: string (nullable = true)
 |-- B_IMAGE: string (nullable = true)
 |-- B_REF: string (nullable = true)
 |-- CCDID: string (nullable = true)
 |-- COLMEDS: string (nullable = true)
 |-- DIFFSUMRN: string (nullable = true)
 |-- ELLIPTICITY: string (nullable = true)
 |-- FLAGS: string (nullable = true)
 |-- FLUX_RATIO: string (nullable = true)
 |-- GAUSS: string (nullable = true)
 |-- GFLUX: string (nullable = true)
 |-- L1: string (nullable = true)
 |-- LACOSMIC: string (nullable = true)
 |-- MAG: string (nullable = true)
 |-- MAGDIFF: string (nullable = true)
 |-- MAGLIM: string (nullable = true)
 |-- MAG_FROM_LIMIT: string (nullable = true)
 |-- MAG_REF: string (nullable = true)
 |-- MAG_REF_ERR: string (nullable = true)
 |-- MASKFRAC: string (nullable = true)
 |-- MIN_DISTANCE_TO_EDGE_IN_NEW: string (nullable = true)
 |-- N2SIG3: string (nullable = true)
 

In [13]:
df.groupBy('OBJECT_TYPE').count().show()

+-----------+------+
|OBJECT_TYPE| count|
+-----------+------+
|          0|454092|
|          1|444871|
+-----------+------+



In [14]:
from pyspark.mllib.linalg import Vectors

In [15]:
from pyspark.sql import Row

In [16]:
from pyspark.ml.linalg import Vectors, Vector, VectorUDT

Now the ML algo wants a tuple of label and a vector of the other features. Let's make a little function to convert rows to vectrs

In [17]:
def convert_row_to_vector(row, lbl_key='OBJECT_TYPE'):
    row = row.asDict()
    lbl = int(row[lbl_key])
    float_list = [0.0 if str(v) == '' else float(v) for k,v in row.iteritems() if k!= lbl_key]
    return (lbl, Vectors.dense(float_list))
    

Now, we call map on the rdd in the dataframe, converting each row to a vector

In [18]:
lbl_vec_pairs = df.rdd.map(convert_row_to_vector)

Now we can create a dataframe

In [20]:
data = spark.createDataFrame(lbl_vec_pairs, ['label', 'features'])

In [21]:
from pyspark.sql.types import StructField, IntegerType, StructType

from pyspark.mllib.feature import LabeledPoint

In [43]:
#data=lbl_vec_pairs.map(lambda (l,v): LabeledPoint(l,v))

In [22]:
from pyspark.ml.classification import RandomForestClassifier

from pyspark.ml.feature import DecisionTreeParams 
from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder

from pyspark.ml.feature import MinMaxScaler

from pyspark.ml import Pipeline

from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [23]:
from pyspark.ml.tuning import TrainValidationSplitModel

In [24]:
bce = BinaryClassificationEvaluator(metricName='mse')

In [25]:
tr_data, te_data = data.randomSplit([0.8, 0.2])

In [26]:
rf = RandomForestClassifier()

In [27]:
paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [10, 50, 100, 300]) \
    .addGrid(rf.maxDepth, [100, 30, 15, 5]) \
    .build()

In [28]:
tvs = TrainValidationSplit(estimator=rf,
                           estimatorParamMaps=paramGrid,
                           evaluator=bce,
                           trainRatio=0.8)

In [None]:
tvs.fit(tr_data)


In [None]:
prediction = model.transform(test)

In [5]:
# convert to .py file. Now let's submit to queue
! jupyter nbconvert --to script spark-astro-ml.ipynb
!sed -i.bak '/ipython*/d' ./*.py


[NbConvertApp] Converting notebook spark-astro-ml.ipynb to script
[NbConvertApp] Writing 5728 bytes to spark-astro-ml.py


HW!
Items to Work on: 3 Options:

1. ML
 * make a logistic regression model
 * use cross-validation to search a good space of logisitc regression hyoerparams
 * preprocess all features to mean zero and stdev 1
 * submit this job to batch
 
 
2. Data Munging / Saving
 * find number of columns that have an element over 1
 * make a new data frame that contains 
     * the sum of GLUX SNR and GAUSS Columns
     * a column with the max value from each row from the original data
     * the mean value from each row
     * the median
 * conver this data frame to pandas 
 * also save this data frame out to JSON

 
3. Deep Learning
    * Train a convolutional neural network to classify the astronomy images for at least 50 epochs
    * Submit this job to the quueue
    * Plot the learning curve and an accuracy curve