# Data Collection for Quasar Detection

**Goal:** Gather images of quasars, non-quasar celestial objects, and quasar candidates.

We will use images collected from the <a href="http://www.sdss.org/">Sloan Digital Sky Survey</a>.

Some useful links for the Sloan Digitial Sky Survey:
<ul>
   <li><a href="skyserver.sdss.org/dr12/en/tools/chart/listinfo.aspx">SDSS DR12 Image List Tool</a></li>
  <li><a href="https://dr12.sdss.org/fields/">SDSS DR12 Simple Image Query</a></li>
   <li><a href="https://data.galaxyzoo.org/">The Galaxy Zoo</a></li>
   <li><a href="http://cdsweb.u-strasbg.fr/cgi-bin/Sesame">Sesame Name Resolver</a></li>
   <li><a href="http://skyservice.pha.jhu.edu/DR12/ImgCutout/getjpeg.aspx">DR 12 Image Retrieval Script</a></li>
   <li><a href="http://www.sdss.org/wp-content/uploads/2016/08/dr13_boss.png">A Visual Representation of the DR13 Footprint</a></li>
   <li><a href="http://simbad.u-strasbg.fr/simbad/">SIMBAD Astrological Database</a></li>
   <li><a href="http://simbad.u-strasbg.fr/simbad/sim-help?Page=sim-fsam#Sotypes">Useful SIMBAD Documentation</a></li>
</ul> 

Other useful links about Quasars
<ul>
    <li><a href="http://www.galaxyzooforum.org/index.php?topic=272689.0">Understanding QSO and Quasars</a></li>
    <li><a href="https://en.wikipedia.org/wiki/Quasar">Wikipedia Article on Quasars</a></li>
</ul>

In [None]:
from flask import Flask
import urllib 
from urllib import urlencode
from urlparse import urlparse
from urllib2 import urlopen, Request, HTTPError
import urlparse
from IPython.display import display, Image
import os
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
%matplotlib inline
from astropy import units as u
from astropy.coordinates import SkyCoord
from astropy.table import Table
from astroquery.sdss import SDSS
from astroquery.simbad import Simbad

As as test, we will use AstroPy to get a nice image for the project page. We will use SIMBAD to find a random quasar.

In [None]:
# Limit the number of results we get from our query. 
Simbad.ROW_LIMIT = 1000

In [None]:
result = Simbad.query_criteria('region(box,180d +30d, 8d +8d)',otype='QSO')

In [None]:
result

In [None]:
# Choose a random quasar.
qnumber = np.random.randint(0,1000)
# Get that quasar's coordinates, and format them for SkyCoordinate
resultRA = result[qnumber]['RA'].split()
RA = '%sh%sm%ss' % (resultRA[0], resultRA[1], resultRA[2])
resultDEC = result[qnumber]['DEC'].split()
DEC = '%sd%sm%ss' % (resultDEC[0], resultDEC[1], resultDEC[2])
# Convert to a SkyCoordinate
QCoord = SkyCoord(RA, DEC, frame='icrs')
QCoord

We will now get an image from the Sloan Digital Sky Survey. The following code follows <a href="http://www.astropy.org/astropy-tutorials/Coordinates.html">this tutorial</a>.

In [None]:
impix = 1024
imsize = 12*u.arcmin
cutoutbaseurl = 'http://skyservice.pha.jhu.edu/DR12/ImgCutout/getjpeg.aspx'
query_string = urlencode(dict(ra=QCoord.ra.deg, 
                                     dec=QCoord.dec.deg, 
                                     width=impix, height=impix, 
                                     scale=imsize.to(u.arcsec).value/impix))
url = cutoutbaseurl + '?' + query_string
urllib.urlretrieve(url, 'Quasar.jpg')

In [None]:
display(Image('Quasar.jpg'))

# Gathering Quasar Images

There is a list of 46420 detected quasars from the <a href="http://astrostatistics.psu.edu/datasets/SDSS_quasar.html">Penn State Center for Astrostatistics</a>. We will use their <a href="http://astrostatistics.psu.edu/datasets/SDSS_quasar.dat">SDSS_quasar.dat</a> data set and the <a href="http://www.astropy.org/">AstroPy</a> python package.

In [None]:
Quasars = pd.read_fwf('SDSS_quasar.dat')

In [None]:
Quasars.head()

In [None]:
Quasars.tail() # 46420 rows

In [None]:
coord = SkyCoord(str(Quasars.iloc[1]['R.A.'])+'d',str(Quasars.iloc[1]['Dec.'])+'d',frame='icrs')

In [None]:
impix = 120
imsize = 1*u.arcmin
cutoutbaseurl = 'http://skyservice.pha.jhu.edu/DR12/ImgCutout/getjpeg.aspx'
query_string = urllib.parse.urlencode(dict(ra=coord.ra.deg, 
                                     dec=coord.dec.deg, 
                                     width=impix, height=impix, 
                                     scale=imsize.to(u.arcsec).value/impix))
url = cutoutbaseurl + '?' + query_string
urllib.request.urlretrieve(url, 'Quasar_1.jpg')
display(Image('Quasar_1.jpg'))

In [None]:
def get_image(coordinate,name, impix = 120):
    '''
    Downloads the image from the SDSS DR12 release as a impix pixel by impix pixel image.
    
    Parameters
    ----------
    coordinate : coordinate of the celestial object as a Sky Coordinate.
    name: The name string to save the image as. It will be saved as 'name.jpg'.
    
    '''
    imsize = 1*u.arcmin
    cutoutbaseurl = 'http://skyservice.pha.jhu.edu/DR12/ImgCutout/getjpeg.aspx'
    query_string = urllib.parse.urlencode(dict(ra=coordinate.ra.deg, 
                                     dec=coordinate.dec.deg, 
                                     width=impix, height=impix, 
                                     scale=imsize.to(u.arcsec).value/impix))
    url = cutoutbaseurl + '?' + query_string
    urllib.request.urlretrieve(url, './Images/' + name + '.jpg')

In [None]:
get_image(coord,'test1')
# Worked successfully

In [None]:
# Some data manipulation to get Sky Coordinates for each entry.
# The application of the SkyCoord function will take time.
QuasarLocs = pd.concat([Quasars['R.A.'].apply(lambda x: str(x)+'d '), 
                        Quasars['Dec.'].apply(lambda x: str(x)+'d')], axis=1)
QuasarLocs['Coords']= QuasarLocs[['R.A.','Dec.']].apply(lambda x: SkyCoord(x[0],x[1],frame='icrs'), axis=1)

In [None]:
QuasarLocs.head()

In [None]:
# We will now download these images from SDSS DR12
for i in range(46420):
    get_image(QuasarLocs['Coords'].iloc[i],name='Quasar_'+str(i))

# Gathering Non-quasar Celestial Objects

We will use SIMBAD to find objects that are not Quasars or Quasar Candidates. We will sample 200 random regions in the SDSS footprint and take 500 objects from each region.

In [None]:
# Limit the number of results we get from our query.
Simbad.ROW_LIMIT = 20000

In [None]:
# We will stay in the 8h to 16h +0d to +60 footprint region of SDSS. 
# Note that there are some regions in SDSS DR 12 and DR 13 outside of this range, 
# but this range covers a majority of the footprint.
# As the box we form is 8d by 8d, we start at 124d and end at 236d for longitude,
# and start as +4d to +56d in latitude.
NonQuasars = pd.DataFrame()
randcoord =[]
for i in range(200):
    randcoord.append(str(np.random.randint(128,237))+'d +' + str(np.random.randint(4,57))+'d')
    try: 
        # For otype, QSO are quasars, Q? are quasar candidates, and LeQ are gravitationally lenses quasars.
        result = Simbad.query_criteria('region(box,' + randcoord[i] + ', 4d +4d)','otype != QSO','otype != Q?','otype != LeQ')
        sample = result.to_pandas().sample(500)
        NonQuasars = pd.concat([NonQuasars,sample],axis=0)
        if i % 10 == 0:
            print('At attempt %s' % i)
    except:
        print('Attempt Failed... retrying')
        i = i-1        

In [None]:
NonQuasars

In [None]:
# Checking for duplicates, which is a possibility in this process.
NonQuasars[NonQuasars.duplicated()]

In [None]:
# As duplicates were found, we will drop all but the first.
NonQuasars = NonQuasars.drop_duplicates(keep='first')

In [None]:
# Reindexing
NonQuasars.reset_index(inplace=True)
NonQuasars.drop('index', axis=1, inplace=True)

In [None]:
# Saving a copy of the data to accompany the images.
NonQuasars.to_csv('NonQuasarsData.csv',index=False)

In [None]:
NonQuasars.head()

In [None]:
NonQuasars.tail() #94670 rows

In [None]:
def RAtoICRS(RAValue):
    '''
    Converts SIMBAD Right Ascent (RA) format to ICRS format.
    
    Parameters
    ----------
    RAValue : A SIMBAD Right Ascent value in "X Y Z" format for X hours, Y minutes, and Z seconds.
    
    '''
    if len(RAValue.split()) == 1:
        return '%sh' % (RAValue.split()[0])
    elif len(RAValue.split()) == 2:
        return '%sh%sm' % (RAValue.split()[0], RAValue.split()[1])
    elif len(RAValue.split()) == 3: 
        return '%sh%sm%ss' % (RAValue.split()[0], RAValue.split()[1], RAValue.split()[2])
    else: 
        return np.nan()
    
def DECtoICRS(DECValue):
    '''
    Converts SIMBAD Declination (DEC) format to ICRS format.
    
    Parameters
    ----------
    RAValue : A SIMBAD Declination value in "+X Y Z" format for X degrees, Y minutes, and Z seconds.
    
    '''
    if len(DECValue.split()) == 1:
        return '%sd' % (DECValue.split()[0])
    elif len(DECValue.split()) == 2:
        return '%sd%sm' % (DECValue.split()[0], DECValue.split()[1])
    elif len(DECValue.split()) == 3: 
        return '%sd%sm%ss' % (DECValue.split()[0], DECValue.split()[1], DECValue.split()[2])
    else: 
        return np.nan()

In [None]:
# Data manipulation to get Sky Coordinates for each entry.
# Note that SIMBAD gives the values separated as hours, minutes, seconds for RA and degrees, minutes, seconds for Dec
NonQuasarLocs = pd.concat([NonQuasars['RA'].apply(RAtoICRS), 
                           NonQuasars['DEC'].apply(DECtoICRS)], axis=1)
NonQuasarLocs['Coords']= NonQuasarLocs[['RA','DEC']].apply(lambda x: SkyCoord(x[0],x[1],frame='icrs'), axis=1)

In [None]:
NonQuasarLocs.head()

In [None]:
# We will now download these images from SDSS DR12
for i in range(94670):
    get_image(NonQuasarLocs['Coords'].iloc[i],name='NonQ_'+str(i))

# Gathering Quasar Candidates

We will now use SIMBAD to identify quasar candidates for analysis with our trained model.

In [None]:
# As with the Non-quasar data, we will stay in the 8h to 16h +0d to +60 footprint 
# region of SDSS. Note that there are some regions in SDSS DR 12 and DR 13 outside 
# of this range, but this range covers a majority of the footprint.
# Due to timeout issues from SIMBAD, we will use smaller regions
# of width 10d by +6d to gather the candidates.
QuasarCandidates = pd.DataFrame()
for i in range(12): # Separate longitude into 12 segments of length 10d
    for j in range(10): # Separate latitude into 10 segments of length 6d
            try: 
                # For otype Q? are quasar candidates.
                result = Simbad.query_criteria('region(box,' + str(125+10*i)+'d +' + str(3+6*j) + 'd' + ', 5d +3d)','otype = Q?')
                QuasarCandidates = pd.concat([QuasarCandidates,result.to_pandas()],axis=0)
                if (i*10+j) % 20 == 0:
                    print('At attempt %s' % (i*10+j))
            except:
                print('Attempt Failed at i=%s and j=%s.' % (i,j))

In [None]:
QuasarCandidates.head()

In [None]:
QuasarCandidates.tail()

In [None]:
# Checking for duplication
QuasarCandidates[QuasarCandidates.duplicated()]

In [None]:
# Reindexing
QuasarCandidates.reset_index(inplace=True)
QuasarCandidates.drop('index', axis=1, inplace=True)

In [None]:
QuasarCandidates.tail() # 5418 rows

In [None]:
# Saving a copy of the data to accompany the images.
QuasarCandidates.to_csv('QuasarCandidatesData.csv',index=False)

In [None]:
# Data manipulation to get Sky Coordinates for each entry.
# Note that SIMBAD gives the values separated as hours, minutes, seconds for RA and degrees, minutes, seconds for Dec
QuasarCandidateLocs = pd.concat([QuasarCandidates['RA'].apply(RAtoICRS), 
                           QuasarCandidates['DEC'].apply(DECtoICRS)], axis=1)
QuasarCandidateLocs['Coords']= QuasarCandidateLocs[['RA','DEC']].apply(lambda x: SkyCoord(x[0],x[1],frame='icrs'), axis=1)

In [None]:
QuasarCandidateLocs.head()

In [None]:
# We will now download these images from SDSS DR12
for i in range(5418):
    get_image(QuasarCandidateLocs['Coords'].iloc[i],name='QuasarCandidate_'+str(i))