# SDSS Galaxies vs quasars

We're now going to classify galaxy vs quasars in the Sloan Digital Sky Survey.

**Data** 

The dataset is at solutions/galaxyquasar.csv. I have extracted it myself from the SDSS database using the SQL query reported here

## Tasks

- Create arrays for the $(u - g)$, $(g - r)$, $(r - i)$, and $(i - z)$ colors. Also create an array with the class labels where galaxy = 0 and quasar = 1.
- Classify the dataset against the target label.
- Try some of the classification methods we've seen so far and evaluate the performance using the ROC curve.
- Remember to split the dataset into training and validation...

In [24]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from astroML.utils import split_samples
from sklearn.naive_bayes import GaussianNB
from astroML.utils import completeness_contamination

In [5]:
# Import data
file_path = '/home/alessia_pozzi/Astrostatistic/19_Classification/galaxyquasar.csv'
data = pd.read_csv(file_path)

print(data.head())

          u         g         r         i         z   class        z1  \
0  18.97213  18.53676  18.58280  18.34936  18.29215     QSO  0.522819   
1  19.24592  17.47646  16.47817  16.04472  15.68851  GALAXY  0.122846   
2  19.43536  17.70268  16.91565  16.58327  16.39128  GALAXY  0.000000   
3  19.31626  18.18312  17.39591  16.94549  16.65395  GALAXY  0.147435   
4  19.28828  19.11188  18.88937  18.80013  18.49183     QSO  2.011455   

       zerr  
0  0.000155  
1  0.000028  
2  0.000000  
3  0.000009  
4  0.000631  


In [None]:
# Create colors array
u_g = data['u'] - data['g']
g_r = data['g'] - data['r']
r_i = data['r'] - data['i']
i_z = data['i'] - data['z']

# Class labels array
classes = np.array([])
for cl in data['class']:
    if cl == 'QSO':
        classes = np.append(classes, 1)
    if cl == 'GALAXY': 
        classes = np.append(classes, 0)

# Split in training and validation
(x_train, x_test), (y_train, y_test) = split_samples(data, classes, [0.75, 0.25], random_state=0)
print('Training set size: ', len(x_train))
print('Validation set size: ', len(x_test))

TypeError: _column_stack_dispatcher() takes 1 positional argument but 4 were given

In [None]:
classifiers = []
predictions = []
Ncolors = np.arange(1, x.shape[1] + 1)

order = np.array([1, 0, 2, 3])

y_prob = np.array([])

for nc in Ncolors:
    clf = GaussianNB()
    clf.fit(x_train[:, :nc], y_train)
    y_pred = clf.predict(x_test[:, :nc])
    
    # Added by GTR to be able to compute precision, recall, fpr, and tpr
    # Gives the probability for both classes, take just one
    y_prob = np.append(y_prob,clf.predict_proba(x_test[:, :nc])[:,1])

    classifiers.append(clf)
    predictions.append(y_pred)

completeness, contamination = completeness_contamination(predictions, y_test)

print("completeness", completeness)
print("contamination", contamination)

ValueError: could not convert string to float: 'QSO'

## Ideas

- Try using different colors (a subset of them first, than all together). Which is the most important feature?
- What are the colors that better satisfy or invalidate the "Naive" assumption of independence between the attributes?