# Exercises: Classifications and Regression 

## Exercise 1: classifying quasars and AGNs

The goal of this exercise is to classify galaxy vs quasars in the Sloan Digital Sky Survey (SDSS). For this purpose, we will be using a catalog magnitudes of galaxies and quasars as compiled by the SDSS. This catalog contains the following columns: `u, g, r, i, z, class, z1, zerr`. Then, you will explore this data set and use a Gaussian Naive Bayes classifier to find which colors yield the most efficient classification.  
**In Practice** 

- Read the csv file `galaxyquasar.csv`. This file originally results from a SQL query implemented in the `fetch_sdss_galaxy_colors()` of the `astroML.datasets` (but this query occasionally fails).  
- Create arrays for the (u-g), (g-r), (r-i), and (i-z) colors. Also create an array with the class labels where galaxy=0 and quasar=1.
- Classify the data set against the target labels using a Gaussian Naive Bayes classifier. 
- Is it best to use all features or only a subsample of them? If so, which one(s)? Justify.
- Explore other classifiers and discuss your results. 

## Exercise 2: Predicting redshift from photometry. 

The goal of this exercise is to predict redshift of galaxies from the sloan digital sky survey based only on the photometry. The data sample is now saved in the csv file `sdss_data_sub_v2.csv`. This data set is closer to what you would get in the real world that in exercise \#1 (i.e. less cleaning). 

The columns are: 
- 'bestobjid': Object ID
- 'ra', 'dec': location of the object on the sky in degrees.
- 'zsp': Spectroscopic redshift
- 'class': Object Class
- 'sourceType': Source type  ('SEQUELS_TARGET','QSO', 'LRG', 'ELG', 'GALAXY')
- 'u', 'g', 'r', 'i', 'z' and 'err_u', 'err_g', 'err_r', 'err_i', 'err_z' are the magnitudes in 5 photometric bands and their associated uncertainties. 

The source types acronyms mean:
- QSO = quasars,
- LRG = Luminous Red Galaxies,
- ELG = Emission line Galaxies
- GALAXY = galaxies that are not LRG or ELG.
- 'SEQUEL_TARGET' = Object that do not enter any meaningful class. They should not be considered for redshift estimation. 

### Objective: 
- Use a regression algorithm to predict the redshift based on 5 photometric bands ('u', 'g', 'r', 'i', 'z'). Compare the results using various regression techniques (The first one to try is LinearRegression). 
- Redo the same exercise for each source type separately.
- Which band(s) are the most important for the redshift predictions? Is it the same band for each source type?

Tip: In order to understand which bands are the most important for the redshift prediction, you can use the "permutation_importance" function using the command line: 
`from sklearn.inspection import permutation_importance`  (see  [https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html) for an explanation). 

Bonus: Explore methodologies similar to exercise #1 for classification, but this time dealing with the different source types. 
