# SDSS Galaxies Machine Learning Tutorial

This is less of a tutorial and more of a playground for you to explore machine learning. I've extracted a random set of a few thousand spectra of galaxies from SDSS DR12. I've also extracted the redshift, as determined by template and line-fitting algorithms. The idea is for you to use machine learning regression to be able to predict the redshift from the spectrum.<br>

Have a look at the description of SDSS spectra here: http://www.sdss.org/dr12/spectro/

In [1]:
from astropy.io import fits
import numpy as np
import matplotlib.pyplot as plt
import glob, os
from scipy.interpolate import interp1d
import pandas as pd

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error

%matplotlib nbagg

### Load the data
I've extracted the SDSS spectra from the original fits files (which are a pain to work with), interpolated them onto the same wavelength range and put the results into a numpy array (`F`). Each row is a new galaxy, each column in the row is the flux at a given wavelength. The wavelength range is given in `wavs`, which you won't need for the machine learning (since it's the same for each object) but you can use if you want to interpret the spectra physically. Lastly, there's an array of metadata, `met`, each row of which corresponds to the same row in `F`. The columns are [`redshift`, `plate`, `mjd`, `fiberID`]. The plate, mjd and fiberID uniquely identify this spectrum if you'd like to go back to the SDSS database to get more metadata to play with.

In [2]:
F = np.load('spectra.npy')
met = np.load('metadata.npy')
wavs = np.linspace(4000, 8000, 1000) # in angstroms

From here you're on your own! Plot the data, come up with your own set of features and try out a regression algorithm!