# Hubble's Law

For this project, we'll explore one of the most famous (and fundamental) relations in astronomy, Hubble's law. Hubble's law describes the relationship between a galaxy's distance and its recession velocity (the velocity with which it appears to be moving away from us). 

To derive Hubble's observation, we first need a distance and a velocity associated with a sample of galaxies. To do this, we'll use data from the [Sloan Digital Sky Survey (SDSS)](https://en.wikipedia.org/wiki/Sloan_Digital_Sky_Survey), an observing program that has been conducted from the Apache Point Observatory in New Mexico since the year 2000. However, Hubble's Law was discovered almost exactly a century ago (in 1922) by Edwin Hubble! *Read a little bit about SDSS to see where this data is coming from! Some things to think about might be: how big is the telescope? how does it compare to the telescopes you used on the roof (or the telescopes at Lick Observatory)? how many objects have been observed? why was this survey conducted?*

## Part I: Distances
First, let's figure out how to measure the distance to a galaxy. Measuring the *absolute* distance to galaxies  requires some knowledge of the intrinsic brightness (which we call the luminosity, $L$) of a star or some other detectable source in the galaxy. 

As we get further from a galaxy, its photons get spread out more. Below is an image showing this, where the light from a galaxy (its instrinsic brightness, or luminosity) emit from the galaxy and spread out in the dashed lines. Notice how the distance between the dashed lines gets larger as they get further from the galaxy. The further apart the dashed lines are, the dimmer the galaxy seems.

<img src="./flux-img.png" width=300 height=300 />

This means that if we know what the light from the galaxy actually is (luminosity) and also measure the light from Earth, we can relate the two to get a distance. Said as an equation:

$$
F = \frac{L}{4\pi d^2}
$$

where $F$ is what we measure on Earth (the object's flux), $L$ is the luminosity of the object, and $d$ is the distance to the object.

While this seems like a simple process, doing this is challenging in general. But we can luckily get away with measuring the *relative* distance for calculating Hubble's law. That is, with some assumptions, we can measure the distance of an object relative to some better-understood standard, such as our nearest neighbor, the Andromeda galaxy. If we assume that, on average, all galaxies have roughly the same intrinsic brightness and size, then the only source of differences in their observed properties will be due to their being closer or further away! This is the same principle as what we see on Earth -- *what do we expect to happen as we move something closer and further away from us?* 

In classic astronomy fashion, the system used to quantify the observed brightness of an object is unintuitive and archaic, but tradition dictates that we keep it around. In fact, the system runs exactly opposite to what you'd expect -- the smaller the magnitude, the brighter the object, so the very brightest objects have negative magnitudes. For example, the brightest object in our sky, the Sun, has a magnitude of $m_{\rm Sun} \sim -26$. *What is the next brightest star in the sky? What magnitude is it?* The faintest objects we can see with our naked eyes have magnitude $m\lesssim 6$. This is part of why we use telescopes -- by collecting light for a long period of time and summing that up, we can see fainter objects than we'd be able to with our eyes alone. With this, the faintest objects that can be detected as part of SDSS have magnitudes $m\lesssim 23$. 

The data we'll be working with from SDSS gives us the *apparent magnitude* of the observed galaxies. This is related to the flux received at the detector by the following relation
$$F_{\rm gal}/F_{\rm stan} = 10^{-0.4(m_{\rm gal}-m_{\rm stan})}$$
where the subscripts 'gal' and 'stan' refer to the galaxy of interest and our standard source, respectively. *Let's familiarize ourselves with the magnitude system. Using the equations above, calculate how much fainter (how many times) is the faintest source detectable in SDSS compared to that which we can see with the naked eye?* For our standard source, let's use the Andromeda galaxy, which is at a distance of $760\ {\rm kiloparsec}$ (*what is that unit?*) and has an apparent magnitude of $m_{\rm andromeda}\sim 3.4$. *Recall we're assuming that every galaxy has roughly the same intrinsic luminosity $L$. Use this equation to relate the distance to the apparent magnitudes. (Hint: you'll want to use the first equation written above).*

Okay, enough background, let's start working with some real data!

In [None]:
# first, import useful packages
import numpy as np
import matplotlib.pyplot as plt

# to import the SDSS data, we'll need some astronomy packages
from astroquery.sdss import SDSS
import astropy.units as u
from astropy.constants import c

Based on your reading about SDSS, you'll know that the data were captured using plates. Therefore, we'll have to pick an observing plate from which we can analyze the data. Visit [this webpage](https://skyserver.sdss.org/dr12/en/tools/getimg/plate.aspx) and choose a plate from the BOSS survey list dropdown. Once you click `get plate`, you'll see the list of galaxies that were observed on that plate. Note down the two numbers listed in the dropdown menu -- these will be the plate number and the date on which these galaxies were observed and you'll need them for the next step.


In [None]:
# read in data from a file

# link to plate browser: https://skyserver.sdss.org/dr12/en/tools/getimg/plate.aspx
# choose a plate andpaste your particular plate and the associated data (MJD) here
plate = # CHOOSE A PLATE
mjd = 51602
print('plate =', plate, "\nMJD =", mjd)

# build the query for accessing the data from the database
query = "SELECT TOP 1000 objid, ra, dec, modelMag_i AS app_mag , z from SpecPhoto WHERE (class = 'GALAXY' AND plate = '%d')" %(plate)

# Run the query and store the results
result = SDSS.query_sql(query, data_release=16)

In [None]:
# define some variables for our 'standard' source -- the distances can then be calculated relative to this
m_andromeda = 3.4
d_andromeda = 760 # in units of kpc

# fill this in with the equation relating distance and magnitude above
m_gal = 
d_gal = 

Having calculated the distances associated with each galaxy using the apparent magnitude, let's do a sanity check to make sure that the trend matches our intuition! *think about what you expect this plot to look like before you make it*

In [None]:
# plot the distance vs the apparent magnitude -- I've filled this one in for you, but I'll let you do the other plots
plt.scatter(d_gal, m_gal, s = 5)
plt.xlabel('change this')
plt.ylabel('change this')
plt.show()

If you're satisfied that the above plot makes sense, then we're in good shape -- we've now found the physical distances to hundreds of galaxies!

## Part II: Velocities
The next step to finding Hubble's law is finding the velocity associated with each of these galaxies. The way we find velocities to distant objects in astronomy is using their [spectra](https://openstax.org/books/astronomy/pages/5-3-spectroscopy-in-astronomy), leveraging a principle similar to the idea behind the Doppler shift. *Read the link in the previous sentence to get familiar with the idea of observing/'taking' a spectrum.* The spectrum of an object is one of the most valuable things we can observe as it gives us direct insight into the composition of the object, among other things. Different elements that are present in a star or galaxy will leave a unique imprint on the observed spectrum, in the form of [*absorption and emission lines*](https://openstax.org/books/astronomy/pages/5-5-formation-of-spectral-lines), that appear at well-defined places (wavelengths) in the spectrum. However, if the object is moving relative to us, these lines will be shifted by a well-defined amount, just like a [*Doppler shift*](https://openstax.org/books/astronomy/pages/5-6-the-doppler-effect). That is, the amount that they are shifted is given by the equation
$$\frac{\lambda_{\rm obs}-\lambda_{\rm em}}{\lambda_{\rm em}} = \frac{v}{c}$$
where the 'obs' and 'em' subscripts correspond to the observed and emitted values of the wavelength $\lambda$, respectively, $v$ is the velocity of the object's motion, and $c = 3\times 10^{8}\ {\rm m/s}$ is the speed of light. To better understand what this is saying, let's put some numbers to it: if we observe a galaxy and expect to see an emission line associated with oxygen, such as that at ~4363 angstroms, but instead find that line at 5363 angstroms when we observe the spectrum, then this means that 
$$\frac{\lambda_{\rm obs}-\lambda_{\rm em}}{\lambda_{\rm em}} = \frac{5363 - 4363}{4363} = \frac{1000}{4363} \approx 0.23\implies v \approx 0.23 c = 69,000\ {\rm km/s}$$
so this tells us that the galaxy is moving away from us at 23% the speed of light, or roughly 69,000 km/s -- that's pretty fast!

Doing this for a large number of galaxies is not straightforward because it requires identification of different lines and comparison to the expected wavelengths of the lines, but people have developed codes that can do this automatically given some spectrum. Conventionally, the result of these calculations is reported as a *redshift*, $z$ for the galaxy, which is defined as 
$$z = \frac{\lambda_{\rm obs}-\lambda_{\rm em}}{\lambda_{\rm em}} \approx \frac{v}{c}$$
This means that for the above example, we'd say the galaxy is "at redshift 0.23". We call it a redshift because the emission line has been moved to longer (redder) wavelengths as a result of the galaxy's motion away from us. Conversely, if the galaxy were to be moving towards us, then $\lambda_{\rm obs}$ would be smaller than $\lambda_{\rm em}$, so $z$ would be negative, and we'd would say that the line is *blueshifted*. 

*Play around with this on the plate browser you used before. Click on a few galaxies and open their spectra. Compare the location of the emission lines to the number listed in their name (i.e., \[OIII\]4363 refers to an oxygen line, hence the O, emitted at a wavelength of 4363 angstroms) and calculate the redshift. Compare this to what's listed on the galaxy profile. You should find that these numbers match! (You may need to look up the wavelengths of different emission lines.)*

In [None]:
# you can use these cells to do your calculations!


Now that you're familiar with how we get the redshift, convert the redshifts reported in the data to velocities for our diagram.

In [None]:
# calculate the velocity below


Make a histogram of the velocities that we're using for our sample.


In [None]:
# code for plotting goes here


## Part III: Putting it together
With the distances and velocities calculated, you're ready to make a Hubble diagram. *Plot the two quantities against each other and think about what the results mean. Does there appear to be a trend between the distance and velocity? What sort of shape does that trend take (i.e., is it a line, a parabola, something else)? Interpret this -- what does that trend indicate?*

In [None]:
# code for plotting distance vs velocity goes here


Ok ignoring the few outliers, it looks like there is a pretty suggestive linear trend between the distance and velocity of these galaxies. On top of that, it looks like virtually all of these galaxies have positive velocities. *What does this mean?* This trend suggests that the velocity and distance are related by *a constant of proportionality*. In other words, we can write an equation relating the two as follows:
$$v = H d$$
Mathematically, such an equation indicates a linear relationship between the two quantities, with the constant of proportionality appearing as the slope. As I'm sure you've seen (at least in your math classes), such a relationship is pretty common in situations in our everyday lives. For example, if you're working a job and make $\$10$ an hour, you could describe your total earnings over some period of time with such a relationship. In our/Hubble's case, we know $v$ and $d$, so we want to find that constant of proportionality, which we've written as $H$ (for Hubble).

Therefore, our goal is to fit a line to the data to read off the slope. Luckily, this is pretty straightforward to do in python using numpy, but *first, pick some representative points and estimate the slope of the line by hand*. This will help ensure that we're confident in the result that we ultimately get from the computer.

In [None]:
# code to fit the line goes here (the numpy polyfit method might come in handy)


In [None]:
# print out the coefficients you measure -- what do these correspond to in Hubble's law?


In [None]:
# code for making hubble diagram with best fit line


## Part IV: Cleaning the data

Since it looks like our line visually doesn't seem to fit that well, let's try and improve the calculation. One easy thing that we can do is removing the outliers. This means we remove data points that look like they might not fit the general trend and are thus biasing our calculation. *Can you identify any points that you might want to remove?* 

Doing this by hand may be the most intuitive/straightforward thought, but it is also the most time consuming. Instead, let's think about how we can do this automatically, using some basic statistics. Previously we plotted the distribution of velocities; let's now do the same with the distance:

In [None]:
# code to plot a histogram of distances goes here


From this, it seems pretty clear that there are some points at fairly large distances that don't really sit with the rest of the distribution. To quantify just how far these points are from the rest of the distribution, let's calculate the *mean* and the *standard deviation*. These are statistics (numbers) that can be used to represent some information about a distribution of values. More concretely, you're probably fairly familiar with the mean, as it refers to the average value of the distribution and is in some sense a "representative" ballpark value of what we might expect if we picked a number randomly from the distribution. The standard deviation basically quantifies how much the distribution is spread away from the average. This means if the standard deviation is larger, we are more likely to get values far away from the mean when we randomly draw from the distribution and vice versa if it's smaller. These aren't too difficult to calculate in practice, but `numpy` has some built in packages to make our lives easier (`numpy.std` and `numpy.mean`). 


In [None]:
mean_dist = # fill this in
stdev_dist = # fill this in
print(mean_dist + 3*stdev_dist)

Now let's remove the data points that are really far away from the mean (i.e. more than 3 standard deviations away).

In [None]:
new_dist = # fill this in
new_vel = # fill this in

In [None]:
# code to fit the line goes here (the numpy polyfit method might come in handy)


In [None]:
# plot the data and your new line


This looks like it fits our nearby points a bit better! Believe it or not, we've now done better than Hubble did when he made his groundbreaking discovery. Here are a few questions to answer based on these results:
***
1. What are the units of H in our equation relating velocity and distance? What about the reciprocal of H? To answer this, you should consolidate the units (i.e., cancel out like units) as much as possible. When you consolidate the units, what is the numerical value you measure for H?
***
2. In general, distance and velocity are related by the following: $v = d/t$. How does this relate to the equation for Hubble's law? I.e., what role does H play in this equation?
***
3. Use your answer to (1) to calculate the reciprocal of H and express your answer in years. This is known as the "Hubble time". Combine this with your answer to (2) to interpret what this quantity represents for our universe. Compare what you get with what you'd get if you used the measured value for H = 70 km/s/Mpc.