### Further Hypothesis Testing

In [None]:
# Select this cell and type Ctrl-Enter to execute the code below.

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt


## 1 - Introduction

In this workshop, you will apply a number of commonly encountered parametric and non-parametric tests to answer a variety of research questions about an [example data set](https://www.kaggle.com/deepu1109/star-dataset).

We begin with a brief exploration of the data.

### Data set

The file `stars.csv` contains a dataset of 240 stars, with five variables for each star:

|variable | description|
|:---|:---|
|temperature | the surface temperature (K)|
|luminosity | luminosity relative to sun|
|radius | radius relative to sun|
|spectral_class | the spectral class of each star (O,B,A,F,G,K or M)|
|type | as defined below|


The luminosity and radius of each star is calculated relative to that of the Sun:

$L_{sun} = 3.83 \times 10^{26}\text{W}$

$R_{sun} = 6.96 \times 10^8\text{m}$


The stars are classified into 6 types:

code | type
:---|:---
0 |Brown Dwarf
1 |Red Dwarf
2 |White Dwarf
3 |Main Sequence
4 |Supergiant
5 |Hypergiant

The dataset contains 40 examples of each type.


Load the data using the `pandas` library:

In [None]:
data = pd.read_csv("stars.csv")
type_key = ['Brown Dwarf', 'Red Dwarf', 'White Dwarf', 'Main Sequence', 'Supergiant','Hypergiant']

data.head()

We will be using histogram plots from `matplotlib` to visualise distributions of these variables.

For example, execute the following to see an overall histogram of luminosity:

In [None]:
sample = data.luminosity
xlab = 'luminosity'
ylab = 'freq'

bins = np.linspace(sample.min(), sample.max())
ax = plt.axes()
ax.set_xlabel(xlab)
ax.set_ylabel(ylab)
plt.hist(sample, bins, alpha=0.5, color='gray')
plt.show()

Or, more usefully, on a log scale:

In [None]:
sample = data.luminosity.apply(np.log)
xlab = 'log(luminosity)'
ylab = 'freq'

bins = np.linspace(sample.min(), sample.max())
ax = plt.axes()
ax.set_xlabel(xlab)
ax.set_ylabel(ylab)
plt.hist(sample, bins, alpha=0.5, color='gray')
plt.show()

We can split the data by another variable to compare different groups of stars. 

For example, the following shows log(luminosity), grouped by type:

In [None]:
sample = data.luminosity.apply(np.log)
grouped = sample.groupby(data.type)
xlab = 'log(luminosity)'
ylab = 'freq'

bins = np.linspace(sample.min(), sample.max())
ax = plt.axes()
ax.set_xlabel(xlab)
ax.set_ylabel(ylab)
grouped.apply(lambda x: plt.hist(x, bins, alpha=0.5, color = 'C' + str(x.name), label=type_key[x.name]))
ax.legend(loc='upper center')
plt.show()