### *** Names: [Insert Your Name Here]***

# Prelab 6

##  Prelab 6 Contents

1. Creating Statistical Graphics from Pandas DataFrames
2. Filtering/Selecting a Subset of Data
3. Testing Differences Between Datasets
  * Computing Confidence Intervals

In [None]:
#various things that we will need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats as st

In [None]:
# these set the pandas defaults so that it will print ALL values, even for very long lists and large dataframes
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Read in the QuaRCS data as a pandas dataframe called "data".

In [None]:
#read in the data, skipping the first 73 rows of ancillary information
data=pd.read_csv('planets030619.csv', skiprows=72)
print(data.shape)

In [None]:
data.columns

To make the dataset a bit more manageable for plotting, we'll truncate it to include only planet discovery methods that have found more than 30 planets and also only things that are legitimately classified as planets (masses < 13 Jupiter masses). You don't have to understand everything that's going on in the cell below, however some of the techniques employed may be useful to you later, so I recommend you spend a few minutes trying to undertsand what's going on. 

In [None]:
#this truncates to only planet detection methods with >30 successful detections (skip if you want all of them)
methods,methods_inds,methods_counts = np.unique(data['pl_discmethod'],return_index=True,return_counts=True)
methods = methods[methods_counts> 30]
print("I am keeping only the following discovery methods: ", methods)

#find the indices of all entries where pl_discmethod is one of these four
inds = [j for j in range(len(data)) if data['pl_discmethod'][j] in methods and data['pl_bmassj'][j] < 13.]

#write a new dataframe with just these entries
data2 = data.loc[inds]

#note the table is much smaller than it once was
print("My shape is now: ", data2.shape)

## 1. Creating Statistical Graphics from Pandas DataFrames

<div class=hw>
### Exercise 1 - Summary plots for distributions

*Warning: Although you will be using Exoplanet database data to investigate and experiment with each type of plot below, when you write up your descriptions, they should refer to the **general properties** of the plots, and not to the exoplanet data specifically. In other words, your descriptions should be general descriptions of the plot types that could be applied to any dataset.*

### 1a - Histogram
The syntax for creating a histogram for a pandas dataframe column is: 

dataframe["Column Name"].hist(bins=nbins)

Play around with the column name and bins and refer to the docstring as needed until you understand thoroughly what is being shown. Describe what this ***type of plot*** (not any individual plot that you've made) shows in words and describe when you think it might be useful. 

Play around with inputs (e.g. column name) until you find a case (dataframe column) where you think the histogram tells you something important and use it as an example to inform your answer. Inputs that do not produce informative histograms should also help to inform your answer. Save a couple of representative histograms (good and bad, use plt.savefig("figure name")) and integrate them into your written (markdown) explanation to support your argument. 

In [None]:
#this cell is for playing around with histograms

*Your explanation here, with figures*

<div class=hw>
### 1b - Box plot

The syntax for creating a box plot for a pair of pandas dataframe columns is: 

dataframe.boxplot(column="column name 1", by="column name 2")

Play around with the column and by variables and refer to the docstring as needed until you understand thoroughly what is being shown. Describe what this ***type of plot*** (not any individual plot that you've made) shows in words and describe when you think it might be useful. 

Play around with inputs (e.g. column names) until you find a case that you think is well-described by a box and whisker plot and use it as an example to inform your answer. Inputs that do not produce informative box plots should also help to inform your answer. Save a couple of representative box plots (good and bad) and integrate them into your written explanation.  

In [None]:
#your sample boxplot code here

*Your explanation here*

<div class=hw>
### 1c - Scatter Plot
The syntax for creating a scatter plot is: 

dataframe.plot.scatter(x='column name',y='column name')

Play around with the column and refer to the docstring as needed until you understand thoroughly what is being shown. Describe what this ***type of plot*** (not any individual plot that you've made) shows in words and describe when you think it might be useful.

Play around with inputs (e.g. column names) until you find a case that you think is well-described by a scatter plot and use it as an example to inform your answer. Inputs that do not produce informative scatter plots should also help to inform your answer. Save a couple of representative pie charts (good and bad) and integrate them into your written explanation.  

In [None]:
#your sample scatter plot code here

*Your explanation here*

## 2.  Filtering/ Selecting a Subset of Data

You will find it quite useful for the rest of this class to be able to select subsets from larger datasets. One basic form of filtering employs conditionals inside of square brackets. For example:

In [None]:
x = np.array(np.arange(10))
print(x)
y=x[x > 3]
print(y)

<div class=hw>
### Exercise 2
--------------

Write a function called "filter" that takes a dataframe, column name, and value for that column as input and returns a new dataframe containing only those rows where column name = value. For example filter(data, "PRE_GENDER", 1) should return a dataframe about half the size of the original dataframe where all values in the PRE_GENDER column are 1. 

In [None]:
#your function here

In [None]:
#your tests here

*** If you get to this point during lab time on Tuesday, stop here***

## 3. Testing Differences Between Datasets 

### 3.1 Computing Confidence Intervals

Now that we have a mechanism for filtering the dataset, we can test differences between groups with confidence intervals. The syntax for computing the confidence interval on a mean for a given variable is as follows. 

variable1 = st.t.interval(conf_level,n,loc=np.nanmean(variable2), scale=st.sem(variable2))

where conf_level is the confidence level you with to calculate (e.g. 0.95 is 95% confidence, 0.98 is 98%, etc.)
n is the number of samples and should generally be set to the number of valid entries in variable2 -1. 

An example can be found below (if your filter function is working as specified).

In [None]:
## apply filter to select only men from data, and pull the scores from this group into a variable
df2=filter(data2,'pl_discmethod','Transit')
transit_radii=df2['pl_radj']
#print mean
print(np.nanmean(transit_radii))

In [None]:
#compute 95% confidence intervals on the mean (low and high)
transitradii_conf=st.t.interval(0.95, len(transit_radii)-1, loc=np.nanmean(transit_radii), 
                                scale=st.sem(transit_radii, nan_policy='omit'))
transitradii_conf

<div class=hw>
### Exercise 3
------------------

Choose a planet property that you find interesting and compare the mean for that property across the four discovery methods. Then write a paragraph describing the results. Are the differences between the groups significant according to your data? Would they still be significant if you were to compute the 98% (3-sigma) confidence intervals?

In [None]:
#code to filter data and compute confidence intervals for each answer choice

***explanatory text***

In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../../custom.css", "r").read()
    return HTML(styles)
css_styling()