# Section 1. Intro to Pandas

So far, we have used Python lists as well as Numpy arrays to store and manipulate data, each of which have their place in a simple Python program. But there's an excellent (and very popular) Python package called Pandas that greatly facilitates handling comma separated value (CSV) files and other types of data files. The basic tool in pandas is the DataFrame, which you can think of as a large table with built-in functions to process rows and columns, read and write to many different formats on disk, and interact with other DataFrames. One key advantage of pandas over other types of data storage is that it can easily handle data of many different types, including strings and numbers.

## 1.1 Getting Started

First things first, we need to import the package. We should also import NumPy for other useful functions
```
import pandas as pd
import numpy as np
```

In [None]:
import pandas as pd
import numpy as np

### 1.1.1 The DataFrame

There are many ways to create DataFrames in Pandas. Here, we will use one method that involves lists. In this case, each row is its own list, much like with a NumPy 2D array.

First, let's define some familiar looking data.

```
planets = [['Mercury', 0.0553, 0.4, 0],
           ['Venus', 0.815, 0.7, 0],
           ['Earth', 1.0, 1.0, 1],
           ['Mars', 0.107, 1.5, 2],
           ['Jupiter', 317.8, 5.2, 80],
           ['Saturn', 95.2, 9.5, 83],
           ['Uranus', 14.5, 19.2, 27],
           ['Neptune', 17.1, 30.1, 14]]

colnames = ['Name','Mass','Distance','N_moons']
# Units: None, Earth Masses, AU, None
```

Take note of a few things here. 

First, unlike in NumPy arrays, where you can only have one type (usually int or float), each row I've defined here mixes three different types of data: string, float, and int. This is one of the advantages of Pandas: different columns are allowed (and encouraged) to have different datatypes, depending on the usecase. In particular, one useful function we will not use here is the [time/datetime](https://pandas.pydata.org/docs/user_guide/timeseries.html) datatype.

Second, notice that I have also defined a second list to use as names for the columns. This is a huge feature of the design philosophy of the DataFrame. While a NumPy 2D array can be thought of as a huge matrix of numbers, a DataFrame really is a table. Note that I could have put the units in the column names. However, that can quickly become very unwieldy for reasons you will see later. 

Now, let's make the dataframe and print it out.
```
df = pd.DataFrame(planets, columns=colnames)
print(df)
```

In the second cell, just try running this and see how the output may look different. This distinction is only relevant when working in a Jupyter notebook. When working with large amounts of data with Python scripts, we'd essentially never need to display our DataFrames in either manner.
```
df
```

### 1.1.2 Accessing Rows and Cells

First, let's talk about how to access individual rows and cells. For this, we will use the method `df.loc()`, where you can replace df with the name of your dataframe. 

You probably noticed in the previous example that each row is specified by an index. In the example above, we just had integer indices from 0 to 7. So, if we want to access the row for Earth, we can just execute
```
df.loc([2])
```
Note that some Pandas DataFrames you encounter may be indexed differently. For instance, the indices may not correspond to their row number at all. In fact, you can even find tables where the indices are strings. When using `df.loc()` you must remember to use the label given.

We can illustrate this quite well by showing how to access a specific cell. We simply specify both the index (as above) and the column name. So, if we wanted to get the mass of Venus, for example, we'd execute
```
df.loc([2,'Mass'])
```
Try this yourself below

Note that we can also use this functionality with `.loc()` to change the values of individual cells, but in practice when working with data we didn't create ourselves, there is no reason to do so.

### 1.1.3 Accessing Columns

Accessing individual rows and cells is actually less useful than you'd think since each row only represents one data point. Usually, when we're using tables, we're actually more interested in the statistics of many data points. The DataFrame structure represents this, because it's very easy to get different columns. For example, if we want the different masses,
```
df['Mass']
```
If we wanted multiple columns, we enter a list of column names instead of just one, e.g.,
```
df[['Mass','Distance']]
```
Note this is why I chose not to include units in column names. Since we need to write out the column names, adding the units can make the code harder to parse. Instead, it can be better, depending on your preferences, to state clearly in a comment what units are being used.

### 1.1.4 df.values
When using some functions, like `plt.plot()`, we don't want the full Pandas DataFrame. Using `df.values` will extract the values and return a NumPy array.
```
df.values # returns a 2D array
df.loc([2]).values # returns a 1D array of the third row
df['Mass'].values # returns a 1D array of the Mass column
```

## 1.2 Making Changes to a DataFrame
### 1.2.1 Adding New Columns from Data

Let's say we get more data from a different source. We want to add it to our DataFrame. We can do so like this.
```
radii = [0.383, 0.949, 1, 0.532, 11.21, 9.45, 4.01, 3.88] # in Earth radii 

df['Radius'] = radii
```

One big note: **your list or list-like object with the new data must be in the same order as the rows in your DataFrame. If they aren't, your data will be incorrect. This is extremely important.**

Add this new column below. It will be used in the next section.

### 1.2.2 Adding New Columns from Other Columns

It's important to remember that columns are essentially NumPy arrays with extra stuff. As a result, we can do mathematical operations on columns just like we do with arrays. We can leverage this to create new columns. For example, let's get the density of the different planets. Remember that density is mass over volume. For this example, let's get the density in SI units. As a result, we'll also need to convert the masses and radii to SI units
```
m_e = 5.9722e24 # kg
r_e = 6.371e6 # m, using the average radius
df['Density'] = m_e*df['Mass']/((4/3)*np.pi*(r_e*df['Radius'])**3)
```

Now that you've done that, what do you notice about the densities of the outer planets versus the inner planets? Compare these densities to the densities of various materials in this [wikipedia page](https://en.wikipedia.org/wiki/Density#Densities). Which materials are these densities most similar to?

What you should find is that the inner planets are rocky, so their densities are more comparable to metals, while the gas giants have densities that are closer to that of water.

### 1.2.3 Practice Problem 1: Kepler's Law

Using Kepler's Third Law, make another column that corresponds to the orbital period for each planet **in years.** Once again, Keplers Third Law is 

$$T = 2\pi \sqrt{\frac{a^3}{G(M_1 + M_2)}}$$

For the case of solar system planets, you may make the approximation that $M_1 + M_2 \approx M_\odot \approx 2e30$ kg. 

You can calculate by hand how many seconds are in a year, but note a year technically has 365.25 days!

Finally, you'll need to know that 1 AU is defined as exactly 149597870700 m.

### 1.2.4 Practice Problem 2: Equilibrium Temperature

The [equilibrium temperature](https://en.wikipedia.org/wiki/Planetary_equilibrium_temperature) of a planet is defined as the temperature at which the power supplied by its star (due to solar radiation) is equal to the power emitted by the planet (due to blackbody radiation). Without getting too much into the details, the formula (in units of Kelvin) is as follows:

$$ T_{\rm eq} =  \left( \frac{L (1 - A_B)}{16\sigma\pi a^2} \right)^{1/4}$$

Here, L is the luminosity of the Sun (3.828e26 Watts), a is the semimajor axis (same as the "distance" column), $\sigma =$ 5.67e-8 (SI units) is the Stefan-Boltzmann constant, and $A_B$ is something called the [Bond Albedo](https://en.wikipedia.org/wiki/Bond_albedo).

Your task:
1. Create a new column for the Bond albedo of each planet, taking the data from the linked wikipedia article
2. Create a new column for the equilibrium temperature using the formula above. 
3. Create a new column for the effective/surface temperature of each planet using the values listed below. 
4. Create a scatter plot of the predicted equilibrium temperature versus the actual surface temperature. Show the 1-to-1 line using a dotted line for comparison. After reading the first linked wikipedia article, can you think of at least one reason why they may be different? (Hint: you've definitely learned about one of them in other science classes or the news)

```
Teff = [412.5, 737, 288, 215, 124.4, 95, 59.1, 59.3]
```
Values taken from de Pater and Lissauer (2010). Note Mercury's is the average of 100 and 725



## 1.3 Intermediate Pandas

### 1.3.1 Boolean Indexing with DataFrames

When we take a look at data, it can happen that we are only interested in a subset of the rows based on one or more conditions. Luckily for us, this is almost exactly identical to doing the same thing with NumPy arrays! The difference is that we use the column names instead of their indices. For example, let's get only the planets that have moons.

```
df[df['N_moons'] > 0]
```

Just like before, we can chain different things together. For example, let's get only the planets that have moons but are further than 5 au from the Sun.

```
df[(df['N_moons'] > 0) & (df['Distance] > 5)]
```

As a reminder, here is a list of the logical operators you can use:

```
== # equal to
!= # not equal to
<= # less than or equal to
<  # less than
>= # greater than or equal to
>  # greater than

& # logical and
| # logical or
~ # logical not
```


### 1.3.2 Using Other People's Data

Over this module I've been hinting at the fact that you will never create your own DataFrame from scratch in practical situations. Almost all data you analyze with Pandas will either be data you get from somewhere else or data that you create from some complicated simulation. In other words, you need to know how to take files of data and create a DataFrame out of it.

A common format for data is called a CSV file for comma separated values. In a CSV file, each row contains the same amount of numbers, each separated by a comma. It's possible to use other "separators" as well. A commonly used one is whitespace, where some number of whitespaces (whether tabs or spaces) are used to separate different values in a row.

One of the main ways to read such files in Pandas is to use `pd.read_csv()`.

```
filename = 'path/to/file' # whatever the path to the file is

```