# Selecting Data from a Data Frame
***
Once you've collected your data, saved it to a CSV and uploaded it into the notebook with Pandas, the next big thing you'll need to do is to **select which data you want to analyze**. Most of the time, you'll have data for different objects, data with rough trials and so on all in the same data set. <font color="#cccccc">*And if you don't, you should worry less about keeping a perfect spreadsheet :)*</font>

Selecting the right subset of your data from the data frame is tricky, but once you master it, you will save yourself lots of time!

First, let's look at our old data set from the [Pandas Quick Start Guide](/0%20Quick%20Start%20Guide%20-%20Pandas.ipynb).

In [31]:
import pandas as pd    # Importing the Pandas library and giving it a nickname, pd
myData = pd.read_csv("../data/pandas-masses.csv")   # Loading in the data file
myData

Unnamed: 0,Object,Trial,Volume (cm3),dV (cm3),Mass (g),dM (g),Extra notes
0,golf ball,1.0,41.0,1.0,62.0,1.0,measured
1,golf ball,2.0,39.0,1.0,58.0,1.0,measured
2,golf ball,3.0,37.0,1.0,56.0,1.0,measured
3,golf ball,4.0,40.0,1.0,62.0,1.0,measured
4,golf ball,5.0,42.0,1.0,64.0,1.0,measured
5,Taylor,1.0,920.0,10.0,630.0,10.0,measured
6,Human Heart,1.0,280.0,10.0,310.0,10.0,wiki
7,Coffee Cup,1.0,240.0,10.0,1100.0,100.0,measured
8,Table,1.0,1300.0,100.0,20000.0,1000.0,measured
9,Reid,1.0,66400.0,100.0,80000.0,1000.0,he told us


### `head()` function

A quick check to see if your data set is correct is to look at just the top few rows. <br>
You can do this using `head(n)` where `n` is the number of rows you want to see.

In [10]:
myData.head(2) # Displaying top 2 rows of the data file

Unnamed: 0,Object,Trial,Volume (cm3),dV (cm3),Mass (g),dM (g),Extra notes
0,golf ball,1.0,41.0,1.0,62.0,1.0,measured
1,golf ball,2.0,39.0,1.0,58.0,1.0,measured


### `tail()` function
Similarly, you can look at only the last `n` rows:

In [30]:
myData.tail(2) # Displaying the bottom 2 rows

Unnamed: 0,Object,Trial,Volume (cm3),dV (cm3),Mass (g),dM (g),Extra notes
10,Strange Quark,1.0,0.0,1e-34,1.7e-25,1e-27,wiki
11,All humanity,1.0,490000000000000.0,100000000000.0,461000000000000.0,1000000000000.0,wiki


## Selecting a specific column
***
Other than these two simple functions, choosing data in the DataFrame is almost always done using **square brackets**, `[ ... ]`. It can get pretty confusing, because you use square brackets for pretty much anything: selecting rows, columns, subsets of the data... but once you get the hang of it, it's also pretty convenient.

You can choose a column of the data frame by using **dataFrameName**`["columnName"]`.

In [13]:
myData["Volume (cm3)"]    # Looking at just volumes, regardless of any other data

0     4.100000e+01
1     3.900000e+01
2     3.700000e+01
3     4.000000e+01
4     4.200000e+01
5     9.200000e+02
6     2.800000e+02
7     2.400000e+02
8     1.300000e+03
9     6.640000e+04
10    0.000000e+00
11    4.900000e+14
Name: Volume (cm3), dtype: float64

If you want, you can save this column to a variable. This variable is a [Pandas Series](https://www.geeksforgeeks.org/python-pandas-series/), which is a lot like a [Numpy array](../2%20Python%20Basics/1%20Quick%20Start%20Guide%20-%20NumPy.ipynb). <br>
If it's a column of numbers, you can do all sorts of math to it! You can also use the same selection methods on the Series as on the Data Frame.

In [15]:
volumes = myData["Volume (cm3)"]   # Selecting the volumes column and saving it to a variable called "volumes"
volumes.head(2)                    # Displaying just the top 2 entries of volumes

0    41.0
1    39.0
Name: Volume (cm3), dtype: float64

## Selecting rows you want

***

Very often you want to select only the data entries that satisfy a specific condition. For example, here I measured golf balls, strange quarks and other things; but I really want to look at just the golf balls.

To do this, we need to write a condition that we use when choosing the rows. For example, to choose the golf balls, we want all rows where **Object** is equal to **golf ball**.

You can write this as `myData["Object"] == "golf ball"`
* The left hand side selects the column **Object** from the Data Frame, like above
* The right hand side checks each row of the column: is it equal to "golf ball"?

<font color="#cccccc">*If you are curious, you can try putting this line into Python and seeing what it outputs. You will see a series that looks like "True, True, False, ..." depending on the object in each row. The next trick we show selects just the rows where this condition is equal to True!*</font>

Then, using the square brackets, `[ ... ]` we choose all rows that satisfy this condition.

In [18]:
myData[ myData["Object"] == "golf ball" ]  # Selecting just the rows where the object is the golf ball

Unnamed: 0,Object,Trial,Volume (cm3),dV (cm3),Mass (g),dM (g),Extra notes
0,golf ball,1.0,41.0,1.0,62.0,1.0,measured
1,golf ball,2.0,39.0,1.0,58.0,1.0,measured
2,golf ball,3.0,37.0,1.0,56.0,1.0,measured
3,golf ball,4.0,40.0,1.0,62.0,1.0,measured
4,golf ball,5.0,42.0,1.0,64.0,1.0,measured


***
**Be careful!** A very common error here is to miss the first `myData` outside of the `[ ... ]`. Remember, if you were reading this code, you would read 
> *Select all rows from **myData** where the **column "Object" of myData** is equal to golf ball*. 

It's tempting to write `myData[ "Object" == "golf ball"]`, i.e. *Select all rows where object is golf ball*. But if you do this, Python will be comparing strings `"Object"` and `"golf ball"` which are definitely not the same. You want to tell Python to compare the values of the **Object column of myData**. So you need to say **myData** twice!

***

Like before, we can save this selection to a new variable, which will be a smaller Data Frame.

In [20]:
golf_balls = myData[ myData["Object"] == "golf ball"]  # Saving the selection to a new data frame
golf_balls.head(2)                                     # Outputting the top 2 rows of the new data frame

Unnamed: 0,Object,Trial,Volume (cm3),dV (cm3),Mass (g),dM (g),Extra notes
0,golf ball,1.0,41.0,1.0,62.0,1.0,measured
1,golf ball,2.0,39.0,1.0,58.0,1.0,measured


### Multiple Conditions

Sometimes, you want to select the data using more than one condition. Once you can use one condition, the rest is easy. For example, we want to choose rows where

1. The **Object** is **golf ball** `AND`
2. The **Mass** is *less than* **64** `AND`
3. The **Mass** is *more than* **56**

We can write each of these conditions separately first.

1. `myData["Object"] == "golf ball"`
2. `myData["Mass (g)"] < 64`
3. `myData["Mass (g)"] > 56`

Then, we just combine these conditions inside our selection, following two extra rules:

* Every separate condition must be in brackets, `( ... )`
* `AND` is written using `&` operator

In [22]:
# Select all rows from myData where:
myData[ (myData["Object"] == "golf ball") &   # The object is golf ball AND
        (myData["Mass (g)"] < 64) &           # The mass is less than 64 g AND
        (myData["Mass (g)"] > 56)  ]          # The mass is more than 56 g

Unnamed: 0,Object,Trial,Volume (cm3),dV (cm3),Mass (g),dM (g),Extra notes
0,golf ball,1.0,41.0,1.0,62.0,1.0,measured
1,golf ball,2.0,39.0,1.0,58.0,1.0,measured
3,golf ball,4.0,40.0,1.0,62.0,1.0,measured


## Combining selections
***

Now you know how to choose columns from the data frame, and how to choose specific rows, yay! 
You often need to do both of these things, which is not much harder.

For example, you need to look at the volumes of all golf balls. To do this, you want to:

1. Choose only rows that measure the golf balls and
2. Choose the volumes column.

You already know how to do this in two steps:

In [23]:
golf_balls   = myData[ myData["Object"] == "golf ball"]  # Select just the rows that contain golf balls
golf_volumes = golf_balls["Volume (cm3)"]                # Look at the volumes column
golf_volumes.head(2)                                     # Output the top 2 volumes

0    41.0
1    39.0
Name: Volume (cm3), dtype: float64

If you want, you can do it all in one line by **combining brackets**! It's just like not having the extra line break between the first and the second line.

In [28]:
golf_volumes = myData[ myData["Object"] == "golf ball"]["Volume (cm3)"]
golf_volumes.head(2)

0    41.0
1    39.0
Name: Volume (cm3), dtype: float64

## Using `iloc[]`

Sometimes, instead of selecting data rows based on some logical condition, you might want to select the rows based only on their row number. You've already seen two functions that do this: `head()` and `tail()`, which let you select the first/last few rows. 

`iloc[]` allows you to choose rows from anywhere in the data frame. <br>

The simplest way to use it is: **dataFrameName**`.iloc[start row number : end row number]` (this always excludes the last row).

For example:

In [32]:
myData.iloc[3:6]

Unnamed: 0,Object,Trial,Volume (cm3),dV (cm3),Mass (g),dM (g),Extra notes
3,golf ball,4.0,40.0,1.0,62.0,1.0,measured
4,golf ball,5.0,42.0,1.0,64.0,1.0,measured
5,Taylor,1.0,920.0,10.0,630.0,10.0,measured


`iloc` follows all the same selection rules at Lists, you can check out the [Picking values from lists guide](../2%20Python%20Basics/6%20Picking%20Values%20from%20Lists.ipynb) for more ways to use `iloc`!