# Selecting Rows

In the last lecture, we saw how to extract specific columns from a data frame. In many cases, we also need to extract specific rows. This operation is often called "filtering" -- we are filtering out the rows that we don't want, leaving the ones that we do. 

In [1]:
#standard imports
import pandas as pd
import numpy as np

From the last video, you should already have a file called palmer_penguins.csv in the same folder as this notebook. Please makesure this is case.

__Now:__ let's read the csv into a dataframe

In [2]:
penguins=pd.read_csv("palmer_penguins.csv")

Let's restrict attention to only four columns:

In [3]:
cols=["Species", "Region", "Island", "Culmen Length (mm)"]
penguins=penguins[cols]

#use .head() to only display the top five rows
penguins.head()

Unnamed: 0,Species,Region,Island,Culmen Length (mm)
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5
2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,40.3
3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7


The simplest way to select rows of data is by explicitly naming the value(s) of the index for the rows you want. Remember that the index is the set of bold numbers at the far left. To do this, you should use the `df.loc` attribute of the data frame, like this: 

In [4]:
#rows one through three (note the end point IS included here)
penguins.loc[1:3]


Unnamed: 0,Species,Region,Island,Culmen Length (mm)
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5
2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,40.3
3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,


You can also select rows with a list

In [9]:
rows=[1,4,0]
s=penguins.loc[rows]


Note: Indexing is "Like a dict"

In [19]:
# note that this works, even though s does not have a 4th row, 
# because s does have an index with value 4
s.loc[4]

Species               Adelie Penguin (Pygoscelis adeliae)
Region                                             Anvers
Island                                          Torgersen
Culmen Length (mm)                                   36.7
Name: 4, dtype: object

In [20]:
# on the other hand, this doesn't work
#s.loc[2]

## Boolean Indexing

While it's good to know how to refer to rows by index, this is not the most useful way to filter data frames. Boolean indexing instead allows us to filter the rows of a data set based on one or more conditions. Boolean indexing in data frames is very similar to Boolean indexing in `numpy` arrays. 

In [21]:
#Recall, our data looks like this 
penguins

Unnamed: 0,Species,Region,Island,Culmen Length (mm)
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5
2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,40.3
3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7
...,...,...,...,...
339,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,
340,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,46.8
341,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,50.4
342,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,45.2


Let's use boolean indexing to restrict attention to penguins with culmen length < 40

In [22]:
#create a mask
culmen_mask = penguins['Culmen Length (mm)'] < 40
culmen_mask


0       True
1       True
2      False
3      False
4       True
       ...  
339    False
340    False
341    False
342    False
343    False
Name: Culmen Length (mm), Length: 344, dtype: bool

In [23]:
penguins[culmen_mask]

Unnamed: 0,Species,Region,Island,Culmen Length (mm)
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7
5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.3
6,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,38.9
...,...,...,...,...
146,Adelie Penguin (Pygoscelis adeliae),Anvers,Dream,39.2
147,Adelie Penguin (Pygoscelis adeliae),Anvers,Dream,36.6
148,Adelie Penguin (Pygoscelis adeliae),Anvers,Dream,36.0
149,Adelie Penguin (Pygoscelis adeliae),Anvers,Dream,37.8


Now, let's restrict to penguins living on the Island Torgersen

In [24]:
#create a new mask
island_mask=penguins['Island']=="Torgersen"

#apply  boolean indexing
penguins[island_mask]

Unnamed: 0,Species,Region,Island,Culmen Length (mm)
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5
2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,40.3
3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7
5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.3
6,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,38.9
7,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.2
8,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,34.1
9,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,42.0


Penguins that live on Torgersen and have short culmen. (You need & here, "and" doesn't work)

In [25]:
penguins[culmen_mask & island_mask]

Unnamed: 0,Species,Region,Island,Culmen Length (mm)
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7
5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.3
6,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,38.9
7,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.2
8,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,34.1
10,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,37.8
11,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,37.8
13,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,38.6


Penguins that live on Torgersen or have short culmen. (Use bitwise or operator | not "or)

In [26]:
penguins[culmen_mask | island_mask]

Unnamed: 0,Species,Region,Island,Culmen Length (mm)
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5
2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,40.3
3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7
...,...,...,...,...
146,Adelie Penguin (Pygoscelis adeliae),Anvers,Dream,39.2
147,Adelie Penguin (Pygoscelis adeliae),Anvers,Dream,36.6
148,Adelie Penguin (Pygoscelis adeliae),Anvers,Dream,36.0
149,Adelie Penguin (Pygoscelis adeliae),Anvers,Dream,37.8


## Getting rid of nan's

In [27]:
# The function .isna() return true if an entry is a nan 
nans=penguins["Culmen Length (mm)"].isna()

#Here are the entries with nan's in the culmen length
penguins[nans]

Unnamed: 0,Species,Region,Island,Culmen Length (mm)
3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,
339,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,


Typically, we want to look at all of the entries WITHOUT nans

In [28]:
penguins=penguins[np.invert(nans)]

In [29]:
penguins.shape

(342, 4)