# Indexing, Slicing and Subsetting DataFrames in Python

## Loading our data again

Since this is a new notebook we'll need to import the pandas library again, and re-initialize the surveys_df DataFrame we were using in all of our examples.

In [1]:
import pandas as pd

In [2]:
surveys_df = pd.read_csv('https://ndownloader.figshare.com/files/2292172')

## Selecting data using Labels (Column Headings)

If we want to get at just 1 column from our DataFrame, we just pass in the name of the column in square-brackets. This is similar to getting at 1 *value* from a *dictionary* by passing in its *key*.

In [3]:
print(surveys_df['species_id'].head())

0    NL
1    NL
2    DM
3    DM
4    DM
Name: species_id, dtype: object


You can do the exact same thing using **dot-notation**. The downside to this way is that it doesn't allow for using column names that contain a space, and also Jupyter notebooks don't highlight the column name in the same nice way.

In [4]:
print(surveys_df.species_id.head())

0    NL
1    NL
2    DM
3    DM
4    DM
Name: species_id, dtype: object


You can save the contents of 1 column to a variable, and then run further commands on it separately. Here we figure out the value_counts() like we did previously, except this time we do it on the *species_column* variable.

In [5]:
species_column = surveys_df['species_id']
print(species_column.value_counts())

DM    10596
PP     3123
DO     3027
PB     2891
RM     2609
DS     2504
OT     2249
PF     1597
PE     1299
NL     1252
OL     1006
PM      899
AH      437
AB      303
SS      248
SH      147
SA       75
RF       75
CB       50
BA       46
SO       43
SF       43
DX       40
PC       39
PL       36
PH       32
CQ       16
CM       13
OX       12
UR       10
PI        9
RO        8
PG        8
UP        8
PX        6
PU        5
SU        5
UL        4
US        4
RX        2
AS        2
ZL        2
CT        1
SC        1
ST        1
CU        1
CS        1
CV        1
Name: species_id, dtype: int64


Be aware that when we subset a DataFrame down to 1 column, it is now known as a **Series**, which is a new data type. There are certain functions that can only be run on Series (and some that can only be run on DataFrames).

In [6]:
print(type(species_column))

<class 'pandas.core.series.Series'>


You can also pass in more than 1 column as a *list* to subset a DataFrame. Notice that since the list of columns is a Python *list*, we wrap it with an extra set of square brackets.

In [7]:
print(surveys_df[['species_id','plot_id']].head())

  species_id  plot_id
0         NL        2
1         NL        3
2         DM        2
3         DM        7
4         DM        3


Here you can see that the extra square-brackets were necessary because we were essentially creating a brand new list on the spot. If we save a list of columns into a variable, we can pass that right into the DataFrame and get the same thing.

It's also important to notice that since Python *lists* are **ordered**, this is how you can re-order the columns in a Pandas DataFrame.

In [8]:
column_list = ['plot_id','species_id']
print(surveys_df[column_list].head())

   plot_id species_id
0        2         NL
1        3         NL
2        2         DM
3        7         DM
4        3         DM


## Extracting Range based Subsets: Slicing

Back in *Lesson 00 - Introduction to Python*, we learned how to create lists, and how to access a single item in a list.

In [9]:
a = ['a','b','c','d']

Remember that Python is 0-indexed, so counting items starts with the number 0.

In [10]:
a[0]

'a'

We can also get a **range** of values from a list, by providing the index of the first item, a colon, and then the index *after* the last item we want to pull out.

In [11]:
a[0:3]

['a', 'b', 'c']

## Slicing Subsets of Rows in Python

We can **slice** a DataFrame using the same range notation, which works on the Row Number of the Data Frame

In [12]:
surveys_df[0:3]

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
1,2,7,16,1977,3,NL,M,33.0,
2,3,7,16,1977,2,DM,F,37.0,


Python range notation has a shortcut, where if you leave out the starting index, it defaults to 0.

In [13]:
surveys_df[:5]

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
1,2,7,16,1977,3,NL,M,33.0,
2,3,7,16,1977,2,DM,F,37.0,
3,4,7,16,1977,7,DM,M,36.0,
4,5,7,16,1977,3,DM,M,35.0,


In [14]:
df_slice = surveys_df[:2]
print(df_slice)

   record_id  month  day  year  plot_id species_id sex  hindfoot_length  \
0          1      7   16  1977        2         NL   M             32.0   
1          2      7   16  1977        3         NL   M             33.0   

   weight  
0     NaN  
1     NaN  


## Copying Objects vs Referencing Objects in Python

Let's start out with an example that shows how *copying* and *referencing* objects works in Pandas.

In [15]:
# Using the 'copy() method'
true_copy_surveys_df = surveys_df.copy()

# Using the '=' operator
ref_surveys_df = surveys_df

You might think that the code ref_surveys_df = surveys_df creates a fresh distinct copy of the surveys_df DataFrame object. However, using the = operator in the simple statement y = x does not create a copy of our DataFrame. Instead, y = x creates a new variable y that references the same object that x refers to. To state this another way, there is only one object (the DataFrame), and both x and y refer to it.

In contrast, the copy() method for a DataFrame creates a true copy of the DataFrame.

Let’s look at what happens when we reassign the values within a subset of the DataFrame that references another DataFrame object:

In [16]:
# Assign the value `0` to the first three rows of data in the DataFrame
ref_surveys_df[0:3] = 0

Let’s try the following code:

In [24]:
# ref_surveys_df was created using the '=' operator
ref_surveys_df.head()

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,0,0,0,0,0,0,0,0.0,0.0
1,0,0,0,0,0,0,0,0.0,0.0
2,0,0,0,0,0,0,0,0.0,0.0
3,4,7,16,1977,7,DM,M,36.0,
4,5,7,16,1977,3,DM,M,35.0,


In [25]:
# surveys_df is the original dataframe
surveys_df.head()

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,0,0,0,0,0,0,0,0.0,0.0
1,0,0,0,0,0,0,0,0.0,0.0
2,0,0,0,0,0,0,0,0.0,0.0
3,4,7,16,1977,7,DM,M,36.0,
4,5,7,16,1977,3,DM,M,35.0,


What is the difference between these two dataframes?

When we assigned the first 3 columns the value of 0 using the ref_surveys_df DataFrame, **the surveys_df DataFrame is modified too**. Remember we created the reference ref_survey_df object above when we did ref_survey_df = surveys_df. Remember surveys_df and ref_surveys_df refer to the same exact DataFrame object. If either one changes the object, the other will see the same changes to the reference object.

Okay, that’s enough of that. Let’s create a brand new clean dataframe from the original data CSV file.

In [26]:
surveys_df = pd.read_csv('https://ndownloader.figshare.com/files/2292172')

## Slicing Subsets of Rows and Columns in Python

We can select specific ranges of our data in both the row and column directions using either label or integer-based indexing.

* **loc** is primarily label based indexing. Integers may be used but they are interpreted as a label.
* **iloc** is primarily **i**nteger based indexing

To select a subset of rows and columns from our DataFrame, we can use the iloc method. For example, we can select month, day and year (columns 2, 3 and 4 if we start counting at 1), like this

In [28]:
# iloc[row slicing, column slicing]
surveys_df.iloc[0:3, 1:4]

Unnamed: 0,month,day,year
0,7,16,1977
1,7,16,1977
2,7,16,1977


Just like with list range indexing, we can just provide a lone colon if we want to return every row

In [30]:
surveys_df.iloc[:,0:3]

Unnamed: 0,record_id,month,day
0,1,7,16
1,2,7,16
2,3,7,16
3,4,7,16
4,5,7,16
5,6,7,16
6,7,7,16
7,8,7,16
8,9,7,16
9,10,7,16


Indexing by labels loc differs from indexing by integers iloc. With loc, the both start bound and the stop bound are inclusive. When using loc, integers can be used, but the integers refer to the index label and not the position. For example, using loc and select 1:4 will get a different result than using iloc to select rows 1:4.

In [31]:
surveys_df.loc[[0,2],['species_id','year']]

Unnamed: 0,species_id,year
0,NL,1977
2,DM,1977


We can also select a specific data value using a row and column location within the DataFrame and iloc indexing:

```python
# Syntax for iloc indexing to finding a specific data element
dat.iloc[row, column]
```

In [23]:
surveys_df.loc[0,'year']

0