### Subset DataFrame
- we’ve learned how to inspect, describe and summarize a Pandas DataFrame.

- we’ll learn how to extract a subset of a Pandas DataFrame. This is very useful because we often want to perform operations on subsets of our data.

- There are many different ways of subsetting a Pandas DataFrame. You may need to select specific columns with all rows. Sometimes, you want to select specific rows with all columns or select rows and columns that meet a specific criterion, etc.

  - All different ways of subsetting can be divided into 4 categories: `Selection, Slicing, Indexing and Filtering.`

- `Selection`  - Columnn selection
- `Slicing`    - Row selection
- `Indexing`   - Combines Selection and Slicing
- `Filtering`  - Selection based on conditions

In [2]:
import pandas as pd

In [3]:
df_iris=pd.read_csv("data/iris.csv")

In [15]:
df_iris

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


### Selection

- When we grab the entire column(s), it refers to as Selection. The selected column(s) contain all the rows.

**Selecting a single column using the column name**

- We can select a single column of a Pandas DataFrame using its column name. 

In [19]:
df_iris["Species"]          #The output is a Series which is a single column

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: Species, Length: 150, dtype: object

In [18]:
df_iris.Species      #This also will work same way

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: Species, Length: 150, dtype: object

In [21]:
df_iris[["Species"]]        #The output is a DataFrame which is a single column

Unnamed: 0,Species
0,Iris-setosa
1,Iris-setosa
2,Iris-setosa
3,Iris-setosa
4,Iris-setosa
...,...
145,Iris-virginica
146,Iris-virginica
147,Iris-virginica
148,Iris-virginica


**Selecting multiple columns using the column names**

- We can select multiple columns of a Pandas DataFrame using its column names. We can define columns names inside a list:

In [25]:
df_iris[["SepalLengthCm","Species"]]       #This time, the output is a Pandas DataFrame

Unnamed: 0,SepalLengthCm,Species
0,5.1,Iris-setosa
1,4.9,Iris-setosa
2,4.7,Iris-setosa
3,4.6,Iris-setosa
4,5.0,Iris-setosa
...,...,...
145,6.7,Iris-virginica
146,6.3,Iris-virginica
147,6.5,Iris-virginica
148,6.2,Iris-virginica


# Position and Label Based Indexing: ```df.iloc``` and ```df.loc```

You have seen some ways of selecting rows and columns from dataframes. Let's now see some other ways of indexing dataframes, which pandas recommends, since they are more explicit (and less ambiguous).

There are two main ways of indexing dataframes:
1. Position based indexing using ```df.iloc```
2. Label based indexing using ```df.loc```



**.loc()**\
Pandas provide various methods to have purely label based indexing. When slicing, the end bound is also included. Integers are valid labels, but they refer to the label and not the position.

- location, which is used to retrieve the data if the index is in String/character 
- syntax-- df.loc[row,column]


**.iloc()**\
Pandas provide various methods in order to get purely integer based indexing. Like python and numpy, these are 0-based indexing.

- index location, which is used to retrieve the data if the index is in numbers 
- syntax-- df.iloc[row,column]


**Selecting a single column using the .loc attribute**

- The same results can be obtained using the .loc attribute which selects Pandas data by label (column name).

In [29]:
df_iris.loc[:,'SepalLengthCm']        #selecting all rows in SepalLengthCm column

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: SepalLengthCm, Length: 150, dtype: float64

**Selecting multiple columns using the `.loc attribute`**


In [35]:
df_iris.loc[:,'SepalLengthCm':'PetalLengthCm']    #all rows in SepalLengthCm to PetalLengthCm

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm
0,5.1,3.5,1.4
1,4.9,3.0,1.4
2,4.7,3.2,1.3
3,4.6,3.1,1.5
4,5.0,3.6,1.4
...,...,...,...
145,6.7,3.0,5.2
146,6.3,2.5,5.0
147,6.5,3.0,5.2
148,6.2,3.4,5.4


In [36]:
df_iris.loc[:,['SepalLengthCm','Species']]   #Not slicing specifing columns

Unnamed: 0,SepalLengthCm,Species
0,5.1,Iris-setosa
1,4.9,Iris-setosa
2,4.7,Iris-setosa
3,4.6,Iris-setosa
4,5.0,Iris-setosa
...,...,...
145,6.7,Iris-virginica
146,6.3,Iris-virginica
147,6.5,Iris-virginica
148,6.2,Iris-virginica


**Selecting a single column using the `.iloc attribute`**

In [40]:
df_iris

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


In [41]:
df_iris.iloc[:,5]         #acessing column by index 

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: Species, Length: 150, dtype: object

**Selecting multiple columns using the `.iloc attribute`**

In [48]:
df_iris.iloc[:,3:5]   

Unnamed: 0,PetalLengthCm,PetalWidthCm
0,1.4,0.2
1,1.4,0.2
2,1.3,0.2
3,1.5,0.2
4,1.4,0.2
...,...,...
145,5.2,2.3
146,5.0,1.9
147,5.2,2.0
148,5.4,2.3


**Selecting consecutive columns using the .iloc attribute (The easy way)**

In [47]:
df_iris.iloc[:,[3,4,1]]


Unnamed: 0,PetalLengthCm,PetalWidthCm,SepalLengthCm
5,1.7,0.4,5.4
6,1.4,0.3,4.6
2,1.3,0.2,4.7
3,1.5,0.2,4.6
4,1.4,0.2,5.0
56,4.7,1.6,6.3


### Slicing

- When we want to extract certain rows from the DataFrame, it refers to as Slicing. The extracted rows are called slices and contain all the columns.

**Selecting a single row using the .iloc attribute**
- The easiest way to extract a single row is to use the row index inside the .iloc

In [52]:
df_iris.iloc[0]          #only 0th index row details
#The output is a Pandas Series which contains the row values.

Id                         1
SepalLengthCm            5.1
SepalWidthCm             3.5
PetalLengthCm            1.4
PetalWidthCm             0.2
Species          Iris-setosa
Name: 0, dtype: object

In [53]:
df_iris.iloc[[0]]       #The output is a Pandas DatFrame which contains the row values.

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa


**Selecting multiple rows using the `.iloc attribute`**
- We can extract multiple rows of a Pandas DataFrame using its row indices. We include row indices inside a list:

In [55]:
df_iris.iloc[[2,4,7,87]]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
2,3,4.7,3.2,1.3,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
7,8,5.0,3.4,1.5,0.2,Iris-setosa
87,88,6.3,2.3,4.4,1.3,Iris-versicolor


**Selecting the last few rows**

- The negative indices count rows from the bottom.

In [57]:
df_iris.iloc[[-4,-10,-3,-11]]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
146,147,6.3,2.5,5.0,1.9,Iris-virginica
140,141,6.7,3.1,5.6,2.4,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
139,140,6.9,3.1,5.4,2.1,Iris-virginica


### Indexing

- When we combine column selection and row slicing, it is referred to as Indexing. Here, we can use .loc and .iloc attributes of a Pandas DataFrame.



**Selecting a single value using the .iloc attribute**

- If we specify a single row and a single column, the intersection is a single value!

In [58]:
df_iris.iloc[0,0] 
#Keep in mind that we cannot use column or row names inside .iloc[].Only the index numbers can be used.

1

**Selecting a single value using the .loc attribute**

In [59]:
df_iris.loc[0,"Species"]

'Iris-setosa'

**Selecting multiple rows and columns using the .iloc attribute**

In [60]:
df_iris.iloc[[23,12,56],[1,5]]    

Unnamed: 0,SepalLengthCm,Species
23,5.1,Iris-setosa
12,4.8,Iris-setosa
56,6.3,Iris-versicolor


**Selecting multiple rows and columns using the .loc attribute**

In [61]:
df_iris.loc[[34,12,10,11],['Species','SepalLengthCm']]

Unnamed: 0,Species,SepalLengthCm
34,Iris-setosa,4.9
12,Iris-setosa,4.8
10,Iris-setosa,5.4
11,Iris-setosa,4.8


**Selecting consecutive rows and columns using the .loc and .iloc attributes (The easy way)**

In [63]:
df_iris.iloc[0:7,1:4]        #slicing rows and columns

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm
0,5.1,3.5,1.4
1,4.9,3.0,1.4
2,4.7,3.2,1.3
3,4.6,3.1,1.5
4,5.0,3.6,1.4
5,5.4,3.9,1.7
6,4.6,3.4,1.4


In [64]:
df_iris.loc[0:7,'SepalLengthCm':"PetalLengthCm"]

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm
0,5.1,3.5,1.4
1,4.9,3.0,1.4
2,4.7,3.2,1.3
3,4.6,3.1,1.5
4,5.0,3.6,1.4
5,5.4,3.9,1.7
6,4.6,3.4,1.4
7,5.0,3.4,1.5


### Filtering

- When we select rows and columns based on specific criteria or conditions, it is referred to as Filtering. We can also combine the above-discussed methods with this.

**Filtering based on a single criterion with all columns**

- Let’s subset our data when SpealLengthCm > 5. Here, we select all the columns when SpealLengthCm > 5.

In [67]:
df_iris["SepalLengthCm"]>5
#output is Series of boolean data type. We can use this Series to get the required subset of the data.

0       True
1      False
2      False
3      False
4      False
       ...  
145     True
146     True
147     True
148     True
149     True
Name: SepalLengthCm, Length: 150, dtype: bool

In [69]:
df_iris[df_iris["SepalLengthCm"]>5]
#Series inside DataFrame will give output in DataFrame

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa
10,11,5.4,3.7,1.5,0.2,Iris-setosa
14,15,5.8,4.0,1.2,0.2,Iris-setosa
15,16,5.7,4.4,1.5,0.4,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


**Filtering based on a single criterion with a few columns**

- Let’s subset our data when SepalLengthCm > 5. This time, we select only 1 columns when SepalLengthCm > 5. For this, we can combine the above filtering technique with .loc[].

In [70]:
df_iris.loc[df_iris["SepalLengthCm"]>5,["Species"]]

Unnamed: 0,Species
0,Iris-setosa
5,Iris-setosa
10,Iris-setosa
14,Iris-setosa
15,Iris-setosa
...,...
145,Iris-virginica
146,Iris-virginica
147,Iris-virginica
148,Iris-virginica


**Filtering based on two criteria with & operator (Same column)**

- Let’s subset our data when SepalLengthCm > 5 & SepalLengthCm < 6. Here, we use two conditions and combine them with the & operator. Each condition should be surrounded in parentheses.

In [74]:
df_iris[(df_iris["SepalLengthCm"]> 5) & (df_iris["SepalLengthCm"]< 6)]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa
10,11,5.4,3.7,1.5,0.2,Iris-setosa
14,15,5.8,4.0,1.2,0.2,Iris-setosa
15,16,5.7,4.4,1.5,0.4,Iris-setosa
16,17,5.4,3.9,1.3,0.4,Iris-setosa
17,18,5.1,3.5,1.4,0.3,Iris-setosa
18,19,5.7,3.8,1.7,0.3,Iris-setosa
19,20,5.1,3.8,1.5,0.3,Iris-setosa
20,21,5.4,3.4,1.7,0.2,Iris-setosa


**Filtering based on two criteria with the between() method**

- filtering can be achieved using the between() method.

In [72]:
df_iris[df_iris['SepalLengthCm'].between(5,6)]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa
7,8,5.0,3.4,1.5,0.2,Iris-setosa
10,11,5.4,3.7,1.5,0.2,Iris-setosa
...,...,...,...,...,...,...
119,120,6.0,2.2,5.0,1.5,Iris-virginica
121,122,5.6,2.8,4.9,2.0,Iris-virginica
138,139,6.0,3.0,4.8,1.8,Iris-virginica
142,143,5.8,2.7,5.1,1.9,Iris-virginica


**Filtering based on two criteria with AND operator (Different columns)**

- Here, the two conditions are made using two different columns: SepalLengthCm and PetalLengthCm.

In [77]:
df_iris[(df_iris["SepalLengthCm"]>5)  &  (df_iris["PetalLengthCm"]>6)]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
105,106,7.6,3.0,6.6,2.1,Iris-virginica
107,108,7.3,2.9,6.3,1.8,Iris-virginica
109,110,7.2,3.6,6.1,2.5,Iris-virginica
117,118,7.7,3.8,6.7,2.2,Iris-virginica
118,119,7.7,2.6,6.9,2.3,Iris-virginica
122,123,7.7,2.8,6.7,2.0,Iris-virginica
130,131,7.4,2.8,6.1,1.9,Iris-virginica
131,132,7.9,3.8,6.4,2.0,Iris-virginica
135,136,7.7,3.0,6.1,2.3,Iris-virginica


**Filtering based on two criteria with OR operator**

- When we use the AND operator, the filtering happens considering both conditions to be true. If we want at least one condition to be true, we can use the OR operator.

In [78]:
df_iris[(df_iris["SepalLengthCm"]>5)  |  (df_iris["PetalLengthCm"]>6)]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa
10,11,5.4,3.7,1.5,0.2,Iris-setosa
14,15,5.8,4.0,1.2,0.2,Iris-setosa
15,16,5.7,4.4,1.5,0.4,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


**Filtering based on the minimum and maximum values**

- Let’s subset our data based on the minimum and maximum values of the SepalLengthCm variable. First, we get the indices of the minimum and maximum:

In [81]:
df_iris["SepalLengthCm"].idxmax()     #max value index

131

In [80]:
df_iris["SepalLengthCm"].idxmin()     #min value index

13

In [84]:
df_iris.iloc[[df_iris["SepalLengthCm"].idxmin(),df_iris["SepalLengthCm"].idxmax()]]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
13,14,4.3,3.0,1.1,0.1,Iris-setosa
131,132,7.9,3.8,6.4,2.0,Iris-virginica
