# DataFrame Subsetting and Indexing

### Setting up the workspace

In [1]:
import numpy as np
import pandas as pd
from numpy.random import randn
from IPython.display import display
pd.options.display.float_format = '{:,.3f}'.format

### DataFrame Example

In [2]:
np.random.seed(234)
df = pd.DataFrame(randn(7, 5), index = list('ABCDEFG'), 
                  columns = ["VAR" + "_" + str(num) for num in range(1, 6)])

In [3]:
df


Unnamed: 0,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
A,0.819,-1.044,0.351,0.922,-0.087
B,-3.129,-0.97,0.935,0.044,1.425
C,-0.557,0.927,-1.284,1.096,-1.932
D,0.479,1.345,-0.175,-0.083,-0.888
E,-0.301,0.908,-0.646,-1.324,1.678
F,0.299,0.109,1.044,0.15,-0.252
G,0.847,0.506,0.393,0.142,-1.143


## Conditional Selection with DataFrames  

  - Conditional Selection is very common in data analysis. 
  
  
  - When we pass a condition we get an object of booleans.
  
  
  - Pandas has an important feature which is __conditional selection using bracket notation__
  
```python 
 The Bracket notation is this: 
    
    df[cond1]
    
```

### Comparison Operators

  - If you compare the whole DataFrame object with an integer you get an DataFrame of __booleans__

In [4]:
df > 0

Unnamed: 0,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
A,True,False,True,True,False
B,False,False,True,True,True
C,False,True,False,True,False
D,True,True,False,False,False
E,False,True,False,False,True
F,True,True,True,True,False
G,True,True,True,True,False


To perform boolean selection, you create a boolean object then use the __bracket notation__.

In [5]:
cond_bool = df > 0

In [6]:
slice_df = df[cond_bool]
slice_df

Unnamed: 0,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
A,0.819,,0.351,0.922,
B,,,0.935,0.044,1.425
C,,0.927,,1.096,
D,0.479,1.345,,,
E,,0.908,,,1.678
F,0.299,0.109,1.044,0.15,
G,0.847,0.506,0.393,0.142,


> When we slice using a condition, we get the data point where we have __True__, and __NaN__ where we have __False__. 

> Actually, we can combine the previous two steps in just one step as follows:

In [7]:
df[df > 0]

Unnamed: 0,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
A,0.819,,0.351,0.922,
B,,,0.935,0.044,1.425
C,,0.927,,1.096,
D,0.479,1.345,,,
E,,0.908,,,1.678
F,0.299,0.109,1.044,0.15,
G,0.847,0.506,0.393,0.142,


> However, contional Selection involves selecting rows or columns, not the entire data set.

#### Selection Rows Condionally 

 - The concern is often to retrieve the data that meets a certain condition. In other words, we are trying to answer a question by selecting only the observations needed for that answer.

```python
 
  df['col-name' <operator>] <operator> any conditional operator (>, < >=, <= , ==, !=)
    
 # Then We can subset the DataFrame. for example:

  df[df['col-name'] == 0]   
```

#### Note: 

 Conditional Selection returns a __series of booleans__.

In [8]:
df['VAR_1'] < 0

A    False
B     True
C     True
D    False
E     True
F    False
G    False
Name: VAR_1, dtype: bool

In [9]:
type(df['VAR_1'] < 0)

pandas.core.series.Series

> When we use the conditional selection, we will get only the values that correspond to __Trues__ 

In [10]:
sliced_var_1 = df[df['VAR_1'] < 0]
sliced_var_1

Unnamed: 0,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
B,-3.129,-0.97,0.935,0.044,1.425
C,-0.557,0.927,-1.284,1.096,-1.932
E,-0.301,0.908,-0.646,-1.324,1.678


Conditional selection is often done in just one step:

In [11]:
df[df['VAR_3'] > 0]

Unnamed: 0,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
A,0.819,-1.044,0.351,0.922,-0.087
B,-3.129,-0.97,0.935,0.044,1.425
F,0.299,0.109,1.044,0.15,-0.252
G,0.847,0.506,0.393,0.142,-1.143


> When subsetting, there is a goal in our mind. That might be selecting one or more columns from the DataFrame. That is done through steps: 

1. Subset the DataFrame and store the new subset.


2. Select the desired column(s).

In [12]:
sub_df = df[df['VAR_5'] < 0]
sub_df

Unnamed: 0,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
A,0.819,-1.044,0.351,0.922,-0.087
C,-0.557,0.927,-1.284,1.096,-1.932
D,0.479,1.345,-0.175,-0.083,-0.888
F,0.299,0.109,1.044,0.15,-0.252
G,0.847,0.506,0.393,0.142,-1.143


#### Selecting  one or more variables:

  - It can be done through the normal slicing technique.

In [13]:
sub_df['VAR_1']

A    0.819
C   -0.557
D    0.479
F    0.299
G    0.847
Name: VAR_1, dtype: float64

### Selecting more variables

  - Selecting more than variables require to pass a list of varialbes. [[...]].

In [14]:
sub_df[['VAR_1', 'VAR_3']]

Unnamed: 0,VAR_1,VAR_3
A,0.819,0.351
C,-0.557,-1.284
D,0.479,-0.175
F,0.299,1.044
G,0.847,0.393


 A Better way to achieve the same result is by combining the the steps in one step. That is, we can open a new single (or double) square bracket for the second subsetting.

Here is the Syntax:

```python 

     df[df['col-name'] <cond> ] ['col-name']  ===> One Variable
     df[df['col-name'] <cond> ] [['col1', 'col1', ...]]  ===> More Variables

```
A lot of brackets is confusing at first, but it is efficient.

#### Selecting the first variable where the fifth variable values are negative

In [15]:
var_1 = df[df['VAR_5'] < 0]['VAR_1']
var_1

A    0.819
C   -0.557
D    0.479
F    0.299
G    0.847
Name: VAR_1, dtype: float64

### Selecting two variable: 

  - We want to retrieve the first and the fifth variable where the fifth variable values are negative.

In [16]:
var_1_and_5 = df[df['VAR_5'] < 0][['VAR_1', 'VAR_5']]
var_1_and_5

Unnamed: 0,VAR_1,VAR_5
A,0.819,-0.087
C,-0.557,-1.932
D,0.479,-0.888
F,0.299,-0.252
G,0.847,-1.143


### Selecting three variables example

In [17]:
df[df['VAR_2'] < 0][['VAR_2', 'VAR_3', 'VAR_5']]

Unnamed: 0,VAR_2,VAR_3,VAR_5
A,-1.044,0.351,-0.087
B,-0.97,0.935,1.425


### Multiple Conditional Selection

- When performing conditional selection we __are not restricted to only one condition__, but we can pass more than one condition. 


- Multiple conditional selection requires us to know about the logical operators **ampersand (&)** and the **pipe operator (|)**. 


- Multiple conditional selection is an attempt to answer a question or solve a problem.


- Performing a multiple selection in pandas is by passing **a list of tuples combined with the logical operators**.

Here is the syntax
---
```python

 df[(df['col-name']<cond>) & (df['col-name']<cond>)]
-----------------------------------------------------
Warning:
=======   
&: is the and operator {(the ampersand operator) (the usual "and" does not work). Returns Both conditions are correct}
    
|: is the or operator {(or the pipe operator) (the usual "or" does not work). Returns either one is correct}.
```

### Note: 

   - Q: Why the "and" and "or" do not work on pandas Series?
   

   - A: (and + Or) operators deal with single values but not series. 

#### Review of logical Operators

In [18]:
True & True, True & False, False & True, False & False

(True, False, False, False)

In [19]:
True  | True, True | False, False | True, False | False

(True, True, True, False)

### Two conditions example

- Suppose we want to have the negative values of variable 05 (VAR_5), and the positive values of variable 01 (VAR_1). 

In [20]:
m_cond_df = df[(df['VAR_5'] < 0) & (df['VAR_1'] > 0)]
m_cond_df

Unnamed: 0,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
A,0.819,-1.044,0.351,0.922,-0.087
D,0.479,1.345,-0.175,-0.083,-0.888
F,0.299,0.109,1.044,0.15,-0.252
G,0.847,0.506,0.393,0.142,-1.143


### The  Or operator "|" Example

- Retrieve the negative values of the fifth variable **OR** the positive values to the first variable.

In [21]:
df[(df['VAR_5'] < 0) | (df['VAR_1'] > 0)]

Unnamed: 0,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
A,0.819,-1.044,0.351,0.922,-0.087
C,-0.557,0.927,-1.284,1.096,-1.932
D,0.479,1.345,-0.175,-0.083,-0.888
F,0.299,0.109,1.044,0.15,-0.252
G,0.847,0.506,0.393,0.142,-1.143


### Three Conditions Example

In [22]:
df[(df['VAR_5'] < 0) & (df['VAR_1'] > 0) & (df['VAR_2'] > 1)]

Unnamed: 0,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
D,0.479,1.345,-0.175,-0.083,-0.888


### Selecting Variable based on multiple conditions

   - It is not different than subsetting with one condition. We can use extra **single square brackets for selecting one variable**.
   
   
   - Or **double square brackets (a list of list of variables) to select more than one variable**
   
   
   - Using lots of square brackets is bit confusing at first, but it is __very efficient__ and makes us save some memory. It is a good practice to familiarize ourselves with this syntax.

### Selecting One Variable based on multiple conditions

 - Retrieve the the third variable where the firth variable is negative and the firsth variable is positive.

In [23]:
var_3 = df[(df['VAR_5'] < 0) & (df['VAR_1'] > 0)]['VAR_3']
var_3

A    0.351
D   -0.175
F    1.044
G    0.393
Name: VAR_3, dtype: float64

### Selection more variables with multiple conditions example

In [24]:
var_3_4_2 = df[(df['VAR_5'] < 0) & (df['VAR_1'] > 0)][['VAR_3', 'VAR_4', 'VAR_2']]
var_3_4_2

Unnamed: 0,VAR_3,VAR_4,VAR_2
A,0.351,0.922,-1.044
D,-0.175,-0.083,1.345
F,1.044,0.15,0.109
G,0.393,0.142,0.506


Note:

- the DataFrame object returned has the variables in the same order they are provided with. 

# Re-Indexing DataFrames

  - ReIndexing a DataFrame is important. However, sometimes we want to change the index labels to new ones or reset the index to its default.  
  
### Resetting the Index 

  - The __df.reset_index()__ will reset the index to its default. Note that:
  
      - reset_index() will not make a copy of the object. 
      - If we already have an index, reset_index will convert the old index to a new column.

In [25]:
df.head()

Unnamed: 0,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
A,0.819,-1.044,0.351,0.922,-0.087
B,-3.129,-0.97,0.935,0.044,1.425
C,-0.557,0.927,-1.284,1.096,-1.932
D,0.479,1.345,-0.175,-0.083,-0.888
E,-0.301,0.908,-0.646,-1.324,1.678


In [26]:
df.index

Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')

Our DataFrame already has a label-index. We expect it will be a new column when resetting the DataFrame object to its default.

### Reset Index

In [27]:
df.reset_index()

Unnamed: 0,index,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
0,A,0.819,-1.044,0.351,0.922,-0.087
1,B,-3.129,-0.97,0.935,0.044,1.425
2,C,-0.557,0.927,-1.284,1.096,-1.932
3,D,0.479,1.345,-0.175,-0.083,-0.888
4,E,-0.301,0.908,-0.646,-1.324,1.678
5,F,0.299,0.109,1.044,0.15,-0.252
6,G,0.847,0.506,0.393,0.142,-1.143


Old index object became a new column called **index**

 Did the original df get affected? Of course no

In [28]:
df.head()

Unnamed: 0,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
A,0.819,-1.044,0.351,0.922,-0.087
B,-3.129,-0.97,0.935,0.044,1.425
C,-0.557,0.927,-1.284,1.096,-1.932
D,0.479,1.345,-0.175,-0.083,-0.888
E,-0.301,0.908,-0.646,-1.324,1.678


### Inplace Argument 

   - Many Pandas DataFrame methods have __inplace__ argument to protect us from accidently changing the data.
   
   
   - If we want to reset the DataFrame index permanently, we have to provide __inplace = True__. 

In [29]:
df.reset_index(inplace = True)

In [30]:
df.head()

Unnamed: 0,index,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
0,A,0.819,-1.044,0.351,0.922,-0.087
1,B,-3.129,-0.97,0.935,0.044,1.425
2,C,-0.557,0.927,-1.284,1.096,-1.932
3,D,0.479,1.345,-0.175,-0.083,-0.888
4,E,-0.301,0.908,-0.646,-1.324,1.678


### Rename Method. 
- I would like to rename the new column as __OldIndex__, to refer to it later. This is a chance for use to learn about the __rename method__. 

- Rename method has several arguments, but in order the rename the variable names, we need to pass a dict to __columns__ argument. 

- You need to set __inplace = True__ if we want the change to permanently takes place.

Here is the syntax
---
```python 

df.rename(columns = {"old-var-name":"new-var-name"}, inplace = True)
```

In [31]:
df.rename(columns = {'index': 'OldIndex'}, inplace = True ) 
df.head()

Unnamed: 0,OldIndex,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
0,A,0.819,-1.044,0.351,0.922,-0.087
1,B,-3.129,-0.97,0.935,0.044,1.425
2,C,-0.557,0.927,-1.284,1.096,-1.932
3,D,0.479,1.345,-0.175,-0.083,-0.888
4,E,-0.301,0.908,-0.646,-1.324,1.678


### Column as an Index 

Great! What if we want to use a column as an index. In our case, we will get back the index (renamed to OldIndex) to be an __index__. 

- Here it comes __df.set_index('col-name candidate')__. This also will not affect the original data.


- **Permanent seting new index requires to set inplace = True**. If we do this, it will override the old label index and we won't be able to get it back. Do not do it unless you are sure, or make a copy of the original data. 

Here is the syntax
---
```python

df.set_index('col-to-be-index')
```

In [32]:
df.set_index("OldIndex")

Unnamed: 0_level_0,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
OldIndex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,0.819,-1.044,0.351,0.922,-0.087
B,-3.129,-0.97,0.935,0.044,1.425
C,-0.557,0.927,-1.284,1.096,-1.932
D,0.479,1.345,-0.175,-0.083,-0.888
E,-0.301,0.908,-0.646,-1.324,1.678
F,0.299,0.109,1.044,0.15,-0.252
G,0.847,0.506,0.393,0.142,-1.143


In [33]:
df.head()

Unnamed: 0,OldIndex,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
0,A,0.819,-1.044,0.351,0.922,-0.087
1,B,-3.129,-0.97,0.935,0.044,1.425
2,C,-0.557,0.927,-1.284,1.096,-1.932
3,D,0.479,1.345,-0.175,-0.083,-0.888
4,E,-0.301,0.908,-0.646,-1.324,1.678


#### Permanent seting New Index

In [34]:
df.set_index("OldIndex", inplace= True)

In [35]:
df.head()

Unnamed: 0_level_0,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
OldIndex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,0.819,-1.044,0.351,0.922,-0.087
B,-3.129,-0.97,0.935,0.044,1.425
C,-0.557,0.927,-1.284,1.096,-1.932
D,0.479,1.345,-0.175,-0.083,-0.888
E,-0.301,0.908,-0.646,-1.324,1.678


### The index name

In [36]:
df.index.name

'OldIndex'

### Reindex method

  - **Reindex** method will create a new object with the data conformed to a new index. (reindex works on boths Series and DataFrames objects)

In [37]:
df_reind = pd.DataFrame(df, index = ['F', 'B', 'A', 'E', 'C', 'D', 'G'])

In [38]:
df_reind

Unnamed: 0,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
F,0.299,0.109,1.044,0.15,-0.252
B,-3.129,-0.97,0.935,0.044,1.425
A,0.819,-1.044,0.351,0.922,-0.087
E,-0.301,0.908,-0.646,-1.324,1.678
C,-0.557,0.927,-1.284,1.096,-1.932
D,0.479,1.345,-0.175,-0.083,-0.888
G,0.847,0.506,0.393,0.142,-1.143


### Reindexing Example

- If we call the __reindex method__ on the object, it will rearrange the DataFrame according to the new index. 

- If we give and index that does not have match of data points, NaN will be produced.

- To reindex a DataFrame, input a list of lables to the __labels__ argument

Here is the syntax
```python
df.reindex(labels = ['label1', 'label2', ...])

```

In [39]:
df_reind.reindex(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'])

Unnamed: 0,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
A,0.819,-1.044,0.351,0.922,-0.087
B,-3.129,-0.97,0.935,0.044,1.425
C,-0.557,0.927,-1.284,1.096,-1.932
D,0.479,1.345,-0.175,-0.083,-0.888
E,-0.301,0.908,-0.646,-1.324,1.678
F,0.299,0.109,1.044,0.15,-0.252
G,0.847,0.506,0.393,0.142,-1.143
H,,,,,


### Filling NANs with Reindex Method

__reindex__ method has an argument __fill_value__ to fill NaNs values

In [40]:
df_reind.reindex(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'], fill_value= -999)

Unnamed: 0,VAR_1,VAR_2,VAR_3,VAR_4,VAR_5
A,0.819,-1.044,0.351,0.922,-0.087
B,-3.129,-0.97,0.935,0.044,1.425
C,-0.557,0.927,-1.284,1.096,-1.932
D,0.479,1.345,-0.175,-0.083,-0.888
E,-0.301,0.908,-0.646,-1.324,1.678
F,0.299,0.109,1.044,0.15,-0.252
G,0.847,0.506,0.393,0.142,-1.143
H,-999.0,-999.0,-999.0,-999.0,-999.0


### Note:

  - **reindex** works on columns as well. 

In [41]:
col_df = df.reindex(columns = ['VAR_1', 'VAR_6', 'VAR_2'])
col_df

Unnamed: 0_level_0,VAR_1,VAR_6,VAR_2
OldIndex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0.819,,-1.044
B,-3.129,,-0.97
C,-0.557,,0.927
D,0.479,,1.345
E,-0.301,,0.908
F,0.299,,0.109
G,0.847,,0.506


NaNs are produced to the New column.

### Filling the variable NaN values with fill_value

In [42]:
df.reindex(columns = ['VAR_1', 'VAR_6', 'VAR_2'], fill_value=0)

Unnamed: 0_level_0,VAR_1,VAR_6,VAR_2
OldIndex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0.819,0,-1.044
B,-3.129,0,-0.97
C,-0.557,0,0.927
D,0.479,0,1.345
E,-0.301,0,0.908
F,0.299,0,0.109
G,0.847,0,0.506


## Conclusion

**Subsetting (Variable Selection), Indexing, ReIndexing ... are common tasks in data analysis. Thus, giving much of interest to this section is extremely important.**