# Querying A Dataframe
In this module we're going to talk about querying DataFrames. The first step in the process is to understand Boolean masking. Boolean masking is the heart of fast and efficient querying in numpy and pandas, and it's analogous to bit masking used in other areas of computational science. By the end of this module you'll understand how Boolean masking works, and how to apply this to a DataFrame to get out data you're interested in.

A Boolean mask is an array which can be of one dimensional like a series, or two dimensions like a data frame, where each of the values in the array are either true or false. This array is essentially overlaid on top of the data structure that we're querying. And any cell aligned with the true value will be admitted into our final result, and any cell aligned with a false value will not.

In [1]:
#First lets import pandas library on which we are going to work.
import pandas as pd

In [2]:
#Lets now import a dataframe on which we are going to work.
csvDataFrame=pd.read_csv("assets/Admission_Predict.csv")

In [3]:
csvDataFrame.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


Now just be on the safer side, lets edit the dataframe.

1)Lets change the index from seperate index to Serial Number.

2)Take the column name and convert all the columns to lowercase and trim all the extra spaces. 

In [4]:
#Edit1
csvDataFrame=pd.read_csv("assets/Admission_Predict.csv",index_col=0)
#Edit 2
csvDataFrame.columns=[x.lower().strip() for x in csvDataFrame.columns]
csvDataFrame.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


###### Applying Boolean Masking-Example 1
When we want to apply boolean mask, we need to apply it directly on the series/dataframe.

For instance we want see only those students whose chance of admit is more than 0.7. Let see how we do it

In [5]:
#Lets first access the column on which the comparison is to be conducted.
targetColumn=csvDataFrame["chance of admit"]

In [6]:
#Then we will simply compare all that data in our target column with 0.7
qualifiedCandidates=targetColumn>0.7

In [7]:
qualifiedCandidates

Serial No.
1       True
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398     True
399    False
400     True
Name: chance of admit, Length: 400, dtype: bool

In [8]:
#One way is this
csvDataFrame[qualifiedCandidates]

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.00,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.80
6,330,115,5,4.5,3.0,9.34,1,0.90
...,...,...,...,...,...,...,...,...
395,329,111,4,4.5,4.0,9.23,1,0.89
396,324,110,3,3.5,3.5,9.04,1,0.82
397,325,107,3,3.0,3.5,9.11,1,0.84
398,330,116,4,5.0,4.5,9.45,1,0.91


We did what we had to do. Since we are passing condition on one column only the output has been a series or it would have been a dataframe. The result is a Boolean mask - true or false values each being indexed. Underneath, pandas is applying the comparison operator you specified through vectorization (so efficiently and in parallel) to all of the values in the array you specified which, in this case, is the chance of admit column of the dataframe. 

So, what do you do with the boolean mask once you have formed it? Well, you can just lay it on top of the data to "hide" the data you don't want, which is represented by all of the False values. We do this by using the .where() function on the original DataFrame.

The general syntax of doing it is.

###### < Variable >=< Dataframe which is to boolean masked >.where(< Boolean Masking >)

Lets see how we do it.

In [9]:
csvBooleanEdit=csvDataFrame.where(qualifiedCandidates)
csvBooleanEdit.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
5,,,,,,,,


When we pass the boolean mask through a data frame, we obtain a new dataframe in which for the students whose masks were true, thier data is visible but for students whose boolean mask is false, thier data is replaced by NaN.

But lets say we want to completely drop those rows whose values are nothing but NaN. For that we simply chain the .where attribute with .dropna attribute. The general syntax can be observed as follows.

###### < Variable >=< DataFrame That is to be Masked >.where(boolean mask).dropna()

Lets see an example.

In [10]:
csvBooleanEdit=csvDataFrame.where(qualifiedCandidates).dropna()
csvBooleanEdit.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
6,330.0,115.0,5.0,4.5,3.0,9.34,1.0,0.9


As observable the row containing NaN, specifically the fifth row has been completly dropped and has been replaced by 6th row. Similarly all the other row of the dataframe have been completely dropped that alingned with false command of the boolean masking.

What if I say to you that there is another way of doing the same thing we have done via using .where() and .dropna() but in a more optimised and fast manner. That way is simply overloading the indexing operator. Lets see the general syntax.

###### < Variable >=< DataFrame >[< Dataframe >[Column to be Tested Name] < Operator > < Number >]

Lets see an example to have an better understanding.

In [11]:
csvBooleanEdit=csvDataFrame[csvDataFrame["chance of admit"]>0.7]
csvBooleanEdit.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
6,330,115,5,4.5,3.0,9.34,1,0.9


(***)
 
Taking a bit of deTour here. We have seen quite a few applications of the indexing operator in reference to dataframe. Let sum them up for future references.

1) Used to select a single or a set of column and display as output either as a series or a dataframe.

2) To be used for selection of column for applying boolean masking.

Examples can be observed in previous set of code.

(***)

Till now we have used single conditions to boolean mask the dataframes. Lets say we wanna use multiple conditions to boolean mask the dataframe. Previosly we have seen that whenever we have had to check multiple conditions we have generally used "and" or "or". But when we use "and" or "or" in boolean masking of dataframe error is generated. Hence we replace "and" with "&" and "or" with "|". 
Lets see a few example to understand.

###### Question
Generate masked data of dataframe Admission_Predict where the chance of admit is greater than 0.7 and less than 0.9. 

In [12]:
csvBooleanEdit=csvDataFrame[(csvDataFrame["chance of admit"]>0.7) and (csvDataFrame["chance of admit"]<0.9)]
csvBooleanEdit.head()

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [14]:
csvBooleanEdit=csvDataFrame[(csvDataFrame["chance of admit"]>0.7) & (csvDataFrame["chance of admit"]<0.9)]
csvBooleanEdit.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
7,321,109,3,3.0,4.0,8.2,1,0.75
12,327,111,4,4.0,4.5,9.0,1,0.84


Note one thing here. The conditions that we are passing must be closed in parenthesis or else error will be generated. See example.

In [15]:
csvBooleanEdit=csvDataFrame[csvDataFrame["chance of admit"]>0.7 & csvDataFrame["chance of admit"]<0.9]
csvBooleanEdit.head()

TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]

Now to solve this parenthesis problem we have predifined methods .gt(< Number >) and .lt(< Number >) and chain it to the column.

Lets see a example to understand this better.

In [16]:
csvBooleanEdit=csvDataFrame[csvDataFrame["chance of admit"].gt(0.7) & csvDataFrame["chance of admit"].lt(0.9)]
csvBooleanEdit.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
7,321,109,3,3.0,4.0,8.2,1,0.75
12,327,111,4,4.0,4.5,9.0,1,0.84


We can further optimize it by the following code.

In [17]:
csvBooleanEdit=csvDataFrame[csvDataFrame["chance of admit"].lt(0.9).gt(.7)]
csvBooleanEdit.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65
7,321,109,3,3.0,4.0,8.2,1,0.75


In this lecture, we have learned to query dataframe using boolean masking, which is extremely important and often used in the world of data science. With boolean masking, we can select data based on the criteria we desire and, frankly, you'll use it everywhere. We've also seen how there are many different ways to query the DataFrame, and the interesting side implications that come up when doing so.