### Querying a DataFrame

In [11]:
import pandas as pd
df = pd.read_csv('admission_predict.csv')
df.columns = [x.lower().strip() for x in df.columns]
df


Unnamed: 0,serial no.,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.00,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.80
4,5,314,103,2,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...,...
395,396,324,110,3,3.5,3.5,9.04,1,0.82
396,397,325,107,3,3.0,3.5,9.11,1,0.84
397,398,330,116,4,5.0,4.5,9.45,1,0.91
398,399,312,103,3,3.5,4.0,8.78,0,0.67


In [12]:
# We are going to check which students have chance of admit higer than 0.7
# Here we are going to create a BOOLEAN MASK
admit_mask = df['chance of admit'] > 0.95
admit_mask


0      False
1      False
2      False
3      False
4      False
       ...  
395    False
396    False
397    False
398    False
399    False
Name: chance of admit, Length: 400, dtype: bool

In [13]:
# Now we can use our mask to "filter" the data in the original dataframe
df.where(admit_mask)


Unnamed: 0,serial no.,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
0,,,,,,,,,
1,,,,,,,,,
2,,,,,,,,,
3,,,,,,,,,
4,,,,,,,,,
...,...,...,...,...,...,...,...,...,...
395,,,,,,,,,
396,,,,,,,,,
397,,,,,,,,,
398,,,,,,,,,


In [14]:
# Notice that it left NaN values where the condition doesn't apply so...
df.where(admit_mask).dropna()


Unnamed: 0,serial no.,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
24,25.0,336.0,119.0,5.0,4.0,3.5,9.8,1.0,0.97
71,72.0,336.0,112.0,5.0,5.0,5.0,9.76,1.0,0.96
81,82.0,340.0,120.0,4.0,5.0,5.0,9.5,1.0,0.96
130,131.0,339.0,114.0,5.0,4.0,4.5,9.76,1.0,0.96
143,144.0,340.0,120.0,4.0,4.5,4.0,9.92,1.0,0.97
148,149.0,339.0,116.0,4.0,4.0,3.5,9.8,1.0,0.96
202,203.0,340.0,120.0,5.0,4.5,4.5,9.91,1.0,0.97
203,204.0,334.0,120.0,5.0,4.0,5.0,9.87,1.0,0.97
213,214.0,333.0,119.0,5.0,5.0,4.5,9.78,1.0,0.96
384,385.0,340.0,113.0,4.0,5.0,5.0,9.74,1.0,0.96


**IMPORTANT**. Pandas developers created a shorthand that combines .where and dropna. Probably less readable but is most common find it in people's code

In [15]:
df[df['chance of admit'] > 0.95]


Unnamed: 0,serial no.,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
24,25,336,119,5,4.0,3.5,9.8,1,0.97
71,72,336,112,5,5.0,5.0,9.76,1,0.96
81,82,340,120,4,5.0,5.0,9.5,1,0.96
130,131,339,114,5,4.0,4.5,9.76,1,0.96
143,144,340,120,4,4.5,4.0,9.92,1,0.97
148,149,339,116,4,4.0,3.5,9.8,1,0.96
202,203,340,120,5,4.5,4.5,9.91,1,0.97
203,204,334,120,5,4.0,5.0,9.87,1,0.97
213,214,333,119,5,5.0,4.5,9.78,1,0.96
384,385,340,113,4,5.0,5.0,9.74,1,0.96


---

In [16]:
# Now take a look. We can combine multiple masks with logical operators
# Be aware with python "and" and "or". For pandas we have to use & | operators
(df['chance of admit'] > 0.7) & (df['chance of admit'] < 0.9)
# Also is important to separate the therms with parethesis

0      False
1       True
2       True
3       True
4      False
       ...  
395     True
396     True
397    False
398    False
399    False
Name: chance of admit, Length: 400, dtype: bool

In [17]:
# Also we can do
df['chance of admit'].gt(0.7) & df['chance of admit'].lt(0.9)

0      False
1       True
2       True
3       True
4      False
       ...  
395     True
396     True
397    False
398    False
399    False
Name: chance of admit, Length: 400, dtype: bool

In [18]:
# Or...
df['chance of admit'].gt(0.7).lt(0.9)


0      False
1      False
2      False
3      False
4       True
       ...  
395    False
396    False
397    False
398     True
399    False
Name: chance of admit, Length: 400, dtype: bool

https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html#pandas-equivalents-for-some-sql-analytic-and-aggregate-functions

#### Here you can find a comparisson between querying in sql and pandas. It's very important