### Pandas nlargest(), nsmallest(), query() and where() methods

https://medium.com/@tderick/become-a-pandas-ninja-with-nlargest-nsmallest-query-and-where-methods-490ab97bbe99

Throughout this article with this dataset, we’ll explore nlargest(), nsmallest(), query() and where() methods.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [7]:
df =pd.read_csv('https://raw.githubusercontent.com/tderick/datainsightonelineprogram/main/pandas%20technique%20for%20datascience%20blog%202/Participated%20Teams%20General%20Statistics.csv')

In [8]:
df

Unnamed: 0,Rank,Team,Titles,Part's,Played,Win,Draw,Lose,Goal_For,Goal_Against,Goal_Difference,Points,Points%
0,1,Egypt,7.0,24,100,57,17,26,164,88,76,188,62.7
1,2,Ghana,4.0,22,99,54,20,25,130,82,48,182,61.3
2,3,Nigeria,3.0,18,93,51,21,21,132,89,43,174,62.4
3,4,Ivory Coast,2.0,23,95,42,25,28,138,100,38,151,53.0
4,5,Cameroon,5.0,19,84,41,27,16,123,76,47,150,59.5
5,6,Algeria,2.0,18,74,28,21,25,93,85,8,105,47.3
6,7,Zambia,1.0,17,67,26,20,21,81,69,12,98,48.8
7,8,Tunisia,1.0,19,75,23,29,23,94,91,3,98,43.5
8,9,Morocco,1.0,17,65,24,23,18,74,58,16,95,48.7
9,10,DR Congo,2.0,19,73,20,24,29,88,102,14,84,38.3


### 1. DataFrame.nlargest() and DataFrame.nsmallest()

__DataFrame.nlargest()__ is a pandas method that ordered DataFrame in descending order and takes the first n rows. The signature of this method is:

DataFrame.nlargest(n, columns, keep ="first")

where 

* __columns__ is a list of columns to use sort DataFrame in descending order

* __keep__ indicate the occurences of duplicate rows to conserve .This parameter takes value in the  set __first__, __last__, __all__

* __n__ indicate the number rows to take after sorting the DataFrame

In [9]:
df.nlargest(5, columns="Rank")

Unnamed: 0,Rank,Team,Titles,Part's,Played,Win,Draw,Lose,Goal_For,Goal_Against,Goal_Difference,Points,Points%
41,42,Botswana,,1,3,0,0,3,2,9,7,0,0.0
40,41,Mauritius,,1,3,0,0,3,2,8,6,0,0.0
39,40,Burundi,,1,3,0,0,3,0,4,4,0,0.0
38,39,Tanzania,,2,6,0,1,5,5,14,9,1,5.5
37,38,Niger,,2,6,0,1,5,1,9,8,1,5.6


The image shows us clearly that __enlargest()__ method sort DataFrame in descending  order and take the first 5 elements. You can add keep 
parameter to tell __enlargest()__ method to do infront of duplicate rows. The default value for __keep__ is __first__

In [10]:
df.nlargest(5, columns="Rank", keep="first") #  keep first is default

Unnamed: 0,Rank,Team,Titles,Part's,Played,Win,Draw,Lose,Goal_For,Goal_Against,Goal_Difference,Points,Points%
41,42,Botswana,,1,3,0,0,3,2,9,7,0,0.0
40,41,Mauritius,,1,3,0,0,3,2,8,6,0,0.0
39,40,Burundi,,1,3,0,0,3,0,4,4,0,0.0
38,39,Tanzania,,2,6,0,1,5,5,14,9,1,5.5
37,38,Niger,,2,6,0,1,5,1,9,8,1,5.6


In [11]:
df.nlargest(5, columns="Rank", keep="last")

Unnamed: 0,Rank,Team,Titles,Part's,Played,Win,Draw,Lose,Goal_For,Goal_Against,Goal_Difference,Points,Points%
41,42,Botswana,,1,3,0,0,3,2,9,7,0,0.0
40,41,Mauritius,,1,3,0,0,3,2,8,6,0,0.0
39,40,Burundi,,1,3,0,0,3,0,4,4,0,0.0
38,39,Tanzania,,2,6,0,1,5,5,14,9,1,5.5
37,38,Niger,,2,6,0,1,5,1,9,8,1,5.6


In [12]:
df.nlargest(5, columns="Rank", keep="all")

Unnamed: 0,Rank,Team,Titles,Part's,Played,Win,Draw,Lose,Goal_For,Goal_Against,Goal_Difference,Points,Points%
41,42,Botswana,,1,3,0,0,3,2,9,7,0,0.0
40,41,Mauritius,,1,3,0,0,3,2,8,6,0,0.0
39,40,Burundi,,1,3,0,0,3,0,4,4,0,0.0
38,39,Tanzania,,2,6,0,1,5,5,14,9,1,5.5
37,38,Niger,,2,6,0,1,5,1,9,8,1,5.6


To sort our DataFrame in ascending order and get the first n rows, pandas provide another method to do that: __DataFrame.nsmallest()__. The signature of this method is:

DataFrame.nsmallest(n, columns, keep ="first")

This method takes the same parameter as __nlargest()__ but works in an inverse way.

In [13]:
 df.nsmallest(5, columns="Rank")

Unnamed: 0,Rank,Team,Titles,Part's,Played,Win,Draw,Lose,Goal_For,Goal_Against,Goal_Difference,Points,Points%
0,1,Egypt,7.0,24,100,57,17,26,164,88,76,188,62.7
1,2,Ghana,4.0,22,99,54,20,25,130,82,48,182,61.3
2,3,Nigeria,3.0,18,93,51,21,21,132,89,43,174,62.4
3,4,Ivory Coast,2.0,23,95,42,25,28,138,100,38,151,53.0
4,5,Cameroon,5.0,19,84,41,27,16,123,76,47,150,59.5


The result DataFrame in this case is ordered in ascending order based on columns we specify.

In this section, we talk about nlargest and nsmallest which are pretty useful and easy when working with pandas. In the next section, we’ll talk about __DataFrame.query()__.

### 2.DataFrame query

This method is a pandas shortcut  for filtering DataFrame.It allows us to query Dataframe columns with a boolean expression.The signature of this method is :

DataFrame.query(expr, inplace=False, **kwargs)

where 

* __expr__ is string to evaluate 

* __inplace__ indicate if we’ll work on a copy of DataFrame or not


In [15]:
df.query("Win > 20")

Unnamed: 0,Rank,Team,Titles,Part's,Played,Win,Draw,Lose,Goal_For,Goal_Against,Goal_Difference,Points,Points%
0,1,Egypt,7.0,24,100,57,17,26,164,88,76,188,62.7
1,2,Ghana,4.0,22,99,54,20,25,130,82,48,182,61.3
2,3,Nigeria,3.0,18,93,51,21,21,132,89,43,174,62.4
3,4,Ivory Coast,2.0,23,95,42,25,28,138,100,38,151,53.0
4,5,Cameroon,5.0,19,84,41,27,16,123,76,47,150,59.5
5,6,Algeria,2.0,18,74,28,21,25,93,85,8,105,47.3
6,7,Zambia,1.0,17,67,26,20,21,81,69,12,98,48.8
7,8,Tunisia,1.0,19,75,23,29,23,94,91,3,98,43.5
8,9,Morocco,1.0,17,65,24,23,18,74,58,16,95,48.7
10,11,Senegal,,15,60,23,14,23,69,54,15,83,46.1


The previous line of code return all entries where the number of matches played is greater than 20. We can combine multiple conditions in the same expression.

In [18]:
display(df.query("Win > 20 and Lose < 20"))
display(df.query("Win > 20 & Lose < 20"))

Unnamed: 0,Rank,Team,Titles,Part's,Played,Win,Draw,Lose,Goal_For,Goal_Against,Goal_Difference,Points,Points%
4,5,Cameroon,5.0,19,84,41,27,16,123,76,47,150,59.5
8,9,Morocco,1.0,17,65,24,23,18,74,58,16,95,48.7


Unnamed: 0,Rank,Team,Titles,Part's,Played,Win,Draw,Lose,Goal_For,Goal_Against,Goal_Difference,Points,Points%
4,5,Cameroon,5.0,19,84,41,27,16,123,76,47,150,59.5
8,9,Morocco,1.0,17,65,24,23,18,74,58,16,95,48.7


We have various boolean operator to use to combine different condition in our expression.

* __and with equivalent &__
* __or with his equivalent |__
* __not with his equivalent ~__

We can find all African countries which have never win a African Cup of Nation like this:

In [27]:
display(df.query("~(Titles>= 1)"))
display(df.query("not(Titles>= 1)"))

Unnamed: 0,Rank,Team,Titles,Part's,Played,Win,Draw,Lose,Goal_For,Goal_Against,Goal_Difference,Points,Points%
10,11,Senegal,,15,60,23,14,23,69,54,15,83,46.1
11,12,Mali,,11,50,17,17,16,61,64,3,68,45.3
13,14,Guinea,,12,43,12,16,15,59,63,4,52,40.3
14,15,Burkina Faso,,11,41,7,13,21,38,62,24,34,27.6
17,18,Gabon,,7,21,6,7,8,19,26,7,25,39.7
18,19,Angola,,8,26,4,12,10,30,39,9,24,30.7
20,21,Togo,,8,25,3,8,14,19,42,23,17,22.7
21,22,Equatorial Guinea,,2,10,4,3,3,8,10,2,15,50.0
22,23,Uganda,,7,23,4,3,16,21,38,17,15,21.7
23,24,Libya,,3,11,3,5,3,12,13,1,14,42.4


Unnamed: 0,Rank,Team,Titles,Part's,Played,Win,Draw,Lose,Goal_For,Goal_Against,Goal_Difference,Points,Points%
10,11,Senegal,,15,60,23,14,23,69,54,15,83,46.1
11,12,Mali,,11,50,17,17,16,61,64,3,68,45.3
13,14,Guinea,,12,43,12,16,15,59,63,4,52,40.3
14,15,Burkina Faso,,11,41,7,13,21,38,62,24,34,27.6
17,18,Gabon,,7,21,6,7,8,19,26,7,25,39.7
18,19,Angola,,8,26,4,12,10,30,39,9,24,30.7
20,21,Togo,,8,25,3,8,14,19,42,23,17,22.7
21,22,Equatorial Guinea,,2,10,4,3,3,8,10,2,15,50.0
22,23,Uganda,,7,23,4,3,16,21,38,17,15,21.7
23,24,Libya,,3,11,3,5,3,12,13,1,14,42.4


In [28]:
# To query all African countries with a number of Win greater than 20 and the number of lost less than 20, we write this condition:

df.query("Win > 20 and Lose < 20 ")

Unnamed: 0,Rank,Team,Titles,Part's,Played,Win,Draw,Lose,Goal_For,Goal_Against,Goal_Difference,Points,Points%
4,5,Cameroon,5.0,19,84,41,27,16,123,76,47,150,59.5
8,9,Morocco,1.0,17,65,24,23,18,74,58,16,95,48.7


### 3. DataFrame.where

DataFrame.where() is another method offered by pandas that are not regularly used but can be useful in certain cases. This method replaces values for which our condition is evaluated to False. In other words, it replaces all values that do not satisfy one or more criteria by the one we give or with NaN by default. The signature of this method is :

DataFrame.where(cond, other=nan, inplace=False, axis = None, level = None, errors = "raise", try_cast = NoDefault.no_default)

where

* __cond__ is the condition to be satisfy by each entries. This condition can a boolean Series/DataFrame, array-like, or callable

* __other__  specify the value to give to each cell where the evaluation of condition is __False__

In [29]:
# For the following example, we will drop column Team to only have a DataFrame with numerical values.

In [30]:
df.drop("Team",axis =1 ,inplace=True)

In [31]:
df.where(df<100,"HIGHT").head(10)

Unnamed: 0,Rank,Titles,Part's,Played,Win,Draw,Lose,Goal_For,Goal_Against,Goal_Difference,Points,Points%
0,1,7.0,24,HIGHT,57,17,26,HIGHT,88,76,HIGHT,62.7
1,2,4.0,22,99,54,20,25,HIGHT,82,48,HIGHT,61.3
2,3,3.0,18,93,51,21,21,HIGHT,89,43,HIGHT,62.4
3,4,2.0,23,95,42,25,28,HIGHT,HIGHT,38,HIGHT,53.0
4,5,5.0,19,84,41,27,16,HIGHT,76,47,HIGHT,59.5
5,6,2.0,18,74,28,21,25,93,85,8,HIGHT,47.3
6,7,1.0,17,67,26,20,21,81,69,12,98,48.8
7,8,1.0,19,75,23,29,23,94,91,3,98,43.5
8,9,1.0,17,65,24,23,18,74,58,16,95,48.7
9,10,2.0,19,73,20,24,29,88,HIGHT,14,84,38.3


In [33]:
df.where(df<100,"küçük").head(10)

Unnamed: 0,Rank,Titles,Part's,Played,Win,Draw,Lose,Goal_For,Goal_Against,Goal_Difference,Points,Points%
0,1,7.0,24,küçük,57,17,26,küçük,88,76,küçük,62.7
1,2,4.0,22,99,54,20,25,küçük,82,48,küçük,61.3
2,3,3.0,18,93,51,21,21,küçük,89,43,küçük,62.4
3,4,2.0,23,95,42,25,28,küçük,küçük,38,küçük,53.0
4,5,5.0,19,84,41,27,16,küçük,76,47,küçük,59.5
5,6,2.0,18,74,28,21,25,93,85,8,küçük,47.3
6,7,1.0,17,67,26,20,21,81,69,12,98,48.8
7,8,1.0,19,75,23,29,23,94,91,3,98,43.5
8,9,1.0,17,65,24,23,18,74,58,16,95,48.7
9,10,2.0,19,73,20,24,29,88,küçük,14,84,38.3


If we don’t set other parameter, the False entries will be replaced by NaN.

In [34]:
df.where(df < 100).head()

Unnamed: 0,Rank,Titles,Part's,Played,Win,Draw,Lose,Goal_For,Goal_Against,Goal_Difference,Points,Points%
0,1,7.0,24,,57,17,26,,88.0,76,,62.7
1,2,4.0,22,99.0,54,20,25,,82.0,48,,61.3
2,3,3.0,18,93.0,51,21,21,,89.0,43,,62.4
3,4,2.0,23,95.0,42,25,28,,,38,,53.0
4,5,5.0,19,84.0,41,27,16,,76.0,47,,59.5


In [35]:
# We can achieve the same as previous using a callback function in the first parameter of where method.

df.where(lambda x : x < 100, "Ok").head()

Unnamed: 0,Rank,Titles,Part's,Played,Win,Draw,Lose,Goal_For,Goal_Against,Goal_Difference,Points,Points%
0,1,7.0,24,Ok,57,17,26,Ok,88,76,Ok,62.7
1,2,4.0,22,99,54,20,25,Ok,82,48,Ok,61.3
2,3,3.0,18,93,51,21,21,Ok,89,43,Ok,62.4
3,4,2.0,23,95,42,25,28,Ok,Ok,38,Ok,53.0
4,5,5.0,19,84,41,27,16,Ok,76,47,Ok,59.5


### Conclusion

we give an overview of how nlargest(), nsmallest(), query() and where() methods work. For more examples, you can check pandas documentation for each method. The source code used for this article is avalaible here.
https://github.com/tderick/datainsightonelineprogram/tree/main/pandas%20technique%20for%20datascience%20blog%204