<a href="https://colab.research.google.com/github/OptimalDecisions/sports-analytics-foundations/blob/main/pandas-basics/Pandas_Basics_2_5_Filtering_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Pandas Basics 2.5


# Filtering using Conditions

  <img src = "../img/sa_logo.png" width="100" align="left">

  Ram Narasimhan

  <br><br><br>

  << [2.4 Column Operations](Pandas_Basics_2_4_Column_Operations.ipynb) | [2.5 Filtering using Conditions](Pandas_Basics_2_5_Filtering_Data.ipynb) | [2.6 Sorting](Pandas_Basics_2_6_Sorting.ipynb) >>


In this notebook, we will explore how we can filter rows in a data frame using "boolean masks" and Conditions.

- Learn what a Boolean Mask is
- Filtering a df by one Condition
- Filtering a df by Two or more conditions
- Using AND to combine two conditions
- Using OR to combine two conditions
- Using `~` to Negate a condtion
- Using Multiple conditions
- How to create new (smaller) dataframes by Subsetting rows based on a condition
- Using np.where() to Search and Modify columns

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline


In [4]:
url = "https://raw.githubusercontent.com/OptimalDecisions/sports-analytics-foundations/main/data/2022-2023%20NBA%20Player%20Stats%20-%20Regular.csv"
df = pd.read_csv(url, encoding = "ISO-8859-1", sep=";")


## What is a `Boolean Mask`?

A `Boolean mask` in Pandas is a mechanism for filtering data within a DataFrame using `True` or `False`. It involves creating a conditional expression that is applied element-wise to the DataFrame, resulting in a mask of True and False values. This mask can then be used to selectively extract or modify data in the DataFrame.

### Step by Step Approach

- Creating a Condition:

You start by creating a condition or a logical expression that produces a boolean result. For example, you might want to find all rows where a certain column is greater than a specific value.

- Apply Mask and Filter Data:

You can use the boolean mask to filter the DataFrame, extracting only the rows that satisfy the condition.

In [12]:
cond = df['PTS'] > 30
print(cond.sum())

6


In [13]:
df[cond]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
12,11,Giannis Antetokounmpo,PF,28,MIL,63,63,32.1,11.2,20.3,...,0.645,2.2,9.6,11.8,5.7,0.8,0.8,3.9,3.1,31.1
160,125,Luka Don?i?,PG,23,DAL,66,66,36.2,10.9,22.0,...,0.742,0.8,7.8,8.6,8.0,1.4,0.5,3.6,2.5,32.4
184,143,Joel Embiid,C,28,PHI,66,66,34.6,11.0,20.1,...,0.857,1.7,8.4,10.2,4.2,1.0,1.7,3.4,3.1,33.1
209,164,Shai Gilgeous-Alexander,PG,24,OKC,68,68,35.5,10.4,20.3,...,0.905,0.9,4.0,4.8,5.5,1.6,1.0,2.8,2.8,31.4
373,292,Damian Lillard,PG,32,POR,58,58,36.3,9.6,20.7,...,0.914,0.8,4.0,4.8,7.3,0.9,0.3,3.3,1.9,32.2
590,465,Jayson Tatum,PF,24,BOS,74,74,36.9,9.8,21.1,...,0.854,1.1,7.7,8.8,4.6,1.1,0.7,2.9,2.2,30.1


## 	Filtering by Condition


In [14]:
df[df['PTS']>30]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
12,11,Giannis Antetokounmpo,PF,28,MIL,63,63,32.1,11.2,20.3,...,0.645,2.2,9.6,11.8,5.7,0.8,0.8,3.9,3.1,31.1
160,125,Luka Don?i?,PG,23,DAL,66,66,36.2,10.9,22.0,...,0.742,0.8,7.8,8.6,8.0,1.4,0.5,3.6,2.5,32.4
184,143,Joel Embiid,C,28,PHI,66,66,34.6,11.0,20.1,...,0.857,1.7,8.4,10.2,4.2,1.0,1.7,3.4,3.1,33.1
209,164,Shai Gilgeous-Alexander,PG,24,OKC,68,68,35.5,10.4,20.3,...,0.905,0.9,4.0,4.8,5.5,1.6,1.0,2.8,2.8,31.4
373,292,Damian Lillard,PG,32,POR,58,58,36.3,9.6,20.7,...,0.914,0.8,4.0,4.8,7.3,0.9,0.3,3.3,1.9,32.2
590,465,Jayson Tatum,PF,24,BOS,74,74,36.9,9.8,21.1,...,0.854,1.1,7.7,8.8,4.6,1.1,0.7,2.9,2.2,30.1


Summary

- a Boolean mask acts as a filter that helps you focus on specific parts of your data based on certain conditions.
- It allows us to create logical conditions and then use those conditions to either extract relevant data.
- This concept is central to data manipulation and analysis in Pandas.


## 	Using Two or More Conditions

In practice, often, just one condition is not sufficient.
When using two conditions to filter data in Pandas, we often combine them using logical operators such as `&` (and), `|` (or), or `~` (not).

Let's look at a few examples:

### First AND second condition


In [19]:
cond1 = df['3P'] > 3
cond2 = df['PTS'] > 25

df[cond1 & cond2]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
139,108,Stephen Curry,PG,34,GSW,56,56,34.7,10.0,20.2,...,0.915,0.7,5.4,6.1,6.3,0.9,0.4,3.2,2.1,29.4
293,230,Kyrie Irving,PG,30,TOT,60,60,37.4,9.9,20.1,...,0.905,1.0,4.1,5.1,5.5,1.1,0.8,2.1,2.8,27.1
294,230,Kyrie Irving,PG,30,BRK,40,40,37.0,10.0,20.5,...,0.883,1.0,4.2,5.1,5.3,1.0,0.8,2.3,2.7,27.1
373,292,Damian Lillard,PG,32,POR,58,58,36.3,9.6,20.7,...,0.914,0.8,4.0,4.8,7.3,0.9,0.3,3.3,1.9,32.2
427,338,Donovan Mitchell,SG,26,CLE,68,68,35.8,10.0,20.6,...,0.867,0.9,3.3,4.3,4.4,1.5,0.4,2.6,2.5,28.3
590,465,Jayson Tatum,PF,24,BOS,74,74,36.9,9.8,21.1,...,0.854,1.1,7.7,8.8,4.6,1.1,0.7,2.9,2.2,30.1


There are only 5 players (Kyrie Irving played for 2 different teams) who have a high 3P and also average more than 25 PTS

### Either condition 1 OR condition 2

We could use the OR operator a `pipe` symbol `|` to check for either-or conditions.





In [24]:
useful_cols = ['Player', 'Pos', 'Tm', 'FGA', 'FG', '3PA', '3P']
cond_high_3PA = df['3PA'] > 10
cond_high_FGA = df['FGA']> 20

df[cond_high_3PA | cond_high_FGA][useful_cols]

Unnamed: 0,Player,Pos,Tm,FGA,FG,3PA,3P
12,Giannis Antetokounmpo,PF,MIL,20.3,11.2,2.7,0.7
24,LaMelo Ball,PG,CHO,20.0,8.2,10.6,4.0
66,Devin Booker,SG,PHO,20.1,9.9,6.0,2.1
85,Jaylen Brown,SF,BOS,20.6,10.1,7.3,2.4
139,Stephen Curry,PG,GSW,20.2,10.0,11.4,4.9
160,Luka Don?i?,PG,DAL,22.0,10.9,8.2,2.8
184,Joel Embiid,C,PHI,20.1,11.0,3.0,1.0
209,Shai Gilgeous-Alexander,PG,OKC,20.3,10.4,2.5,0.9
293,Kyrie Irving,PG,TOT,20.1,9.9,8.3,3.1
294,Kyrie Irving,PG,BRK,20.5,10.0,8.7,3.3


## Using a Negative Condition (NOT)

The tilda `~` operator negates a Boolean column. That is it flips the True to False, and makes False to True.

To use the Negative condition, we would write the *opposite* of what we want, and then we flip the Boolean mask.

This example will make things clear.

Goal: We are looking for NBA players who are 38 years or older.


In [33]:
young_players = df['Age'] < 38

df[~young_players]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
258,201,Udonis Haslem,C,42,MIA,7,1,10.1,1.4,4.1,...,0.8,0.6,1.0,1.6,0.0,0.1,0.3,0.1,1.6,3.9
290,227,Andre Iguodala,SF,39,GSW,8,0,14.1,0.9,1.9,...,0.667,0.4,1.8,2.1,2.4,0.5,0.4,1.1,1.4,2.1
306,239,LeBron James,PF,38,LAL,55,54,35.5,11.1,22.2,...,0.768,1.2,7.1,8.3,6.8,0.9,0.6,3.2,1.6,28.9


## Combining Multiple conditions

We can certainly combine several conditions. Be careful about OR and AND.

Example: We only want Point Guards (PG).
And we are looking for high 3PA and high FGA players.

Let's see how we can combine everything

In [26]:
cond_PG = df['Pos']=='PG'

df[cond_PG & (cond_high_FGA | cond_high_3PA)][useful_cols]

Unnamed: 0,Player,Pos,Tm,FGA,FG,3PA,3P
24,LaMelo Ball,PG,CHO,20.0,8.2,10.6,4.0
139,Stephen Curry,PG,GSW,20.2,10.0,11.4,4.9
160,Luka Don?i?,PG,DAL,22.0,10.9,8.2,2.8
209,Shai Gilgeous-Alexander,PG,OKC,20.3,10.4,2.5,0.9
293,Kyrie Irving,PG,TOT,20.1,9.9,8.3,3.1
294,Kyrie Irving,PG,BRK,20.5,10.0,8.7,3.3
373,Damian Lillard,PG,POR,20.7,9.6,11.3,4.2


## 	Creating New Dataframe by Subsetting Rows

We can use the conditions above to create a smaller df (a subset).

It is good practice to use the `.copy()` command to make sure that the copy is clearly separate from the original df.




In [36]:
cond = df['Tm'] == 'GSW'

gsw_df = df[cond].copy()

In [39]:
gsw_df.shape


(18, 30)

In [40]:
gsw_df.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
23,20,Patrick Baldwin Jr.,SF,20,GSW,31,0,7.3,1.4,3.5,...,0.667,0.0,1.3,1.3,0.4,0.2,0.1,0.4,0.5,3.9
139,108,Stephen Curry,PG,34,GSW,56,56,34.7,10.0,20.2,...,0.915,0.7,5.4,6.1,6.3,0.9,0.4,3.2,2.1,29.4
159,124,Donte DiVincenzo,SG,26,GSW,72,36,26.3,3.3,7.5,...,0.817,1.1,3.4,4.5,3.5,1.3,0.1,1.6,1.8,9.4
227,176,Draymond Green,PF,32,GSW,73,73,31.5,3.4,6.5,...,0.713,0.9,6.3,7.2,6.8,1.0,0.8,2.8,3.1,8.5
229,178,JaMychal Green,PF,32,GSW,57,1,14.0,2.4,4.4,...,0.776,1.3,2.3,3.6,0.9,0.4,0.4,0.9,1.8,6.4


## Using `np.where()` to Search and Modify



Let's say that we believe that in NBA, the two most offensive positions in basketball are typically the Shooting Guard (SG) and the Small Forward (SF). (This is debatable, but this is just an example!)

We want to create a new column in the Player df which we want to label as "Offense" or "Defense." If someone is POS SG or Pos SF, then we will label them Offense.

We can do the above in one line, using `np.where()`

In [42]:
df['Pos'].unique()

array(['C', 'SG', 'PF', 'PG', 'SF', 'PF-SF', 'SF-SG', 'SG-PG'],
      dtype=object)

In [44]:
pos_SF = df['Pos']=='SF'
pos_SG = df['Pos']=='SG'
df["Style"] = np.where(pos_SF | pos_SG, "Offense", "Defense")

In [46]:
df.sample(10) # Look at the newly created Style column.
#We created it using np.where()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Style
274,215,Justin Holiday,SG,33,TOT,46,2,15.3,1.7,4.4,...,0.1,1.1,1.2,0.9,0.4,0.4,0.5,1.5,4.5,Offense
270,211,George Hill,PG,36,IND,11,1,15.1,1.7,3.1,...,0.2,1.5,1.6,1.9,0.6,0.3,0.6,1.3,5.2,Defense
263,206,Juancho Hernangómez,PF,27,TOR,42,10,14.6,1.1,2.7,...,0.6,2.3,2.9,0.6,0.4,0.1,0.4,1.0,2.9,Defense
49,38,D?vis Bert?ns,PF,30,DAL,45,1,10.9,1.5,3.6,...,0.2,1.0,1.2,0.5,0.2,0.2,0.2,1.2,4.6,Defense
183,142,Keon Ellis,SG,23,SAC,16,0,4.4,0.4,1.0,...,0.3,0.3,0.5,0.4,0.3,0.1,0.1,0.6,1.5,Offense
662,525,Vince Williams Jr.,SG,22,MEM,15,1,7.0,0.8,2.7,...,0.3,0.7,1.0,0.3,0.4,0.1,0.3,0.8,2.0,Offense
519,408,Davon Reed,SG,27,TOT,43,1,8.0,0.7,2.0,...,0.2,1.1,1.4,0.5,0.3,0.1,0.5,0.9,2.1,Offense
291,228,Joe Ingles,SF,35,MIL,46,0,22.7,2.3,5.4,...,0.3,2.5,2.8,3.3,0.7,0.1,1.2,1.6,6.9,Offense
125,96,John Collins,PF,25,ATL,71,71,30.0,5.1,10.0,...,1.1,5.4,6.5,1.2,0.6,1.0,1.1,3.1,13.1,Defense
47,36,Malik Beasley,SG,26,LAL,26,14,23.9,4.0,10.3,...,0.3,3.0,3.3,1.2,0.8,0.0,1.2,1.2,11.1,Offense




<< [2.4 Column Operations](Pandas_Basics_2_4_Column_Operations.ipynb) | [2.5 Filtering using Conditions](Pandas_Basics_2_5_Filtering_Data.ipynb) | [2.6 Sorting](Pandas_Basics_2_6_Sorting.ipynb) >>