<a href="https://colab.research.google.com/github/MonkeyWrenchGang/MGTPython/blob/main/module_2/2_filtering_rows_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Filtering Rows of Data in a Pandas DataFrame

It is pretty easy to select specific rows of a DataFrame based on conditions. Behind the sceenes panda's uses boolean indexing, to create a mask of True and False values that correspond to the rows that meet your desired conditions. 
Once it has the  have your mask, you can use it to select only the rows of the DataFrame that meet your conditions.

There are different ways of filtering rows of data in pandas. One way is by using the boolean indexing method, which allows you to filter rows by passing a boolean expression to the DataFrame. The expression should be such that it returns a boolean value (True or False) for each row of the DataFrame. For example, to filter a DataFrame df to show only the rows where the value in the "Age" column is greater than 30, you could use the following code: df[df["Age"] > 30]. This will return a new DataFrame with only the rows where the value in the "Age" column is greater than 30.

Another way to filter the rows is by using the query() method. It allows you to filter the rows by passing a string containing a boolean expression as an argument. This can be more readable and efficient than chaining multiple conditions together. For example, you can use the following code to filter a DataFrame df to show only the rows where the value in the "Age" column is greater than 30 and the value in the "Name" column starts with 'A': df.query('Age > 30 and Name.str.startswith("A")')


---

In this tutorial we are going to do some basic filtering of rows using both methods. 

## "boolean indexing" 
- filter using simple equality (==)
- filter using simple greater than less than 
- filter using compound conditions with and (&) and or (|)

##"query()" 
- filter using simple equality conditions 
- filter strings w. functions and equality condition 
- filter using compound conditions 
- filter with date parts 


---

### To do this we'll be using 

Mad Money stock picsk by Jim Cramer 2016-2022

found here: 

https://www.kaggle.com/datasets/diamondprox/jimcramer

which was pulled from : 
https://old.reddit.com/r/wallstreetbets/comments/109cswl/inverse_cramer_i_analyzed_all_21653_buy_and_sell/

and I've posted a copy here: 
https://raw.githubusercontent.com/MonkeyWrenchGang/MGTPython/main/module_2/data/Cramers%20Picks_2016%20to%202022_Performance%20v2.csv



## Import Libraries
Like always we set the stage by importing the necessary libraries and configuring our environment. 

```python
# -- notebook options -- 
from IPython.core.display import display, HTML
from IPython.display import clear_output
display(HTML("<style>.container { width:90% }</style>"))
import warnings
warnings.filterwarnings('ignore')

# -- key libraries --
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 

# -- need this to render charts in notebook -- 
%matplotlib inline

```

Lets have some fun with this ! 

In [3]:
# -- notebook options -- 
from IPython.core.display import display, HTML
from IPython.display import clear_output
display(HTML("<style>.container { width:90% }</style>"))
import warnings
warnings.filterwarnings('ignore')

# -- key libraries --
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 

# -- need this to render charts in notebook -- 
%matplotlib inline


# Import Data

Let's pull the data into a dataframe 

```python 
stocks = pd.read_csv("https://raw.githubusercontent.com/MonkeyWrenchGang/MGTPython/main/module_2/data/Cramers%20Picks_2016%20to%202022_Performance%20v2.csv")
stocks.head()
```

unfortunatly, this dataset is going to require quite a bit of clean up so lets dive into what we need to do. 

1. standardize column names to remove spaces, shift to lower case and replace special characters
2. parse the date column into a real date.
3. convert the % columns to numeric

In [4]:
cramer = pd.read_csv("https://raw.githubusercontent.com/MonkeyWrenchGang/MGTPython/main/module_2/data/Cramers%20Picks_2016%20to%202022_Performance%20v2.csv")
cramer.head()

Unnamed: 0,S.No,Company,Ticker,Date,Call,1-Day Change Recommendation,1-Week Change Recommendation,1-Month Change Recommendation,1-Year Change Recommendation,1-Day Change Benchmark,1-Week Change Benchmark,1-Month Change Benchmark,1-Year Change Benchmark
0,1,Lululemon Athletica,LULU,27/3/2018,Positive Mention,2.0%,2.7%,15.4%,95%,-0.4%,1.4%,3%,10%
1,2,Penn National Gaming,PENN,14/7/2020,Buy,10.3%,4.4%,52.5%,100%,-0.2%,1.6%,5%,37%
2,3,Simon Property Group,SPG,13/11/2020,Buy,-1.2%,8.2%,10.6%,121%,0.4%,-1.4%,1%,31%
3,4,PVH Corp,PVH,25/5/2016,Buy,-2.5%,3.8%,-9.4%,13%,0.0%,0.7%,-4%,18%
4,5,Broadcom,AVGO,2/6/2016,Positive Mention,-1.9%,-1.1%,-7.0%,58%,0.0%,-0.1%,0%,18%


## Clean up column names
One simple yet powerful tip that will make your Pandas experience smoother is to clean up your column names. This includes making them all lowercase(a-z), removingspecial characters, and replacing spaces with underscores. This will ensure that dealing with columns is consistent and effortless. Trust me, this one small step will save you a lot of headaches in the long run.

let's see how to do this copy and paste the following code : 
```python
cramer.columns = ( cramer.columns
    .str.strip()           # -- remove leading / trailing spaces 
    .str.lower()           # -- lower case column names 
    .str.replace(' ', '_') # -- replace spaces with underscore 
    .str.replace('-', '_') # -- replace dash with underscore 
    .str.replace('.', '_') # -- replace dash with underscore 
    .str.replace('(', '')  # -- remove open paren
    .str.replace(')', '')  # -- remove close paren
    .str.replace('?', '')  # -- remove question mark 
    .str.replace('\'', '') # -- remove single quote notice the backslash \ this is an escape character
)

print(cramer.columns)
cramer.head()
```

In [5]:
cramer.columns = ( cramer.columns
    .str.strip()           # -- remove leading / trailing spaces 
    .str.lower()           # -- lower case column names 
    .str.replace(' ', '_') # -- replace spaces with underscore 
    .str.replace('-', '_') # -- replace dash with underscore 
    .str.replace('.', '_') # -- replace dash with underscore 
    .str.replace('(', '')  # -- remove open paren
    .str.replace(')', '')  # -- remove close paren
    .str.replace('?', '')  # -- remove question mark 
    .str.replace('\'', '') # -- remove single quote notice the backslash \ this is an escape character
)

print(cramer.columns)
cramer.head()

Index(['s_no', 'company', 'ticker', 'date', 'call',
       '1_day_change_recommendation', '1_week_change_recommendation',
       '1_month_change_recommendation', '1_year_change_recommendation',
       '1_day_change_benchmark', '1_week_change_benchmark',
       '1_month_change_benchmark', '1_year_change_benchmark'],
      dtype='object')


Unnamed: 0,s_no,company,ticker,date,call,1_day_change_recommendation,1_week_change_recommendation,1_month_change_recommendation,1_year_change_recommendation,1_day_change_benchmark,1_week_change_benchmark,1_month_change_benchmark,1_year_change_benchmark
0,1,Lululemon Athletica,LULU,27/3/2018,Positive Mention,2.0%,2.7%,15.4%,95%,-0.4%,1.4%,3%,10%
1,2,Penn National Gaming,PENN,14/7/2020,Buy,10.3%,4.4%,52.5%,100%,-0.2%,1.6%,5%,37%
2,3,Simon Property Group,SPG,13/11/2020,Buy,-1.2%,8.2%,10.6%,121%,0.4%,-1.4%,1%,31%
3,4,PVH Corp,PVH,25/5/2016,Buy,-2.5%,3.8%,-9.4%,13%,0.0%,0.7%,-4%,18%
4,5,Broadcom,AVGO,2/6/2016,Positive Mention,-1.9%,-1.1%,-7.0%,58%,0.0%,-0.1%,0%,18%


## Convert date to a "date"

The **to_datetime** function in the pandas library converts various types of date representations into a standard datetime format. It can handle input in a variety of formats, such as strings, integers, and datetime objects, and can also handle missing or malformed data relatively gracefully. 

lets see how to do this: 

```python
# -- fix date
cramer["date"] = pd.to_datetime(cramer["date"])
cramer.head()

# or

# -- another way 
date_as_string = cramer["date"]
cramer["date"] = pd.to_datetime(date_as_string)
cramer.head()
```


In [6]:
date_as_string = cramer["date"]
cramer["date"] = pd.to_datetime(date_as_string)
cramer.head()

Unnamed: 0,s_no,company,ticker,date,call,1_day_change_recommendation,1_week_change_recommendation,1_month_change_recommendation,1_year_change_recommendation,1_day_change_benchmark,1_week_change_benchmark,1_month_change_benchmark,1_year_change_benchmark
0,1,Lululemon Athletica,LULU,2018-03-27,Positive Mention,2.0%,2.7%,15.4%,95%,-0.4%,1.4%,3%,10%
1,2,Penn National Gaming,PENN,2020-07-14,Buy,10.3%,4.4%,52.5%,100%,-0.2%,1.6%,5%,37%
2,3,Simon Property Group,SPG,2020-11-13,Buy,-1.2%,8.2%,10.6%,121%,0.4%,-1.4%,1%,31%
3,4,PVH Corp,PVH,2016-05-25,Buy,-2.5%,3.8%,-9.4%,13%,0.0%,0.7%,-4%,18%
4,5,Broadcom,AVGO,2016-02-06,Positive Mention,-1.9%,-1.1%,-7.0%,58%,0.0%,-0.1%,0%,18%


## Convert % columns to number

This is a little fancy, we want to use the `apply()` function along with a `lambda function` to strip the percent symbol and convert the column to a numeric data type for multiple columns in a Pandas DataFrame. Don't worry about this now we'll cover this in more detail later.

Here is an example:

```python
# list columns to convert 
cols_to_convert = ["col1","col2"] 
# use apply and lambda to convert them. 
df[cols_to_convert] = df[cols_to_convert].apply(lambda x: pd.to_numeric(x.str.strip('%'), errors='coerce')/100)
```



In [7]:
cols_to_convert = cramer.filter(regex='benchmark|recommendation').columns
cramer[cols_to_convert] = cramer[cols_to_convert].apply(lambda x: pd.to_numeric(x.str.strip('%'), errors='coerce')/100)
cramer.head()

Unnamed: 0,s_no,company,ticker,date,call,1_day_change_recommendation,1_week_change_recommendation,1_month_change_recommendation,1_year_change_recommendation,1_day_change_benchmark,1_week_change_benchmark,1_month_change_benchmark,1_year_change_benchmark
0,1,Lululemon Athletica,LULU,2018-03-27,Positive Mention,0.02,0.027,0.154,0.95,-0.004,0.014,0.03,0.1
1,2,Penn National Gaming,PENN,2020-07-14,Buy,0.103,0.044,0.525,1.0,-0.002,0.016,0.05,0.37
2,3,Simon Property Group,SPG,2020-11-13,Buy,-0.012,0.082,0.106,1.21,0.004,-0.014,0.01,0.31
3,4,PVH Corp,PVH,2016-05-25,Buy,-0.025,0.038,-0.094,0.13,0.0,0.007,-0.04,0.18
4,5,Broadcom,AVGO,2016-02-06,Positive Mention,-0.019,-0.011,-0.07,0.58,0.0,-0.001,0.0,0.18


# 1. Boolean indexing

1. filter using simple equality (==)
  - how many times did Cramer mention "GME" returned more than 50% in the first week? 
```python
# 1. filter using simple equality (==)
cramer[cramer["ticker"] == "GME"][["company","ticker","date","call","1_week_change_recommendation"]]
```
2. filter using simple greater than less than
  - what picks returned more than 50% in the first week? 
```python
# 2. filter using simple greater than less than
cramer[cramer["1_week_change_recommendation"] > 0.5][["company","date","call","1_week_change_recommendation"]]
```

3. filter using compound conditions with and (&) and or (|)
  - what picks returned more than 50% in the first week and were a call equal to Sell? 
  ```python
# 3. filter using compound conditions with and (&) and or (|)
cramer[(cramer["1_week_change_recommendation"] > 0.5) & (cramer["call"] == "Sell")][["company","date","call","1_week_change_recommendation"]]
  ```



In [16]:
# 1. filter using simple equality (==)
cramer[cramer["ticker"] == "GME"][["company","ticker","date","call","1_week_change_recommendation"]]

Unnamed: 0,company,ticker,date,call,1_week_change_recommendation
354,GameStop,GME,2021-07-06,Negative Mention,-0.258
934,GameStop,GME,2021-02-26,Sell,0.615
1023,GameStop,GME,2021-03-24,Sell,0.042
1123,GameStop,GME,2021-03-19,Sell,-0.068
1764,GameStop,GME,2016-08-22,Negative Mention,-0.112
2679,GameStop,GME,2022-01-25,Negative Mention,-0.031
2770,GameStop,GME,2019-03-29,Sell,-0.041
3112,GameStop,GME,2021-05-19,Negative Mention,0.491
6253,GameStop,GME,2017-11-17,Sell,0.105
8045,GameStop,GME,2017-11-28,Positive Mention,-0.006


In [14]:
# 2. filter using simple greater than less than
cramer[cramer["1_week_change_recommendation"] > 0.5][["company","date","call","1_week_change_recommendation"]]

Unnamed: 0,company,date,call,1_week_change_recommendation
373,MicroVision,2021-04-19,Sell,0.941
577,CareTrust REIT,2020-03-16,Sell,0.529
614,QuantumScape,2020-12-14,Positive Mention,1.117
780,GrowGeneration,2020-11-08,Negative Mention,1.514
934,GameStop,2021-02-26,Sell,0.615
971,fuboTV,2021-01-19,Negative Mention,0.551
3117,Pivotal Software,2019-07-08,Sell,0.58
6895,AMC Entertainment,2021-05-20,Negative Mention,1.162
7381,Corsair Gaming,2020-11-16,Buy,0.597
8017,United Natural Foods,2020-06-05,Sell,0.714


In [15]:
# 3. filter using compound conditions with and (&) and or (|)
cramer[(cramer["1_week_change_recommendation"] > 0.5) & (cramer["call"] == "Sell")][["company","date","call","1_week_change_recommendation"]]

Unnamed: 0,company,date,call,1_week_change_recommendation
373,MicroVision,2021-04-19,Sell,0.941
577,CareTrust REIT,2020-03-16,Sell,0.529
934,GameStop,2021-02-26,Sell,0.615
3117,Pivotal Software,2019-07-08,Sell,0.58
8017,United Natural Foods,2020-06-05,Sell,0.714
19205,Royal Caribbean Cruises,2020-01-04,Sell,0.62
19859,Virgin Galactic,2020-12-02,Sell,0.575
21128,Norwegian Cruise Line,2020-01-04,Sell,0.561
21311,Upstart,2022-10-05,Sell,0.726


# 2. Query() method

The query() function in Pandas allows you to filter rows of a DataFrame based on one or more conditions. The function takes a string as an argument, which is the conditional logic / expression used to filter the rows of the DataFrame. I prefer query() method as it is both more performant and offers a more convenient and readable way to filter rows.

You can use standard Python comparison operators, such as >, <, ==, !=, in, and not in, as well as logical operators, such as and, or, and not to filter rows.



---

Lets make some queries 
1. filter using simple equality conditions
  - What dates did Cramer recomend "Broadcom"? the problem is that the following query won't work because of leading and trailing spaces
  ```python
  # won't work because of trailing spaces 
    cramer.query('company  == "Broadcom"')[["company","ticker","date","call"]]
  ```
  - instad we need to strip the spaces like this: 
  ```python
  # 1. filter using simple equality conditions
  cramer["company"] = cramer["company"].str.strip()
  cramer.query('company  == "Broadcom"')[["company","ticker","date","call"]]
  ```
  - or like this:
  ```python
  # 1. filter using simple equality conditions
  cramer.query('company.str.strip()  == "Broadcom"')[["company","ticker","date","call"]]
  ```
2. filter strings w. functions and equality condition
  - well we just saw that, how about we look for words containing "Block" ?  here we need to tell query that we are using a specific python function, to do this we pass engine="python" to the query.
  ```python
  # 2. filter strings w. functions
cramer.query("company.str.contains('Block')",engine='python')[["company","date","call","1_week_change_recommendation"]]
  ```

3. filter using compound conditions
  - what stocks have a Buy call and returned over 50% in one week? here is a gotcha, So 1_week_change_recommendation starts with a number wich is an invalid python variable name to get around this we have to use back-tics  to quote the variable like this:
  ```python
  # 3. filter using compound conditions
cramer.query('call == "Buy" & `1_week_change_recommendation` > 0.5')[["company","date","call","1_week_change_recommendation"]]
  ```
  - try switching buy to Sell and using and using "and" vs "&" as well as referencing a different variable like "1_month_change_recommendation" 

4. filter with date parts
  - can you filter for all of Cramer's Buy picks for January 2022 that had a 1 week change > 10%? 
  ```python
  # 4. filter with date parts
  cramer.query(' date.dt.month == 1 & date.dt.year == 2022 and call == "Buy" and `1_week_change_recommendation` > 0.1 ')\
[["company","date","call","1_week_change_recommendation"]]

  ```



In [30]:
# 1. filter using simple equality conditions
cramer.query('company.str.strip()  == "Broadcom"')[["company","ticker","date","call"]]

Unnamed: 0,company,ticker,date,call
4,Broadcom,AVGO,2016-02-06,Positive Mention
106,Broadcom,AVGO,2016-08-12,Positive Mention
611,Broadcom,AVGO,2020-01-23,Positive Mention
653,Broadcom,AVGO,2018-11-30,Negative Mention
736,Broadcom,AVGO,2020-11-06,Buy
...,...,...,...,...
20651,Broadcom,AVGO,2021-11-05,Buy
20696,Broadcom,AVGO,2018-01-31,Buy
20856,Broadcom,AVGO,2019-12-12,Buy
20903,Broadcom,AVGO,2016-08-11,Buy


In [32]:
# 2. filter strings w. functions

cramer.query("company.str.contains('Block')",engine='python')[["company","date","call","1_week_change_recommendation"]]

Unnamed: 0,company,date,call,1_week_change_recommendation
268,Block,2020-02-07,Buy,0.076
281,Block,2020-08-09,Positive Mention,0.047
316,Block,2021-01-03,Buy,-0.107
599,Block,2018-10-30,Buy,0.126
738,Block,2019-05-08,Buy,-0.025
...,...,...,...,...
20783,Block,2021-05-01,Sell,0.018
20987,Block,2020-06-11,Buy,-0.027
21004,Block,2020-07-05,Buy,0.055
21041,Block,2020-02-27,Buy,-0.123


In [22]:
# 3. filter using compound conditions
cramer.query('call == "Buy" and `1_week_change_recommendation` > 0.5')[["company","date","call","1_week_change_recommendation"]]

Unnamed: 0,company,date,call,1_week_change_recommendation
7381,Corsair Gaming,2020-11-16,Buy,0.597
8609,Robinhood,2021-07-29,Buy,0.565


In [29]:
cramer.query(' date.dt.month == 1 & date.dt.year == 2022 and call == "Buy" and `1_week_change_recommendation` > 0.1 ')\
[["company","date","call","1_week_change_recommendation"]]

Unnamed: 0,company,date,call,1_week_change_recommendation
158,Qualtrics,2022-01-26,Buy,0.114
163,Revolve Group,2022-01-26,Buy,0.124
654,Bed Bath & Beyond,2022-01-26,Buy,0.148
660,Cleveland-Cliffs,2022-01-25,Buy,0.161
879,Callon Petroleum,2022-01-26,Buy,0.114
880,Enphase Energy,2022-01-26,Buy,0.108
2097,Unity Software,2022-01-02,Buy,0.132
3010,Nucor,2022-01-27,Buy,0.115
9168,Dutch Bros,2022-01-24,Buy,0.194
19137,Advanced Micro Devices,2022-01-27,Buy,0.174


# Passing Values to Query()

One useful feature of Pandas is the ability to query data using the .query() method. In addition to passing plain strings to the query method, you can also pass variables to the query by using the @ symbol. This allows for more dynamic and flexible querying of your data, as the variables can be passed in from external sources or calculated dynamically.

Let's see some examples:

```python
# 1. what picks returned more than 20% and were a Buy call in 2022? 
max_one_week_return = 0.2
cramer_call = "Buy"
cramer.query(' date.dt.year == 2022 and call == @cramer_call and `1_week_change_recommendation` > @max_one_week_return ')\
[["company","date","call","1_week_change_recommendation"]]

# 2. what stock had the max one week return in 2022 and call was Buy? 

max_return = cramer.query(' date.dt.year == 2022 and call == @cramer_call')["1_week_change_recommendation"].max()
print("max return in 2022: {}\n".format(max_return))

cramer.query(' date.dt.year == 2022 and call == @cramer_call and `1_week_change_recommendation` == @max_return ')\
[["company","date","call","1_week_change_recommendation"]]

# 3. what stock had the max one week return and call was Sell
cramer_call = "Sell"

max_return = cramer.query(' date.dt.year == 2022 and call == @cramer_call')["1_week_change_recommendation"].max()
print("max return in 2022: {}\n".format(max_return))

cramer.query(' date.dt.year == 2022 and call == @cramer_call and `1_week_change_recommendation` == @max_return ')\
[["company","date","call","1_week_change_recommendation"]]


```



In [33]:
max_one_week_return = 0.2
cramer_call = "Buy"
cramer.query(' date.dt.year == 2022 and call == @cramer_call and `1_week_change_recommendation` > @max_one_week_return ')\
[["company","date","call","1_week_change_recommendation"]]

Unnamed: 0,company,date,call,1_week_change_recommendation
51,Palo Alto Networks,2022-02-22,Buy,0.219
367,Nextdoor 0ings,2022-03-02,Buy,0.224
2411,Oceaneering International,2022-02-28,Buy,0.238
6996,Coupa Software,2022-03-15,Buy,0.304
20504,Dutch Bros,2022-03-14,Buy,0.202
20715,Roblox,2022-11-05,Buy,0.202
20859,Upstart,2022-01-24,Buy,0.246


In [37]:
max_return = cramer.query(' date.dt.year == 2022 and call == @cramer_call')["1_week_change_recommendation"].max()
print("max return in 2022: {}\n".format(max_return))

cramer.query(' date.dt.year == 2022 and call == @cramer_call and `1_week_change_recommendation` == @max_return ')\
[["company","date","call","1_week_change_recommendation"]]

max return in 2022: 0.304



Unnamed: 0,company,date,call,1_week_change_recommendation
6996,Coupa Software,2022-03-15,Buy,0.304


In [38]:
cramer_call = "Sell"

max_return = cramer.query(' date.dt.year == 2022 and call == @cramer_call')["1_week_change_recommendation"].max()
print("max return in 2022: {}\n".format(max_return))

cramer.query(' date.dt.year == 2022 and call == @cramer_call and `1_week_change_recommendation` == @max_return ')\
[["company","date","call","1_week_change_recommendation"]]


max return in 2022: 0.726



Unnamed: 0,company,date,call,1_week_change_recommendation
21311,Upstart,2022-10-05,Sell,0.726


# Conclusion 


---

In Pandas, there are two commonly used methods for filtering rows in a DataFrame: "Boolean indexing" and the "Query" method. The "Boolean indexing" approach is commonly used by data engineers, while the "Query" method is more frequently used by data scientists and analytics professionals. The "@" symbol is used in the Query method to pass values to the query, making it a useful tool for automating data filtering processes

Some ideas to try:

1. filter a list in a Boolean indexing 

```python
# Create a list of cities to filter on
stocks_list = ['GME', 'AMC']

# Use the list in the filter
cramer[cramer['ticker'].isin(stocks_list)]

```

2. try the same stunt with a query like this

```python
# Create a list of cities to filter on
stocks_list = ['GME', 'AMC']
cramer.query('ticker in @stocks_list')
```

3. order by date decending? 

```python
# Create a list of cities to filter on
stocks_list = ['GME', 'AMC']
cramer.query('ticker in @stocks_list').sort_values("date", ascending=False)
```