# 📓 Lesson 4: Selecting and Filtering Data

📘 What you will learn:
In this lesson, you will learn:
1. How to select rows and columns from a DataFrame
2. How to use loc and iloc for selection
3. How to filter rows based on conditions
4. How to use functions like isin(), between(), and query()

## 📁 Step 1: Load the sample dataset
We will use the file Sales_January_2019.csv from the "data/" folder.

In [None]:
import pandas as pd

df = pd.read_csv('../data/Sales_January_2019.csv')

# View first few rows
print(df.head())

## 🧱 Step 2: Selecting Columns
You can select a single column by name:

In [None]:
# Select one column
print(df['Product'])

Or multiple columns:

In [None]:
# Select multiple columns
print(df[['Product', 'Price Each']])

## 🔢 Step 3: Selecting Rows by Position (with iloc)
iloc lets you select rows using index numbers:

In [None]:
# Get the first row
print(df.iloc[0])

# Get rows 0 to 4
print(df.iloc[0:5])

## 🏷 Step 4: Selecting Rows by Label (with loc)
loc is used to select rows and columns by name/index label.

Example (after setting an index):

In [None]:
# Set Order ID as index
df_with_index = df.set_index('Order ID')

# Get row with a specific index label
print(df_with_index.loc['141235'])

💡 Tip: It is always a good idea to check "df.dtypes" after reading data to make sure the data type is correct.

 you can check index it's "object" or "int":

In [None]:
# is it object or int?
print(df['Order ID'].dtype) 

if it was "object you should use "str":

In [None]:
# Set Order ID as index
df_with_index = df.set_index('Order ID')

# Get row with a specific index label (str)
print(df_with_index.loc['141235'])

you can convert it to "int" before set_index

(if you want to work with int index for sure)

In [None]:
df['Order ID'] = pd.to_numeric(df['Order ID'], errors='coerce')
df_with_index = df.set_index('Order ID')
print(df_with_index.loc[141235])

💡errors='coerce' causes data to be converted to NaN if it cannot be converted (suitable for removing bad rows)

💡 After conversion, it is better to remove the rows that become NaN to avoid problems in the analysis:

🛠 Final proposed version (safe and understandable)

In [None]:
# Convert Order ID to numeric (in case it’s string)
df['Order ID'] = pd.to_numeric(df['Order ID'], errors='coerce')

# Drop NaNs if needed
df_clean = df.dropna(subset=['Order ID'])

# Set index
df_with_index = df_clean.set_index('Order ID')

# Safely check existence
order_id = 141235
if order_id in df_with_index.index:
    print(df_with_index.loc[order_id])
else:
    print(f"Order ID {order_id} not found.")

💡 Tip: checking what indexes we have.

In [None]:
print(df['Order ID'].dropna().unique())

🔸 If index is not set, you can still use .loc with conditions (see below).

## 🧪 Step 5: Filtering Rows with Conditions

You can filter rows based on a condition:

In [None]:
# Convert column index to integer
df['Quantity Ordered'] = pd.to_numeric(df['Quantity Ordered'], errors='coerce')

# Remove bad rows (NaNs)
df = df.dropna(subset=['Quantity Ordered'])

# Find all orders where quantity is more than 5
print(df[df['Quantity Ordered'] > 5])

You can combine conditions with & (and) and | (or):

In [None]:
# Orders in San Francisco and quantity > 2
filtered = df[(df['Purchase Address'].str.contains('San Francisco')) & (df['Quantity Ordered'] > 2)]
print(filtered.head())


## 🔍 Step 6: Using isin() and between()

Use isin() to match values from a list:

In [None]:
# Orders of specific products
products = ['USB-C Charging Cable', 'Bose SoundSport Headphones']
print(df[df['Product'].isin(products)])

Use between() for range filtering:

In [None]:
# Ensure 'Price Each' is numeric
df['Price Each'] = pd.to_numeric(df['Price Each'], errors='coerce')

# Drop rows with NaN values
df = df.dropna(subset=['Price Each'])

# Now safely filter with between
filtered = df[df['Price Each'].between(10, 100)]

print(filtered.head())

## 🧠 Step 7: Using .query() for cleaner filtering

Pandas supports SQL-like syntax with query():

In [None]:
# Same as above using query
result = df.query("`Product` == 'Macbook Pro Laptop' and `Price Each` > 1000")
print(result.head())


You can use backticks ` if column names have spaces.

## 🧠 Practice Exercises
1. Show all orders from New York with quantity over 3
2. Select only the 'Product' and 'Price Each' columns from those orders
3. Show orders where price is between $50 and $150
4. Use .query() to find orders for 'iPhone' that cost more than $600

In [None]:
# 1
ny_orders = df[(df['Purchase Address'].str.contains('New York')) & (df['Quantity Ordered'] > 3)]

# 2
print(ny_orders[['Product', 'Price Each']])

# 3
print(df[df['Price Each'].between(50, 150)])

# 4
print(df.query("`Product` == 'iPhone' and `Price Each` > 600"))


## 📌 Summary

In this lesson, you learned how to:
- Select columns using []
- Select rows using iloc and loc
- Filter rows based on conditions
- Use isin(), between(), and query() for advanced filtering

👉 In the next lesson, we’ll clean our data by removing or fixing missing and duplicate values.