# Class 2: pandas Data Wrangling for R Users

**Course:** Data 201-Class 2  
**Background assumed:** R, dplyr pipelines, basic data.frames

---

### How to use this notebook
- Run cells **top to bottom**.
- Complete all **TODO** sections.
- When R code is shown, **translate it to pandas**.
- Focus on clarity and correctness.

## 0. Setup

In [None]:
import numpy as np
import pandas as pd
##
pd.set_option("display.max_columns", 20)

## 1. From data.frame to DataFrame

In R you might do:
```r
df <- read.csv("data.csv")
```

For today we create a small example dataset.

In [None]:
data = {
    "price": [200, 180, 250, 300, np.nan, 220, 260],
    "size": [1400, 1200, 1600, 1800, 1500, 1550, 1700],
    "bedrooms": [3, 2, 3, 4, 3, 3, 4],
    "neighborhood": ["A", "A", "B", "B", "A", "B", "B"],
}

df = pd.DataFrame(data)
df

**TODO:**  
1. Check the **dimensions** of `df`  
2. **Inspect** the structure of `df`  
3. Produce **summary statistics**

In [None]:
# Your code here

## 2. Index vs Columns (Very Important!)

**TODO:**  
1. Inspect `df.index`  
2. Inspect `df.columns`  

**Question:** Why is the index *not* the same as a regular column?

In [None]:
# Your code here

## 3. Selecting Columns (dplyr::select)

In R:
```r
select(df, price, size)
```

**TODO:**  
1. Select only `price` and `size`  
2. Select a single column (`price`) as a **Series**

In [None]:
# Your code here

## 4. Filtering Rows (dplyr::filter)

In R:
```r
filter(df, price > 200)
```

**TODO:**  
1. Filter rows where `price > 200`  
2. Filter rows where `price > 200` **and** `bedrooms >= 3`

In [None]:
# Your code here

## 5. Creating New Variables (dplyr::mutate)

In R:
```r
mutate(df, price_per_sqft = price / size)
```

**TODO:** Create a new column `price_per_sqft`.

In [None]:
# Your code here

## 6. Sorting Rows (dplyr::arrange)

In R:
```r
arrange(df, price)
```

**TODO:**  
1. Sort `df` by `price` (ascending)  
2. Sort `df` by `price` (descending)

In [None]:
# Your code here

## 7. Grouping and Aggregation

In R:
```r
df %>%
  group_by(neighborhood) %>%
  summarize(mean_price = mean(price, na.rm = TRUE))
```

**TODO:** Reproduce the summary above in pandas.

In [None]:
# Your code here

## 8. Method Chaining (Pipe Mindset)

Pandas supports method chaining, similar to `%>%` in R.

**TODO:** Using chaining, compute:
- **Filter:** `price > 200`  
- **Mutate:** `price_per_sqft = price / size`  
- **Group by** `neighborhood`  
- **Summarize:** average `price_per_sqft`

In [None]:
# Your code here

## 9. Missing Data

**TODO:**  
1. **Identify** which values are missing  
2. **Drop** rows with missing values  
3. **Fill** missing prices with the mean price  

**Question:** When is dropping missing data reasonable? When is it risky?

In [None]:
# Your code here

## 10. Common Pitfalls for R Users

- Boolean filtering requires **parentheses** when combining conditions.  
- `groupby()` does nothing until you **aggregate**.  
- Watch out for **chained assignment** warnings (use `.loc` or single assignment).

## 11. Active Learning Exercise (15–20 min)

**R pipeline:**
```r
df %>%
  filter(size > 1400) %>%
  mutate(price_per_sqft = price / size) %>%
  group_by(neighborhood) %>%
  summarize(avg_ppsqft = mean(price_per_sqft, na.rm = TRUE))
```

**TASK:** Translate the pipeline above into pandas using method chaining. Write clean, readable code.

In [None]:
# Your solution below

## 12. Wrap-Up Reflection

In 2–3 sentences:
- What feels **most similar** to dplyr?  
- What feels **most different**?  
- What do you find **confusing** so far?