# Assignment 2 – Data Wrangling with pandas

## Author Section:
- ***Name:*** Hunter Tzou
- ***Date:*** 02.24.2026
- ***Class:*** DATA 201




*(Based on Class 2 – pandas for R users)*

---

## Learning Objectives
By completing this assignment, you should be able to:

- Use pandas equivalents of `dplyr` verbs:
  - `filter()` → boolean indexing / `.loc[]`
  - `mutate()` → `.assign()`
  - `group_by()` → `.groupby()`
  - `summarise()` → `.agg()`
  - `arrange()` → `.sort_values()`
- Apply boolean logic correctly using `&` and `|`
- Use **method chaining** for readable, step-by-step transformations
- Create grouped summary tables with multiple statistics

---

## Dataset

Use the provided dataset: `housing.csv`

In [13]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/HunterTzou/DATA201/refs/heads/main/Dataset/housing.csv")

---

# Part A – Core Wrangling (Method Chaining Required)

Using **one chained expression**, create a summary table that:

1. Keeps only observations where `price > 250000`
2. Keeps only homes with `size > 1000`
3. Creates a new variable:
   - `price_per_sqft = price / size`
4. Groups by `neighborhood`
5. Computes:
   - mean of `price_per_sqft`
   - median of `price_per_sqft`
   - count of homes
6. Sorts the result by mean `price_per_sqft` (descending)

### Requirements
- Do NOT create intermediate variables (`df2`, `df3`, etc.)
- Use `.assign()` to create new variables
- Use `.agg()` with named aggregation
- **Do not use `reset_index()`** (we did not cover it yet)

### Output format note
After `groupby()`, pandas will put `neighborhood` in the *index*. That is OK for this assignment.
Your final output can look like a table where `neighborhood` is on the left as the index.

### Deliverable
- Paste your code
- Show the final DataFrame output

---


In [6]:
## PART A ##
(df
 .query("price > 250000 and size > 1000")
 .assign(price_per_sqft = lambda t: t.price / t.size)
 .groupby("neighborhood")
 .agg(mean_price_per_sqft = ("price_per_sqft","mean"))
 .round(2)
)

Unnamed: 0_level_0,mean_price_per_sqft
neighborhood,Unnamed: 1_level_1
Downtown,505.05
Midtown,505.05
Suburb,505.05
Uptown,505.05
Waterfront,505.05


# Part B – Translation to dplyr

In a Markdown cell, write the equivalent **dplyr** code that performs the same transformation.

Your R pipeline should include:

- `filter()`
- `mutate()`
- `group_by()`
- `summarise()`
- `arrange()`

### Reflection (3–6 sentences)

Answer:

- Which syntax feels clearer for you — pandas chaining or dplyr pipelines?
- What feels similar between them?
- What feels different?

---

```
library(dplyr)

df %>%
  filter(price > 250000, size > 1000) %>%
  mutate(price_per_sqft = price / size) %>%
  group_by(neighborhood) %>%
  summarise(mean_price_per_sqft = mean(price_per_sqft)) %>%
  mutate(mean_price_per_sqft = round(mean_price_per_sqft, 2))
```

I personally like python better because it is cleaner. I think that both are fairly simple to read though. They both are chaining things, but python puts the `.` in front of all the functions and R uses pipe symbols at the end.


# Part C – Boolean Logic Debugging

The following code produces an error:

```
df[df["price"] > 250000 & df["size"] > 1000]
```
### Tasks

1. Fix the code.
2. Explain **why** the error occurs.
3. Rewrite the filter using `.query()` instead.

---

In [9]:
## PART C ##

## FIXED

df[(df["price"] > 250000) & (df["size"] > 1000)]



Unnamed: 0,listing_id,price,size,bedrooms,neighborhood,type
0,100001,1500000,1280.741760,1.0,Suburb,Townhouse
1,100002,1500000,1406.283113,2.0,Uptown,SingleFamily
2,100003,1500000,4146.825713,6.0,Suburb,MultiFamily
3,100004,1500000,3946.599818,6.0,Suburb,SingleFamily
4,100005,1500000,1243.751760,1.0,Downtown,MultiFamily
...,...,...,...,...,...,...
595,100596,1500000,1443.241197,3.0,Midtown,Condo
596,100597,1500000,1083.909714,2.0,Suburb,Condo
597,100598,1500000,1600.126432,1.0,Suburb,SingleFamily
598,100599,1500000,1248.216637,1.0,Waterfront,Condo


In [10]:
## .QUERY

df.query("price > 250000 and size > 1000")

Unnamed: 0,listing_id,price,size,bedrooms,neighborhood,type
0,100001,1500000,1280.741760,1.0,Suburb,Townhouse
1,100002,1500000,1406.283113,2.0,Uptown,SingleFamily
2,100003,1500000,4146.825713,6.0,Suburb,MultiFamily
3,100004,1500000,3946.599818,6.0,Suburb,SingleFamily
4,100005,1500000,1243.751760,1.0,Downtown,MultiFamily
...,...,...,...,...,...,...
595,100596,1500000,1443.241197,3.0,Midtown,Condo
596,100597,1500000,1083.909714,2.0,Suburb,Condo
597,100598,1500000,1600.126432,1.0,Suburb,SingleFamily
598,100599,1500000,1248.216637,1.0,Waterfront,Condo


2. I am honestly not sure why the error occurs other than that the conditions of the filter need to be in parenthesis. I think that it is because python is having a hard time understanding what you are trying to say without the separators.

# Part D – Short Concept Questions

Answer briefly (2–4 sentences each):

1. Why must we wrap each condition in parentheses when using `&` in pandas?
   - It is because of *operator precedence*. Just like PEMDAS in math, python evaluates certain things first before others and you need the parenthesis to override the precedence of `&` since it preceeds `>` in the statement.
2. What is the advantage of method chaining over creating many temporary DataFrames?
   - It is easier to read, debug, and reproduce. With temp df, it is a lot of names to remember and use correctly, but we are using the same variable throughout all of the process.
3. In `.agg(mean_price=("price", "mean"))`, what does `"price"` represent? What does `"mean"` represent?
   - take the `price` column of each group (i.e. the data to compute)
   - take the `mean` (average price) from the `price` column of each group (i.e. the function to be computed)

4. When you `groupby("neighborhood")`, why does `neighborhood` appear on the left (index) in the result table?

   - Each group is reduced into one row and pandas needs a label for each group so it adds the `neighborhood` labels to the index column.

---

## Optional Extension (Extra practice)
Create a second summary table grouped by `type` that reports:
- mean `price`
- median `price`
- count of listings
(Use method chaining again.)
---

In [16]:
## OPTIONAL EXTENSION ##

(df
 .groupby("type")
 .agg(mean_price = ("price","mean"),med_price = ("price","median"), count_listings = ("price", "count"))
 .round(2)
)

Unnamed: 0_level_0,mean_price,med_price,count_listings
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Condo,1500000.0,1500000.0,183
MultiFamily,1500000.0,1500000.0,63
SingleFamily,1500000.0,1500000.0,235
Townhouse,1500000.0,1500000.0,119


In [19]:
### I wanted to make sure I was not crazy because all the prices were the same lol

df["price"].nunique()
df["price"].value_counts().head(10)

Unnamed: 0_level_0,count
price,Unnamed: 1_level_1
1500000,600
