# Assignment 2 – Data Wrangling with pandas
*(Based on Class 2 – pandas for R users)*

---


## Dataset


In [None]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Reben80/Data201/refs/heads/main/Dataset/housing.csv")
df.head()

Unnamed: 0,listing_id,price,size,bedrooms,neighborhood,type
0,100001,1500000,1280.74176,1.0,Suburb,Townhouse
1,100002,1500000,1406.283113,2.0,Uptown,SingleFamily
2,100003,1500000,4146.825713,6.0,Suburb,MultiFamily
3,100004,1500000,3946.599818,6.0,Suburb,SingleFamily
4,100005,1500000,1243.75176,1.0,Downtown,MultiFamily


# Part A – Core Wrangling (Method Chaining Required)

In [None]:
#code for the dataframe output
(df
.query("price > 250000 and size > 1000")
.assign(price_sqft = df["price"] / df["size"])  #lambda returned incorrect calculations that seemed to just be the exact same repeated
.groupby("neighborhood")
.agg(mean_ppsqft = ("price_sqft", "mean"),
     median_ppsqft = ("price_sqft", "median"),
     number_of_homes = ("listing_id", "count"))
.sort_values("mean_ppsqft", ascending = False)
.head())

Unnamed: 0_level_0,mean_ppsqft,median_ppsqft,number_of_homes
neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Downtown,977.820905,1001.557049,99
Midtown,921.141446,901.992377,92
Suburb,861.917078,836.878644,157
Uptown,860.935608,843.653468,99
Waterfront,849.508891,792.663291,48


# Part B – Translation to dplyr

{r}  
housing <- df |>  
>filter(price > 250000 & size > 1000) |>  
  mutate(price_sqft = price / size) |>  
  group_by(neighborhood) |>  
  summarise(  
  >> mean_ppsqft = mean(price_sqft),  
  median_ppsqft = median(price_sqft),  
  n = n()) |>  

>arrange(desc(mean_ppsqft)) |>


### Reflection (3–6 sentences)

Answer:

dplyr pipelines I like more as they can be cleaner when working with them opposed to chaining in python, i think dplyr is easier to understand as well.
Both functions work the same and follow a simiilar structure, but pandas chaining is not separated from one another by pipes. For example, with the .assign you have to type variables like price and size with "df", brackets, and quotes in pandas making it a bit more complicated, where dplyr's mutate allowes you to use just the column names.

# Part C – Boolean Logic Debugging

The following code produces an error:

```python
df[df["price"] > 250000 & df["size"] > 1000]
```

### Tasks

1. Fix the code.
2. Explain **why** the error occurs.
3. Rewrite the filter using `.query()` instead.

---

1.


In [12]:
df[(df["price"] > 250000) & (df["size"] > 1000)]

Unnamed: 0,listing_id,price,size,bedrooms,neighborhood,type
0,100001,1500000,1280.741760,1.0,Suburb,Townhouse
1,100002,1500000,1406.283113,2.0,Uptown,SingleFamily
2,100003,1500000,4146.825713,6.0,Suburb,MultiFamily
3,100004,1500000,3946.599818,6.0,Suburb,SingleFamily
4,100005,1500000,1243.751760,1.0,Downtown,MultiFamily
...,...,...,...,...,...,...
595,100596,1500000,1443.241197,3.0,Midtown,Condo
596,100597,1500000,1083.909714,2.0,Suburb,Condo
597,100598,1500000,1600.126432,1.0,Suburb,SingleFamily
598,100599,1500000,1248.216637,1.0,Waterfront,Condo


2.  
The first one is being read as "**250000 & df["size"]**" because of missing parenthesis not being around a conditon so its not seeing the first part as a rule to follow. Adding parenthesis around "df["price"] > 250000" makes this a conditon to follow with "(df["size"] > 1000)" also having to be met because of &.  
3.  


In [13]:
df.query("price > 250000 and size > 1000")

Unnamed: 0,listing_id,price,size,bedrooms,neighborhood,type
0,100001,1500000,1280.741760,1.0,Suburb,Townhouse
1,100002,1500000,1406.283113,2.0,Uptown,SingleFamily
2,100003,1500000,4146.825713,6.0,Suburb,MultiFamily
3,100004,1500000,3946.599818,6.0,Suburb,SingleFamily
4,100005,1500000,1243.751760,1.0,Downtown,MultiFamily
...,...,...,...,...,...,...
595,100596,1500000,1443.241197,3.0,Midtown,Condo
596,100597,1500000,1083.909714,2.0,Suburb,Condo
597,100598,1500000,1600.126432,1.0,Suburb,SingleFamily
598,100599,1500000,1248.216637,1.0,Waterfront,Condo


# Part D – Short Concept Questions

Answer briefly (2–4 sentences each):

1. Why must we wrap each condition in parentheses when using `&` in pandas?
- each parentheses is read as a condition that must be met and helps also read them in order
2. What is the advantage of method chaining over creating many temporary DataFrames?
- Method chaining helps coding being more organized and easier to read
3. In `.agg(mean_price=("price", "mean"))`, what does `"price"` represent? What does `"mean"` represent?
- Price represents the "price" column that is in the dataset. Mean represents the function that should be used on the data in the caloum being called for
4. When you `groupby("neighborhood")`, why does `neighborhood` appear on the left (index) in the result table?
- Being able to group by neighborhood separates rows in the dataframe by the different neighborhood values, this makes the neighborhood a row itself opposed to how it was one per house
---