# Pandas Student Notebook — Foundations Practice (2)  
## Dataset: Kaggle “House Prices: Advanced Regression Techniques” (train.csv)

### Goal of this notebook
You are practicing the core Pandas workflow you will reuse in every analysis: loading → inspecting → cleaning → feature engineering → grouping/aggregating → reshaping → validating.

### Important reminders (read before coding)
This is not about making code run. It is about results that are meaningfully correct. Always define the grain (what does 1 row represent), prefer vectorized operations over loops and `apply(axis=1)`, verify you are counting the right thing (`count` vs `nunique`), sanity-check with `describe()` and `value_counts()`, and treat missing values as information: count them, locate them, decide how to handle them, and verify the effect.

Expected columns include: `SalePrice, Neighborhood, OverallQual, GrLivArea, YearBuilt, YearRemodAdd, LotArea, MSZoning, HouseStyle, KitchenQual, ...`

Write your code in the empty code cells.


## 0. Setup + first look

Load `train.csv` into a DataFrame named `df` and get a feeling for the dataset (missing values, num cols, names of cols, basic summary statistics,..) as you see fit

Write as a comment: What is the grain of this dataset (otherwise phrased: what does 1 row represent)?


In [2]:
import kagglehub
import pandas as pd
import os

# Download latest version
path = kagglehub.dataset_download("rishitaverma02/house-prices-advanced-regression-techniques")

df = pd.read_csv(os.path.join(path, 'train (1).csv'))



In [3]:
df.shape

(1460, 81)

In [4]:
df.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [5]:
df.isna().sum()

Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [7]:
df.dtypes

Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
                  ...   
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
Length: 81, dtype: object

In [8]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## 1. Missing values audit

1) Create a Series `missing_counts` with the number of missing values per column, sorted descending.  
2) Show the top 10 columns with missing values.  
3) Create a DataFrame `missing_top10` with columns: `column`, `missing_count`, `missing_pct`.

Write as a comment: Are missing values here likely random? Give one reason why they might not be.


In [9]:
missing_counts = df.isna().sum().sort_values(ascending=False)

In [10]:
missing_counts.head(10)

PoolQC         1453
MiscFeature    1406
Alley          1369
Fence          1179
MasVnrType      872
FireplaceQu     690
LotFrontage     259
GarageYrBlt      81
GarageCond       81
GarageType       81
dtype: int64

In [11]:
missing_top_10 = pd.concat([missing_counts, missing_counts / len(df)], axis=1).reset_index().head(10)
missing_top_10.columns = ['column', 'missing_count', 'missing_pct']
missing_top_10

# They likely are not. For example: the column with the most missing values is PoolQC, which according to the description contains the 'Pool area in square feet'. It is likely that most houses do not have a pool and thus this field is not filled in. Same with the following columns 'Value of miscellaneous feature' and 'Type of alley access to property'


Unnamed: 0,column,missing_count,missing_pct
0,PoolQC,1453,0.995205
1,MiscFeature,1406,0.963014
2,Alley,1369,0.937671
3,Fence,1179,0.807534
4,MasVnrType,872,0.59726
5,FireplaceQu,690,0.472603
6,LotFrontage,259,0.177397
7,GarageYrBlt,81,0.055479
8,GarageCond,81,0.055479
9,GarageType,81,0.055479


## 2. Basic column selection + filtering

1) Create `mini` with these columns (only if they exist):  
`SalePrice, Neighborhood, OverallQual, GrLivArea, YearBuilt, YearRemodAdd, LotArea, MSZoning`

2) Find all the houses where the above grade (ground) living area is more than 250 m2 and the overall material and finish of the house has at least a rate of 7. (careful, harder than it seems ;) ) and make a DataFrame `large_houses`


Show:
- number of rows in the filtered result
- `head()` of the filtered result

Write as a comment: Why is it risky to interpret patterns on a very small filtered subset?


In [12]:
mini = df[['SalePrice', 'Neighborhood', 'OverallQual', 'GrLivArea', 'YearBuilt', 'YearRemodAdd', 'LotArea', 'MSZoning']]

imperial_to_metric_factor = 0.09290304

large_houses = mini[((mini['GrLivArea']) > 250) & (mini['OverallQual'] >= 7)]
large_houses.shape

(548, 8)

In [13]:
len(large_houses)
# Interpreting patterns on a small, filtered subset is risky because it lacks statistical power, leading to "noise" being mistaken for a genuine "signal." In a tiny sample, random fluctuations and outliers are magnified, creating spurious correlations that don't exist in the broader population.

# Furthermore, heavy filtering often introduces selection bias, where the remaining data is no longer representative of the whole (you'll later learn that this results in 'overfitting', where a model or conclusion is so precisely tuned to the quirks of that specific subset that it fails to generalize to any other data). Effectively, the smaller the subset, the higher the probability that your "discovery" is simply a mathematical fluke.

548

## 3. Time-like feature engineering (vectorized)

`YearBuilt` and `YearRemodAdd` are integers, but they represent time.

Create the following columns
- house_age_at_sale
- years_since_remodel
- is_remodeled

Constraints:
- No loops.
- No `apply(axis=1)`.

Show `df[['YrSold','YearBuilt','YearRemodAdd','house_age_at_sale','years_since_remodel','is_remodeled']].head()`.


In [14]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [15]:
from datetime import datetime as dt

df['house_age_at_sale'] = df['YrSold'] - df['YearBuilt']
df['years_since_remodel'] = dt.now().year - df['YearRemodAdd']
df['is_remodeled'] = df['YearRemodAdd'] != df['YearBuilt']


## 4. Data validation checks

Create checks (print counts of violations):
1) `house_age_at_sale` should not be negative.
2) `years_since_remodel` should not be negative.
3) `SalePrice` should be positive.

If there are violations, show a small sample of the violating rows.

Write as a comment: Why are these data quality checks rather than business logic?


In [16]:
len(df[df['house_age_at_sale'] < 0])

# better: assert (df['house_age_at_sale'] >= 0).all()

0

In [17]:
len(df[df['years_since_remodel'] < 0])
# better: assert (df['years_since_remodel'] >= 0).all()

0

In [18]:
len(df[df['SalePrice'] < 0])
# better: assert (df['SalePrice'] >= 0).all()

0

In [19]:
# These checks are classified as data quality rather than business logic because they validate physical and temporal reality rather than a specific company strategy. They address "impossible" data points—such as a house being sold before it was built or a negative transaction price—which represent errors in data entry, pipeline corruption, or sensor malfunction. While business logic defines how a company chooses to process valid data, data quality checks ensure the underlying values are internally consistent and logically sound enough to be processed in the first place.

## 5. Groupby: pricing patterns by neighborhood

Compute a table by `Neighborhood` with:
- number of houses
- median sale price
- mean sale price
- mean `OverallQual`

Sort by median sale price descending and show the top 10 neighborhoods.

Constraints:
- Use clear output column names.

Write as a comment: Why is median often better than mean for prices?


In [20]:
neighourhood_group = df.groupby('Neighborhood').agg(number_of_houses=('Neighborhood', 'count'),
                                                    median_sale_price=('SalePrice', 'median'),
                                                    mean_sale_price=('SalePrice', 'mean'),
                                                    mean_OverallQual=('OverallQual', 'mean'))

neighourhood_group = neighourhood_group.sort_values(by='mean_sale_price', ascending=False)
neighourhood_group.head()

# the median is generally preferred over the mean because it is robust to outliers. Price distributions are typically "right-skewed," meaning a small number of ultra-luxury mansions or multi-million dollar sales can pull the mean significantly upward, creating a misleading "average" that is higher than what most people actually paid.

# The median represents the literal middle of the market, where exactly half the homes are more expensive and half are cheaper. This makes it a more accurate reflection of the typical experience. While the mean is useful for understanding total volume or aggregate value, it is too easily distorted by extreme high-end data points to serve as a reliable measure of central tendency for prices.

Unnamed: 0_level_0,number_of_houses,median_sale_price,mean_sale_price,mean_OverallQual
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
NoRidge,41,301500.0,335295.317073,7.926829
NridgHt,77,315000.0,316270.623377,8.25974
StoneBr,25,278000.0,310499.0,8.16
Timber,38,228475.0,242247.447368,7.157895
Veenker,11,218000.0,238772.727273,6.727273


## 6. String cleaning + standardization

Pick ONE categorical text column that often contains messy values (examples: `MSZoning`, `HouseStyle`, `KitchenQual`).

Tasks:
1) Show `value_counts()` for the chosen column.  
2) Create a cleaned version:
- strip whitespace
- uppercase consistently

extra challenge: if you find other data inconsistencies iun your chosen column, resolve those too.

3) Compare `value_counts()` before vs after.

Constraints:
- Do not use loops. We've learned a better way.

Note: this dataset is already relatively clean, so treat this just as an exercise.

Write as a comment: Why can inconsistent casing silently break groupby results?


In [21]:
df['MSZoning'].value_counts()

MSZoning
RL         1151
RM          218
FV           65
RH           16
C (all)      10
Name: count, dtype: int64

In [22]:
df['HouseStyle'].value_counts()

HouseStyle
1Story    726
2Story    445
1.5Fin    154
SLvl       65
SFoyer     37
1.5Unf     14
2.5Unf     11
2.5Fin      8
Name: count, dtype: int64

In [23]:
df['KitchenQual'].value_counts()

KitchenQual
TA    735
Gd    586
Ex    100
Fa     39
Name: count, dtype: int64

In [24]:
df['MSZoning'] = df['MSZoning'].str.strip().str.upper()
df['HouseStyle'] = df['HouseStyle'].str.strip().str.upper()
df['KitchenQual'] = df['KitchenQual'].str.strip().str.upper()


In [25]:
df['MSZoning'].value_counts()

MSZoning
RL         1151
RM          218
FV           65
RH           16
C (ALL)      10
Name: count, dtype: int64

In [26]:
df['HouseStyle'].value_counts()

HouseStyle
1STORY    726
2STORY    445
1.5FIN    154
SLVL       65
SFOYER     37
1.5UNF     14
2.5UNF     11
2.5FIN      8
Name: count, dtype: int64

In [27]:
df['KitchenQual'].value_counts()

KitchenQual
TA    735
GD    586
EX    100
FA     39
Name: count, dtype: int64

In [28]:
# Inconsistent casing silently breaks groupby results because string comparisons are case-sensitive, meaning the computer treats "London," "london," and "LONDON" as three entirely distinct categories. When a grouping operation is performed, the algorithm maps keys based on their exact binary representation; as a result, data that logically belongs to a single group is split across multiple redundant rows.
#
# This is particularly dangerous because it doesn't trigger an error or a crash. Instead, it produces "silent" analytical errors: your counts will be lower than they should be, and your aggregate statistics (like sums or means) will be fragmented across these unintended subgroups. Without a prior normalization step, such as converting all strings to lowercase, your analysis will fail to capture the true aggregate behavior of the data.

## 7. Simple merge exercise

Steps:
1) Create a small lookup table yourself (in code) that groups neighborhoods based on number of houses sold in that area ('Sales Potential').
- More than 100: `High Sales`
- Between 25 and 100 (inclusive): `Medium Sales`
- Less than 25: `Low Sales`
2) Merge it into `df` to create a new column `Sales Potential`.
3) Show the first 5 lines of the new df, filter on columns Id, SalePrice, Neighborhood and Sales Potential

This is a harder one, it's normal if you struggle a bit


In [29]:
def assign_sales_potential(value):
    if value > 100:
        return 'High'
    elif value < 25:
        return 'Low'
    return 'Medium'


sales_loopup = df['Neighborhood'].value_counts().reset_index()
sales_loopup['Sales Potential'] = sales_loopup['count'].apply(assign_sales_potential)

df = df.merge(sales_loopup[['Neighborhood', 'Sales Potential']], on='Neighborhood')

df[['Id', 'SalePrice', 'Neighborhood', 'Sales Potential']].head()

Unnamed: 0,Id,SalePrice,Neighborhood,Sales Potential
0,1,208500,CollgCr,High
1,2,181500,Veenker,Low
2,3,223500,CollgCr,High
3,4,140000,Crawfor,Medium
4,5,250000,NoRidge,Medium


In [30]:
sales_loopup

Unnamed: 0,Neighborhood,count,Sales Potential
0,NAmes,225,High
1,CollgCr,150,High
2,OldTown,113,High
3,Edwards,100,Medium
4,Somerst,86,Medium
5,Gilbert,79,Medium
6,NridgHt,77,Medium
7,Sawyer,74,Medium
8,NWAmes,73,Medium
9,SawyerW,59,Medium


## 9. Normalization concept: per-unit price

Create two new columns:
- 'median_neigbourhood_price_per_sqft`
- 'median_neigbourhood_SalePrice`

Sort by median `price_per_sqft` descending and show top 10.

Write as a comment: Why can `price_per_sqft` change the ranking compared to raw `SalePrice`?


In [31]:
df['price_per_sqft'] = df['SalePrice'] / df['GrLivArea']

df[['median_neigbourhood_price_per_sqft', 'median_neigbourhood_SalePrice']] = df.groupby('Neighborhood')[
    ['price_per_sqft', 'SalePrice']].transform('median')


## 10. Capstone: a clean analysis table

Create a DataFrame `analysis_df` containing (if columns exist):
- `SalePrice`
- `Neighborhood`
- `Sales Potential`
- `OverallQual`
- `GrLivArea`
- `LotArea`
- `house_age_at_sale`
- `is_remodeled`
- `price_per_sqft`

Requirements:
- No missing values in `house_age_at_sale`, `is_remodeled`, `price_per_sqft`.
- Show `analysis_df.head()`


In [32]:
analysis_df = df[['SalePrice'
    , 'Neighborhood'
    , 'Sales Potential'
    , 'OverallQual'
    , 'GrLivArea'
    , 'LotArea'
    , 'house_age_at_sale'
    , 'is_remodeled'
    , 'price_per_sqft']]

In [33]:
analysis_df[['house_age_at_sale', 'is_remodeled', 'price_per_sqft']].isna().sum()

house_age_at_sale    0
is_remodeled         0
price_per_sqft       0
dtype: int64

In [34]:
analysis_df.head()

Unnamed: 0,SalePrice,Neighborhood,Sales Potential,OverallQual,GrLivArea,LotArea,house_age_at_sale,is_remodeled,price_per_sqft
0,208500,CollgCr,High,7,1710,8450,5,False,121.929825
1,181500,Veenker,Low,6,1262,9600,31,False,143.819334
2,223500,CollgCr,High,7,1786,11250,7,True,125.139978
3,140000,Crawfor,Medium,7,1717,9550,91,True,81.537566
4,250000,NoRidge,Medium,8,2198,14260,8,False,113.739763
