# Exercise: House price levels and dispersion

For this exercise, we're using data on around 1,500 observations of house prices and house characteristics from Ames, a small city in Iowa.

1. Load the Ames housing data set from `ames_houses.csv` located in the `data/` folder. 
2. Restrict the data to the columns `SalePrice` and `Neighborhood`.
3. Check that there are no observations with missing values in this data.
4. Compute the average house price (column `SalePrice`) by neighborhood (column `Neighborhood`). List the three most expensive neighborhoods, for example by using [`sort_values()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html).
5. You are interested to quantify the price dispersion in each neighborhood. To this end, compute the standard deviation by neighborhood using [`std()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.std.html). Which are the three neighborhoods with the most dispersed prices?
6. An alternative measure of dispersion is the ratio of the 90th and 10th percentile of the house price distribution. Use the [`quantile()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html) method to compute the P90 and P10 statistics by neighborhood, compute their ratio and print the three neighborhoods with the largest dispersion.

    *Hint:* The `quantile()` function takes _quantiles_ as arguments, i.e., instead of the 90th percentile you need to specify the quantile as 0.9.

In [2]:
import pandas as pd 
import numpy as np 

DATA_PATH = '../data'

fn = f'{DATA_PATH}/ames_houses.csv'

df = pd.read_csv(fn, sep=',')

df

Unnamed: 0,SalePrice,LotArea,Neighborhood,BuildingType,OverallQuality,OverallCondition,YearBuilt,CentralAir,LivingArea,Bathrooms,Bedrooms,Fireplaces,HasGarage
0,208500.0,784.954075,CollgCr,Single-family,7,5,2003,Y,158.848694,2,3,0,1
1,181500.0,891.782144,Veenker,Single-family,6,8,1976,Y,117.232194,2,3,1,1
2,223500.0,1045.057200,CollgCr,Single-family,7,5,2001,Y,165.908636,2,3,1,1
3,140000.0,887.137445,Crawfor,Single-family,7,5,1915,Y,159.498952,1,3,1,1
4,250000.0,1324.668060,NoRidge,Single-family,8,5,2000,Y,204.180953,2,4,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,175000.0,735.441587,Gilbert,Single-family,6,5,1999,Y,152.996374,2,3,1,1
1456,210000.0,1223.878099,NWAmes,Single-family,6,6,1978,Y,192.569207,2,3,2,1
1457,266500.0,839.947307,Crawfor,Single-family,7,9,1941,Y,217.371898,2,4,2,1
1458,142125.0,902.650739,NAmes,Single-family,5,6,1950,Y,100.139703,1,2,0,1


In [3]:
df.head()


Unnamed: 0,SalePrice,LotArea,Neighborhood,BuildingType,OverallQuality,OverallCondition,YearBuilt,CentralAir,LivingArea,Bathrooms,Bedrooms,Fireplaces,HasGarage
0,208500.0,784.954075,CollgCr,Single-family,7,5,2003,Y,158.848694,2,3,0,1
1,181500.0,891.782144,Veenker,Single-family,6,8,1976,Y,117.232194,2,3,1,1
2,223500.0,1045.0572,CollgCr,Single-family,7,5,2001,Y,165.908636,2,3,1,1
3,140000.0,887.137445,Crawfor,Single-family,7,5,1915,Y,159.498952,1,3,1,1
4,250000.0,1324.66806,NoRidge,Single-family,8,5,2000,Y,204.180953,2,4,1,1


In [4]:
df2 = df[['SalePrice', 'Neighborhood']]
df2

Unnamed: 0,SalePrice,Neighborhood
0,208500.0,CollgCr
1,181500.0,Veenker
2,223500.0,CollgCr
3,140000.0,Crawfor
4,250000.0,NoRidge
...,...,...
1455,175000.0,Gilbert
1456,210000.0,NWAmes
1457,266500.0,Crawfor
1458,142125.0,NAmes


In [5]:
df2.isna().sum()

SalePrice       0
Neighborhood    0
dtype: int64

In [6]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   SalePrice     1460 non-null   float64
 1   Neighborhood  1460 non-null   object 
dtypes: float64(1), object(1)
memory usage: 22.9+ KB


In [14]:
groups = df.groupby('Neighborhood')
groups['SalePrice'].mean().sort_values(ascending=False).head(3)

Neighborhood
NoRidge    335295.317073
NridgHt    316270.623377
StoneBr    310499.000000
Name: SalePrice, dtype: float64

In [15]:
groups = df.groupby('Neighborhood')
groups['SalePrice'].std().sort_values(ascending=False).head(3)

Neighborhood
NoRidge    121412.658640
StoneBr    112969.676640
NridgHt     96392.544954
Name: SalePrice, dtype: float64

In [17]:
P10 = groups['SalePrice'].quantile(0.1)
P90 = groups['SalePrice'].quantile(0.9)

In [18]:
ratio = P90 / P10

In [19]:
ratio.sort_values(ascending=False).head(3)

Neighborhood
IDOTRR     2.546182
StoneBr    2.533834
BrkSide    2.309796
Name: SalePrice, dtype: float64

***
# Exercise: Determinants of house prices

For this exercise, we're using data on around 1,500 observations of house prices and house characteristics from Ames, a small city in Iowa.

1.  Load the Ames housing data set from `ames_houses.csv` located in the `data/` folder. 
2.  Restrict the data to the columns `SalePrice`, `LotArea` and `Bedrooms`.
3.  Restrict your data set to houses with one or more bedrooms and a lot area of at least 100m².
4.  Compute the average lot area. Create a new column `LargeLot` which takes on the value of 1 if the lot area is above the average (_"large"_), and 0 otherwise (_"small"_). 

    What is the average lot area within these two categories?
5.  Create a new column `Rooms` which categorizes the number of `Bedrooms` into three groups: 1, 2, and 3 or more. You can create these categories using boolean indexing, [`np.where()`](https://numpy.org/doc/2.0/reference/generated/numpy.where.html), pandas's [`where()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html), or some other way.
6.  Compute the mean `SalePrice` within each group formed by `LargeLot` and `Rooms` (for a total of 6 different categories) using [`groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html).
7.  Compute and report the average price difference between 1 and 2 bedrooms for a house with a small lot area.
8.  Compute and report the average price difference between a small and a large lot for a house with 2 bedrooms.

***
# Exercise: Inflation and unemployment in the US

In this exercise, you'll be working with selected macroeconomic variables for the United States reported at monthly frequency obtained from [FRED](https://fred.stlouisfed.org/).
The data set starts in 1948 and contains observations for a total of 864 months.

1.  Load the data from the file `FRED_monthly.csv` located in the `data/` folder. Print the first 10 observations to get an idea how the data looks like.
2.  Keep only the columns `Year`, `Month`, `CPI`, and `UNRATE`. Moreover, perform this analysis only on observations prior to 1970 and drop the rest.
3.  Since pandas has great support for time series data, we want to create an index based on observation dates. 

    -   To this end, use [`to_datetime()`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) to convert the `Year` and `Month` columns into a date.

        *Hint:* `to_datetime()` requires information on Year/Month/Day, so you need to create a `Day` column first and assign it a value of 1.
        You can then call `to_datetime()` with the argument `df[['Year', 'Month', 'Day']]` to create the corresponding date.
    -   Store the date information in the column `Date`. Delete the columns `Year`, `Month` and `Day` once you are done as these are no longer needed.
    -   Set the `Date` column as the index for the `DataFrame` using [`set_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html).

4.  The column `CPI` stores the consumer price index for the US. You may be more familiar with the concept of inflation, which is the percent change of the CPI relative to the previous period. 
    Create a new column `Inflation` which contains the _annual_ inflation _in percent_ relative to the same month in the previous year by applying 
    [`pct_change()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pct_change.html) to the column `CPI`.

    *Hints:* 
    
    -   Since this is monthly data, you need to pass the arguments `periods=12` to `pct_change()` to get annual percent changes.
    -   You need to multiply the values returned by `pct_change()` by 100 to get percent values.

5.  Compute the average unemployment rate (column `UNRATE`) over the whole sample period. Create a new column `UNRATE_HIGH` that contains an indicator whenever the unemployment rate is above its average value (_"high unemployment period"_). 
    -   How many observations fall into the high- and the low-unemployment periods?
    -   What is the average unemployment rate in the high- and low-unemployment periods?
6.  Compute the average inflation rate for high- and low-unemployment periods. Is there any difference?
7.  Use [`resample()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html) to aggregate
    the inflation data to annual frequency and compute the average inflation within each calendar year.

    Which are the three years with the highest inflation rates in the sample?

    *Hint:* Use the resampling rule `'YE'` when calling `resample()`.


In [20]:
DATA_PATH = '../data'
fn = f'{DATA_PATH}/FRED_monthly.csv'
df = pd.read_csv(fn)
df

Unnamed: 0,Year,Month,CPI,UNRATE,FEDFUNDS,REALRATE,LFPART
0,1948,1,23.7,3.4,,,58.6
1,1948,2,23.7,3.8,,,58.9
2,1948,3,23.5,4.0,,,58.5
3,1948,4,23.8,3.9,,,59.0
4,1948,5,24.0,3.5,,,58.3
...,...,...,...,...,...,...,...
859,2019,8,256.0,3.6,2.1,0.6,63.1
860,2019,9,256.4,3.5,2.0,0.3,63.2
861,2019,10,257.2,3.6,1.8,-0.0,63.3
862,2019,11,257.9,3.6,1.6,-0.2,63.3


In [22]:
df2 = df[['Year', 'Month', 'CPI', 'UNRATE']]
df2


Unnamed: 0,Year,Month,CPI,UNRATE
0,1948,1,23.7,3.4
1,1948,2,23.7,3.8
2,1948,3,23.5,4.0
3,1948,4,23.8,3.9
4,1948,5,24.0,3.5
...,...,...,...,...
859,2019,8,256.0,3.6
860,2019,9,256.4,3.5
861,2019,10,257.2,3.6
862,2019,11,257.9,3.6


In [26]:
df = df.drop(columns = ['FEDFUNDS', 'REALRATE', 'LFPART'])
df

Unnamed: 0,Year,Month,CPI,UNRATE
0,1948,1,23.7,3.4
1,1948,2,23.7,3.8
2,1948,3,23.5,4.0
3,1948,4,23.8,3.9
4,1948,5,24.0,3.5
...,...,...,...,...
859,2019,8,256.0,3.6
860,2019,9,256.4,3.5
861,2019,10,257.2,3.6
862,2019,11,257.9,3.6


In [27]:
df.query('Year<1970').tail(10)

Unnamed: 0,Year,Month,CPI,UNRATE
254,1969,3,36.1,3.4
255,1969,4,36.3,3.4
256,1969,5,36.4,3.4
257,1969,6,36.6,3.5
258,1969,7,36.8,3.5
259,1969,8,36.9,3.5
260,1969,9,37.1,3.7
261,1969,10,37.3,3.7
262,1969,11,37.5,3.5
263,1969,12,37.7,3.5
