In [None]:
import sys
import os
if not any(path.endswith('textbook') for path in sys.path):
    sys.path.append(os.path.abspath('../../..'))
from textbook_utils import *

In [None]:
sfh_df = pd.read_csv('data/sfhousing.csv', error_bad_lines=False,
                     usecols=['city','zip','street', 'price', 'br', 'lsqft', 'bsqft', 'date'])
sfh_df["price"] = pd.to_numeric(sfh_df["price"], errors='coerce')
sfh_df = sfh_df.dropna(subset=['price'])

In [None]:
date_format = '%Y-%m-%d'

def parse_dates_and_years(df, column='date'):
    dates = pd.to_datetime(df[column], format=date_format)
    return df.assign(timestamp=dates)

sfh_df = sfh_df.pipe(parse_dates_and_years)
sfh_df = sfh_df.drop(['date'], axis=1)

(sec:eda_example)=
# Example: Sale Prices for Houses

In this final section, we carry out an exploratory analysis using the questions in the previous
section to direct our investigations. Although EDA typically begins in the data
wrangling stage, for demonstration purposes the data we work here
have already been partially cleaned so that we can focus on exploring
the features of interest. Note also that we do not discuss creating the
visualizations in much detail; that topic is covered in {numref}`Chapter %s <ch:viz>`.

First, we consider the scope (see {numref}`Chapter %s <ch:data_scope>`) and granularity {numref}`Chapter %s <ch:wrangling>` of the data.  

These data were scraped from the San Francisco Chronicle (SFChron)
Website[^SFChron]. They form a complete list of homes sold in the area from Apr 2003 to December 2008. 
Since we have no plans to generalize our findings beyond the time period when and the location where the data were collected, we are working with a census so the population matches the access frame and the sample consists of the entire population.

[^SFChron]:The SFChron published weekly data on the sale of houses in the San
Francisco Bay Area.

Each record represents a sale of a home in the SF Bay Area
during the specified time period.  This means that if a home was sold
twice during this time, then it will have two records in the table. And, if a
home in the Bay Area was not up for sale during this time, then it will not appear in
the dataset.

The data are in the data frame `sfh_df`.

In [None]:
sfh_df

The dataset does not have an accompanying codebook, but we
can determine the features and their storage types by inspection.

In [None]:
sfh_df.info()

Based on the names of the fields, we expect the primary key to consist of some
combination of city, zip, street address, and date.

Sale price is our focus. So let's begin by exploring the distribution of sale price.
To develop your intuition about distributions, make a guess about the shape of the distribution
before you start reading the next section. Don't worry about the range of prices, just sketch the general shape.

## Understanding Price

It seems like a good guess for the distribution of sale price is a highly skewed to the right distribution with a few very expensive houses. The summary statistics shown below confirm this skewness. 

In [None]:
# This option stops scientific notation for pandas
pd.set_option('display.float_format', '{:.2f}'.format)

In [None]:
percs = [0, 25, 50, 75, 100]
prices = np.percentile(sfh_df['price'], percs, interpolation='lower')
pd.DataFrame({'price': prices}, index=percs)

The median is closer to the lower quartile than the upper quartile. 
Also the maximum is 40 times as large as the median!
We might wonder whether that \$20m sale price is simply an anomalous value or
whether there are many houses that sold at such a high price. To find out, we can zoom in on
the right tail of the distribution and compute a few high percentiles.

In [None]:
percs = [95, 97, 98, 99, 99.5, 99.9]
prices = np.percentile(sfh_df['price'], percs, interpolation='lower')
pd.DataFrame({'price': prices}, index=percs)

We see that $99.9\%$ of the houses sold for under $\$4M$ so the $\$20M$ sale is indeed
a rarity. Let's examine the histogram of sale prices below $\$4M$ because 
fewer than 1 in 1,000 sales exceeded $\$4M$. 

In [None]:
sfh = sfh_df[sfh_df['price'] < 4_000_000]
fig = px.histogram(sfh, 
                   x='price', width=400, height=250)

fig.update_traces(xbins=dict( # bins used for histogram
        start=0.0,
        end=4_000_000,
        size=100_000
    ))

fig.show()

Even without the top 0.1%, the distribution remains highly skewed to the right, with a single mode around \$500k.
Let's plot the histogram of the log-transformed sale price. The logarithm transformation often does a good job at converting a right-skewed distribution into one that is more symmetric. 

In [None]:
sfh = sfh.assign(log_price=np.log10(sfh['price']))
fig = px.histogram(sfh, x='log_price', nbins=40,
                   width=400, height=250)

fig.show()

We see that the distribution of log-transformed sale price is roughly symmetric. 
Now that we have an understanding of the distribution of sale price, let's
consider the so-what questions posed in the previous section on EDA guidelines.

## What Next?

We have a description of the shape of sale price, but we need to consider why the shape matters and
look for subgroups of the data that might have different distributions.  

Shape matters because models and statistics based on symmetric distributions tend to have more robust and stable properties
than highly skewed distributions. (We address this issue more when we cover linear models in
{numref}`Chapter %s <ch:linear>`). 
For this reason, we primarily work with the log-transformed sale price. 
And, we might also choose to limit our analysis to sale prices under \$4m since the super-expensive houses may behave quite differently.

To address the considerations of subgroups and further comparisons of sale price, 
we look to the context. The housing market during this time rose rapidly in and then the bottom fell out of the market. 
So the distribution of sale price in, say, 2004, might be quite different
than in 2008, right before the crash. To explore this notion further, we can examine
the behavior of prices over time. Alternatively, we can fix time, and examine the
relationships between price and the other features of interest. 
Both approaches are potentially worthwhile and we consider them both.

We begin with the approach of narrowing our focus. We do this by first limiting the data to
sales made in one calendar year, 2004, so rising prices should have a limited
impact on the distributions and relationships that we examine. And, to limit the
influence of the very expensive and large houses, we also restrict ourselves to
sales below \$4m and houses smaller than 12,000 ft^2.  This subset  still
contains large and expensive houses, but not outrageously so. Later, we
further restrict our exploration to a few cities of interest.

In [None]:
def subset(df):
    return df.loc[(df['price'] < 4_000_000) &
                  (df['bsqft'] < 12_000) & 
                  (df['timestamp'].dt.year == 2004)]


sfh = sfh_df.pipe(subset)
sfh


For this subset, the shape of the distribution of sale price remains the
same---price is still highly skewed to the right. We continue to work with
this subset to address the question of whether there are any potentially important
features to study along with price.

## Examining other features

In addition to the date of the sale, which we identified earlier as features of interest, a few other features that might be important to our investigation are the size of the house, lot (or property)
size, and number of bedrooms. We explore the distributions of these features
and their relationship to sale price.

Let's begin with building and lot size.

Since the size of the property is likely related to its price, it seems
reasonable to guess that these features are also skewed to the right. The figure
below shows the distribution of building size on the left and the log-transformed distribution on the right.

In [None]:
sfh = sfh.assign(log_bsqft=np.log10(sfh['bsqft']))

left = px.histogram(sfh, x='bsqft', histnorm='percent', nbins=60)
right = px.histogram(sfh, x='log_bsqft', histnorm='percent', nbins=60)
fig = left_right(left, right)

In [None]:
fig.update_xaxes(title='Building size (ft^2)', row=1, col=1)
fig.update_xaxes(title='Building size (ft^2, log10)', row=1, col=2)
fig.update_yaxes(title="percent", range=[0, 18], row=1, col=1)
fig.update_yaxes(range=[0, 18], row=1, col=2)
fig.update_layout(width=450, height=250)
fig.show()

The distribution is unimodal with a peak at about 1500 ft^2, and
many houses are over 2,500 ft^2 in size. 
We have confirmed our intuition: the log-transformed building size is
nearly symmetric, although it maintains a slight skew. The same is the case for
the distribution of lot size.

Given both house and lot size have skewed distributions, a scatter plot of the two should most likely be on log scale too. We compare the scatter plot with and without the log-transformation below. 

In [None]:
sfh = sfh.assign(log_lsqft=np.log10(sfh['lsqft']))

```{figure} figures/scatterPlot_price_bsqft.jpg
---
name: scatter-over-plot
---

```

The scatter plot on the left is in the original
units, which make it difficult to discern the relationship because most of the points are crowded into the bottom of the plotting region. 
On the other hand, the scatter plot on the right reveals a few interesting features:
there is a horizontal line along the bottom of the scatter plot where it appears that many houses have the same lot size no matter the building size;
and there appears to be a slight positive log-log linear association between lot and building size.


Let's look at some lower quantiles of lot size to try and figure out this unusual value:

In [None]:
percs = [0.5, 1, 1.5, 2, 2.5, 3]
lots = np.percentile(sfh['lsqft'].dropna(), percs, interpolation='lower')
pd.DataFrame({'lot_size': lots}, index=percs)

We found something interesting! About 2.5% of the houses have a lot size of 436
ft^2. This is an avenue of investigation worth pursuing, which we make a note of.

Another measure of house size is the number of bedrooms. Since this is a
discrete quantitative variable, we can treat it as a qualitative feature and
make a bar plot. 

Houses in the Bay Area tend to be on the smaller side so we venture to guess
that the distribution will have a peak at three and skewed to the right with a
few houses having 5 or 6 bedrooms.

In [None]:
br_cat = sfh.groupby(by=["br"]).size().reset_index(name="counts")
px.bar(br_cat, x="br", y="counts", width=350, height=250)

The bar plot confirms that we generally had the right idea. However, we find that there are some houses with over 30 bedrooms! That's a bit hard to believe and points to a potential data quality problem. Since the records include the addresses of the houses, we can double check theses values on a real estate app. 

In the meantime, we can transform the number of bedrooms into an ordinal feature by
reassigning all values larger than 8 to 8+, and recreate the bar plot for the
transformed data. 

In [None]:
eight_up = sfh.loc[sfh['br'] >= 8, 'br'].unique()
sfh['new_br'] = sfh['br'].replace(eight_up, 8)

br_cat = sfh.groupby(by='new_br').size().reset_index(name="counts")
px.bar(br_cat, x="new_br", y="counts", width=350, height=250)

We can see that even lumping all of the houses with 8+ bedrooms together, they do not amount to many. The distribution is nearly symmetric with a peak at 3, nearly the same proportion of houses have
2 or 4 bedrooms, and nearly the same have 1 or 5. There is asymmetry present
with a few houses having 6 or more bedrooms.

Now, we examine the relationship between the number of bedrooms and sale price.
Before we proceed, we save the transformations done thus far into `sfh`.

In [None]:
def log_vals(df):
    return df.assign(log_price=np.log10(df['price']),
                     log_bsqft=np.log10(df['bsqft']),
                     log_lsqft=np.log10(df['lsqft']))

def clip_br(df):
    eight_up = df.loc[df['br'] >= 8, 'br'].unique()
    new_br = df['br'].replace(eight_up, 8)
    return df.assign(new_br=new_br)

sfh = (sfh_df
 .pipe(subset)
 .pipe(log_vals)
 .pipe(clip_br)
)

Now we're ready to consider relationships between the number of bedrooms and other variables.

## Delving Deeper into Relationships

Let's begin by examining how the distribution of price changes for houses with
different numbers of bedrooms. We can do this with box plots.

In [None]:
px.box(sfh, x='new_br', y='price', log_y=True,
      width=450, height=250)

The median sale price increases with the number of bedrooms from 1
to 5 bedrooms, but for the largest houses (those with 6, 7, and 8+ bedrooms),
the distribution of log-transformed sale price appears nearly the same.

We would expect that houses with one bedroom are smaller than houses with, say,
4 bedrooms. We might also guess that houses with 6 or more bedrooms are similar
in size. To dive deeper, we consider a kind transformation that divides price by 
building size to give us the price per square foot. 
We want to check if this feature is constant for all houses; in other words, price
is primarily determined by size. To do this we look at the relationship between size and
price and price per square foot and size.

We create two scatter plots. The one on the left shows price
against the building size (both log-transformed), and the plot on the right
shows price per square foot (log-transformed) against building size. In addition, each
plot has an added smooth curve that reflects the local average price or price
per square foot) for buildings of roughly the same size.

In [None]:
sfh = sfh.assign(
    ppsf=sfh['price'] / sfh['bsqft'], 
    log_ppsf=lambda df: np.log10(df['ppsf']))

```{figure} figures/trendPPSF.jpg
---
name: trend-PPSF-plot
---

```

The lefthand plot shows what we expect---larger houses cost more. We also see
that there is roughly a log-log association between these features.

The righthand plot in this figure is interestingly nonlinear. We see that
smaller houses cost more per square foot than larger ones, and the price per
square foot for larger houses (houses with many bedrooms) is relatively flat.
This feature appears to be quite interesting so we save the price per square foot transforms into `sfh`.

In [None]:
def compute_ppsf(df):
    return df.assign(
    ppsf=df['price'] / df['bsqft'], 
    log_ppsf=lambda df: np.log10(df['ppsf']))

sfh = (sfh_df
 .pipe(subset)
 .pipe(log_vals)
 .pipe(clip_br)
 .pipe(compute_ppsf)
)

So far we haven't considered the relationship between prices and location. There are house
sales from over 150 different cities in this dataset. Some cities have a
handful of sales and others have thousands. We continue our narrowing down of the data
and examine relationships for a few cities next.

## Fixing Time and Location

One factor to consider is location. You may have heard the expression:
There are three things that matter in real estate: *location, location, location.*
Comparing price across cities might bring added value to our investigation.

We examine data for some cities in the East Bay: Richmond, El Cerrito, Albany,
Berkeley, Walnut Creek, Lamorinda (which is a combination of Lafayette, Moraga,
and Orinda, three neighboring bedroom communities), and Piedmont.

In [None]:
def make_lamorinda(df):
    return df.replace({
        'city': {
            'Lafayette': 'Lamorinda',
            'Moraga': 'Lamorinda',
            'Orinda': 'Lamorinda',
        }
    })

sfh = (sfh_df
 .pipe(subset)
 .pipe(log_vals)
 .pipe(clip_br)
 .pipe(compute_ppsf)
 .pipe(make_lamorinda)
)

Let's begin by comparing the distribution of sale price for these cities. 

In [None]:
cities = ['Richmond', 'El Cerrito', 'Albany', 'Berkeley',
          'Walnut Creek', 'Lamorinda', 'Piedmont']

px.box(sfh.query('city in @cities'),
        x='city', y='price', log_y=True,
       width=450, height=250)

The box plots show that Lamorinda and Piedmont tend to have more expensive homes and Richmond has the least
expensive, but there is overlap in sale price for all areas.


Next, we examine the relationship between price per square foot and house size more closely with a scatter plot, one for each city.

In [None]:
four_cities = ['Berkeley', 'Lamorinda', 'Piedmont', 'Richmond']
fig = px.scatter(sfh.query('city in @four_cities'),
           x='bsqft', y='log_ppsf', facet_col='city', facet_col_wrap=2,
           trendline='ols', trendline_color_override="black"
           )
fig.update_layout(xaxis_range=[0,5500], 
                  yaxis_range=[1.5, 3.5], width=450, height=400)
fig.show()

The relationship between price per square foot and building size is roughly
log-linear with a negative association for each of the four locations. While,
not parallel, it does appear that there is a "location" boost for houses,
regardless of size, where, say, a house in Berkeley costs about \$250 more per
square foot than a house in Richmond.  We also see that Piedmont and Lamorinda
are more expensive cities, and in both cities, there is not the same reduction
in price per square foot for larger houses in comparison to smaller ones.

In EDA, we often revisit earlier plots to check whether new findings add
insights to previous visualization. It is important to continually take stock of our findings and use them to guide us in further explorations. Let's summarize our findings from our EDA. 

## EDA discoveries

Our EDA has uncovered several interesting phenomenon. Briefly, some of the
most notable are:

- Sale price and building size are highly skewed to the right with one mode.
- Price per square foot decreases nonlinearly with building size, with smaller
  houses costing more per square foot than larger houses, and the price per
  square foot being roughly constant for houses with three or more bedrooms.
- More desirable locations add a bump in sale price that is roughly the same
  size for houses of different sizes.

There are many additional explorations we can (and should) perform, and there are several checks that we should make. These include: investigating the 436 value for lot size and crosschecking unusual houses,
like the 30 bedroom house and the $20m house, with online real estate apps.

We narrowed our investigation down to one year and later to a few cities. This narrowing helped us control for features that might interfere with simpler relationships. For example, since the data were collected over several years, the date of sale may confound the relationship between sale price and number of bedrooms. At other times, we want to consider the effect of time on prices.
To examine price changes over time we often make line plots, and we adjust for inflation. We revisit these data in {numref}`Chapter %s <ch:viz>` when we consider data scope and look more closely at trends in time.
  
Despite being brief, this section conveys the basic approach of
EDA in action. For an extended case study on a different dataset, see
{numref}`Chapter %s <ch:pa>`.