# Lab 3: Data Visualization

**Data Science Bootcamp with Python, Pandas, and Plotly**

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import plotly
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio

# Sam's preferred styles
pio.templates["sam"] = go.layout.Template(
    layout=dict(
        margin=dict(l=30, r=30, t=30, b=30),
        autosize=True,
        width=600,
        height=400,
        xaxis=dict(showgrid=True),
        yaxis=dict(showgrid=True),
        title=dict(x=0.5, xanchor="center"),
    )
)
pio.templates.default = "simple_white+sam"

In [3]:
# This option stops scientific notation for pandas
pd.set_option('display.float_format', '{:.2f}'.format)

## Example: Sale Prices for Houses

Let's carry out an exploratory analysis with a focus on visualization. 

Although EDA typically begins in the data
wrangling stage, for demonstration purposes the data we work with here
have already been cleaned so that we can focus on visualization.

**Scope** Our data were scraped from the San Francisco Chronicle (SFChron) Website.
They form a complete list of homes sold in the area from Apr 2003 to December 2008.  
Since we have no plans to generalize our findings beyond the time period and the location, we are working with a census so in terms of scope: the population matches the access frame and the sample consists of the entire population.

**Granularity** As for granularity, each record represents a sale of a home in the SF Bay Area
during the specified time period.  This means that if a home was sold
twice during this time, then there are two records in the table. And, if a
home in the Bay Area was not up for sale during this time, then it does not appear in the dataset.

We read the data into the data frame `sfh`.

In [4]:
def data(csv):
    return f'https://github.com/DS-100/oreilly-bootcamp/blob/main/data/{csv}?raw=true'

sfh_path = data('sfh_2004_EB.csv')
sfh = pd.read_csv(sfh_path)
sfh

The dataset does not have an accompanying codebook, but we
can determine the features and their storage types by inspection.

In [5]:
sfh.info()

Based on the names of the fields, we expect the primary key to consist of some
combination of city, zip, street address, and date.

Sale price is our focus. So let's begin by exploring the distribution of sale price.
To develop your intuition about distributions, make a guess about the shape of the distribution
before you start reading the next section. Don't worry about the range of prices, just sketch the general shape.

### Understanding Price

It seems like a good guess is that the distribution of sale price is a highly skewed to the right with a few very expensive houses. 

Let's create a histogram of sale price.

In [6]:
...

**Describe the Shape**

+ Modes
+ Skew
+ Tails
+ Gaps and Outliers

In [13]:
percs = [95, 97, 98, 99, 99.5, 99.9]
prices = np.percentile(sfh['price'], percs, interpolation='lower')
pd.DataFrame({'price': prices}, index=percs)

We see that 99.9\% of the houses sold for under \$2M. In fact, in the original dataset, some of the sale prices are as high as `$20M`. Since this was such a rarity, and we are interested in the more typical sale prices, we dropped records with sale price over \$4M.
Also, we might also choose to limit our analysis to sale prices under `\$4m` since the super-expensive houses may behave quite differently.

This subset  still contains large and expensive houses, but not outrageously so.

Even without the top 0.1%, the distribution remains highly skewed to the right, with a single mode around \$500k.
Let's plot the histogram of the log-transformed sale price. The logarithm transformation often does a good job at converting a right-skewed distribution into one that is more symmetric. 

In [7]:
# Make a histogram using the log of price rather than price itself
...

We see that the distribution of log-transformed sale price is roughly symmetric. 
Now that we have an understanding of the distribution of sale price, let's
consider the so-what questions posed in the previous section on EDA guidelines.

### Why use the log transformation?


+ Shape matters because models and statistics based on symmetric distributions tend to have more robust and stable properties than highly skewed distributions. For this reason, we primarily work with the log-transformed sale price.

+ Also, it's easier to see the left side of the distribution with the log-transformed values. It would be hard to see small secondary modes or gaps without this transformation. 

### What about temporality?

The housing market rose rapidly during this time and then the bottom fell out of the market. 
So the distribution of sale price in, say, 2004, might be quite different
than in 2008, right before the crash. To explore this notion further, we can examine
the behavior of prices over time. 

Alternatively, we can fix time, and examine the
relationships between price and the other features of interest. 
Both approaches are potentially worthwhile.

We narrow our focus to one year. We have reduced the data to sales made in 2004, so rising prices should have a limited impact on the distributions and relationships that we examine. 

### Examining other features

In addition to the sale price, which is our main focus, a few other features that might be important to our investigation are the size of the house, lot (or property)
size, and number of bedrooms. We explore the distributions of these features
and their relationship to sale price and to each other.

**Make a guess about the shape of the distribution of the building size.**

Since the size of the property is likely related to its price, it seems
reasonable to guess that these features are also skewed to the right.

In [9]:
# Make a histogram of house square footages
...

In [10]:
# Make a histogram of log house square footages
...

The distribution is unimodal with a peak at about 1500 ft^2, and
many houses are over 2,500 ft^2 in size. 
We have confirmed our intuition: the log-transformed building size is
nearly symmetric, although it maintains a slight skew to the right. The same is the case for
the distribution of lot size. 

### Relationship between building and lot size

**What do you think it might look like?**

Given both house and lot size have skewed distributions, a scatter plot of the two should most likely be on log scale too. We compare the scatter plot with and without the log-transformation below. 

In [13]:
# Make a scatter plot, encoding bsqft on the x-axis and lsqft on the y-axis
...

In [15]:
# Make a scatter plot, encoding log_bsqft on the x-axis and log_lsqft on the y-axis
...

The scatter plot in the original
units is very difficult to read because most of the points are crowded into the bottom of the plotting region. 

On the other hand, the scatter plot on the right reveals a few interesting features:

+ There is a horizontal line along the bottom of the scatter plot where it appears that many houses have the same lot size no matter the building size;
+ There are a collection of houses with very large lots, which we should look into;
+ There possibly a slight positive linear association between lot and building size (on log-log scale).


Let's look at some lower quantiles of lot size to try and figure out the unusual small value for the lot size.

In [16]:
percs = [0.5, 1, 1.5, 2, 2.5, 3]
lots = np.percentile(sfh['lsqft'].dropna(), percs, interpolation='lower')
pd.DataFrame({'lot_size': lots}, index=percs)

We found something interesting! About 2.5% of the houses have a lot size of 436
ft^2. This is tiny and makes little sense so we make a note of the anomaly for further investigation.

### Distribution of Number of Bedrooms 

Another measure of house size is the number of bedrooms.

**What do you think it might look like?**

Houses in the Bay Area tend to be on the smaller side so we venture to guess
that the distribution will have a peak at three and skewed to the right with a
few houses having 5 or 6 bedrooms.

Since the number of bedrooms is a discrete quantitative variable, we can treat it as a qualitative feature and make a bar plot.

In [17]:
# Make a bar plot, encoding number of bedrooms on the x-axis and
# their counts on the y-axis
...

The bar plot confirms that we generally had the right idea. However, we find that there are some houses with over 30 bedrooms! That's a bit hard to believe and points to another possible data quality problem. Since the records include the addresses of the houses, we can double check theses values on a real estate app. 

In the meantime, let's just transform the number of bedrooms into an ordinal feature by
reassigning all values larger than 8 to 8+. Then we recreate the bar plot with the
transformed data. 

In [19]:
# We already ran this code to create the new_br column
#eight_up = sfh.loc[sfh['br'] >= 8, 'br'].unique()
#sfh['new_br'] = sfh['br'].replace(eight_up, 8)

# Make a bar plot, encoding new_br on the x-axis and counts on the y-axis
...

We can see that even lumping all of the houses with 8+ bedrooms together, they do not amount to many. The distribution is nearly symmetric with a peak at 3, nearly the same proportion of houses have
2 or 4 bedrooms, and nearly the same have 1 or 5. There is asymmetry present
with a few houses having 6 or more bedrooms.

Now, we examine the relationship between the number of bedrooms and sale price and between building size and sale price.

## Delving Deeper into Relationships with Price

Let's begin by examining how the distribution of price changes for houses with
different numbers of bedrooms. We can do this with box plots.

In [21]:
# Make a box plot, encoding new_br on the x-axis and price on the y-axis.
# Use a log scale for the y-axis
...

The median sale price increases with the number of bedrooms from 1
to 5 bedrooms, but for the largest houses (those with 6, 7, and 8+ bedrooms),
the distribution of log-transformed sale price appears nearly the same.


### Price per square-foot

We would expect that houses with one bedroom are smaller than houses with, say,
4 bedrooms. We might also guess that houses with 6 or more bedrooms are similar
in size. 

Let's consider a transformation that divides price by 
building size to give us the price per square foot. 

$$ \frac{\text{price}} {\text{building size (ft}^2)} $$

Do you think that this feature is constant for all houses? In other words, price
is primarily determined by size. 

Let's first look at how price per square foot relates to the number of bedrooms.


In [22]:
# Make a box plot, encoding new_br on the x-axis and price per square foot
# on the y-axis. Use a log scale for the y-axis
...

As we might expect the price per square foot decreases with the number of bedrooms. The smallest houses are the most expensive per square foot. This observation coincides with the notion of the cost to enter the market. 

### Price and Size

Next let's look at the relationship between price and the size of the house. 
Below we'll add a smooth curve to the scatter plot that shows the average price for houses of the same (or similar) size. 

**What does the trend lime tell you about the relationship between price and building size?**

In [25]:
# Make a scatter plot, encoding log_bsqft on the
# x-axis and log_price on the y-axis. Add the lowess trend line as a black line
# on top of the scatter plot.
...

Some observations:

+ larger houses cost more - not a big surprise
+ the relationship is roughly linear on a log-log scale
+ there's a minimum price to enter into the market 

## Location, Location, Location!

You may have heard the expression:
There are three things that matter in real estate: *location, location, location.*
Comparing price across cities might bring additional insights to our investigation.

So far we haven't considered the relationship between prices and location. There are house
sales from over 150 different cities in the original dataset. Some cities have a
handful of sales and others have thousands. We have narrowed down the data
to a few cities in the SF East Bay: Richmond, El Cerrito, Albany,
Berkeley, Walnut Creek, Lamorinda (which is a combination of Lafayette, Moraga,
and Orinda, three neighboring bedroom communities), and Piedmont.

The city feature is a nominal feature. How do we examine the relationship between a quantitative feature like price and a nominal feature, like city?  

In [26]:
cities = ['Richmond', 'El Cerrito', 'Albany', 'Berkeley',
          'Walnut Creek', 'Lamorinda', 'Piedmont']

# Make a box plot, encoding city on the x-axis and price on the y-axis.
# Only plot the cities that are present in the cities variable above.
...

The box plots show that Lamorinda and Piedmont tend to have more expensive homes and Richmond has the least
expensive, but there is overlap in sale price for many cities.

## Three Features

We can examine the relationship between price per square foot and house size and compare this relationship across the four of these cities.

In [27]:
four_cities = ['Berkeley', 'Lamorinda', 'Piedmont', 'Richmond']

# Make one scatter plot for each city in four_cities.
# Encode bsqft on the x-axis and ppsf on the y-axis. Add the lowess trend line
# to each plot.
...

The relationship between price-per-square-foot and building size is nearly flat for two of the cities (Piedmont and Lamorinda). 
The other two cities show the market entry dip we saw earlier.

We also see that a house in Berkeley costs about \\$100 more per
square foot than a house in Richmond, regardless of size. And Piedmont and Lamorinda are more expensive cities with houses costing about \$500 psf. 

These plots support the location-location-location saying.

In EDA, we often revisit earlier plots to check whether new findings add
insights to previous visualization. It is important to continually take stock of our findings and use them to guide us in further explorations. Let's summarize our findings so far. 

## EDA discoveries

Our EDA has uncovered several interesting phenomenon. Briefly, some of the
most notable are:

- Sale price and building size are highly skewed to the right with one mode. A log-transformation gives a more symmetric distribution, but there's still some skew. 
- Price per square foot decreases nonlinearly with building size, with smaller
  houses costing more per square foot than larger houses, and the price per
  square foot being roughly constant for houses with three or more bedrooms.
- More desirable locations add a bump in sale price that is roughly the same
  size for houses of different sizes.

There are many additional explorations we can (and should) perform, and there are several checks that we should make. These include: investigating the 436 value for lot size and crosschecking unusual houses,
like the 30 bedroom house and the $20m house, with online real estate apps.

We narrowed our investigation down to one year, a few cities, and sale prices under \$4m. This narrowing helped us control for features that might interfere with finding simpler relationships. For example, since the data were collected over several years, the date of sale may confound the relationship between sale price and number of bedrooms. At other times, we want to consider the effect of time on prices.