In [6]:
# Import the libraries and dependencies:
import pandas as pd
from pathlib import Path
import hvplot.pandas
import numpy as np

# Read the national-home-sales.csv file into a DataFrame:
df_home_sales = pd.read_csv(
    Path('national-home-sales.csv'),
    index_col='period_end_date',
    parse_dates=True,
    infer_datetime_format=True
)

# Review the DataFrame:
display(df_home_sales.head())
df_home_sales[['inventory', 'homes_sold']].hvplot().opts(yformatter='%0f')

Unnamed: 0_level_0,inventory,homes_sold,median_sale_price
period_end_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-01-01,1250798,377964,289000
2020-02-01,1265253,405992,294000
2020-03-01,1316823,507324,303000
2020-04-01,1297460,436855,304000
2020-05-01,1289500,421351,299000


In [7]:
# IDENTIFYING PATTERNED RELATIONSHIPS
# When analyzing time series, finding seasonal patterns is just one part of the job.
# Another important task is to identify any relationships between time series patterns.
# By doing so, we can better understand the time series behavior and identify predictable relationships.
# Let's examine the time series of home inventory and homes sold from a different perspective.
# To start, recall our first plot of this dataset in the previous lesson.
# After exmaining both series on the plot, we can infer a relationship between them.
# While the number of homes sold increases, the inventory decreases.
# However, you might want to quantitatively verify this relationship.
# We can use correlation to do so.

In [8]:
# ANALYZING DATA CORRELATIONS
# In statistics, a CORRELATION defines the relationship between two or more variables, whether casual or not.
# We can use Pandas to compute a correlation by using the `corr` function as follows:

# Compute the correlation between 'inventory' and 'homes_sold':
df_home_sales_corr = df_home_sales[['inventory', 'homes_sold']].corr()

# Review the correlation between the inventory and homes_sold:
df_home_sales_corr

Unnamed: 0,inventory,homes_sold
inventory,1.0,-0.006937
homes_sold,-0.006937,1.0


In [9]:
# Notice that the DataFrame has two columns for the inventory and the homes sold.
# And, it has two rows for the inventory and the homes sold.
# Each numerical value indicates the correlation between these variables.
# The value of correlation ranges fro -1 to +1. 
# A positive value implies both variables have an increasing (or direct) relationship as both variables increase over time.
# In contrast, a negative value implies a decreasing (or inverse) relationship - because as one variable increases, the other decreases.
# The closer the correlation is to either -1 or +1, the stronger the correlation is between the two variables.
# From our correlation, we can observe the following: Although the inventory and the homes sold seem to have decreasing relationship (the more homes that are for sale at any time, the fewer homes that actually seem to sell), the correlation at -0.006937 isn't strong.
# This indicates that no predictable relationship exists between these two factors.
# Correlations are so helpful.
# However, be aware that a correlaltion doesn't provide enough information to infer the relationship between two variables.
# Statisticians say that a correlation does not imply causation.
# It means that you can't assume causation from only a correlation value.
# You will need a good deal of information to determine causation between factors, including expertise in the field and extensive testing, which will likely include the ability to control for other related factors.
# Remember that a correlation evaluates how much two variables move together?
# You can apply correlations to the analysis of stocks, too.
# In the next activity, you'll use the code for computing correlations in the same way.
# But, the interpetation will differ a bit.
# That's because going forward, you'll use correlations to identify the relationships between current observations and FUTURE values.
# This differs from identifying the relationships that you've examined so far, which involved variables that were measured at the same time.

In [None]:
# ON THE JOB
# Correlation is a useful statistical tool, because it can indicate a predictive relationship that we can capitalize on in practice.
# For example, a natural gas supplier might deliver more gas on cold days based on the correlation between gas consumption and weather.
# This is because extreme weather might cuase people to consume more gas for heating.