# Welcome to the Lab 🥼🧪
## Correlation analysis of institutional ownership vs. price appreciation

**Note** This notebook will work with any of the 70k+ markets in the API

As a reminder, you can get your Parcl Labs API key [here](https://dashboard.parcllabs.com/signup) to follow along.

To run this immediately, you can use Google Colab. Remember, you must set your `PARCL_LABS_API_KEY` as a secret. See this [guide](https://medium.com/@parthdasawant/how-to-use-secrets-in-google-colab-450c38e3ec75) for more information.

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ParclLabs/parcllabs-examples/blob/main/python/inspiration/corr_analysis_institutional_vs_price_appreciation.ipynb)

In [2]:
import os
import sys
import json
import subprocess
from datetime import datetime
from urllib.request import urlopen

# Collab setup from one click above
if "google.colab" in sys.modules:
    from google.colab import userdata
    %pip install parcllabs plotly kaleido
    api_key = userdata.get('PARCL_LABS_API_KEY')
else:
    api_key = os.getenv('PARCL_LABS_API_KEY')

In [3]:
import parcllabs
import pandas as pd
import plotly.express as px
from parcllabs import ParclLabsClient

print(f"Parcl Labs Version: {parcllabs.__version__}")

Parcl Labs Version: 0.2.2


In [4]:
# Initialize the Parcl Labs client
client = ParclLabsClient(api_key)

In [None]:
market_df = pd.read_csv('atl_zips.csv')
market_df.columns = [col.lower() for col in market_df.columns]
parcl_ids = market_df['parcl_id'].tolist()

In [248]:
# lets get all US markets currently available to trade on the Parcl Exchange
# Now lets say you want all price feed markets that are on the parcl exchange
market_df = client.search_markets.retrieve(
    location_type='ZIP5',
    state_abbreviation='WA',
    sort_by='TOTAL_POPULATION',
    sort_order='DESC',
    as_dataframe=True,
    # query='Tampa',
    params={'limit': 1000},  # expand the default limit to 14, as of this writing, 14 markets are available
)

# lets store the parcl_ids of the markets we are interested in
parcl_ids = market_df['parcl_id'].tolist()
len(parcl_ids)

584

In [249]:
# get stock
stock = client.market_metrics_housing_stock.retrieve_many(
    parcl_ids=parcl_ids,
    as_dataframe=True,
    params={
        'limit': 1
    }
)

stock.head()

|████████████████████████████████████████| 584/584 [100%] in 1:16.3 (7.65/s) 


Unnamed: 0,date,single_family,condo,townhouse,other,all_properties,parcl_id
0,2024-04-01,19553,508,701,4935,25697,5445836
1,2024-04-01,12555,11648,954,1175,26332,5446039
2,2024-04-01,16149,7771,598,1575,26093,5445883
3,2024-04-01,17545,2113,134,2889,22681,5445884
4,2024-04-01,13090,3531,326,2133,19080,5446082


In [250]:
# filter to at least 80% sfh's, at least 10000 units
stock['pct_sfh'] = stock['single_family']/stock['all_properties']
use = stock.loc[(stock['pct_sfh'] > 0.51) & (stock['single_family'] > 1000)]
# use = stock
use.shape

(277, 8)

In [251]:
# get single family home prices
sfh_prices = client.market_metrics_housing_event_prices.retrieve_many(
    parcl_ids=parcl_ids,
    property_type='SINGLE_FAMILY',
    as_dataframe=True,
    params={
        'limit': 300
    }
)

sfh_prices.head()

|████████████████████████████████████████| 584/584 [100%] in 1:47.1 (5.45/s) 



The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.



Unnamed: 0,date,price_median_sales,price_median_new_listings_for_sale,price_median_new_rental_listings,price_standard_deviation_sales,price_standard_deviation_new_listings_for_sale,price_standard_deviation_new_rental_listings,price_percentile_20th_sales,price_percentile_20th_new_listings_for_sale,price_percentile_20th_new_rental_listings,...,price_per_square_foot_standard_deviation_sales,price_per_square_foot_standard_deviation_new_listings_for_sale,price_per_square_foot_standard_deviation_new_rental_listings,price_per_square_foot_percentile_20th_sales,price_per_square_foot_percentile_20th_new_listings_for_sale,price_per_square_foot_percentile_20th_new_rental_listings,price_per_square_foot_percentile_80th_sales,price_per_square_foot_percentile_80th_new_listings_for_sale,price_per_square_foot_percentile_80th_new_rental_listings,parcl_id
0,2024-04-01,419250,435000,1975,37594,71876,198,384140,403980,1895,...,48.19,30.14,0.25,196.91,212.42,1.23,283.44,278.66,1.63,5445836
1,2024-03-01,417000,435000,1995,43917,60562,168,370000,410000,1950,...,50.13,49.01,0.23,211.48,203.4,1.37,292.03,286.7,1.6,5445836
2,2024-02-01,401925,429000,2123,41728,37544,287,375000,395000,1945,...,51.13,40.06,0.08,207.14,214.11,1.41,302.74,294.31,1.5,5445836
3,2024-01-01,399874,457375,1950,50825,53500,180,350000,410000,1750,...,46.05,43.31,0.29,182.32,227.69,1.29,281.07,298.84,1.69,5445836
4,2023-12-01,394750,475000,1850,62188,79040,240,331920,384503,1710,...,42.02,43.61,0.17,235.88,210.25,1.29,286.44,275.1,1.58,5445836


In [252]:
# get percent of homes owned by large portfolios
# get single family home prices
own = client.portfolio_metrics_sf_housing_stock_ownership.retrieve_many(
    parcl_ids=parcl_ids,
    as_dataframe=True,
    params={
        'limit': 1
    }
)

own.head()

|████████████████████████████████████▌⚠︎  | (!) 533/584 [91%] in 1:14.7 (7.14/s) 



The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.



Unnamed: 0,date,count_portfolio_2_to_9,count_portfolio_10_to_99,count_portfolio_100_to_999,count_portfolio_1000_plus,count_all_portfolios,pct_sf_housing_stock_portfolio_2_to_9,pct_sf_housing_stock_portfolio_10_to_99,pct_sf_housing_stock_portfolio_100_to_999,pct_sf_housing_stock_portfolio_1000_plus,pct_sf_housing_stock_all_portfolios,parcl_id
0,2024-04-01,1190,76,1.0,,1267,6.09,0.39,0.01,,6.48,5445836
1,2024-04-01,844,27,1.0,4.0,876,6.72,0.22,0.01,0.03,6.98,5446039
2,2024-04-01,870,58,1.0,45.0,974,5.39,0.36,0.01,0.28,6.03,5445883
3,2024-04-01,1374,122,2.0,84.0,1582,7.83,0.7,0.01,0.48,9.02,5445884
4,2024-04-01,730,19,,62.0,811,5.58,0.15,,0.47,6.2,5446082


In [253]:
# prepare percent change analysis
# get the first four months of prices for 2020 and 2024
twenty = ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04']
twenty_four = ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04']

price_col = 'price_median_sales'

current = sfh_prices[['date', price_col, 'parcl_id']].loc[sfh_prices['date'].isin(twenty_four)].groupby('parcl_id')[price_col].median().reset_index()
bottom = sfh_prices[['date', price_col, 'parcl_id']].loc[sfh_prices['date'].isin(twenty)].groupby('parcl_id')[price_col].median().reset_index()
# current = sfh_prices[['date', 'price_median_sales', 'parcl_id']].loc[sfh_prices['date']=='2024-04-01']
# bottom = sfh_prices.loc[sfh_prices['date']=='2020-04-01'][['parcl_id', 'price_median_sales']]
current = current.loc[current[price_col].notnull()]
bottom = bottom.loc[bottom[price_col].notnull()]

bottom = bottom.rename(columns={price_col: 'start_price'})
current = current.rename(columns={price_col: 'end_price'})
current = current.merge(bottom, on='parcl_id')
current.head()


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of empty slice


Mean of e

Unnamed: 0,parcl_id,end_price,start_price
0,5277027,469958.0,376750.0
1,5277029,381498.0,299990.0
2,5277032,492300.0,310750.0
3,5277037,1300000.0,1212500.0
4,5277039,460000.0,360200.0


In [254]:
own['parcl_id'].nunique()

533

In [255]:
current['pct_delta'] = (current['end_price'] - current['start_price']) / current['start_price']
current = current.merge(own[['parcl_id', 'pct_sf_housing_stock_portfolio_1000_plus']], on='parcl_id')
current['pct_sf_housing_stock_portfolio_1000_plus'] = current['pct_sf_housing_stock_portfolio_1000_plus']/100
current = current.merge(market_df[['parcl_id', 'name']], on='parcl_id')


In [256]:
current.loc[current['pct_sf_housing_stock_portfolio_1000_plus'].isnull()].shape

(77, 6)

In [257]:
current['pct_sf_housing_stock_portfolio_1000_plus'] = current['pct_sf_housing_stock_portfolio_1000_plus'].fillna(0)


In [258]:
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
from sklearn.linear_model import LinearRegression
from scipy.stats import pearsonr

# Assuming 'current' and 'use' DataFrames are defined and loaded with appropriate data
current = current.loc[(current['pct_delta'] < 1) & (current['pct_delta']>0)]
current = current.loc[current['parcl_id'].isin(use['parcl_id'].unique().tolist())]

# Fill null values in 'pct_sf_housing_stock_portfolio_1000_plus' with 0
current['pct_sf_housing_stock_portfolio_1000_plus'] = current['pct_sf_housing_stock_portfolio_1000_plus'].fillna(0)

# Set charting constants
labs_logo_lookup = {
    'blue': 'https://parcllabs-assets.s3.amazonaws.com/powered-by-parcllabs-api.png',
    'white': 'https://parcllabs-assets.s3.amazonaws.com/powered-by-parcllabs-api-logo-white+(1).svg'
}

labs_logo_dict = dict(
    source=labs_logo_lookup['white'],
    xref="paper",
    yref="paper",
    x=0.5,  # Centering the logo below the title
    y=1.01,  # Adjust this value to position the logo just below the title
    sizex=0.15, 
    sizey=0.15,
    xanchor="center",
    yanchor="bottom"
)

# Create the scatter plot
fig = px.scatter(current, x='pct_sf_housing_stock_portfolio_1000_plus', y='pct_delta', text='name')

# Calculate the line of best fit
X = current[['pct_sf_housing_stock_portfolio_1000_plus']]
y = current['pct_delta']
model = LinearRegression().fit(X, y)
current['line_of_best_fit'] = model.predict(X)

# Calculate the correlation coefficient
corr_coef, _ = pearsonr(current['pct_sf_housing_stock_portfolio_1000_plus'], current['pct_delta'])

# Add line of best fit to the plot
fig.add_trace(go.Scatter(
    x=current['pct_sf_housing_stock_portfolio_1000_plus'],
    y=current['line_of_best_fit'],
    mode='lines',
    name='Line of Best Fit',
    line=dict(color='red', dash='dash', width=1),
    opacity=0.7,
    showlegend=False
))

# Annotate the plot with the correlation coefficient
fig.add_annotation(
    x=0.5,
    y=1.1,
    xref='paper',
    yref='paper',
    text=f'<i>Correlation Coefficient: <b>{corr_coef:.2f}</b></i>',
    showarrow=False,
    font=dict(size=14, color='white')
)

# Update layout to show text labels
fig.update_traces(textposition='top center')

# Customize layout
HEIGHT = 900
WIDTH = 1600

fig.update_layout(
    height=HEIGHT,
    width=WIDTH,
    title={
        'text': 'Washington State: Relationship b/w SFH Institutional Ownership and Price Appreciation by Zip Code',
        'y': 0.99,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': dict(size=28, color='#FFFFFF'),
    },
    plot_bgcolor='#000000',  # Dark background for better contrast
    paper_bgcolor='#000000',  # Dark background for the paper
    font=dict(color='#FFFFFF'),
    xaxis=dict(
        title_text='% of Single Family Homes Owned by Institutional Portfolios (1000+ Units)',
        showgrid=False,  # Disable vertical grid lines
        tickangle=-45,
        tickfont=dict(size=14),
        linecolor='rgba(255, 255, 255, 0.7)',  # Axis line color with opacity
        linewidth=1,  # Axis line width
        tickformat='.0%'  # Format ticks as percentages
    ),
    yaxis=dict(
        title_text='% Single Family Home Price Growth Since Q1, `20',
        showgrid=True,
        gridwidth=0.5,  # Horizontal grid line width
        gridcolor='rgba(255, 255, 255, 0.2)',  # Horizontal grid line color with opacity
        tickfont=dict(size=14),
        tickprefix='',  # Add dollar sign to y-axis labels
        zeroline=False,
        linecolor='rgba(255, 255, 255, 0.7)',  # Axis line color with opacity
        linewidth=1,  # Axis line width
        tickformat='.0%'  # Format ticks as percentages
    ),
    hovermode='x unified',  # Unified hover mode for better interactivity
    hoverlabel=dict(
        bgcolor='#1F1F1F',
        font_size=14,
        font_family="Rockwell"
    ),
    margin=dict(l=10, r=10, t=120, b=10)  # Increased top margin to accommodate the logo
)

# Add logo
fig.add_layout_image(labs_logo_dict)
fig.write_image(os.path.join('../graphics/wa_state_corr_coef_zips.png'), width=WIDTH, height=HEIGHT)

# Show plot
fig.show()


In [259]:
# 81 zips
# Corr Coef: 0.34

# GA State
# 267 zips
# Corr Coef: 0.2

# WA State
# 209 zips
# Corr Coef: -0.06
current.shape

(209, 7)

In [260]:
import numpy as np
from scipy.stats import norm

# Given correlation coefficients and sample sizes
r1 = -0.06  # correlation coefficient for all ZIP codes in Georgia
r2 = 0.34  # correlation coefficient for ZIP codes in Atlanta metro area
N1 = 209  # sample size for all ZIP codes in Georgia
N2 = 81  # sample size for ZIP codes in Atlanta metro area

# Fisher Z transformation
Z1 = 0.5 * np.log((1 + r1) / (1 - r1))
Z2 = 0.5 * np.log((1 + r2) / (1 - r2))

# Standard error
SE = np.sqrt(1/(N1 - 3) + 1/(N2 - 3))

# Z-score for the difference
Z = (Z1 - Z2) / SE

# p-value from the Z-score
p_value = 2 * (1 - norm.cdf(abs(Z)))

# Output the results
print(f"Z1: {Z1:.4f}, Z2: {Z2:.4f}")
print(f"Standard Error: {SE:.4f}")
print(f"Z-score: {Z:.4f}")
print(f"P-value: {p_value:.4f}")

# Check if the result is significant
alpha = 0.05
if p_value < alpha:
    print("The difference in correlation coefficients is statistically significant.")
else:
    print("The difference in correlation coefficients is not statistically significant.")


Z1: -0.0601, Z2: 0.3541
Standard Error: 0.1329
Z-score: -3.1153
P-value: 0.0018
The difference in correlation coefficients is statistically significant.


In [124]:
current[['pct_delta', 'pct_sf_housing_stock_portfolio_1000_plus']].corr()

Unnamed: 0,pct_delta,pct_sf_housing_stock_portfolio_1000_plus
pct_delta,1.0,0.203869
pct_sf_housing_stock_portfolio_1000_plus,0.203869,1.0


In [None]:
for pid in listings_long.sort_values('name')['parcl_id'].unique():
    data = listings_long.loc[listings_long['parcl_id'] == pid].sort_values('date')
    sum_data = cnts.loc[cnts['parcl_id']==pid]
    name = data['name'].iloc[0].replace('Kings County', 'Brooklyn County').replace('Washington City', 'Washington, DC')
    build_chart(name, data, save_graphic=True)

    april_new_listings = sum_data['new_listings_for_sale'].values[0]
    april_sales = sum_data['sales'].values[0]
    delta = sum_data['delta'].values[0]
    pct_change_yoy = sum_data['New Listings YoY % Change (30 Day Rolling)'].values[0]
    print(name)
    print(f"April New Listings: {april_new_listings}")
    print(f"April Sales: {april_sales}")
    inc_decr = 'Increase' if delta > 0 else 'Decrease'
    print(f"Delta: {delta} Total Supply {inc_decr}")
    print(f"New Listings YoY % Change (30 Day Rolling): {pct_change_yoy:.02%}\n")
    print('Trade today on: @parcl')
    # create row

In [None]:
cnts.head()

output = cnts[['name', 'new_listings_for_sale', 'sales', 'delta', 'New Listings YoY % Change (30 Day Rolling)']]
output = output.rename(columns={
    'new_listings_for_sale': 'April `24 New Listings for Sale',
    'sales': 'April `24 Sales',
    'delta': 'April `24 Supply Delta',
    'New Listings YoY % Change (30 Day Rolling)': 'New Listings YoY % Change (30 Day Rolling)'
})

output.head()

In [None]:
import plotly.graph_objects as go
import pandas as pd

# Assuming the DataFrame `df` is already defined and populated with the relevant data
df = output[['name', 'April `24 New Listings for Sale', 'April `24 Sales', 'April `24 Supply Delta', 'New Listings YoY % Change (30 Day Rolling)']]
df['New Listings YoY % Change (30 Day Rolling)'] = df['New Listings YoY % Change (30 Day Rolling)']*100

# Function to format the supply delta with a plus sign if positive
def format_supply_delta(value):
    return f"+{value}" if value > 0 else str(value)

# Function to format the percentage with 2 decimal places
def format_percentage(value):
    return f"{value:.2f}%"

# Format the supply delta and percentage change columns
df['April `24 Supply Delta'] = df['April `24 Supply Delta'].apply(format_supply_delta)
df['New Listings YoY % Change (30 Day Rolling)'] = df['New Listings YoY % Change (30 Day Rolling)'].apply(format_percentage)

# Prepare data for the table
formatted_data = [df[col].tolist() for col in df.columns]

# Define headers and table layout
column_headers = ['<b>Market</b>', '<b>April `24 New Listings for Sale</b>', '<b>April `24 Sales</b>', '<b>April `24 Supply Delta</b>', '<b>New Listings YoY % Change (30 Day)</b>']

fig = go.Figure(data=[go.Table(
    header=dict(values=column_headers,
                fill_color='#000000',  # Black for header
                font=dict(color='#FFFFFF', size=12),
                align='center',
                height=30),
    cells=dict(values=formatted_data,
               fill_color='#000000',  # No color coding for cells
               font=dict(color='#FFFFFF', size=12),
               align='center',
               height=30)
)])

# Add the logo image
labs_logo_lookup = {
    'blue': 'https://parcllabs-assets.s3.amazonaws.com/powered-by-parcllabs-api.png',
    'white': 'https://parcllabs-assets.s3.amazonaws.com/powered-by-parcllabs-api-logo-white+(1).svg'
}
labs_logo_dict = dict(
    source=labs_logo_lookup['white'],
    xref="paper",
    yref="paper",
    x=0.5,
    y=1.01,
    sizex=0.2,
    sizey=0.2,
    xanchor="center",
    yanchor="bottom"
)
fig.add_layout_image(labs_logo_dict)

# Set the dimensions of the figure
w = 1400
h = 600

# Update layout and display the figure
fig.update_layout(
    title={
        'text': 'Parcl Exchange Markets Supply & Demand Overview',
        'y': 0.94,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    title_font_color='#FFFFFF',
    width=w,  # Increase the width for wider cells
    height=h,
    paper_bgcolor='#080D16',
    margin=dict(l=10, r=10, t=100, b=10)
)

fig.show()

fig.write_image(os.path.join('../graphics/pricefeeds/supply_demand', f'comp_table_supply_demand.png'), width=w, height=h)
