<a href="https://colab.research.google.com/github/Iamjohnko/Data-science-Project-Portfolio/blob/main/Real_Estate_Market_Intelligence_%26_Trend_Forecasting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Executive Summary**


The exploratory analysis reveals a heterogeneous market where relationships among rents, vacancy, transactions and sentiment depend strongly on submarket and time. At the aggregate level there is no strong linear relationship between rent and vacancy (Pearson r ≈ 0.005, p = 0.7470) and no clear correlation between sentiment and rent change (Spearman ρ ≈ -0.006, p = 0.6725).
However, submarket-level Granger causality tests show localized lead–lag dynamics (e.g., vacancy → rent in Industrial Zone at lag 1; vacancy → rent in Suburban East at lags 2 & 4; and rent → vacancy in Suburban West at multiple lags). Outlier detection flagged a small number of extreme rent observations (IQR: 22, Z-score: 7) worth case-level review. Spatial tests (Moran’s I with the synthetic centroids and chosen weight) did not detect significant spatial clustering (low Moran’s I; p-values ~ 0.06, 0.225–0.251 depending on specification).

Implication: simple, single-equation relationships do not capture the market. A mix of submarket-specific time-series, event-aware sentiment signals, and higher-resolution spatial modeling is needed to produce actionable forecasts and early-warning signals.

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from scipy import stats
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv("real_estate_market_intelligence.csv")
df.head(10)

Unnamed: 0,submarket,property_type,rent_per_sqft,vacancy_rate,transaction_volume,cap_rate,year,quarter,news_sentiment,development_activity
0,Tech Corridor,Retail,35.9,0.152,42368556,0.093,2024,3,0.029,Completed
1,Suburban West,Retail,31.47,0.085,37392010,0.094,2017,1,-0.244,Planned
2,Industrial Zone,Office,49.28,0.249,12222670,0.112,2023,4,0.941,Planned
3,Tech Corridor,Office,33.5,0.019,31423224,0.11,2022,1,0.615,
4,Suburban East,Multifamily,22.07,0.049,37683050,0.055,2021,1,0.862,
5,Industrial Zone,Office,36.87,0.224,1286737,0.088,2015,3,0.115,
6,Industrial Zone,Multifamily,34.47,0.025,13518985,0.065,2016,4,0.357,
7,Tech Corridor,Industrial,26.92,0.223,7714219,0.077,2023,3,-0.667,Planned
8,Midtown,Multifamily,24.82,0.182,8350682,0.08,2019,2,0.506,
9,Suburban East,Industrial,27.02,0.019,47767188,0.066,2015,4,-0.528,Completed


In [None]:
# Ensure dtypes
df['year'] = df['year'].astype(int)
df['quarter'] = df['quarter'].astype(int)
df['year_q'] = df['year'].astype(str) + "-Q" + df['quarter'].astype(str)

print("shape:", df.shape)
print(df.columns.tolist())

shape: (5000, 11)
['submarket', 'property_type', 'rent_per_sqft', 'vacancy_rate', 'transaction_volume', 'cap_rate', 'year', 'quarter', 'news_sentiment', 'development_activity', 'year_q']


In [None]:
# --- Basic summary prints ---
print("\n--- Basic numeric summary ---")
print(df[['rent_per_sqft','vacancy_rate','transaction_volume','cap_rate','news_sentiment']].describe())


--- Basic numeric summary ---
       rent_per_sqft  vacancy_rate  transaction_volume     cap_rate  \
count    5000.000000   5000.000000        5.000000e+03  5000.000000   
mean       29.947494      0.130619        2.516970e+07     0.075220   
std         9.937642      0.068611        1.432543e+07     0.026052   
min         5.000000      0.010000        1.045540e+05     0.030000   
25%        23.185000      0.071000        1.285092e+07     0.053000   
50%        29.985000      0.131000        2.531304e+07     0.075000   
75%        36.662500      0.188000        3.756484e+07     0.098000   
max        65.990000      0.250000        4.998736e+07     0.120000   

       news_sentiment  
count     5000.000000  
mean        -0.009781  
std          0.576962  
min         -1.000000  
25%         -0.501000  
50%         -0.010500  
75%          0.488000  
max          1.000000  


In [None]:
# --- 1. Rental & Vacancy Data (STRUCTURED) ---

# A. Distribution of rents across submarkets (histograms with box marginal)
fig_hist = px.histogram(
    df,
    x="rent_per_sqft",
    color="submarket",
    marginal="box",
    nbins=70,
    title="Rent per Sqft Distribution by Submarket (hist + box)",
    labels={"rent_per_sqft": "Rent ($/sqft)"}
)
fig_hist.update_layout(barmode='overlay')
fig_hist.update_traces(opacity=0.6)
fig_hist.show()

In [None]:
# B. KDE-like density per submarket (histnorm='density', overlay)
fig_density = px.histogram(
    df,
    x="rent_per_sqft",
    color="submarket",
    histnorm='density',
    nbins=80,
    title="Density Approximation of Rent per Sqft by Submarket"
)
fig_density.update_layout(barmode='overlay')
fig_density.update_traces(opacity=0.5)
fig_density.show()

In [None]:
# D. Vacancy trends per submarket (small multiples)
vac_sub = df.groupby(['submarket','year','quarter','year_q'])['vacancy_rate'].mean().reset_index()
fig_vac_sub = px.line(vac_sub.sort_values(['submarket','year','quarter']), x='year_q', y='vacancy_rate',
                      color='submarket', facet_col='submarket', facet_col_wrap=3,
                      title="Vacancy Rate Over Time by Submarket")
fig_vac_sub.update_xaxes(tickangle=45)
fig_vac_sub.update_layout(showlegend=False, height=900)
fig_vac_sub.show()

# E. Correlations between rents, vacancy, transaction volume, cap_rate and sentiment
corr_cols = ['rent_per_sqft','vacancy_rate','transaction_volume','cap_rate','news_sentiment']
corr = df[corr_cols].corr()
fig_corr = px.imshow(corr, text_auto=True, title="Correlation Matrix: rent, vacancy, transaction_volume, cap_rate, sentiment")
fig_corr.show()

In [None]:
# F. Geographic heatmap of submarkets (if you don't have geocodes, use / replace sample centroids)
# NOTE: Replace the coordinates below with your actual submarket centroids (lat, lon).
submarket_coords = {
    "Downtown": (40.7128, -74.0060),
    "Midtown": (40.7549, -73.9840),
    "Suburban East": (40.7891, -73.1349),
    "Suburban West": (40.7282, -74.0776),
    "Industrial Zone": (40.6782, -73.9442),
    "Waterfront": (40.7003, -74.0121),
    "Tech Corridor": (40.7411, -73.9897)
}
centroid_df = df.groupby('submarket').agg(
    rent_per_sqft=('rent_per_sqft','mean'),
    vacancy_rate=('vacancy_rate','mean'),
    transaction_volume=('transaction_volume','sum')
).reset_index()
centroid_df['lat'] = centroid_df['submarket'].map(lambda x: submarket_coords.get(x,(np.nan,np.nan))[0])
centroid_df['lon'] = centroid_df['submarket'].map(lambda x: submarket_coords.get(x,(np.nan,np.nan))[1])

fig_geo = px.scatter_geo(centroid_df, lat='lat', lon='lon', hover_name='submarket',
                         size='transaction_volume', color='rent_per_sqft',
                         projection="natural earth",
                         title="Submarket centroids (rent color, transaction volume size)")
fig_geo.update_traces(marker=dict(opacity=0.8, sizemode='area'))
fig_geo.show()

In [None]:
# G. Outlier detection for rent spikes/drops (IQR & Z-score)
outliers_list = []
for sm, g in df.groupby('submarket'):
    q1 = g['rent_per_sqft'].quantile(0.25)
    q3 = g['rent_per_sqft'].quantile(0.75)
    iqr = q3 - q1
    lower = q1 - 1.5*iqr
    upper = q3 + 1.5*iqr
    sm_out = g[(g['rent_per_sqft'] < lower) | (g['rent_per_sqft'] > upper)].copy()
    sm_out['outlier_method'] = 'IQR'
    outliers_list.append(sm_out)
outliers_df = pd.concat(outliers_list) if outliers_list else pd.DataFrame()
print("IQR outliers found:", outliers_df.shape[0])

IQR outliers found: 22


In [None]:
# Z-score global
df['rent_zscore'] = stats.zscore(df['rent_per_sqft'].fillna(df['rent_per_sqft'].mean()))
z_outliers = df[np.abs(df['rent_zscore']) > 3]
print("Z-score extreme outliers:", z_outliers.shape[0])


Z-score extreme outliers: 7


In [None]:
# Visualize outliers on timeline
fig_out = px.scatter(df, x='year_q', y='rent_per_sqft', color='submarket',
                     hover_data=['property_type','vacancy_rate'], title="Rent per Sqft over time (outliers marked)")
if not z_outliers.empty:
    fig_out.add_trace(go.Scatter(x=z_outliers['year_q'], y=z_outliers['rent_per_sqft'],
                                 mode='markers', marker_symbol='x', marker_size=10, marker_color='black',
                                 name='Z-outliers'))
fig_out.update_xaxes(tickangle=45)
fig_out.show()

In [None]:
# --- 2. TRANSACTION DATA ---

# A. Deal volume trends over time (aggregate)
trans_ts = df.groupby(['year','quarter','year_q'])['transaction_volume'].sum().reset_index().sort_values(['year','quarter'])
fig_trans = px.bar(trans_ts, x='year_q', y='transaction_volume', title="Transaction Volume Over Time")
fig_trans.update_xaxes(tickangle=45)
fig_trans.show()

# B. Price per sqft vs submarket (violin + box)
fig_vio = px.violin(df, x='submarket', y='rent_per_sqft', box=True, points='all', title="Rent per Sqft Distribution by Submarket")
fig_vio.update_layout(xaxis_title="Submarket", yaxis_title="Rent ($/sqft)")
fig_vio.show()

# C. Cap rate analysis across asset classes
fig_cap = px.box(df, x='property_type', y='cap_rate', points='all', title="Cap Rate Distribution by Property Type")
fig_cap.show()
print(df.groupby('property_type')['cap_rate'].agg(['mean','median','count']).round(3))

# D. Time-lag analysis (vacancy -> rent). We'll compute correlations for lags -4..+4 quarters
agg = df.groupby(['submarket','year','quarter']).agg(rent_per_sqft=('rent_per_sqft','mean'), vacancy_rate=('vacancy_rate','mean')).reset_index()
agg = agg.sort_values(['submarket','year','quarter'])
agg['period_index'] = agg.groupby('submarket').cumcount()

lag_results = []
for lag in range(-4,5):
    corr_list = []
    for sm, g in agg.groupby('submarket'):
        series_rent = g.set_index('period_index')['rent_per_sqft']
        series_vac = g.set_index('period_index')['vacancy_rate']
        shifted_vac = series_vac.shift(lag)
        pair = pd.concat([series_rent, shifted_vac], axis=1).dropna()
        if pair.shape[0] > 3:
            corr_list.append(pair.corr().iloc[0,1])
    if len(corr_list) > 0:
        lag_results.append({'lag': lag, 'mean_corr': np.nanmean(corr_list), 'n_submarkets': len(corr_list)})
lag_df = pd.DataFrame(lag_results)
fig_lag = px.line(lag_df, x='lag', y='mean_corr', title="Average correlation between vacancy (lagged) and rent across submarkets")
fig_lag.add_hline(y=0, line_dash='dash')
fig_lag.update_xaxes(dtick=1)
fig_lag.show()

                mean  median  count
property_type                      
Industrial     0.076   0.075   1238
Multifamily    0.074   0.073   1241
Office         0.075   0.075   1260
Retail         0.076   0.076   1261


In [None]:
# --- 3. TEXT DATA (News, Planning Docs, Reports) ---

# If you have `news_text` column with raw text, the script runs TF-IDF, n-grams, and LDA.
# If not, numeric 'news_sentiment' column is used for time series correlation analyses.

if 'news_text' not in df.columns:
    print("\nNo 'news_text' column detected. Skipping TF-IDF / LDA. Using numeric 'news_sentiment' for sentiment + correlation analysis.")
else:
    print("\nProcessing textual data (TF-IDF, n-grams, LDA topics)...")
    # Basic cleaning (light)
    df['news_text_clean'] = df['news_text'].astype(str).str.replace(r'http\S+','', regex=True).str.replace(r'[^A-Za-z0-9 ]+', ' ', regex=True).str.lower()
    docs = df['news_text_clean'].fillna('')
    # 1) Top n-grams / word frequency
    vectorizer = CountVectorizer(ngram_range=(1,2), max_features=1500, stop_words='english')
    X_counts = vectorizer.fit_transform(docs)
    sum_words = np.array(X_counts.sum(axis=0)).ravel()
    terms = vectorizer.get_feature_names_out()
    freq_df = pd.DataFrame({'term':terms, 'count':sum_words}).sort_values('count', ascending=False).head(40)
    fig_tf = px.bar(freq_df.head(25), x='count', y='term', orientation='h', title="Top terms / n-grams (news_text)")
    fig_tf.update_layout(yaxis={'categoryorder':'total ascending'})
    fig_tf.show()

    # 2) TF-IDF top features
    tfidf = TfidfVectorizer(ngram_range=(1,2), max_features=1500, stop_words='english')
    X_tfidf = tfidf.fit_transform(docs)
    tfidf_sums = np.array(X_tfidf.sum(axis=0)).ravel()
    top_idx = tfidf_sums.argsort()[::-1][:25]
    tfidf_df = pd.DataFrame({'term': tfidf.get_feature_names_out()[top_idx], 'tfidf': tfidf_sums[top_idx]})
    fig_tfidf = px.bar(tfidf_df, x='tfidf', y='term', orientation='h', title="Top TF-IDF terms")
    fig_tfidf.update_layout(yaxis={'categoryorder':'total ascending'})
    fig_tfidf.show()

    # 3) Topic modeling (LDA)
    n_topics = 6
    lda_vec = CountVectorizer(max_df=0.95, min_df=5, stop_words='english')
    lda_counts = lda_vec.fit_transform(docs)
    lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
    lda.fit(lda_counts)
    feature_names = lda_vec.get_feature_names_out()
    def top_words_for_topic(model, feature_names, n_top_words=10):
        topic_words = []
        for idx, topic in enumerate(model.components_):
            top_idx = topic.argsort()[:-n_top_words-1:-1]
            words = [feature_names[i] for i in top_idx]
            topic_words.append(" ".join(words))
        return topic_words
    topics = top_words_for_topic(lda, feature_names, n_top_words=12)
    for i, t in enumerate(topics):
        print(f"Topic {i}: {t}")

    # assign topics and visualize prevalence over time
    doc_topic = lda.transform(lda_counts)
    df['topic'] = doc_topic.argmax(axis=1)
    topic_ts = df.groupby(['year','quarter','topic']).size().reset_index(name='count')
    topic_ts['year_q'] = topic_ts['year'].astype(str) + "-Q" + topic_ts['quarter'].astype(str)
    fig_topic = px.bar(topic_ts.sort_values(['year','quarter']), x='year_q', y='count', color='topic', title="Topic prevalence over time")
    fig_topic.update_xaxes(tickangle=45)
    fig_topic.show()



No 'news_text' column detected. Skipping TF-IDF / LDA. Using numeric 'news_sentiment' for sentiment + correlation analysis.


In [None]:
# Sentiment analysis over time (numeric)
if 'news_sentiment' in df.columns:
    sent_ts = df.groupby(['year','quarter','year_q'])['news_sentiment'].mean().reset_index().sort_values(['year','quarter'])
    fig_sent = px.line(sent_ts, x='year_q', y='news_sentiment', title="Average News Sentiment Over Time (All Submarkets)", markers=True)
    fig_sent.update_xaxes(tickangle=45)
    fig_sent.show()

    # Sentiment by submarket heatmap
    sent_by_sub = df.groupby(['submarket','year','quarter'])['news_sentiment'].mean().reset_index()
    if not sent_by_sub.empty:
        # convert to pivot (submarket rows, year_q columns)
        sent_by_sub['year_q'] = sent_by_sub['year'].astype(str) + "-Q" + sent_by_sub['quarter'].astype(str)
        pivot_sent = sent_by_sub.pivot(index='submarket', columns='year_q', values='news_sentiment').fillna(0)
        fig_sent_heat = px.imshow(pivot_sent, title="Sentiment by Submarket (Year-Quarter columns)")
        fig_sent_heat.update_layout(xaxis={'tickangle':45})
        fig_sent_heat.show()

    # Correlation of sentiment & market metrics (quarterly aggregated)
    agg_corr = df.groupby(['year','quarter']).agg({
        'news_sentiment':'mean',
        'rent_per_sqft':'mean',
        'transaction_volume':'sum',
        'vacancy_rate':'mean'
    }).reset_index()
    corr_agg = agg_corr[['news_sentiment','rent_per_sqft','transaction_volume','vacancy_rate']].corr()
    fig_corr_agg = px.imshow(corr_agg, text_auto=True, title="Quarter-aggregated correlation: sentiment vs market metrics")
    fig_corr_agg.show()

# HYPOTHESIS TESTING & ADVANCED STATISTICAL ANALYSIS


In [None]:
!pip install esda


Collecting esda
  Downloading esda-2.8.0-py3-none-any.whl.metadata (2.0 kB)
Downloading esda-2.8.0-py3-none-any.whl (157 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m157.1/157.1 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: esda
Successfully installed esda-2.8.0


In [None]:
# HYPOTHESIS TESTING & ADVANCED STATISTICAL ANALYSIS
# ================================================

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import statsmodels.api as sm
from statsmodels.tsa.stattools import grangercausalitytests
from scipy.stats import pearsonr, spearmanr
from sklearn.preprocessing import StandardScaler
from libpysal.weights import DistanceBand
from esda.moran import Moran
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load dataset
df = pd.read_csv("real_estate_market_intelligence.csv")

# Ensure proper dtypes
df['year'] = df['year'].astype(int)
df['quarter'] = df['quarter'].astype(int)

# Sort by time for time-series tests
df = df.sort_values(by=['submarket', 'year', 'quarter'])

In [None]:
# 1. RENTAL PRICE DYNAMICS
# ================================================

# Correlation between Rent and Vacancy
corr, pval = pearsonr(df['rent_per_sqft'], df['vacancy_rate'])
print(f"Correlation (Rent vs Vacancy): {corr:.3f}, p-value={pval:.4f}")

if pval < 0.05:
    print("Reject H0 → Rental prices are significantly influenced by vacancy rate fluctuations.")
else:
    print("Fail to reject H0 → No significant relationship detected between rents and vacancy rates.")

# Plot relationship
fig = px.scatter(df, x='vacancy_rate', y='rent_per_sqft',
                 color='submarket',
                 trendline='ols',
                 title='Rental Price vs Vacancy Rate across Submarkets')
fig.show()

Correlation (Rent vs Vacancy): 0.005, p-value=0.7470
Fail to reject H0 → No significant relationship detected between rents and vacancy rates.


# Task
Explain the error in the selected code, and if possible, fix the error and incorporate the changes into the existing code. Otherwise, try to diagnose the error.

## Time series stationarity testing

### Subtask:
Test for stationarity in key time series like rent, vacancy, and transaction volume within each submarket using tests like Augmented Dickey-Fuller (ADF) or Kwiatkowski-Phillips-Schmidt-Shin (KPSS).


**Reasoning**:
The previous code block failed due to a KeyError. The column names used in the code block (`Rent_per_Sqft`, `Vacancy_Rate`, and `Submarket`) do not match the actual column names in the DataFrame (`rent_per_sqft`, `vacancy_rate`, and `submarket`). I need to fix these column names to match the DataFrame columns. I will then proceed with the next step of the subtask which is to test for stationarity in key time series.



In [None]:
# ================================================
# 2. TRANSACTION VOLUME & RENTAL PRICE CAUSALITY
# ================================================

# Aggregate by year-quarter for a single submarket example (e.g., Midtown)
midtown = df[df['submarket'] == 'Midtown'].groupby(['year', 'quarter']).agg({
    'rent_per_sqft': 'mean',
    'transaction_volume': 'mean'
}).reset_index()

# Prepare data for Granger causality
granger_df = midtown[['rent_per_sqft', 'transaction_volume']].dropna()
granger_df.columns = ['rent', 'txn']

# Test if Transaction Volume Granger-causes Rent
print("\nGranger Causality Test: Transaction Volume → Rent")
grangercausalitytests(granger_df[['rent', 'txn']], maxlag=4, verbose=True)


Granger Causality Test: Transaction Volume → Rent

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=0.8249  , p=0.3698  , df_denom=36, df_num=1
ssr based chi2 test:   chi2=0.8937  , p=0.3445  , df=1
likelihood ratio test: chi2=0.8836  , p=0.3472  , df=1
parameter F test:         F=0.8249  , p=0.3698  , df_denom=36, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=0.4841  , p=0.6206  , df_denom=33, df_num=2
ssr based chi2 test:   chi2=1.1149  , p=0.5727  , df=2
likelihood ratio test: chi2=1.0989  , p=0.5773  , df=2
parameter F test:         F=0.4841  , p=0.6206  , df_denom=33, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=0.2396  , p=0.8680  , df_denom=30, df_num=3
ssr based chi2 test:   chi2=0.8866  , p=0.8287  , df=3
likelihood ratio test: chi2=0.8762  , p=0.8312  , df=3
parameter F test:         F=0.2396  , p=0.8680  , df_denom=30, df_num=3

Granger Causality
number of lags (no zero) 4
ssr

{np.int64(1): ({'ssr_ftest': (np.float64(0.824932542864846),
    np.float64(0.369782297919501),
    np.float64(36.0),
    np.int64(1)),
   'ssr_chi2test': (np.float64(0.8936769214369163),
    np.float64(0.34448283836429483),
    np.int64(1)),
   'lrtest': (np.float64(0.8835914903802404),
    np.float64(0.34721988246777324),
    np.int64(1)),
   'params_ftest': (np.float64(0.8249325428648699),
    np.float64(0.3697822979194926),
    np.float64(36.0),
    1.0)},
  [<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7aa09f2de3f0>,
   <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7aa0a14fd7f0>,
   array([[0., 1., 0.]])]),
 np.int64(2): ({'ssr_ftest': (np.float64(0.4841025308285098),
    np.float64(0.620557645895667),
    np.float64(33.0),
    np.int64(2)),
   'ssr_chi2test': (np.float64(1.1149027982717195),
    np.float64(0.5726667044455944),
    np.int64(2)),
   'lrtest': (np.float64(1.0988604545615885),
    np.float64(0.5772786343113493),
    np.int64

In [None]:
# 2. Granger Causality Tests
# ================================================

print("\n--- Granger Causality Tests (Vacancy <-> Rent) ---")
max_lag = 4 # Test up to 4 quarters of lag

# Perform Granger causality test for each submarket
for submarket in df['submarket'].unique():
    print(f"\nSubmarket: {submarket}")
    submarket_df = df[df['submarket'] == submarket].copy()
    # Ensure data is sorted by time
    submarket_df = submarket_df.sort_values(by=['year', 'quarter'])

    # Test if vacancy rate Granger-causes rent per sqft
    try:
        # Ensure the time series are aligned and have no missing values for the test
        temp_df = submarket_df[['vacancy_rate', 'rent_per_sqft']].dropna()
        if len(temp_df) > max_lag:
            gc_test_vr_to_rent = grangercausalitytests(temp_df[['vacancy_rate', 'rent_per_sqft']], max_lag, verbose=False)
            # Output a summary of the results (e.g., p-values for each lag)
            print("  Vacancy Rate Granger-causes Rent per Sqft:")
            for lag in range(1, max_lag + 1):
                p_value = gc_test_vr_to_rent[lag][0]['ssr_ftest'][1]
                print(f"    Lag {lag}: p-value = {p_value:.4f}")
                if p_value < 0.05:
                    print(f"      Result: Significant (Reject H0 - Vacancy Rate Granger-causes Rent)")
                else:
                     print(f"      Result: Not Significant (Fail to Reject H0)")
        else:
            print(f"  Not enough data ({len(temp_df)} data points) to perform Granger causality test for Vacancy -> Rent.")

    except Exception as e:
        print(f"  Error during Granger causality test (Vacancy -> Rent): {e}")

    # Test if rent per sqft Granger-causes vacancy rate
    try:
        # Ensure the time series are aligned and have no missing values for the test
        temp_df = submarket_df[['rent_per_sqft', 'vacancy_rate']].dropna()
        if len(temp_df) > max_lag:
             gc_test_rent_to_vr = grangercausalitytests(temp_df[['rent_per_sqft', 'vacancy_rate']], max_lag, verbose=False)
             # Output a summary of the results (e.g., p-values for each lag)
             print("  Rent per Sqft Granger-causes Vacancy Rate:")
             for lag in range(1, max_lag + 1):
                p_value = gc_test_rent_to_vr[lag][0]['ssr_ftest'][1]
                print(f"    Lag {lag}: p-value = {p_value:.4f}")
                if p_value < 0.05:
                    print(f"      Result: Significant (Reject H0 - Rent Granger-causes Vacancy Rate)")
                else:
                    print(f"      Result: Not Significant (Fail to Reject H0)")
        else:
            print(f"  Not enough data ({len(temp_df)} data points) to perform Granger causality test for Rent -> Vacancy.")

    except Exception as e:
        print(f"  Error during Granger causality test (Rent -> Vacancy): {e}")


--- Granger Causality Tests (Vacancy <-> Rent) ---

Submarket: Downtown
  Vacancy Rate Granger-causes Rent per Sqft:
    Lag 1: p-value = 0.4674
      Result: Not Significant (Fail to Reject H0)
    Lag 2: p-value = 0.4912
      Result: Not Significant (Fail to Reject H0)
    Lag 3: p-value = 0.5886
      Result: Not Significant (Fail to Reject H0)
    Lag 4: p-value = 0.7352
      Result: Not Significant (Fail to Reject H0)
  Rent per Sqft Granger-causes Vacancy Rate:
    Lag 1: p-value = 0.8085
      Result: Not Significant (Fail to Reject H0)
    Lag 2: p-value = 0.3892
      Result: Not Significant (Fail to Reject H0)
    Lag 3: p-value = 0.3383
      Result: Not Significant (Fail to Reject H0)
    Lag 4: p-value = 0.3489
      Result: Not Significant (Fail to Reject H0)

Submarket: Industrial Zone
  Vacancy Rate Granger-causes Rent per Sqft:
    Lag 1: p-value = 0.0252
      Result: Significant (Reject H0 - Vacancy Rate Granger-causes Rent)
    Lag 2: p-value = 0.0519
      Resul

In [None]:
# ================================================
# 3. SENTIMENT ANALYSIS & RENTAL CORRELATION
# ================================================

# Compute correlation between Sentiment and Rent change
df['Rent_Change'] = df.groupby('submarket')['rent_per_sqft'].pct_change()
sent_corr, sent_pval = spearmanr(df['news_sentiment'], df['Rent_Change'], nan_policy='omit')
print(f"\nCorrelation (Sentiment vs Rent Change): {sent_corr:.3f}, p-value={sent_pval:.4f}")

if sent_pval < 0.05:
    print("Reject H0 → Positive sentiment precedes rental price increases.")
else:
    print("Fail to reject H0 → Sentiment shows no predictive relationship with rent changes.")

# Sentiment over time
sent_trend = df.groupby(['year'])['news_sentiment'].mean().reset_index()
fig = px.line(sent_trend, x='year', y='news_sentiment', title='Average Sentiment Over Time')
fig.show()


Correlation (Sentiment vs Rent Change): -0.006, p-value=0.6725
Fail to reject H0 → Sentiment shows no predictive relationship with rent changes.


In [None]:
# 4. SPATIAL AUTOCORRELATION (Moran’s I)
# ================================================

# Approximate coordinates for submarkets (synthetic)
coords = {
    "Downtown": (40.7128, -74.0060),
    "Midtown": (40.7549, -73.9840),
    "Suburban East": (40.8300, -73.9000),
    "Suburban West": (40.7200, -74.2000),
    "Industrial Zone": (40.7000, -74.2500),
    "Waterfront": (40.6900, -73.9800),
    "Tech Corridor": (40.7600, -73.9500)
}

df['lat'] = df['submarket'].map(lambda x: coords[x][0])
df['lon'] = df['submarket'].map(lambda x: coords[x][1])

# Average rent per submarket
geo_df = df.groupby('submarket')[['rent_per_sqft', 'lat', 'lon']].mean().reset_index()

# Build distance-based spatial weights
w = DistanceBand.from_array(geo_df[['lat', 'lon']].values, threshold=0.1, binary=True)

# Moran’s I test
moran = Moran(geo_df['rent_per_sqft'], w)
print(f"\nMoran’s I: {moran.I:.3f}, p-value={moran.p_sim:.4f}")

if moran.p_sim < 0.05:
    print("Reject H0 → Rental prices cluster geographically (spatial spillover effects).")
else:
    print("Fail to reject H0 → No spatial autocorrelation detected in rents.")

# Spatial heatmap visualization
fig = px.scatter_mapbox(
    geo_df,
    lat='lat', lon='lon',
    color='rent_per_sqft',
    size='rent_per_sqft',
    color_continuous_scale='Viridis',
    mapbox_style='carto-positron',
    zoom=9,
    title='Spatial Distribution of Average Rent per Sqft by Submarket'
)
fig.show()


Moran’s I: 0.060, p-value=0.2250
Fail to reject H0 → No spatial autocorrelation detected in rents.


In [None]:
# SUMMARY DASHBOARD (INTERACTIVE VIEW)
# ================================================
summary = pd.DataFrame({
    'Test': [
        'Rent vs Vacancy',
        'Transaction Volume → Rent (Granger)',
        'Sentiment vs Rent Change',
        'Spatial Autocorrelation (Moran’s I)'
    ],
    'Statistic': [
        round(corr, 3),
        'F-stat (from Granger)',
        round(sent_corr, 3),
        round(moran.I, 3)
    ],
    'p-value': [
        round(pval, 4),
        'See output above',
        round(sent_pval, 4),
        round(moran.p_sim, 4)
    ],
    'Conclusion': [
        'Reject H0' if pval < 0.05 else 'Fail to Reject H0',
        'Refer to lag tests above',
        'Reject H0' if sent_pval < 0.05 else 'Fail to Reject H0',
        'Reject H0' if moran.p_sim < 0.05 else 'Fail to Reject H0'
    ]
})

fig = go.Figure(data=[go.Table(
    header=dict(values=list(summary.columns), fill_color='lightblue', align='left'),
    cells=dict(values=[summary[col] for col in summary.columns], fill_color='white', align='left')
)])
fig.update_layout(title='Hypothesis Testing Summary Dashboard')
fig.show()