# Investigating Factors of Rising House Values in New York City: Step 2 (Analysis)

Team Members: Francisco Brady (fbrady), Zhonghan Xie (jonasxie), Michael Garner (mngarner)  
Date: 2024-11-06

## Background and Guiding Questions

As house prices continue to rise in New York City, the ripple effects on communities are profound. For example, the median home price in NYC reached approximately \$754,000 as of September 2024, 29.5\% increase from that of September 2017. While rapidly increasing property values may signal economic growth, they also contribute to housing instability and displacement, particularly among lower-income residents. At the same time, access to quality educational resources remains a crucial factor for many families in deciding where to live. In areas where public schools are highly rated and well-attended, home prices often exceed city averages, reflecting the economic value of educational resources placed on educational quality. This project aims to investigate the factors that contribute to rising house prices in NYC, their impact on eviction rates and the intersection between educational outcomes and housing markets. By understanding these relationships, we hope to inform policymakers and community stakeholders on how to address the challenges of access to quality education, housing affordability and stability in the city.

Three questions that we seek to answer in an analysis of housing price, eviction rate, and educational datasets for NYC are:
1. What impact does rising housing prices have on eviction rates in NYC? Are these strongly correlated?
2. Does the change in eviction rate due to housing prices predict a change in primary/secondary education attendance? Specifically: do neighborhoods with higher eviction rates, potentially due to rising housing costs, see a decrease in school attendance?
3. How do the relationships between housing prices, eviction rates, and education vary across different neighborhoods in NYC? Do the predictive relationships differ in locations with different socioeconomic characteristics?

## Descriptive Statistics

Guidance:

Provide a comprehensive summary of your combined dataset using descriptive statistics. This should include means, medians, modes, ranges, variance, and standard deviations for the relevant features of your data.  The descriptive statistics should inform your guiding questions that you developed in Part I of the project, rather than merely providing an overview of your data.  Interpret these results to draw preliminary conclusions about the data.

In [2]:
# Import libraries and data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [4]:
# Will change with merged data, but import data
allData = pd.read_csv('./data/analytic_dataset.csv')

In [None]:
allData["minority_pct"] = 

Unnamed: 0,DBN,school_name,school_type,academic_type,grade_type,open_date,status,address,community_district,council_district,...,hvi,total_population,median_income,white_pct,black_pct,american_indian_alaska_native_pct,asian_pct,hawaiian_pacific_islander_pct,multiple_race_pct,other_race_pct
0,15K001,P.S. 001 The Bergen,DOE,General Academic,Elementary,1965-07-01T00:00:00.000,Open,309 47 STREET,307.0,38.0,...,643949.214602,0.0,,,,,,,,
1,15K001,P.S. 001 The Bergen,DOE,General Academic,Elementary,1965-07-01T00:00:00.000,Open,309 47 STREET,307.0,38.0,...,643949.214602,1503.0,18941.0,40.652,2.728,10.313,6.055,0.0,1.464,39.521
2,15K001,P.S. 001 The Bergen,DOE,General Academic,Elementary,1965-07-01T00:00:00.000,Open,309 47 STREET,307.0,38.0,...,643949.214602,1738.0,25856.0,42.750,3.625,0.000,23.072,0.0,4.603,28.251
3,15K001,P.S. 001 The Bergen,DOE,General Academic,Elementary,1965-07-01T00:00:00.000,Open,309 47 STREET,307.0,38.0,...,643949.214602,5328.0,23235.0,33.296,0.638,1.126,30.593,0.0,9.685,29.505
4,15K001,P.S. 001 The Bergen,DOE,General Academic,Elementary,1965-07-01T00:00:00.000,Open,309 47 STREET,307.0,38.0,...,643949.214602,5431.0,24473.0,41.300,3.701,0.110,19.923,0.0,29.571,20.180
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69437,75X811,P.S. X811,DOE,Special Education,Secondary School,1994-10-24T00:00:00.000,Open,1434 LONGFELLOW AVENUE,203.0,17.0,...,,,,,,,,,,
69438,75X811,P.S. X811,DOE,Special Education,Secondary School,1994-10-24T00:00:00.000,Open,1434 LONGFELLOW AVENUE,203.0,17.0,...,,,,,,,,,,
69439,75X811,P.S. X811,DOE,Special Education,Secondary School,1994-10-24T00:00:00.000,Open,1434 LONGFELLOW AVENUE,203.0,17.0,...,,,,,,,,,,
69440,75X811,P.S. X811,DOE,Special Education,Secondary School,1994-10-24T00:00:00.000,Open,1434 LONGFELLOW AVENUE,203.0,17.0,...,,,,,,,,,,


In [None]:
# Descriptive statistics for the key variables: evictions, school absenteeism, housing prices
# Mean, median, mode, std dev, range, IQR, etc
# Box plot for each
allData.columns
keepCols = ['year', 'pct_attendance',
       'pct_chronically_absent', 'nta_code', 'nta_name', 'borough',
       'number_of_sales', 'average_sale_price', 'median_sale_price',
       'lowest_sale_price', 'highest_sale_price', 'hvi', 'total_population',
       'median_income', 'white_pct', 'black_pct',
       'american_indian_alaska_native_pct', 'asian_pct',
       'hawaiian_pacific_islander_pct', 'multiple_race_pct', 'other_race_pct']

Index(['DBN', 'school_name', 'school_type', 'academic_type', 'grade_type',
       'open_date', 'status', 'address', 'community_district',
       'council_district', 'census_tract', 'year', 'pct_attendance',
       'pct_chronically_absent', 'nta_code', 'nta_name', 'borough',
       'number_of_sales', 'average_sale_price', 'median_sale_price',
       'lowest_sale_price', 'highest_sale_price', 'hvi', 'total_population',
       'median_income', 'white_pct', 'black_pct',
       'american_indian_alaska_native_pct', 'asian_pct',
       'hawaiian_pacific_islander_pct', 'multiple_race_pct', 'other_race_pct'],
      dtype='object')

Interpretation: 

## Inferential Statistics

Guidance:

Conduct appropriate hypothesis tests to investigate if there are significant differences or correlations within your data.  This might involve regression analysis, ANOVA, and/or chi-squared tests.

Clearly state your null and alternative hypotheses, choose an appropriate significance level, and discuss your findings. Make sure to justify the choice of your tests.

Research questions and associated hypotheses to be answered in the inferential analysis:

1. How are housing prices related to eviction rates? Do areas with higher housing prices have higher eviction rates?
    - Null hypothesis (H0): There is no relationship between housing prices and eviction rates.
    - Alternative hypothesis (Ha): There is a positive relationship between housing prices and eviction rates.
2. Is the impact of income on eviction rates mediated by housing prices?
    - H0: There is no relationship between housing prices and the impact of income on eviction rates.
    - Ha: Housing prices mediate the relationship between income and eviction rates.
3. Do neighborhoods with higher eviction rates see a decrease in school attendance?
    - H0: There is no relationship between eviction rates and school attendance.
    - Ha: There is a negative association between eviction rates and school attendance.
4. Is the impact of income on school attendance mediated by evictions?
    - H0: There is no relationship between evictions and the impact of income on school attendance.
    - Ha: Evictions mediate the relationship between income and school attendance.
5. How do the relationships between housing prices, evictions, and chronic absenteeism vary between minority and white-dominant neighborhoods?
    - H0: There is no statistically significant difference in the relationships between housing prices, evictions, and chronic absenteeism between minority and white-dominant neighborhoods.
    - Ha: The relationships between housing prices, evictions, and chronic absenteeism differ between minority and white-dominant neighborhoods.
6. How do the relationships between housing prices, evictions, and chronic absenteeism vary between low, medium and high income neighborhoods?
    - H0: There is no statistically significant difference in the relationships between housing prices, evictions, and chronic absenteeism between neighborhoods of different income levels.
    - Ha: The relationships between housing prices, evictions, and chronic absenteeism differ by neighborhood income level.


ANOVAs:
- Housing prices, eviction rates, chronic absenteeism by minority vs non-minority dominated neighborhoods
- Housing prices, evictions, and school attendance by income level (low, medium, high income neighborhoods)

In [None]:
# Correlation heatmap for housing prices, evictions, and school chronic absenteeism

In [None]:
# Mediation analysis: income -> housing -> evictions

# 1. Regression analysis of income and evictions

# 2. Regression analysis of income and housing prices

# 3. Regression analysis of housing prices and evictions

# 4. Multiple regression of income and housing prices on evictions

In [None]:
# Mediation analysis: income -> evictions -> absenteeism

# 1. Regression analysis of income and absenteeism

# 2. Regression analysis of income and evictions (already done)

# 3. Regression analysis of evictions and absenteeism

# 4. Multiple regression of income and evictions on absenteeism

In [None]:
# ANOVA tests for housing prices, eviction rates, school attendance for low, medium, high income neighborhoods
# Define low = < Q1, medium = Q1 to Q3, high = > Q3

In [None]:
# ANOVA tests for housing prices, eviction rates, school attendance for white/minority dominated neighborhoods
# Sum all other races into minority_pct
# Define white-dominated as white_pct > minority_pct, minority-dominated as minority_pct > white_pct

## Graphical Analysis

Guidance:

Create various types of plots to visualize relationships within your data. Use histograms, bar charts, scatter plots, box plots, and any other suitable graphical representations you've learned.

Be sure to use appropriate titles, labels, and legends to make your plots readable and informative.

Interpret the graphical representations to uncover patterns, trends, and outliers.


In [None]:
# May be redundant with the above, maybe add some pairplots

## Comparative Analysis

Guidance:

Compare and contrast different subsets of your data. This can include comparisons over time, across different categories, or any other relevant segmentation.  Note that for some projects, the nature of this comparative analysis will be obvious.  For others, you will need to think about how you might subset your data.

Discuss any notable similarities or differences you have identified.


Subsets to examine/comparisons to make:
1. Housing prices, evictions, and school attendance by race (minority-dominant vs white-dominant neighborhoods)
- Multi-line plots: housing prices, evictions, and school attendance, each a seprate figure, plotted by racial perecentage (white, black, asian, islander, etc)
2. Housing prices, evictions, and school attendance by income level (low, medium, high income neighborhoods)
- Multi-line plots as well

Most of this is done in the inferential statistics section. Can reproduce or reference the ANOVAs and add plots to substantiate, then discuss.

In [None]:
# Maybe do additional comparisons based on demographic characteristics

## Multivariate Analysis

Guidance:

Perform multivariate analysis to understand the relationships among three or more variables in your dataset.

Use techniques like cross-tabulation, pivot tables, and multivariate graphs.


In [None]:
# Pairplot and correlation heatmap for all variables

## Synthesis

Guidance:

Synthesize the findings from your descriptive and inferential statistics along with your graphical analyses to answer your research questions.

Discuss how the combination of the datasets has provided added value in terms of insights or capabilities that would not be possible with the individual datasets in isolation.


More expanded version of the interpretation blocks for descriptive, inferential statistics

## Reflection

Include a section (using one or more markdown blocks) at the end of your notebook in which you reflect on the process of analyzing the data. Discuss any challenges you encountered and how you overcame them.

Critically evaluate the limitations of your analysis and suggest areas for further research or improvement.


Obvious: reducing complex socioeconomic phenomena such as school truancy and evictions to a few variables likely oversimplifies the problem. The relationships between housing prices, eviction rates, and school attendance are likely to be mediated by a variety of other factors, such as income, employment, and social services.