# Impact of Affordable Housing Status on Housing Prices in Nashville Census Tracts
For the analysis in this notebook, each tract is categorized into one of three groups based on its "affordable housing status". These three groups are:
1. Tracts with no affordable housing developments
2. Tracts with affordable housing developments completed before 2010
3. Tracts with affordable housing developments completed since 2010.

Note: A tract with both an affordable housing development completed before 2010 and a development completed since 2010 will be placed in the second grouping.


In [None]:
import pandas as pd
import sqlite3 as sql
import matplotlib.pyplot as plt
import seaborn as sns

### Read in cleaned sales data & set datatypes

The data file "sales_cleansed_data.csv" is created by the ***sales_data_cleansing.ipynb notebook, which must be run before this notebook.***

In [None]:
#Datatypes dictionary
sales_datatypes = {'apn':'str',
                   'pin':'str',
                   'saleamount':'int',
                   "tract":"str",
                   "saleyear":"int"}

In [None]:
sales_df = pd.read_csv('../data/sales_cleansed_data.csv', dtype = sales_datatypes)

In [None]:
sales_df['saledate'] = pd.to_datetime(sales_df.saledate)
sales_df['ownerdate'] = pd.to_datetime(sales_df.ownerdate)

In [None]:
sales_df.info()

### Average multiple sales for same property in same year

In [None]:
#Create new dataframe subset of columns.
sales_by_apn_year_df = sales_df[['tract','apn','saleyear','saleamount']]
sales_by_apn_year_df.head()

In [None]:
sales_by_apn_year_df = sales_by_apn_year_df.groupby(by=['tract','apn','saleyear']).agg('mean')
sales_by_apn_year_df = sales_by_apn_year_df.reset_index()

***Note:*** The ***sales_by_apn_year_df*** dataframe contains sales data at the property/parcel level (APN). The next several steps create a dataframe with information for each tract; this tracts dataframe will merged into the sales_by_apn_year_df dataframe to add columns with data about each sold property's census tract. ***The sales_by_apn_year_df dataframe will then serve as the starting point for each type of analysis that follows.***

### Build Tracts Dataframe

A dataframe with information for each tract will be built through the following four steps.

#### Build Tracts Dataframe: 1. List of Tracts with Sales Data

A list of tracts with relevant sales data is created for use later in step 4. 

In [None]:
#List of tracts for which we have relevant sales data.
tracts_with_sales = list(sales_df.tract.unique())
print(len(tracts_with_sales),'census tracts contain relevant sales since 2010.')

#### Build Tracts Dataframe: 2. Initial lists of tracts with (and without) affordable housing developments

Lists of tracts with and without affordable housing developments are created for use later in step 4.

In [None]:
lihtc_datatypes = {'FIPS2010':'str'}
lihtc_df = pd.read_csv('../data/LIHTC.csv', dtype = lihtc_datatypes)

In [None]:
#Initial list of tracts with affordable home developments.
lihtc_df['FIPS2010'] = lihtc_df.FIPS2010.str[-5:]
tracts_with_projects = list(lihtc_df.FIPS2010.unique())
#add tract from Barnes projects
tracts_with_projects = tracts_with_projects + ['17200']
print(len(tracts_with_projects),'census tracts contain affordable housing developments.')

In [None]:
tracts_with_sales_no_projects = list(set(tracts_with_sales) - set(tracts_with_projects))
print(len(tracts_with_sales_no_projects),'census tracts do not contain affordable housing developments.')

In [None]:
tracts_with_projects_no_sales = list(set(tracts_with_projects) - set(tracts_with_sales))
tracts_with_projects_no_sales
#This list is empty, which confirms that every census tract that contains an affordable housing development
#has had at least one relevant property sale since 2010.

In [None]:
#Initial list of tracts with affordable housing developments completed before 2010. This list was created by
#a separate analysis focused on the census tract and year completed for each development in LIHTC.csv.
tracts_with_projects_2010 = ['11001', '19400', '11800', '11300', '11200', '19200', '12701', 
                             '13900', '13800', '12702', '13700', '12600', '14300', '13500', 
                             '12801', '19300', '10903', '19500', '11400', '12100', '17401', 
                             '15618', '17000', '11700', '15622', '17100', '13300', '10302', 
                             '19116', '18202', '15629', '15502', '10501', '11002', '10402', 
                             '16200', '16100', '10401', '13601', '19105', '15900', '10301', 
                             '15610', '15620', '14400', '10702', '15623', '15628', '19005']
print(len(tracts_with_projects_2010), 'census tracts contain affordable housing developments completed before 2010.')

In [None]:
#Initial list of tracts with affordable housing developments completed since 2010.
tracts_with_projects_after_2010 = list(set(tracts_with_projects) - set(tracts_with_projects_2010))
print(len(tracts_with_projects_after_2010),'census tracts contain affordable housing developments completed since 2010.')

#### Build Tracts Dataframe: 3. Initial dataframe of Tracts' Median Household Incomes 

Dataframes with median household incomes for each tract are created for later use in step 4.

In [None]:
#Connect with the census database and retrieve data from table B19013, which contains median household incomes
#by census tract.
conn = sql.connect('../data/census.sqlite')
cur = conn.cursor()
B19013_med_hhold_inc = pd.read_sql("SELECT * FROM B19013;", conn) 
cur.close()
conn.close()

In [None]:
#Change datatype of tract column to string.
B19013_med_hhold_inc['tract'] = B19013_med_hhold_inc.tract.astype(str)

In [None]:
#Create dataframe with median of each tract's median household incomes from 2010 to 2020.
med_hhinc_2010_2020_df = B19013_med_hhold_inc.loc[((B19013_med_hhold_inc.year >= 2010)
                                                  & (B19013_med_hhold_inc.year <= 2020))
                                                    & (B19013_med_hhold_inc.value > 0)]
med_hhinc_2010_2020_df = med_hhinc_2010_2020_df[['tract','value']]
med_hhinc_2010_2020_df = med_hhinc_2010_2020_df.groupby(by=['tract']).agg('median')
med_hhinc_2010_2020_df = med_hhinc_2010_2020_df.reset_index()
med_hhinc_2010_2020_df = med_hhinc_2010_2020_df.round({'value': 0})
med_hhinc_2010_2020_df = med_hhinc_2010_2020_df.rename(columns={'value':'med_hhinc_2010_2020'})
med_hhinc_2010_2020_df.head()

In [None]:
#Create dataframe with median of each tract's median household incomes from 2015 to 2020.
med_hhinc_2015_2020_df = B19013_med_hhold_inc.loc[((B19013_med_hhold_inc.year >= 2015)
                                                  & (B19013_med_hhold_inc.year <= 2020))
                                                    & (B19013_med_hhold_inc.value > 0)]
med_hhinc_2015_2020_df = med_hhinc_2015_2020_df[['tract','value']]
med_hhinc_2015_2020_df = med_hhinc_2015_2020_df.groupby(by=['tract']).agg('median')
med_hhinc_2015_2020_df = med_hhinc_2015_2020_df.reset_index()
med_hhinc_2015_2020_df = med_hhinc_2015_2020_df.round({'value': 0})
med_hhinc_2015_2020_df = med_hhinc_2015_2020_df.rename(columns={'value':'med_hhinc_2015_2020'})
med_hhinc_2015_2020_df.head()

#### Build Tracts Dataframe: 4. Build final tracts dataframe

A dataframe with information for each tract is built using the results of the prior steps 1 through 3. The list of tracts with sales data is iterated through to create four related lists of tract information. The final tracts dataframe is built by combing these five lists into a single dataframe with five columns. 

In [None]:
has_project = []  #A True/False boolean for whether a tract has affordable housing.
has_project_2010 = [] #This list becomes the "affordable housing status" for each tract.
                      #The three groupings are 'No AH', 'Built before 2010', and 'Built since 2010'.
med_hhinc_2010_2020 = [] #This is the median of each tract's median household income from 2010 to 2020.
med_hhinc_2015_2020 = [] #This is the median of each tract's median household income from 2015 to 2020.

for tract in tracts_with_sales:
    if tract in tracts_with_projects_2010:
        has_project.append(True)
        has_project_2010.append('Built before 2010')
    elif tract in tracts_with_projects_after_2010:
        has_project.append(True)
        has_project_2010.append('Built since 2010')
    else:
        has_project.append(False)
        has_project_2010.append('No AH')
    
    if med_hhinc_2010_2020_df[med_hhinc_2010_2020_df.tract == tract].shape[0] > 0:
        med_hhinc_2010_2020.append(med_hhinc_2010_2020_df[med_hhinc_2010_2020_df.tract == tract].iloc[0,1])
    else:
        med_hhinc_2010_2020.append(None)
        
    if med_hhinc_2015_2020_df[med_hhinc_2015_2020_df.tract == tract].shape[0] > 0:
        med_hhinc_2015_2020.append(med_hhinc_2015_2020_df[med_hhinc_2015_2020_df.tract == tract].iloc[0,1])
    else:
        med_hhinc_2015_2020.append(None)

In [None]:
tracts_dict = {'tract':tracts_with_sales, 
               'has_project':has_project,
               'has_project_2010':has_project_2010,
               'med_hhinc_2010_2020':med_hhinc_2010_2020,
               'med_hhinc_2015_2020':med_hhinc_2015_2020}
tracts_df = pd.DataFrame(tracts_dict)

***tracts_df columns:***
- has_project: A True/False boolean for whether a tract has affordable housing.
- has_project_2010: This list becomes the "affordable housing status" for each tract. The three groupings are 'No AH', 'Built before 2010', and 'Built since 2010'.
- med_hhinc_2010_2020: This is the median of each tracts median household income from 2010 to 2020.
- med_hhinc_2015_2020: This is the median of each tracts median household income from 2015 

***Note:*** The columns has_project and med_hhinc_2015_2020 are not used. They were created for analyses that were later removed from this notebook.

In [None]:
tracts_df.head()

### Tracts by Affordable Housing Status Bar Chart

In [None]:
# Plot of # of tracts grouped by has_project_2010
tracts_summary = tracts_df[['tract','has_project_2010']].groupby(by=['has_project_2010']).agg('count')
tracts_summary = tracts_summary.reset_index()
tracts_summary = tracts_summary.rename(columns={'tract':'number_of_tracts'})
tracts_summary = tracts_summary.sort_values(by=['number_of_tracts'],ascending=False)
tracts_summary

In [None]:
hue_order = ['No AH',
             'Built before 2010',
             'Built since 2010',]
palette = ['#0F0064','#F3C400','#E7800C']
fig, ax = plt.subplots(figsize=(10,6))
ahs = sns.barplot(data=tracts_summary,
                 x='has_project_2010',
                 y='number_of_tracts',
                  palette=palette
                 )
plt.title('Affordable Housing Status for Tracts with Residential Sales (2010-2022)',
          fontweight = 'bold',
          fontsize = 14)
plt.xlabel('Affordable Housing Status',
          fontweight = 'bold',
          fontsize = 12)
plt.ylabel('Number of Tracts',
          fontweight = 'bold',
          fontsize = 12)
ahs.set_ylim(bottom=0,top=110)
ahs.set_xticklabels(['No AH','Built before 2010','Built since 2010 (in or after 2010)'])
plt.text(x = 0, y=100, s='97', fontsize = 14,ha='center') #No AH
plt.text(x = 1, y=52, s='49', fontsize = 14,ha='center') #Built before 2010
plt.text(x = 2, y=15, s='13', fontsize = 14,ha='center') #Built after 2010
;

### Add tracts info to sales_by_apn_year_df

This dataframe has sales at the property/parcel level (APN) and also info about the tract that contains the parcel. This dataframe serves as the starting point for each type of analysis that follows.

In [None]:
sales_by_apn_year_df = pd.merge(sales_by_apn_year_df, tracts_df, how='left', on=['tract'])

### Median sale prices by affordable housing status and year

In [None]:
#Create new dataframe without apn and tract columns.
sales_by_year_df = sales_by_apn_year_df[['has_project_2010',
                                         'saleyear',
                                         'saleamount']]

In [None]:
#Aggregate to find median price by has_project_2010 and year
sales_by_year_df = sales_by_year_df.groupby(by=['has_project_2010',
                                                'saleyear']).agg('median')
sales_by_year_df = sales_by_year_df.reset_index()

In [None]:
sales_by_year_df = sales_by_year_df.rename(columns={'saleamount':'medprice'})
# medprice is short for median sale price

In [None]:
sales_by_year_df = sales_by_year_df.round({'medprice': 0})

In [None]:
sales_by_year_df.head()

In [None]:
hue_order = ['No AH',
             'Built before 2010',
             'Built since 2010']
palettedict = {'No AH':'#0F0064',
               'Built before 2010':'#F3C400',
               'Built since 2010':'#E7800C'}
fig, ax = plt.subplots(figsize=(10,5))
lp = sns.lineplot(data=sales_by_year_df,
         x='saleyear',
         y='medprice',
         marker='o',
         markersize=5,
         hue='has_project_2010',
         hue_order=hue_order,
         palette=palettedict
         )
lp.set_ylim(0,500000)
plt.title('Annual Median Sale Price by Affordable Housing Status',
          fontweight = 'bold',
          fontsize = 14)
plt.xlabel('')
plt.ylabel('Median Sale Price',
          fontweight = 'bold',
          fontsize = 12)
plt.legend(loc="lower right")
lp.set_xticks([2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022])
lp.set_yticks([0,100000,200000,300000,400000,500000])
lp.set_yticklabels(['$0','$100,000','$200,000','$300,000','$400,000','$500,000'])
plt.text(x = 2014, y=100000, s='AH in 9 of these 13 tracts by 2014', fontsize = 10,ha='left')
plt.text(x = 2018, y=200000, s='AH in all of these tracts by 2018', fontsize = 10,ha='left');

### Annual Growth by affordable housing status and year

A column is added to sales_by_year_df that contains the percentage change in median sale price compared to the prior year.

In [None]:
for index, row in sales_by_year_df.iterrows():
    if row['saleyear'] > 2010:
        prioryearrow = sales_by_year_df.loc[(sales_by_year_df.has_project_2010 == row['has_project_2010'])
                                            & (sales_by_year_df.saleyear == row['saleyear'] - 1)]
        priormedprice = prioryearrow.iloc[0,2]
        sales_by_year_df.loc[index, 'pctpricegrowth'] = 100 * (row['medprice'] - priormedprice) / priormedprice

In [None]:
sales_by_year_df.head()

In [None]:
hue_order = ['No AH',
             'Built before 2010',
             'Built since 2010']
palettedict = {'No AH':'#0F0064',
               'Built before 2010':'#F3C400',
               'Built since 2010':'#E7800C'}
fig, ax = plt.subplots(figsize=(10,5))
pg = sns.lineplot(data=sales_by_year_df,
         x='saleyear',
         y='pctpricegrowth',
         marker='o',
         markersize=5,
         hue='has_project_2010',
         hue_order=hue_order,
         palette=palettedict
         )
plt.title('Annual Sale Price Growth by Affordable Housing Status',
          fontweight = 'bold',
          fontsize = 14)
plt.xlabel('')
plt.ylabel('% Growth in Median Sale Price',
          fontweight = 'bold',
          fontsize = 12)
pg.set_xticks([2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022])
pg.set_ylim(bottom=-10,top=30)
pg.set_yticklabels(['-10%','-5%','0%','5%','10%','15%','20%','25%','30%'])
plt.legend(loc="lower right")
;

### Analysis of 2010-2020 sales price growth vs median household income by tract

In [None]:
#Create new dataframe with a subset of columns from sales_by_apn_year_df.
sales_by_tract_year_df = sales_by_apn_year_df[['tract',
                                               'saleyear',
                                               'saleamount',
                                               'has_project_2010',
                                               'med_hhinc_2010_2020',
                                               'med_hhinc_2015_2020']]

In [None]:
#Aggregate to find median sale price by tract and year
sales_by_tract_year_df = sales_by_tract_year_df.groupby(by=['tract',
                                                            'saleyear',
                                                            'has_project_2010',
                                                            'med_hhinc_2010_2020',
                                                            'med_hhinc_2015_2020']).agg('median')
sales_by_tract_year_df = sales_by_tract_year_df.reset_index()

In [None]:
sales_by_tract_year_df = sales_by_tract_year_df.rename(columns={'saleamount':'medprice'}) #medprice = median sale price

In [None]:
sales_by_tract_year_df = sales_by_tract_year_df.round({'medprice': 0})

In [None]:
#Create dataframe with one row per tract with data for 2020, including median household
#income for 2010-2020 and 2015-2020 periods.
growth_by_tract_df = sales_by_tract_year_df.loc[sales_by_tract_year_df.saleyear == 2020]

In [None]:
#Iterate through growth_by_tract_df to add columns for the percent growth in median sale 
#price from 2010-2020 and 2015-2020.
for index, row in growth_by_tract_df.iterrows():
    if sales_by_tract_year_df.loc[(sales_by_tract_year_df.tract == row['tract'])
                                  & (sales_by_tract_year_df.saleyear == 2010)].shape[0] > 0:
        medprice2010 = sales_by_tract_year_df.loc[(sales_by_tract_year_df.tract == row['tract'])
                                                  & (sales_by_tract_year_df.saleyear == 2010)].iloc[0,5]
        growth_by_tract_df.loc[index,'pctpricegrowth10yr'] = 100 * (row['medprice'] - medprice2010) / medprice2010
    else:
        print('tract',row['tract'],'missing 2010') #Note when a tract did not have a sale in 2010.
    
    if sales_by_tract_year_df.loc[(sales_by_tract_year_df.tract == row['tract'])
                                              & (sales_by_tract_year_df.saleyear == 2015)].shape[0]:
        medprice2015 = sales_by_tract_year_df.loc[(sales_by_tract_year_df.tract == row['tract'])
                                                  & (sales_by_tract_year_df.saleyear == 2015)].iloc[0,5]
        growth_by_tract_df.loc[index,'pctpricegrowth5yr'] = 100 * (row['medprice'] - medprice2015) / medprice2015
    else:
        print('tract',row['tract'],'missing 2015') #Note when a tract did not have a sale in 2015.

In [None]:
growth_by_tract_df.describe()

In [None]:
#Count of tracts in upper quartile by median household income 2010-2020
growth_by_tract_df.loc[growth_by_tract_df.med_hhinc_2010_2020 > 59602].has_project_2010.value_counts()

In [None]:
#Count of tracts in upper quartile by growth in median sale price 2010-2020
growth_by_tract_df.loc[growth_by_tract_df.pctpricegrowth10yr > 169.189189].has_project_2010.value_counts()

In [None]:
hue_order = ['No AH',
             'Built before 2010',
             'Built since 2010']
palettedict = {'No AH':'#0F0064',
               'Built before 2010':'#F3C400',
               'Built since 2010':'#E7800C'}
fig, ax = plt.subplots(figsize=(7,7))
sp10 = sns.scatterplot(data=growth_by_tract_df,
                       x='med_hhinc_2010_2020',
                       y='pctpricegrowth10yr',
                       hue='has_project_2010',
                       hue_order=hue_order,
                       palette=palettedict)
plt.legend(loc="upper right")

plt.title('Price Growth vs Household Income for Nashville Tracts (2010-2020)',
          fontweight = 'bold',
          fontsize = 14)
plt.xlabel('Median Household Income *',
           fontweight = 'bold',
           fontsize = 12)
plt.ylabel('% Growth in Median Sale Price',
           fontweight = 'bold',
           fontsize = 12)
sp10.set_xlim(0,200000)
sp10.set_xticklabels(['$0','$25k','$50k','$75k','$100k',
                      '$125k','$150k','$175k','$200k'])
sp10.set_ylim(-100,1000)
sp10.set_yticks([-100,0,100,200,300,400,500,600,700,800,900,1000])
sp10.set_yticklabels(['-100%','0%','100%','200%','300%','400%','500%','600%','700%','800%','900%','1000%'])

sp10.axhline(169.189189, linewidth=1, linestyle=':', color='black', label='1qtr')

plt.text(x = 197000, y=180, s='upper quartile', fontsize = 10,ha='right') #LIP
;

### Same scatterplot but without tracts 'Built since 2010'

This was needed to complement a graph from a different notebook, with the two graphs appearing side by side on a single presentation slide.

In [None]:
growth_by_tract_df_remove_after_2010 = growth_by_tract_df.loc[growth_by_tract_df.has_project_2010 != 'Built since 2010']
growth_by_tract_df_remove_after_2010.has_project_2010.value_counts()

In [None]:
hue_order = ['No AH',
             'Built before 2010']
palettedict = {'No AH':'#0F0064',
               'Built before 2010':'#F3C400'}
fig, ax = plt.subplots(figsize=(7,7))
sp10 = sns.scatterplot(data=growth_by_tract_df_remove_after_2010,
         x='med_hhinc_2010_2020',
         y='pctpricegrowth10yr',
         hue='has_project_2010',
         hue_order=hue_order,
         palette=palettedict
         )
plt.legend(loc="upper right")

plt.title('Price Growth vs Household Income for Nashville Tracts (2010-2020)',
          fontweight = 'bold',
          fontsize = 14)
plt.xlabel('Median Household Income',
           fontweight = 'bold',
           fontsize = 12)
plt.ylabel('% Growth in Median Sale Price',
          fontweight = 'bold',
          fontsize = 12)
sp10.set_xlim(0,200000)
sp10.set_xticklabels(['$0','$25k','$50k','$75k','$100k',
                      '$125k','$150k','$175k','$200k'])

sp10.set_yticks([-100,0,100,200,300,400,500,600,700,800,900,1000])
sp10.set_yticklabels(['-100%','0%','100%','200%','300%','400%','500%','600%','700%','800%','900%','1000%'])
;