# Washington DC Housing Market Analysis

## Project Goal

The goal of this analysis is to explore Washington DC housing market data and gather initial findings. From these findings, we will reconfigure and group the Washington DC regions according to various housing statistics. Upon reconfiguration, these regions will then show similar/same sale prices and other housing characteristics. This analysis will be the basis of future evaluation including building predictive models to predict future home sale price and other notable housing market variables.

## Summary of Data

This analysis uses housing market data from the time period of February 2012 to October 2019, including data for prices (median sale price, percentage of homes sold above list price, percentage of homes that had price drop, etc.), inventory (number of homes on market, new listings, months of supply, etc.), and sales (number of homes sold, median days on market, etc.).

#### Data Source: https://www.redfin.com/blog/data-center

## Testing and Analysis

In order to reconfigure the Washington DC regions, we perform multiple hypothesis testings to ensure the certain regions share the same housing charisteristics. In our project, we forcus on three variables: Median Sale Prices, Homes Sold Month-over-Month, and Inventory Month-over-Month. 


### Testing Methodology:

- Two-sample T test with unequal variance to test the means between two regions

https://en.wikipedia.org/wiki/Student%27s_t-test

- Kolmogorov Smirnov test to test the distributions between two regions

https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

- Wilcoxon Rank test to test the difference for two population mean ranks from two regions

https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test

- Mann-Whitney rank test to test if a random sample from one region is different than another one 

https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test

- Kendall Tau test to test if two samples are independent

https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient

### Assumptions: 

Based on time-series dataset we have, there is a relationship over time for each feature. Therefore, pratical speaking, we relax our testing assumption that each sample (region) is indepedent within each sample. In addition, we will confirm if two samples are independent of each other based on Kendall Tau test.


### Exhaustive Approach:

We perform a pairwise test among 82 regions for each testing methodology. Under the current size, there are over 3,300 comparions (82 choose 2). Therefore, we run the tests in a loop based on the 'combine_list'. In addition, there is a concern for the significance level since we use simultaneous hypthesis testing. To resove this, we define our significance level $\alpha = .000001$ within each test. The significance level would be less than .01 since $( 1-\alpha)^{10000} > .99$

### Rationale for the current approach:

In practice, it is difficult to test two time series samples given the nature of dependency. Therefore, as mentioned in the assumption section, we relax the independence assumption. In addition, we will use the results from non-parametric Wilcoxon Rank test, Mann-Whitney rank, Kendall Tau test since we don't want to check the normality assumption from Two-sample T test and Kolmogorov Smirnov test doesn't take time factor into consideration. 

We will combine the test results from all three non-parametric tests to properly decide which two regions should be combined. 

### Data Import

In [1]:
#Import code and libraries from Exploratory Data Analysis. This code takes care of all data imports and data cleaning.

%run python_folder/python_files/DC_House_Prices_EDA

## Hypothesis Tests and Visualizations

### Median Sale Price Hypothesis Tests

In [7]:
# median_sale_price is the final dataset(panda.dataframe) with only median sale price for each region. Say n regions
# Create a list combine_list containing any two regions out of n region: list(itertools.combinations(range(A.shape[1]), 2))
# Run a loop to perform hypothesis testings: T test, Wilcoxon test, KS test, Mann-Whitney rank test, KS test

import itertools
from statsmodels.stats.weightstats import ttest_ind as t_test
from scipy.stats import wilcoxon
from scipy.stats import ks_2samp 
from scipy.stats import mannwhitneyu
from scipy.stats import kendalltau
 
combine_list = list(itertools.combinations(range(median_sale_price.shape[1]), 2))
#statistic, p-value, degree of freedom for two-sample t test
t_stat = []
p_val_t = []
df = []

# statistic and p-value for Wilcoxon test 
non_stat = []
p_val_wc = []

# statistic and p-value for KS test 
ks_stat = []
p_val_ks = []

# statistic and p-value for Mann-Whitney rank test  test 
mu_stat = []
p_val_mu = []

# statistic and p-value for Kenndall Tau test
kt_stat = []
p_val_kt = []

for i in range(len(combine_list)):
    temp1, temp2, temp3 = t_test(median_sale_price.iloc[:,combine_list[i][0]], median_sale_price.iloc[:,combine_list[i][1]], alternative = "two-sided", usevar = "unequal")
    t_stat.append(float(temp1)) 
    p_val_t.append(float(temp2))
    df.append(float(temp3))
    temp4, temp5 = wilcoxon(median_sale_price.iloc[:,combine_list[i][0]], median_sale_price.iloc[:,combine_list[i][1]], alternative = "two-sided", zero_method = "zsplit")
    non_stat.append(float(temp4))
    p_val_wc.append(float(temp5)) 
    temp6, temp7 = ks_2samp(median_sale_price.iloc[:,combine_list[i][0]], median_sale_price.iloc[:,combine_list[i][1]], alternative = "two-sided", mode = 'asymp')
    ks_stat.append(float(temp6))
    p_val_ks.append(float(temp7))
    temp8, temp9 = mannwhitneyu(median_sale_price.iloc[:,combine_list[i][0]], median_sale_price.iloc[:,combine_list[i][1]], alternative = "two-sided")
    mu_stat.append(float(temp8))
    p_val_mu.append(float(temp9))
    temp10, temp11 = kendalltau(A.iloc[:,combine_list[i][0]], A.iloc[:,combine_list[i][1]])
    kt_stat.append(float(temp10))
    p_val_kt.append(float(temp11))
    
test_matrix = pd.DataFrame(list(zip(combine_list, t_stat, p_val_t, df, non_stat, p_val_wc, ks_stat, p_val_ks, mu_stat, p_val_mu, kt_stat, p_val_kt)), 
               columns =['combinatory', 'Stat_ttest', 'p_ttest', 'df_ttest', 'Stat_wctest', 'p_wctest', 'Stat_kstest', 'p_kstest', 'Stat_mutest', 'p_mutest', 'Stat_kttest', 'p_kttest'])

#test_matrix contains the test results for our median_sale_price hypothesis tests

In [9]:
# Create a function to detect the hypothesis testing results. 0: reject null, 1: fail to reject null
def test_p_value(p = .05, name = 'ttest'):
    name_list = []
    for i in range(len(test_matrix)):
        if test_matrix["p_" + name][i] < p:
            name_list.append(0) 
        else:
            name_list.append(1)   
    test_matrix['index_' + name] = name_list
    
    
test_p_value(p = .000001, name = 'ttest')
test_p_value(p = .000001, name = 'wctest')
test_p_value(p = .000001, name = 'kstest')
test_p_value(p = .000001, name = 'mutest')
test_p_value(p = .000001, name = 'kttest')
#test_matrix
#test_matrix.shape

# Save the final testing results for median_sale_price in data folder
test_matrix.to_csv('data/TM_Median Sale Price.csv')

#### Region Grouping

In [None]:
# Find all regions with similar characteristic based on both Wilcoxon and Mann Whitney tests
index_wc = test_matrix['index_wctest'] == 1
index_mu = test_matrix['index_mutest'] == 1
index_kt = test_matrix['index_kttest'] == 0
#test_matrix[index_wc]
#test_matrix[index_mu]
test_matrix[index_wc & index_mu & index_kt][0:40]

In [None]:
#Find names of matched regions from above cell
df_list[1]['Region_1'][0]
df_list[18]['Region_18'][0]
df_list[35]['Region_35'][0]
df_list[65]['Region_65'][0]

In [17]:
# Set up data time frame
time_frame = pd.date_range('2012-02-01','2019-10-01', 
              freq='MS').strftime("%Y-%b").tolist()

### Median Sale Price Data Visualizations

In [18]:
#Create dataset of first 8 regions in dataset for visualization purposes, using median_sale_price
df_temp = pd.DataFrame(list(zip(time_frame, median_sale_price['Median Sale Price_0'], median_sale_price['Median Sale Price_1'], median_sale_price['Median Sale Price_2'], median_sale_price['Median Sale Price_3'], median_sale_price['Median Sale Price_4'], median_sale_price['Median Sale Price_5'], median_sale_price['Median Sale Price_6'], median_sale_price['Median Sale Price_7'])), 
               columns =['Time', df_list[0]['Region_0'][0], df_list[1]['Region_1'][0], df_list[2]['Region_2'][0], df_list[3]['Region_3'][0], df_list[4]['Region_4'][0], df_list[5]['Region_5'][0], df_list[6]['Region_6'][0], df_list[7]['Region_7'][0]])

In [None]:
# plot of first 8 regions in dataset, using median_sale_price
sns_plot = sns.lineplot(df_temp['Time'], df_temp.iloc[:,1], label = df_temp.columns[1])
sns_plot = sns.lineplot(df_temp['Time'], df_temp.iloc[:,2], label = df_temp.columns[2])
sns_plot = sns.lineplot(df_temp['Time'], df_temp.iloc[:,3], label = df_temp.columns[3])
sns_plot = sns.lineplot(df_temp['Time'], df_temp.iloc[:,4], label = df_temp.columns[4])
sns_plot = sns.lineplot(df_temp['Time'], df_temp.iloc[:,5], label = df_temp.columns[5])
sns_plot = sns.lineplot(df_temp['Time'], df_temp.iloc[:,6], label = df_temp.columns[6])
sns_plot = sns.lineplot(df_temp['Time'], df_temp.iloc[:,7], label = df_temp.columns[7])
sns_plot = sns.lineplot(df_temp['Time'], df_temp.iloc[:,8], label = df_temp.columns[8])


plt.legend()
plt.title("Time Series - Median House Price",fontsize = 25)
plt.xlabel("Time", fontsize = 15)
plt.ylabel("Median House Price", fontsize = 25)
sns.set(rc={'figure.figsize':(15,10)})
plt.xticks(range(5, 100, 12))
#plt.xlim()
plt.show()

#save image of visualization
sns_plot.figure.savefig("data_visualizations/Time Series - Median House Price.png")

In [22]:
#Create dataset of 4 matched regions, based on median sale price
df_temp = pd.DataFrame(list(zip(time_frame, median_sale_price['Median Sale Price_1'], median_sale_price['Median Sale Price_18'], median_sale_price['Median Sale Price_35'], median_sale_price['Median Sale Price_65'])), 
               columns =['Time', df_list[1]['Region_1'][0], df_list[18]['Region_18'][0], df_list[35]['Region_35'][0], df_list[65]['Region_65'][0]])
#df_temp1 = df_temp.set_index(df['Time'])

In [None]:
# plot of 4 matched regions, based on median sale price
sns_plot = sns.lineplot(df_temp['Time'], df_temp.iloc[:,1], label = 'American University Park / Friendship Heights / Tenleytown')
sns_plot = sns.lineplot(df_temp['Time'], df_temp.iloc[:,2], label = 'Chevy Chase-DC')
sns_plot = sns.lineplot(df_temp['Time'], df_temp.iloc[:,3], label = 'Foxhall Village')
sns_plot = sns.lineplot(df_temp['Time'], df_temp.iloc[:,4], label = 'Southeast Chevy Chase')

plt.legend()
plt.title("Time Series - Median House Price",fontsize = 20)
plt.xlabel("Time", fontsize = 20)
plt.ylabel("Median House Price", fontsize = 20)
sns.set(rc={'figure.figsize':(15,10)})
plt.xticks(range(5, 100, 12))
#plt.xlim()
plt.show()

#save image of visualization
sns_plot.figure.savefig("data_visualizations/Time Series - Median House Price_Group.png")

### Homes Sold Month-over-Month Hypothesis Tests

In [None]:
# homes_sold_mom is the final dataset(panda.dataframe) with only homes sold MoM for each region. Say n regions
# Create a list combine_list containing any two regions out of n region: list(itertools.combinations(range(A.shape[1]), 2))
# Run a loop to perform hypothesis testings: T test, Wilcoxon test, KS test, Mann-Whitney rank test, KS test 

import itertools
from statsmodels.stats.weightstats import ttest_ind as t_test
from scipy.stats import wilcoxon
from scipy.stats import ks_2samp 
from scipy.stats import mannwhitneyu

combine_list = list(itertools.combinations(range(homes_sold_mom.shape[1]), 2))

#statistic, p-value, degree of freedom for two-sample t test
t_stat = []
p_val_t = []
df = []

# statistic and p-value for Wilcoxon test 
non_stat = []
p_val_wc = []

# statistic and p-value for KS test 
ks_stat = []
p_val_ks = []

# statistic and p-value for Mann-Whitney rank test  test 
mu_stat = []
p_val_mu = []

for i in range(len(combine_list)):
    temp1, temp2, temp3 = t_test(homes_sold_mom.iloc[:,combine_list[i][0]], homes_sold_mom.iloc[:,combine_list[i][1]], alternative = "two-sided", usevar = "unequal")
    t_stat.append(float(temp1)) 
    p_val_t.append(float(temp2))
    df.append(float(temp3))
    temp4, temp5 = wilcoxon(homes_sold_mom.iloc[:,combine_list[i][0]], homes_sold_mom.iloc[:,combine_list[i][1]], alternative = "two-sided", zero_method = "zsplit")
    non_stat.append(float(temp4))
    p_val_wc.append(float(temp5)) 
    temp6, temp7 = ks_2samp(homes_sold_mom.iloc[:,combine_list[i][0]], homes_sold_mom.iloc[:,combine_list[i][1]], alternative = "two-sided", mode = 'asymp')
    ks_stat.append(float(temp6))
    p_val_ks.append(float(temp7))
    temp8, temp9 = mannwhitneyu(homes_sold_mom.iloc[:,combine_list[i][0]], homes_sold_mom.iloc[:,combine_list[i][1]], alternative = "two-sided")
    mu_stat.append(float(temp8))
    p_val_mu.append(float(temp9))

test_matrix_2 = pd.DataFrame(list(zip(combine_list, t_stat, p_val_t, df, non_stat, p_val_wc, ks_stat, p_val_ks, mu_stat, p_val_mu)), 
               columns =['combinatory', 'Stat_ttest', 'p_ttest', 'df_ttest', 'Stat_wctest', 'p_wctest', 'Stat_kstest', 'p_kstest', 'Stat_mutest', 'p_mutest'])
#test_matrix_2 contains the test results for our homes_sold_mom hypothesis tests

In [27]:
# Create a function to detect the hypothesis testing results. 0: reject null, 1: fail to reject null
def test_p_value_2(p = .05, name = 'ttest'):
    name_list = []
    for i in range(len(test_matrix_2)):
        if test_matrix_2["p_" + name][i] < p:
            name_list.append(0) 
        else:
            name_list.append(1)   
    test_matrix_2['index_' + name] = name_list
    

test_p_value_2(p = .05, name = 'ttest')
test_p_value_2(p = .05, name = 'wctest')
test_p_value_2(p = .05, name = 'kstest')
test_p_value_2(p = .05, name = 'mutest')
#(test_matrix_2.index_wctest == 1)


# Save the final testing results for homes_sold_mom in data folder
test_matrix_2.to_csv('data/TM_Homes Sold MoM.csv')

### Homes Sold Month-over-Month Data Visualizations

In [30]:
#Create dataset of first 8 regions in dataset for visualization purposes, using homes_sold_mom
df_temp_2 = pd.DataFrame(list(zip(time_frame, homes_sold_mom['Homes Sold MoM _0'], homes_sold_mom['Homes Sold MoM _1'], homes_sold_mom['Homes Sold MoM _2'], homes_sold_mom['Homes Sold MoM _3'], homes_sold_mom['Homes Sold MoM _4'], homes_sold_mom['Homes Sold MoM _5'], homes_sold_mom['Homes Sold MoM _6'], homes_sold_mom['Homes Sold MoM _7'])), 
               columns =['Time', df_list[0]['Region_0'][0], df_list[1]['Region_1'][0], df_list[2]['Region_2'][0], df_list[3]['Region_3'][0], df_list[4]['Region_4'][0], df_list[5]['Region_5'][0], df_list[6]['Region_6'][0], df_list[7]['Region_7'][0]])

In [None]:
# plot of first 8 regions in dataset, using homes_sold_mom
sns_plot = sns.lineplot(df_temp_2['Time'], df_temp_2.iloc[:,1], label = df_temp_2.columns[1])
sns_plot = sns.lineplot(df_temp_2['Time'], df_temp_2.iloc[:,2], label = df_temp_2.columns[2])
sns_plot = sns.lineplot(df_temp_2['Time'], df_temp_2.iloc[:,3], label = df_temp_2.columns[3])
sns_plot = sns.lineplot(df_temp_2['Time'], df_temp_2.iloc[:,4], label = df_temp_2.columns[4])
sns_plot = sns.lineplot(df_temp_2['Time'], df_temp_2.iloc[:,5], label = df_temp_2.columns[5])
sns_plot = sns.lineplot(df_temp_2['Time'], df_temp_2.iloc[:,6], label = df_temp_2.columns[6])
sns_plot = sns.lineplot(df_temp_2['Time'], df_temp_2.iloc[:,7], label = df_temp_2.columns[7])
sns_plot = sns.lineplot(df_temp_2['Time'], df_temp_2.iloc[:,8], label = df_temp_2.columns[8])


plt.legend()
plt.title("Time Series - Homes Sold Month-over-Month",fontsize = 15)
plt.xlabel("Time", fontsize = 15)
plt.ylabel("Homes Sold Month-over-Month", fontsize = 15)
sns.set(rc={'figure.figsize':(15,10)})
plt.xticks(range(5, 100, 12))
#plt.xlim()
plt.show()

#save image of visualization
sns_plot.figure.savefig("data_visualizations/Time Series - Homes Sold Month-over-Month.png")

In [32]:
#Create dataset of 4 matched regions, based on homes_sold_mom
df_temp_2 = pd.DataFrame(list(zip(time_frame, homes_sold_mom['Homes Sold MoM _1'], homes_sold_mom['Homes Sold MoM _18'], homes_sold_mom['Homes Sold MoM _35'], homes_sold_mom['Homes Sold MoM _65'])), 
               columns =['Time', df_list[1]['Region_1'][0], df_list[18]['Region_18'][0], df_list[35]['Region_35'][0], df_list[65]['Region_65'][0]])

In [None]:
# plot of 4 matched regions, based on homes_sold_mom
sns_plot = sns.lineplot(df_temp_2['Time'], df_temp_2.iloc[:,1], label = 'American University Park / Friendship Heights / Tenleytown')
sns_plot = sns.lineplot(df_temp_2['Time'], df_temp_2.iloc[:,2], label = 'Chevy Chase-DC')
sns_plot = sns.lineplot(df_temp_2['Time'], df_temp_2.iloc[:,3], label = 'Foxhall Village')
sns_plot = sns.lineplot(df_temp_2['Time'], df_temp_2.iloc[:,4], label = 'Southeast Chevy Chase')

plt.legend()
plt.title("Time Series - Homes Sold Month-over-Month",fontsize = 15)
plt.xlabel("Time", fontsize = 15)
plt.ylabel("Homes Sold Month-over-Month", fontsize = 15)
sns.set(rc={'figure.figsize':(15,10)})
plt.xticks(range(5, 100, 12))
#plt.xlim()
plt.show()

#save image of visualization
sns_plot.figure.savefig("data_visualizations/Time Series - Homes Sold Month-over-Month_Group.png")

### Inventory Month-over-Month Hypothesis Tests

In [34]:
# inventory_mom is the final dataset(panda.dataframe) with only inventory MoM for each region. Say n regions
# Create a list combine_list containing any two regions out of n region: list(itertools.combinations(range(A.shape[1]), 2))
# Run a loop to perform hypothesis testings: T test, Wilcoxon test, KS test, Mann-Whitney rank test, KS test

import itertools
from statsmodels.stats.weightstats import ttest_ind as t_test
from scipy.stats import wilcoxon
from scipy.stats import ks_2samp 
from scipy.stats import mannwhitneyu
combine_list = list(itertools.combinations(range(inventory_mom.shape[1]), 2))

#statistic, p-value, degree of freedom for two-sample t test
t_stat = []
p_val_t = []
df = []

# statistic and p-value for Wilcoxon test 
non_stat = []
p_val_wc = []

# statistic and p-value for KS test 
ks_stat = []
p_val_ks = []

# statistic and p-value for Mann-Whitney rank test  test 
mu_stat = []
p_val_mu = []

for i in range(len(combine_list)):
    temp1, temp2, temp3 = t_test(inventory_mom.iloc[:,combine_list[i][0]], inventory_mom.iloc[:,combine_list[i][1]], alternative = "two-sided", usevar = "unequal")
    t_stat.append(float(temp1)) 
    p_val_t.append(float(temp2))
    df.append(float(temp3))
    temp4, temp5 = wilcoxon(inventory_mom.iloc[:,combine_list[i][0]], inventory_mom.iloc[:,combine_list[i][1]], alternative = "two-sided", zero_method = "zsplit")
    non_stat.append(float(temp4))
    p_val_wc.append(float(temp5)) 
    temp6, temp7 = ks_2samp(inventory_mom.iloc[:,combine_list[i][0]], inventory_mom.iloc[:,combine_list[i][1]], alternative = "two-sided", mode = 'asymp')
    ks_stat.append(float(temp6))
    p_val_ks.append(float(temp7))
    temp8, temp9 = mannwhitneyu(inventory_mom.iloc[:,combine_list[i][0]], inventory_mom.iloc[:,combine_list[i][1]], alternative = "two-sided")
    mu_stat.append(float(temp8))
    p_val_mu.append(float(temp9))
test_matrix_3 = pd.DataFrame(list(zip(combine_list, t_stat, p_val_t, df, non_stat, p_val_wc, ks_stat, p_val_ks, mu_stat, p_val_mu)), 
               columns =['combinatory', 'Stat_ttest', 'p_ttest', 'df_ttest', 'Stat_wctest', 'p_wctest', 'Stat_kstest', 'p_kstest', 'Stat_mutest', 'p_mutest'])
#test_matrix_3 contains the test results for our inventory_mom hypothesis tests

In [36]:
# Create a function to detect the hypothesis testing results. 0: reject null, 1: fail to reject null

def test_p_value_3(p = .05, name = 'ttest'):
    name_list = []
    for i in range(len(test_matrix_3)):
        if test_matrix_3["p_" + name][i] < p:
            name_list.append(0) 
        else:
            name_list.append(1)   
    test_matrix_3['index_' + name] = name_list
test_p_value_3(p = .05, name = 'ttest')
test_p_value_3(p = .05, name = 'wctest')
test_p_value_3(p = .05, name = 'kstest')
test_p_value_3(p = .05, name = 'mutest')
test_matrix_3

#save inventory_mom test matrix to data folder
test_matrix_3.to_csv('data/TM_Inventory MoM.csv')

### Inventory Month-over-Month Data Visualizations

In [39]:
#Create dataset of first 8 regions in dataset for visualization purposes, using inventory_mom
df_temp_3 = pd.DataFrame(list(zip(time_frame, inventory_mom['Inventory MoM _0'], inventory_mom['Inventory MoM _1'], inventory_mom['Inventory MoM _2'], inventory_mom['Inventory MoM _3'], inventory_mom['Inventory MoM _4'], inventory_mom['Inventory MoM _5'], inventory_mom['Inventory MoM _6'], inventory_mom['Inventory MoM _7'])), 
               columns =['Time', df_list[0]['Region_0'][0], df_list[1]['Region_1'][0], df_list[2]['Region_2'][0], df_list[3]['Region_3'][0], df_list[4]['Region_4'][0], df_list[5]['Region_5'][0], df_list[6]['Region_6'][0], df_list[7]['Region_7'][0]])

In [None]:
# plot of first 8 regions in dataset, using inventory_mom
sns_plot = sns.lineplot(df_temp_3['Time'], df_temp_3.iloc[:,1], label = df_temp_3.columns[1])
sns_plot = sns.lineplot(df_temp_3['Time'], df_temp_3.iloc[:,2], label = df_temp_3.columns[2])
sns_plot = sns.lineplot(df_temp_3['Time'], df_temp_3.iloc[:,3], label = df_temp_3.columns[3])
sns_plot = sns.lineplot(df_temp_3['Time'], df_temp_3.iloc[:,4], label = df_temp_3.columns[4])
sns_plot = sns.lineplot(df_temp_3['Time'], df_temp_3.iloc[:,5], label = df_temp_3.columns[5])
sns_plot = sns.lineplot(df_temp_3['Time'], df_temp_3.iloc[:,6], label = df_temp_3.columns[6])
sns_plot = sns.lineplot(df_temp_3['Time'], df_temp_3.iloc[:,7], label = df_temp_3.columns[7])
sns_plot = sns.lineplot(df_temp_3['Time'], df_temp_3.iloc[:,8], label = df_temp_3.columns[8])


plt.legend()
plt.title("Time Series - Inventory Month-over-Month",fontsize = 15)
plt.xlabel("Time", fontsize = 15)
plt.ylabel("Inventory Month-over-Month", fontsize = 15)
sns.set(rc={'figure.figsize':(15,10)})
plt.xticks(range(5, 100, 12))
#plt.xlim()
plt.show()

#save image of visualization
sns_plot.figure.savefig("data_visualizations/Time Series - Inventory Month-over-Month.png")

In [41]:
#Create dataset of 4 matched regions, based on inventory_mom
df_temp_3 = pd.DataFrame(list(zip(time_frame, inventory_mom['Inventory MoM _1'], inventory_mom['Inventory MoM _18'], inventory_mom['Inventory MoM _35'], inventory_mom['Inventory MoM _65'])), 
               columns =['Time', df_list[1]['Region_1'][0], df_list[18]['Region_18'][0], df_list[35]['Region_35'][0], df_list[65]['Region_65'][0]])

In [None]:
# plot of 4 matched regions, based on inventory_mom
sns_plot = sns.lineplot(df_temp_3['Time'], df_temp_3.iloc[:,1], label = 'American University Park / Friendship Heights / Tenleytown')
sns_plot = sns.lineplot(df_temp_3['Time'], df_temp_3.iloc[:,2], label = 'Chevy Chase-DC')
sns_plot = sns.lineplot(df_temp_3['Time'], df_temp_3.iloc[:,3], label = 'Foxhall Village')
sns_plot = sns.lineplot(df_temp_3['Time'], df_temp_3.iloc[:,4], label = 'Southeast Chevy Chase')

plt.legend()
plt.title("Time Series - Inventory Month-over-Month",fontsize = 15)
plt.xlabel("Time", fontsize = 15)
plt.ylabel("Inventory Month-over-Month", fontsize = 15)
sns.set(rc={'figure.figsize':(15,10)})
plt.xticks(range(5, 100, 12))
#plt.xlim()
plt.show()

#save image of visualization
sns_plot.figure.savefig("data_visualizations/Time Series - Inventory Month-over-Month_Group.png")

## Results and Conclusions

As mentioned previously, we reconfigured regions based on (1) Wilcoxon rank test, (2) Mann-Whitney rank, and (3) Kendall Tau test. First, we confirm if regions are independent based on Kendall Tau test. Next, based on the testing statistics from Wilcoxon rank test, Mann-Whitney rank, we conclude that regions should be reconfigured. There are three housing statistics/features tested in our analysis: Median Home Sale Price, Homes Sold Month-over-Month, and Inventory Month-over-Month. 

Based on our analysis on Median Home Sale Price, our claim is reassured that some regions share the same home sale prices. For example, four regions should be reconfigured: (1) American University Park / Friendship Heights / Tenleytown, (2) Chevy Chase-DC (3) Foxhall Village, and (4) Southeast Chevy Chase. However, based on the analyses for the Homes Sold Month-over-Month and Inventory Month-over-Month, it is not statistically significant to show that there is a difference among these four regions. Therefore, we recommend that additional investigation be done there.

## What's Next

#### Use Case
We can use these initial tests and findings for further analysis in which we can predict future Washington DC housing market statistics and also see what influences Washington DC housing market characteristics over time.

#### Next Steps
We will use prior years’ data to build a predictive model in which we can predict the future home sale prices of the Washington DC housing market, based on regions.