# Do Duplexes Sell for Less per Square Foot than Single Family Homes?

## Hypotheses
Our null hypothesis is that duplexes do not sell for less per square foot than single family homes.
Our alternative hypothesis is that they do sell for less than single family homes..

We will try to get results with a 95% confidence, so we will set our alpha to .05

#### Possible Errors:
If we make a type 1 error, we would claim that duplexes sell for less per square foot, when in reality they do not.

On the other hand, if we make a type 2 error, we would claim that they do not sell for less, when in fact they do.

In [9]:
# First we import the libraries we will be using.
import scipy.stats as stats
import statsmodels.stats.power as power
import pandas as pd
import numpy as np

In [10]:
def tt_ind(sample1, sample2, alpha = .05, equal_var = True, tails = 2):
    """
    Takes 2 array-like objects, sample1 and sample 2: samples to test for difference
    and 1 float: the level of confidence, alpha (default .05)
    and 1 bool: whether samples have equal variances (default True)
    and a number of tails: 1 or 2 (default 2)
    performs two sample t-test and prints critical stat, test stat, and one-tailed pvalue
    """
    tcrit = stats.t.ppf(q=.05, df = len(sample1) + len(sample2)-1)
    tstat = stats.ttest_ind(sample1, sample2, equal_var = equal_var)
    if tails == 1:
        print(f'critical stat is {tcrit}, test stat is {tstat[0]} with a pvalue of {tstat[1]/2}')
    elif tails == 2:
        print(f'critical stat is {tcrit}, test stat is {tstat[0]} with a pvalue of {tstat[1]}')
    else:
        print('Please set tails to either 1 or 2')


def cohen_d(sample1, sample2):
    """
    Takes 2 array-like objects: samples to compare
    Returns a float: the standard effect size according to the Cohen D equation.
    """
    effect_size = (sample1.mean() - sample2.mean()) / np.sqrt(((len(sample1) -1) * sample1.var()
                                                         + len(sample2) -1 * sample2.var()
                                                          / len(sample1) + len(sample2) -2))
    return effect_size

## Load the data

In [11]:
salespath = r'C:\Users\caell\flatiron\projects\phase_2_project\phase_2_project_chicago-sf-seattle-ds-082420\data\EXTR_RPSale.csv'
parcelpath = r'C:\Users\caell\flatiron\projects\phase_2_project\phase_2_project_chicago-sf-seattle-ds-082420\data\EXTR_Parcel.csv'
residentialpath = r'C:\Users\caell\flatiron\projects\phase_2_project\phase_2_project_chicago-sf-seattle-ds-082420\data\EXTR_ResBldg.csv'
sales = pd.read_csv(salespath, encoding = 'ISO-8859-1')
parcels = pd.read_csv(parcelpath, encoding = 'ISO-8859-1')
residences = pd.read_csv(residentialpath, encoding = 'ISO-8859-1')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


## Prepare the data


In [12]:
sales = sales[sales['DocumentDate'].astype(str).str.endswith('2019')]
sales = sales[(sales['SalePrice'] > 120000) & (sales['SalePrice'] < 3000000)]

### Join the tables and extract the features we want to compare

In [13]:

duplexs = parcels[parcels['PresentUse'] == 3]
duplexs = duplexs.merge(sales, on = ['Major','Minor']).merge(residences, on = ['Major','Minor'])
duplexs = duplexs[['SalePrice','SqFtTotLiving']]
duplexs['cost_per_sqft'] = duplexs.SalePrice / duplexs.SqFtTotLiving
singlefamily = parcels[parcels['PresentUse'] == 2]
singlefamily = singlefamily.merge(sales, on = ['Major','Minor']).merge(residences, on = ['Major','Minor'])
singlefamily = singlefamily[['SalePrice','SqFtTotLiving']]
singlefamily['cost_per_sqft'] = singlefamily.SalePrice / singlefamily.SqFtTotLiving


### Single family vs duplex sample size and sample means

In [14]:
sample1 = duplexs['cost_per_sqft']
sample2 = singlefamily['cost_per_sqft']
print(f'In 2019 {len(sample1)} duplexes were sold, and {len(sample2)} single family homes were sold.')

print(f'The mean cost per sqft of our samples for single family homes is {sample2.mean()}')
print(f'The mean cost per sqft of our samples for duplexs is {sample1.mean()}')

In 2019 269 duplexes were sold, and 24705 single family homes were sold.
The mean cost per sqft of our samples for single family homes is 368.8780652069677
The mean cost per sqft of our samples for duplexs is 439.0665500492637


A quick glace at the sample means seems to indicate that, in fact, duplexes sell for more per square foot than single family homes.  Let's test whether this difference is statistically significant, especially since our sample size of duplexes is much smaller than for single family homes.

## Testing for statistical significance

We will be using a two sample, one-tailed Welch's test to determine the statistical significance of the difference in means.  Our T-critical value tells us that we need a test statistic below -1.645 to confirm with 95% confidence that duplexes sell for less per square foot than single family homes.  We are looking for a pvalue of .05 or less to confirm our result.

In [16]:
tt_ind(sample1, sample2, alpha = .05, equal_var = False, tails = 1)

critical stat is -1.6449146458926769, test stat is 3.3512678738706114 with a pvalue of 0.00045785649983049495


# We cannot reject the null hypothesis

Our critical stat, which tells us if duplexes sell for less per square foot than single family homes is ~ -1.64.  Our test statistic would need to be below that for us to confidently confirm this.  

In fact, the test statistic is ~ 3.35.  We cannot reject our null hypothesis. 

### What are the chances we are wrong?
Let's check the power of our test, the chance that we would detect the lower average per square foot value of duplexes, if it were there.

In [17]:
effect = cohen_d(sample1, sample2)

power.tt_ind_solve_power(alpha = .95, 
                         nobs1 = len(sample1), 
                         ratio = len(sample1) / len(sample2),
                         alternative = 'smaller',
                         effect_size = effect)

0.9477560624182295

The power of our test is just under .95.  If, in fact, duplexes do sell for less per square foot than single family homes, we would get this same result about 5% of the time.  This gives us ~ 95% confidence that we are not mistaken. 

## It's likely that duplexes do not sell for less than single family homes.

## Next steps

We recommend testing whether duplexes, in fact may sell for more per square foot than single family homes, which our tests may indicate is the case.  If so, subdividing a home into a duplex may be a successful way for some homeowners to improve the value of their home.