# Home Work 11

The Boston Housing Dataset

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The following describes the dataset columns:

1. CRIM - per capita crime rate by town
2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS - proportion of non-retail business acres per town.
4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
5. NOX - nitric oxides concentration (parts per 10 million)
6. RM - average number of rooms per dwelling
7. AGE - proportion of owner-occupied units built prior to 1940
8. DIS - weighted distances to five Boston employment centres
9. RAD - index of accessibility to radial highways
10. TAX - full-value property-tax rate per 10,000 dollar
11. PTRATIO - pupil-teacher ratio by town
12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT - % lower status of the population
14. MEDV - Median value of owner-occupied homes in $1000's

Read Datasets BostonHousing and import libraries (0.1p)

In [8]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.proportion import proportion_confint
from statsmodels.stats.weightstats import ztest

boston_df = pd.read_csv('BostonHousing.csv') 
boston_df

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


Problem 1: Descriptive Statistics (0.2p)
Compute the mean, median, standard deviation, minimum, and maximum values for the "AGE" column in the Boston Housing Dataset.

In [9]:
mean_age = boston_df['age'].mean()
median_age = boston_df['age'].median()
std_dev_age = boston_df['age'].std()
min_age = boston_df['age'].min()
max_age = boston_df['age'].max()

print(f"Mean AGE: {mean_age}")
print(f"Median AGE: {median_age}")
print(f"Standard Deviation AGE: {std_dev_age}")
print(f"Minimum AGE: {min_age}")
print(f"Maximum AGE: {max_age}")

Mean AGE: 68.57490118577076
Median AGE: 77.5
Standard Deviation AGE: 28.148861406903638
Minimum AGE: 2.9
Maximum AGE: 100.0


Problem 2: Confidence Interval for Proportion (0.2p)
Calculate a 95% confidence interval for the proportion of residential land zoned for lots over 25,000 sq.ft. (column 'ZN').

In [10]:
ci_zn = proportion_confint(sum(boston_df['zn'] > 25) , len(boston_df['zn']), alpha=0.05)
print(f"Confidence Interval for ZN: {ci_zn}")

Confidence Interval for ZN: (0.12087737320071373, 0.1834704528862428)


Problem 3: Z-Test for Mean (0.2p)
Perform a Z-test to determine if there is a significant difference in the average number of rooms per dwelling ('RM') between two randomly selected towns.

In [11]:
town_sample_1 = boston_df['rm'].sample(n=30, random_state=1)
town_sample_2 = boston_df['rm'].sample(n=30, random_state=2)
z_stat_rm, p_value_rm = ztest(town_sample_1, town_sample_2)
print(f"Z-statistic for RM: {z_stat_rm}")
print(f"P-value for RM: {p_value_rm}")

Z-statistic for RM: -0.1355493555604788
P-value for RM: 0.8921775442074631


Problem 4: Z-Test for Proportion (0.2p)
Conduct a Z-test to assess if the proportion of owner-occupied units built prior to 90 ('AGE') is significantly different from 0.5.

In [22]:
from statsmodels.stats.proportion import proportions_ztest

count_age_below_90 = (boston_df['age'] < 90).sum()
nobs_age = len(boston_df['age'])
z_stat_age, p_value_age = proportions_ztest(count_age_below_90, nobs_age, value=0.5)
print(f"Z-statistic for AGE: {z_stat_age}")
print(f"P-value for AGE: {p_value_age}")

Z-statistic for AGE: 7.811946338924822
P-value for AGE: 5.631153655947907e-15


Problem 5. Confidence interval for the population mean (0.2p)
Calculate a 90% confidence interval for the average nitric oxides concentration ('NOX').

In [23]:
ci_nox = sm.stats.DescrStatsW(boston_df['nox']).tconfint_mean(alpha=0.1)
print(f"Confidence Interval for NOX: {ci_nox}")

Confidence Interval for NOX: (0.5462062027268326, 0.5631839158502425)


Problem 6: Linear Regression with sm.OLS.from_formula() (0.2p)
Perform a linear regression using the formula "MEDV ~ RM + AGE" and provide the summary.

In [26]:
model_ols = sm.OLS.from_formula('medv ~ rm + age', data=boston_df)
result_ols = model_ols.fit()
summary_ols = result_ols.summary()
print(summary_ols)


                            OLS Regression Results                            
Dep. Variable:                   medv   R-squared:                       0.530
Model:                            OLS   Adj. R-squared:                  0.528
Method:                 Least Squares   F-statistic:                     283.9
Date:                Sat, 25 Nov 2023   Prob (F-statistic):           2.94e-83
Time:                        11:28:24   Log-Likelihood:                -1649.1
No. Observations:                 506   AIC:                             3304.
Df Residuals:                     503   BIC:                             3317.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -25.2774      2.857     -8.848      0.0

Problem 7: GEE Analysis with sm.GEE.from_formula() (0.2p)
Conduct a Generalized Estimating Equation (GEE) analysis using the formula "MEDV ~ RM + AGE" and provide the summary.

In [27]:
model_gee = sm.GEE.from_formula('medv ~ rm + age', groups=boston_df.index, data=boston_df)
result_gee = model_gee.fit()
summary_gee = result_gee.summary()
print(summary_gee)


                               GEE Regression Results                              
Dep. Variable:                        medv   No. Observations:                  506
Model:                                 GEE   No. clusters:                      506
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   1
Family:                           Gaussian   Mean cluster size:                 1.0
Dependence structure:         Independence   Num. iterations:                     1
Date:                     Sat, 25 Nov 2023   Scale:                          39.890
Covariance type:                    robust   Time:                         11:28:33
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -25.2774      4.570     -5.531      0.000     -34.234     -16.321
rm     