<a href="https://colab.research.google.com/github/GrassyPeaches/StatsInColab/blob/main/House_sales.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sale prices of houses in Duke Forest, Durham, NC**

In [4]:
import pandas as pd
import numpy as np
import plotly.express as px
import requests
from io import StringIO

url = "https://openintro.org/data/csv/duke_forest.csv"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
response.raise_for_status() # Raise an exception for bad status codes
df = pd.read_csv(StringIO(response.text))

## **1. The data dictionary:**

**address:**
Address of house.

**price:** * *This is our Y or dependent variable* *

Sale price, in USD.

**bed**
Number of bedrooms.

**bath:**
Number of bathrooms.

**area:** * *This is our X or independent variable* *

Area of home, in square feet.

**type:**
Type of home (all are Single Family).

**year_built:**
Year the home was built.

**heating:**
Heating sytem.

**cooling:**
Cooling system (other or central).

**parking:**
Type of parking available and number of parking spaces.

**lot:**
Area of the entire property, in acres.

**hoa:**
If the home belongs to an Home Owners Association, the associated fee (NA otherwise).

**url:**
URL of the listing.

## **2. What are we trying to explore?**

## The following will be an exploration of the correlation between two variables of the dataset: the area of the home (in square feet) and the price of the home (in USD). By the end, we should know wether the price of a home is positively or negatively correlated with the increase in the size of a home in Duke Forest, Durham NC.


In [5]:
fig = px.scatter(
    df,
    x = "area",
    y = "price",
    labels = {"area":"Area of home, in square feet","price":"Sale price, in USD"}
)
fig.show()

## Already we can infer from the previous plot that there **may** be a positive correlation between area of a home and it's price.

In [6]:
# Fit linear regression line (y = mx + b); Finding m & b
m, b = np.polyfit(df["area"], df["price"], 1)
df["price_pred"] = m * df["area"] + b    # y_pred (gift aid prediction)

In [7]:
print("Slope or beta1 is", m, "\nIntercept or beta0 is", b)

Slope or beta1 is 159.48328050235747 
Intercept or beta0 is 116652.32506259099


In [8]:
fig = px.scatter(
    df,
    x = "area",
    y = "price",
    labels = {"area":"Area of home, in square feet","price":"Sale price, in USD"}
)
fig.add_scatter(x=df["area"], y=df["price_pred"], mode="lines",name='Regression line (prediction)')
fig.show()

## With the help of a regression line, we can see more clearly the prediction of the data and how it is correlated. We can also see how the majority of the data is close to the prediction line, although there are some outliers.

In [9]:
import statsmodels.formula.api as smf
model = smf.ols('price ~ area',data=df).fit()
print(model.summary())
print(f"\nRoot MSE (Residual Std Error): {np.sqrt(model.mse_resid):.4f}")

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.445
Model:                            OLS   Adj. R-squared:                  0.439
Method:                 Least Squares   F-statistic:                     77.03
Date:                Mon, 28 Jul 2025   Prob (F-statistic):           6.29e-14
Time:                        18:15:39   Log-Likelihood:                -1317.6
No. Observations:                  98   AIC:                             2639.
Df Residuals:                      96   BIC:                             2644.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   1.167e+05   5.33e+04      2.188      0.0

## **3. How strong is the relationship between sales and area? Proof analytically.**


In [12]:
np.corrcoef(df["price"], df["area"])[0,1]

np.float64(0.6672289740040003)

R ≈ .667

Since R ≈ +0.5 there is a weak positive relationship between price and area

## **4. How good does the linear fit the model on the dataset?**

Based on the OLS Regression Results, there is a weak positive correlation between the two variables.


$R^2$ = .445, which means that only ~44% of the variance in the price of the home is explained by the area of the home.


Based on the Duke Forest, Durham, NC dataset, the model prediction is off by about ± $168,797, on average.


Therefore the fit of the model with the dataset is moderately positve at best.

## **5. Analyze the relationship between area and sales to identify the rate of increase**

Slope = $159.48

Intercept = $116,700

price = 116700 + 159.48 × area

This means that the starting price point (intercept) is 116,700 USD and every square foot increase in the area of a house is an increase of about ≈ 159.48 USD.

## **6. Identify the outliers and proof numerically (see boxplot mentioned below)**

To indentify the outliers we can use the IQR (inter quartile range) method.


Using the box plot below we can compute:


The IQR = Q3 - Q1 = 645k - 450k = 195k

The Lower Fence = Q1 - 1.5 x IQR = 157.5k

The Lower Fence = Q3 + 1.5 x IQR = 937.5k

As a result any price for a home above 937.5k is an outlier and any price for a home below 157.5k is also an outlier.

- In fact, in the box plot below, we can see a high outlier of $1.52M


In [13]:
fig = px.box(
    df,
    x = "price",
    points = "all",
    title = "Single family houses sale price in USD that were sold in Nov 2020 in the Duke Forest neighborhood of Durham, NC"
)
fig.show()

## **7. Use a suitable graph to illustrate the relationship between price and the number of bedrooms**

In [22]:
df['bed'].unique()

array([3, 5, 2, 4, 6])

In [25]:
fig = px.box(df, x='bed', y='price',
             title='Price of Home vs. Number of Bedrooms')
fig.update_layout(xaxis_title='Number of Bedrooms', yaxis_title='Price (in USD)')
fig.show()


## The box plot above is a detailed display of the increase in price per bedroom, in which deeper details of the relationsip between the number of bedrooms and price can be explored. Some observations that can be done are the outliers and median price per number of bedrooms in a home.

In [26]:
avg_price = df.groupby('bed', as_index=False)['price'].mean()

fig = px.bar(avg_price, x='bed', y='price',
             title='Average Price by Number of Bedrooms')
fig.update_layout(xaxis_title='Number of Bedrooms', yaxis_title='Average Price')
fig.show()


The bar graph above is a cleaner interpretation of the positive relationship between the increase in the number of bedrooms and the home price.