# Understanding boston house price 
This script used the Boston Housing dataset to understand how do various housing-related features influence the median home price in Boston.
More specifically, it aims to determine which factors significantly impact housing prices by performing a simple linear regression analysis on the Boston Housing dataset. The regression model will help identify key predictors of home prices, such as crime rates, number of rooms, property tax rates, and accessibility to highways.
## Data pre-processing
Drops any missing values.

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm

# Function of data pre-processing and regression analysis

In [8]:
def load_and_clean_data():
    """Loads the Boston Housing dataset and cleans missing values."""
    data = fetch_openml(name="boston", version=1, as_frame=True)
    df = data.frame
    df = df.dropna()  # Remove missing values if any
    return df

def normalize_data(df, target_column):
    """Normalizes features and separates the target variable."""
    X = df.drop(columns=[target_column])
    y = df[target_column]
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    return pd.DataFrame(X_scaled, columns=X.columns), y

def perform_regression(X, y):
    """Performs simple linear regression using statsmodels."""
    X = sm.add_constant(X)  # Adds an intercept term
    model = sm.OLS(y, X).fit()
    return model

In [9]:
def main():
    df = load_and_clean_data()
    
    target_column = "MEDV"  # Boston housing price column
    X, y = normalize_data(df, target_column)
    
    model = perform_regression(X, y)
    
    print(model.summary())  # Display regression results

# Data preprocessing and Regression analysis

In [10]:
if __name__ == "__main__":
    main()

                            OLS Regression Results                            
Dep. Variable:                   MEDV   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.734
Method:                 Least Squares   F-statistic:                     108.1
Date:                Fri, 14 Mar 2025   Prob (F-statistic):          6.72e-135
Time:                        13:06:47   Log-Likelihood:                -1498.8
No. Observations:                 506   AIC:                             3026.
Df Residuals:                     492   BIC:                             3085.
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         22.5328      0.211    106.814      0.0