# **Project Name**    -



##### **Project Type**    - EDA/Regression
##### **Contribution**    - SARATHRAJ R

# **Project Summary -**

This project focuses on analyzing the Boston Housing dataset to build a predictive model for estimating housing prices based on various socioeconomic and environmental factors. The dataset includes 506 observations on 14 different attributes, such as crime rate, average number of rooms per dwelling, distance to employment centers, and more.

The primary objectives of this project are:

To explore and understand the underlying structure of the dataset using exploratory data analysis (EDA).

To identify key factors that influence housing prices.

To preprocess the data, handle missing values (if any), and normalize/standardize features where appropriate.

To apply and evaluate various regression models (e.g., Linear Regression, Decision Trees, Random Forest, Gradient Boosting, etc.) for predicting median home values.

To interpret model outputs and validate model performance using appropriate metrics such as RMSE, MAE, and R².

Through this project, we aim to gain insights into the housing market in Boston and demonstrate how machine learning techniques can be effectively used in real estate price prediction tasks.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The goal of this project is to develop a predictive model that accurately estimates the median value of owner-occupied homes in the Boston area based on a variety of features describing the neighborhood, housing characteristics, and environmental factors.

With rising interest in data-driven decision-making in real estate, stakeholders such as homebuyers, realtors, and city planners need reliable tools to assess property values. By leveraging historical housing data and machine learning techniques, this project seeks to identify the key determinants of housing prices and provide a model that can predict home values with high accuracy.

This problem falls under the domain of supervised regression and involves analyzing how features like crime rate, average number of rooms, proximity to employment hubs, and more, influence housing prices in the Boston area.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

### Dataset First View

In [None]:
# Dataset First Look
df = pd.read_csv('/content/Boston_Housing (1).csv')
df

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False)

In [None]:
sns.heatmap(df.corr())

### What did you know about your dataset?

Upon reviewing the basic information of the Boston Housing dataset, we observed that it contains 506 rows and 14 columns, with no missing values. The features include numeric attributes representing socioeconomic, housing, and environmental factors such as crime rate, number of rooms, and distance to employment centers. The target variable is the median value of owner-occupied homes (MEDV). Most features are continuous and vary across a wide range, indicating potential for diverse influence on housing prices. Initial observations suggest certain features, like the number of rooms (RM), have a strong correlation with housing prices.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

The Boston Housing dataset consists of 14 variables that describe various socioeconomic and environmental factors influencing housing prices. These include crime rate (CRIM), residential land zoning (ZN), proportion of non-retail business land (INDUS), and whether the area borders the Charles River (CHAS). Other features capture pollution levels (NOX), average number of rooms (RM), age of buildings (AGE), distance to employment centers (DIS), highway access (RAD), tax rate (TAX), and education quality (PTRATIO). The dataset also includes demographic indicators like percentage of lower-status population (LSTAT) and a variable related to the proportion of Black residents (B). The target variable is MEDV, representing the median value of owner-occupied homes in thousands of dollars.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in df.columns.tolist():
  print(df[col].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
#Renaming columns for consistency.
df.columns = df.columns.str.replace(' ', '_')
df.columns

In [None]:
df.columns = df.columns.str.strip().str.upper()

In [None]:
df.dtypes

In [None]:
df['CHAS'] = df['CHAS'].astype(float)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

for col in df.select_dtypes(include=['float64', 'int64']).columns:
    plt.figure(figsize=(6, 2))
    sns.boxplot(x=df[col])
    plt.title(f"Outlier Check: {col}")
    plt.show()



In [None]:
from scipy.stats import skew

# Select only numerical columns
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns

# Calculate skewness for each column
skew_values = df[numeric_cols].apply(skew).sort_values(ascending=False)

# Print skewness values
print("Skewness of numerical features:")
print(skew_values)

# Identify highly skewed features (threshold = |0.5|)
print("\nHighly skewed features (|skew| > 0.5):")
skewed_features = skew_values[abs(skew_values) > 0.5]
print(skewed_features)


In [None]:
#  Feature Engineering (log-transform skewed features)
df['LOG_CRIM'] = np.log1p(df['CRIM'])
df['LOG_LSTAT'] = np.log1p(df['LSTAT'])

In [None]:
# 📐 Feature Scaling (excluding target variable)
from sklearn.preprocessing import StandardScaler

features = df.drop(['MEDV'], axis=1)
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
scaled_df = pd.DataFrame(scaled_features, columns=features.columns)


In [None]:
# Reattach the target variable
scaled_df['MEDV'] = df['MEDV'].values

In [None]:
#  Final Cleaned Data Ready
print("\nCleaned and processed dataset shape:", scaled_df.shape)
print(scaled_df.head())

### What all manipulations have you done and insights you found?

Loaded the dataset, standardized column names, removed duplicates, and checked for missing values.

Converted the binary CHAS column to a categorical type for better interpretation.

Visualized potential outliers using boxplots for key numerical features.

Applied log transformation to skewed features (CRIM, LSTAT) to reduce skewness.

Scaled all features (excluding the target MEDV) using StandardScaler and created a final cleaned dataset.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(8, 4))
sns.histplot(df['MEDV'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of MEDV (House Prices)')
plt.xlabel('MEDV')
plt.ylabel('Frequency')
plt.show()

Understands how house prices are distributed and detects any skew or ceiling effects.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
sns.countplot(x='CHAS', data=df, palette='Set2')
plt.title('Number of Homes Near the Charles River')
plt.xlabel('CHAS (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()


Shows how many homes are near the Charles River — useful for categorical variable understanding.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(6, 4))
sns.scatterplot(x='RM', y='MEDV', data=df, color='green', hue = 'CHAS')
plt.title('RM vs MEDV')
plt.xlabel('Average Number of Rooms per Dwelling')
plt.ylabel('MEDV')
plt.show()

Reveals a strong positive correlation between number of rooms and house price.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(6, 4))
sns.regplot(x='LSTAT', y='MEDV', data=df, scatter_kws={'alpha':0.5}, line_kws={'color':'red'})
plt.title('LSTAT vs MEDV (with Regression Line)')
plt.xlabel('% Lower Status Population')
plt.ylabel('MEDV')
plt.show()

Shows a clear negative correlation — higher LSTAT leads to lower house prices.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
top_corr = df.corr().abs().nlargest(5, 'MEDV')['MEDV'].index
plt.figure(figsize=(8, 6))
sns.heatmap(df[top_corr].corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Top Features Correlated with MEDV')
plt.show()


Highlights multivariate relationships and helps in feature selection.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_ind

# Hypothesis:
# H0: There is no difference in average MEDV between CHAS=1 and CHAS=0
# H1: There is a significant difference in average MEDV between CHAS=1 and CHAS=0

medv_near_river = df[df['CHAS'] == 1]['MEDV']
medv_not_near_river = df[df['CHAS'] == 0]['MEDV']

t_stat, p_val = ttest_ind(medv_near_river, medv_not_near_river, equal_var=False)

print("🔹 T-test: MEDV by CHAS")
print(f"T-statistic: {t_stat:.4f}, P-value: {p_val:.4f}")
if p_val < 0.05:
    print(" Result: Reject the null hypothesis — CHAS has a significant effect on MEDV.\n")
else:
    print(" Result: Fail to reject the null hypothesis — No significant effect of CHAS on MEDV.\n")


### Hypothetical Statement - 2

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import pearsonr

# Hypothesis:
# H0: There is no correlation between RM and MEDV (ρ = 0)
# H1: There is a significant correlation between RM and MEDV (ρ ≠ 0)

corr_coef, p_val = pearsonr(df['RM'], df['MEDV'])

print("🔹 Pearson Correlation: RM vs MEDV")
print(f"Correlation Coefficient: {corr_coef:.4f}, P-value: {p_val:.4f}")
if p_val < 0.05:
    print(" Result: Reject the null hypothesis — RM is significantly correlated with MEDV.\n")
else:
    print(" Result: Fail to reject the null hypothesis — No significant correlation.\n")



### Hypothetical Statement - 3

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import f_oneway

# Hypothesis:
# H0: All RAD groups have the same average MEDV
# H1: At least one RAD group has a different average MEDV

groups = [df[df['RAD'] == rad]['MEDV'] for rad in df['RAD'].unique()]
f_stat, p_val = f_oneway(*groups)

print("🔹 ANOVA: MEDV across RAD groups")
print(f"F-statistic: {f_stat:.4f}, P-value: {p_val:.4f}")
if p_val < 0.05:
    print(" Result: Reject the null hypothesis — MEDV varies across RAD groups.\n")
else:
    print(" Result: Fail to reject the null hypothesis — No significant difference among RAD groups.\n")



## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif['Feature'] = features.columns
vif['VIF'] = [variance_inflation_factor(scaled_df[features.columns].values, i) for i in range(len(features.columns))]
print(vif)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
#  Import ML Libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

#  Define Features and Target
X = scaled_df.drop('MEDV', axis=1)
y = scaled_df['MEDV']

#  Split into Train and Test Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#  Initialize Models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=0.1),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
}

#  Train and Evaluate Each Model
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    rmse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f" {name}")
    print(f"    RMSE: {rmse:.2f}")
    print(f"    R² Score: {r2:.2f}\n")


By examining we could confirm that Random forest outperforms all the model in our dataset.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestRegressor
import numpy as np

# Model
model = RandomForestRegressor(n_estimators=100, random_state=42)

# 5-Fold Cross-Validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold, scoring='r2')

# Results
print(" R² Scores per Fold:", scores)
print(" Average R² Score:", np.mean(scores).round(4))
print(" Standard Deviation:", np.std(scores).round(4))


In [None]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Grid Search with 5-fold CV
grid_search = GridSearchCV(
    estimator=RandomForestRegressor(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring='r2',
    n_jobs=-1,
    verbose=1
)

# Fit to training data
grid_search.fit(X_train, y_train)

# Best parameters
print(" Best Parameters:", grid_search.best_params_)
print(f" Best Cross-Validated R²: {grid_search.best_score_:.4f}")

# Evaluate best model on test data
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)

from sklearn.metrics import mean_squared_error, r2_score
print("\n Evaluation on Test Set:")
print("RMSE:", mean_squared_error(y_test, y_pred, squared=False))
print("R² Score:", r2_score(y_test, y_pred))


### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***