<h1>Problem Statement</h1>

<h2>Context</h2>
<p>To explore and analyze regression algorithms, we will be working with the Boston Housing dataset. This dataset consists of various attributes of Boston suburbs or towns, originally drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. Our goal is to understand the relationship between these attributes and the median home value.</p>

<h2>Content</h2>
<p>Each record in the database describes a Boston suburb or town. The data attributes are defined as follows (adapted from the <a href="https://archive.ics.uci.edu/ml/datasets/Housing" target="_blank">UCI Machine Learning Repository</a>):</p>

<ul>
  <li><b>CRIM:</b> Per capita crime rate by town</li>
  <li><b>ZN:</b> Proportion of residential land zoned for lots over 25,000 sq.ft.</li>
  <li><b>INDUS:</b> Proportion of non-retail business acres per town</li>
  <li><b>CHAS:</b> Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)</li>
  <li><b>NOX:</b> Nitric oxides concentration (parts per 10 million)</li>
  <li><b>RM:</b> Average number of rooms per dwelling</li>
  <li><b>AGE:</b> Proportion of owner-occupied units built prior to 1940</li>
  <li><b>DIS:</b> Weighted distances to five Boston employment centers</li>
  <li><b>RAD:</b> Index of accessibility to radial highways</li>
  <li><b>TAX:</b> Full-value property-tax rate per $10,000</li>
  <li><b>PTRATIO:</b> Pupil-teacher ratio by town</li>
  <li><b>B:</b> 1000(Bk − 0.63)² where Bk is the proportion of Black residents by town</li>
  <li><b>LSTAT:</b> % lower status of the population</li>
  <li><b>MEDV:</b> Median value of owner-occupied homes in $1000s</li>
</ul>

<p>As seen, the input attributes vary in their units and scales, making this a good dataset for exploring feature engineering, scaling, and regression modeling techniques.</p>
e a mixture of units.

In [24]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt  # Corrected to import the pyplot module of matplotlib
import seaborn as sns
import warnings

# Ignoring warnings
warnings.filterwarnings('ignore')  # Corrected typo from 'warmings.igonore' to 'warnings.filterwarnings'


In [25]:
# Reading the data
df=pd.read_csv('/kaggle/input/bouston-housing-dataset/Housing DB.csv')
df.head(5)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [26]:
df.columns

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'MEDV'],
      dtype='object')

In [27]:
# checking for missing values
df.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

In [28]:
# checking for descriptive stats
df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


In [29]:
# First split: 80% for training+validation, 20% for test
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Second split: 75% of the 80% for training, 25% of the 80% for validation
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42)

# Display shapes of the splits to confirm
print("X_train shape:", X_train.shape)
print("X_val shape:", X_val.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_val shape:", y_val.shape)
print("y_test shape:", y_test.shape)

X_train shape: (303, 13)
X_val shape: (101, 13)
X_test shape: (102, 13)
y_train shape: (303,)
y_val shape: (101,)
y_test shape: (102,)


In [33]:
#Model creation
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_error,mean_squared_error
# Initialize the models
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "Support Vector Regressor": SVR(kernel='rbf'),
    "Neural Network (MLP)": MLPRegressor(hidden_layer_sizes=(64, 64), max_iter=500, random_state=42)
}


In [39]:
# Train and evaluate each model
for model_name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    
    # Predict on the test set
    y_pred = model.predict(X_test)
    
    # Calculate MAE
    mae = mean_absolute_error(y_test, y_pred)
    
    # Print the performance
    print(f"{model_name} - Mean Absolute Error: {mae:.2f}")

Linear Regression - Mean Absolute Error: 3.28
Decision Tree - Mean Absolute Error: 2.99
Random Forest - Mean Absolute Error: 2.28
Gradient Boosting - Mean Absolute Error: 2.13
Support Vector Regressor - Mean Absolute Error: 4.60
Neural Network (MLP) - Mean Absolute Error: 3.42


In [35]:
#Training the data
#Train and evaluate each model
for model_name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    
y_train_pred = model.predict(X_train)
train_mae = mean_absolute_error(y_train, y_train_pred)
train_mse = mean_squared_error(y_train, y_train_pred)
print(f"Training MAE: {train_mae}")
print(f"Training MSE: {train_mse}")

Training MAE: 3.2843514129815112
Training MSE: 20.02999089101483


In [38]:
# For test data
y_test_pred = model.predict(X_test)
test_mae = mean_absolute_error(y_test, y_test_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
test_rmse = np.sqrt(test_mse)
print(f"Test MAE: {test_mae}")
print(f"Test MSE: {test_mse}")
print(f"Test RMSE: {test_rmse}")



Test MAE: 3.4239163349104795
Test MSE: 21.783001594659353
Test RMSE: 4.6672263277732045


In [40]:
#saving the model 
import pickle
import joblib
# Assuming `model` is your trained Gradient Boosting model
with open('gradient_boosting_model.pkl', 'wb') as file:
    pickle.dump(model, file)
joblib.dump(model, 'gradient_boosting_model.joblib')


['gradient_boosting_model.joblib']

In [41]:
# Loading the model

with open('gradient_boosting_model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)


In [42]:
loaded_model = joblib.load('gradient_boosting_model.joblib')
