
# Description:
                                                               
The Boston Housing dataset contains information about various factors affecting housing prices in the Boston area.

It includes features such as the per capita crime rate, average number of rooms per dwelling, 
proportion of residential land zoned for lots over 25,000 square feet, and more.
---------------------------------------------------------
# Columns:

* CRIM: per capita crime rate by town
* ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS: proportion of non-retail business acres per town
* CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise)
* NOX: nitric oxides concentration (parts per 10 million)
* RM: average number of rooms per dwelling
* AGE: proportion of owner-occupied units built prior to 1940
* DIS: weighted distances to five Boston employment centres
* RAD: index of accessibility to radial highways
* TAX: full-value property-tax rate per ($10,000) 

* PTRATIO: pupil-teacher ratio by town

* B: 1000(Bk - 0.63)^2 where Bk is the proportion of [people of African American descent] by town
* LSTAT: % lower status of the population
* MEDV: Median value of owner-occupied homes in $1000s (target variable)

-----------------------------------------------
# Dataset Source: 
The data were derived from information collected by the U.S. Census Service concerning housing in the area of Boston, Massachusetts.

------------------

# **Step 1 : Import Libraries**
--------------

In [None]:


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt 
import seaborn as sns
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# to avoid warnings
import warnings
warnings.filterwarnings('ignore')


# **Step 2 : Read the Data**
---------------------------

In [None]:
# Read the data
house = pd.read_csv("/kaggle/input/the-boston-housing-dataset/Boston (1).csv")
house.head()


# **Step 3 : Data Exploration**
---------------------------------

In [None]:
# Shape of Data
house.shape

In [None]:
# Data information
house.info()

In [None]:
# Checking Null Values
house.isna().sum()

In [None]:
# Checking Duplicate Values
house.duplicated().sum()

In [None]:
# Summary of data
house.describe()

In [None]:
house.columns


# **Step 4 : Data Visulization**
---------------

**4.1 Histogram :  Distribution of Housing Prices**
--

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Histogram of Housing Prices (MEDV)

sns.distplot(house['MEDV'], bins=20, kde=True)
plt.title('Distribution of Housing Prices (MEDV)')
plt.xlabel('Median Housing Price ($1000s)')
plt.ylabel('Frequency')
plt.show()

**Visualize the distribution of median housing prices across different neighborhoods in Boston using histograms. This will help understand the range and distribution of housing prices in the dataset.**

**Obs :-**


* A right-skewed distribution indicates that there are fewer properties with very high prices compared to the number of properties with lower to moderate prices. 

*   Prices range from  5,000 to 50,000 (in dollers), indicating a diverse range of property values.



* Quartile analysis offers insights into the distribution across different price ranges.

* Extreme values, like the 50,000(in dollers) maximum, may represent potential outliers.



-----
**4.2 Boxplot & Scatterplot**
--
**Relation between Features and Target, Distribution of Features of Houses**
--



In [None]:
# List of features
features = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT'] 

# Create a scatter plot and boxplot for each feature side by side
for feature in features:
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))  # Create a new figure with 1 row and 2 columns

    # Scatter plot of feature with target variable
    axes[0].scatter(house[feature], house['MEDV'])
    axes[0].set_title(f'Scatter plot of {feature} with House Price')
    axes[0].set_xlabel(feature)
    axes[0].set_ylabel('House Price')

    # Boxplot of the feature
    axes[1].boxplot(house[feature])
    axes[1].set_title(f'Boxplot for {feature}')
    axes[1].set_xlabel('Feature')
    axes[1].set_ylabel('Values')

    plt.tight_layout()  
    plt.show();

**`This visualization provides insights into the variability of housing features across different areas pf boston.`**

**Obs : -**
--



- **`CRIM:`** Shows the spread of crime rates, with potential outliers indicating areas with exceptionally high crime rates.
- **`ZN:`** Illustrates the distribution of residential land proportions, with outliers highlighting towns with unique zoning characteristics.

- **`NX:`** Shows the variability in nitric oxides concentration, indicating differences in air quality.
- **`RM:`** Illustrates the distribution of room counts, with outliers representing properties with unusual room counts.
- **`AGE:`** Indicates the prevalence of older properties, highlighting areas with historic significance.
- **`DIS:`** Illustrates the distribution of distances to employment centers, indicating accessibility to job opportunities.
- **`RAD:`** Shows the distribution of accessibility to radial highways, indicating transportation infrastructure.
- **`TAX:`** Reveals the distribution of property tax rates, highlighting variations in tax burdens.- 
- **`PTRATIO:`** Indicates differences in pupil-teacher ratios, reflecting educational resources.
- **`B:`** Illustrates demographic diversity, showing the distribution of the proportion of Black residents.
- **`LSTAT:`** Shows the distribution of socio-economic status, indicating concentrations of lower-status residents.


---------------------



**4.3 Barplot : Average Housing prices by Accessibility of Road Highways**
--


In [None]:
rad_medv_mean = house.groupby('RAD')['MEDV'].mean().reset_index()
rad_medv_mean

In [None]:
# Bar Plot Average House Price By Accessibility of Road Highways

sns.barplot(x='RAD', y='MEDV', data=rad_medv_mean, color='orange',edgecolor='black')
plt.title('Average House Price By Accessibility of Road Highways')
plt.xlabel('Accessibility of Road Highways')
plt.ylabel('Mean Housing Price ($1000s)')
plt.show();

**`This bar plot illustrates the relationship between the accessibility of road highways (RAD) and the average house price (MEDV) in various neighborhoods.`**

**Obs:-**
--

*   Correlations between RAD (index of accessibility to radial highways) and MEDV could indicate the impact of transportation infrastructure on property values.

*   potential positive relationship between RAD and housing prices



*  There is a notable outlier at RAD=24 where the MEDV is substantially lower compared to the surrounding data points


-----------
**Heatmap : Boston Housing Features**
--


In [None]:
# Correlation Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(house.corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Boston Housing Features')
plt.show()

**`Heatmap to visualize the correlation matrix of all features in the dataset. This will help identify which features are most strongly correlated with housing prices and with each other.`**

**Obs:-**
--
* Strong positive correlation between the number of rooms (RM) and median housing prices (MEDV).
* Strong negative correlation between the percentage of lower status population (LSTAT) and median housing prices.


* Moderate negative correlations between industrial land proportion (INDUS) and pupil-teacher ratio (PTRATIO) with median housing prices.


* Weak negative correlation between crime rate (CRIM) and median housing prices.
* Weak positive correlation between properties along the Charles River (CHAS) and median housing prices.


* Potential multicollinearity between accessibility to radial highways (RAD) and property-tax rate (TAX).


---------------

# **Step 5 : Split the Data**
--------------

In [None]:
# Split the Data 
X = house.drop(columns=['MEDV']) #features
y = house['MEDV'] #target variable

In [None]:
# Splitting Data for Train and Test
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.8,random_state=2)

In [None]:
# shape of spiltted data
print("The shape of X_train :",X_train.shape)
print("The shape ofX_test :",X_test.shape)
print("The shape of y_train :",y_train.shape)
print("The shape of y_test :",y_test.shape)

# **Step 6 : Train the Model**

-------------------------



## **1.Decision Tree Regressor**
------------

In [None]:
# Decision Tree Regressor
from sklearn.tree import DecisionTreeRegressor 
dtr = DecisionTreeRegressor(max_depth=5)

In [None]:
# Fit the model on Training dataset
dtr.fit(X_train,y_train)

In [None]:
# Predictions of  decision Tree Regressor on Testing Data
y_pred_dtr=dtr.predict(X_test)

In [None]:
# Accuracy Score of Model
from sklearn.metrics import mean_absolute_percentage_error
error = mean_absolute_percentage_error(y_pred_dtr,y_test)
print("Accuracy of Decision Tree Regressor is :%.2f "%((1 - error)*100),'%')


## **2.Random Forest Regressor**
------------

In [None]:
# Random Forest Regressor 
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(max_depth = 10, min_samples_leaf = 1, min_samples_split = 2, n_estimators = 200)

In [None]:
# Fit the model on Training datset
rfr.fit(X_train,y_train)

In [None]:
# Predictions of  Ranforest Forest Regressor on Testing Data
y_pred_rfr = rfr.predict(X_test)

In [None]:
# Accuracy Score of Model

error = mean_absolute_percentage_error(y_pred_rfr,y_test)
print("Accuracy of Random Forest Regressor is :%.2f "%((1 - error)*100),'%')

# Best Model
Random Forest Regressor:
--
* **Accuracy**: 90.34%
* **Pros**: Ensemble of decision trees reduces overfitting, robust to noise and outliers, handles high-dimensional data well.

* **In this scenario, since Random Forest Regressor has the highest accuracy, it might be considered the best model for making predictions on the given dataset.**

-----------------------------------------------------------THE END------------------------------------------------------------------------
### **Thank You for Exploring My Notebook!**

I hope you found this notebook informative and helpful in your data science journey📈. Your feedback is highly appreciated!

If you enjoyed this notebook or found it helpful🙂, please consider upvoting it. Your support motivates me to create more content and improve the quality of my work📊.

Additionally, if you have any suggestions, questions, or ideas for improvement 📶, please feel free to leave a comment. I value your input and would love to hear from you!

Happy learning, and thank you for your support🙂!
---
----------------------------