### **Name:OBBA MARK CALVIN**
### *REG NO: S23B23/047*
### *ACCESS NO: B24277*


CRISP-DM Pipeline  which also stands for Cross-Industry Standard Process which is a widely adopted six-phase framework for data mining and data science projects, providing a structured, cyclical, and iterative approach from initial business understanding to deployment and monitoring

## **Business Understanding:**


 The project begins with a clear definition of business objectives, success criteria, and requirements In this case its a Housing business 

The primary goal is to analyze housing data to understand factors that influence property tax rates and develop predictive models for property tax estimation using the TAX (Property tax per year in dollars) column as our target variable.

## **Data Understanding:**

 This phase involves collecting, exploring, and analyzing data to gain initial insights and identify data quality issues. 

## **Importing the necessary libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## **Loading the dataset**

In [2]:
housing = pd.read_excel('HousingData.xlsx')
housing.head()

Unnamed: 0,PID,CRIM,AC,INDUS,LS,PR,RM,AGE,DIS,RAD,PTRATIO,DMT,LSTAT,MO,TAX
0,101,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,15.3,396.9,4.98,2.0,296.0
1,102,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,17.8,396.9,9.14,2.0,242.0
2,103,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,17.8,292.4,4.03,3.0,242.0
3,104,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,18.7,394.63,2.94,0.0,222.0
4,105,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,18.7,396.9,5.33,0.0,222.0


## **The Exploratory Data Analysis (EDA)**

In [3]:
# checking the info of the dataset
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 15 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   PID      506 non-null    int64  
 1   CRIM     504 non-null    float64
 2   AC       506 non-null    float64
 3   INDUS    506 non-null    float64
 4   LS       504 non-null    float64
 5   PR       503 non-null    float64
 6   RM       502 non-null    float64
 7   AGE      502 non-null    float64
 8   DIS      503 non-null    float64
 9   RAD      504 non-null    float64
 10  PTRATIO  503 non-null    float64
 11  DMT      502 non-null    float64
 12  LSTAT    505 non-null    float64
 13  MO       504 non-null    float64
 14  TAX      505 non-null    float64
dtypes: float64(14), int64(1)
memory usage: 59.4 KB


In [4]:
# checking for the data types of each column
housing.dtypes

PID          int64
CRIM       float64
AC         float64
INDUS      float64
LS         float64
PR         float64
RM         float64
AGE        float64
DIS        float64
RAD        float64
PTRATIO    float64
DMT        float64
LSTAT      float64
MO         float64
TAX        float64
dtype: object

In [5]:

housing.describe()

Unnamed: 0,PID,CRIM,AC,INDUS,LS,PR,RM,AGE,DIS,RAD,PTRATIO,DMT,LSTAT,MO,TAX
count,506.0,504.0,506.0,506.0,504.0,503.0,502.0,502.0,503.0,504.0,503.0,502.0,505.0,504.0,505.0
mean,353.5,3.604056,11.363636,11.136779,0.069444,0.554164,6.285307,68.53008,3.796207,9.492063,18.443539,305.41453,12.654099,0.644841,407.726733
std,146.213884,8.609134,23.322453,6.860353,0.254461,0.11583,0.704098,28.13882,2.103234,8.676649,2.165602,142.836519,7.148104,1.794307,168.312294
min,101.0,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,12.6,0.32,1.73,0.0,187.0
25%,227.25,0.082155,0.0,5.19,0.0,0.449,5.8855,45.025,2.10035,4.0,17.35,293.5725,6.93,0.0,279.0
50%,353.5,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.2157,5.0,19.0,386.91,11.34,0.0,330.0
75%,479.75,3.674808,12.5,18.1,0.0,0.624,6.6235,93.975,5.16495,24.0,20.2,394.9975,16.96,0.0,666.0
max,606.0,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,22.0,396.9,37.97,19.0,711.0


In [6]:
# Check for missing values
housing.isnull().sum()

PID        0
CRIM       2
AC         0
INDUS      0
LS         2
PR         3
RM         4
AGE        4
DIS        3
RAD        2
PTRATIO    3
DMT        4
LSTAT      1
MO         2
TAX        1
dtype: int64

## **Handling misssing values**

Using the mean imputation method that replaces the missing values with the mean of that particular column

In [7]:
# Handling all missing values using mean imputation
housing.fillna(housing.mean(), inplace=True)
# checking for the missing values again
housing.isnull().sum()

PID        0
CRIM       0
AC         0
INDUS      0
LS         0
PR         0
RM         0
AGE        0
DIS        0
RAD        0
PTRATIO    0
DMT        0
LSTAT      0
MO         0
TAX        0
dtype: int64

## **Importin more necessary libraries**

In [8]:
from sklearn.model_selection import train_test_split # for splitting the data
from sklearn.preprocessing import StandardScaler  # for feature scaling
from sklearn.linear_model import LinearRegression  # for linear regression model
from sklearn.metrics import mean_squared_error, r2_score # for model evaluation



## **Data Preparation:**

 Data is cleaned, transformed, and integrated to create a final dataset ready for modeling. 

In [9]:
# Features and Target variable selection
X = housing.drop(['PR', 'PID'], axis=1) # Features
y = housing['PR'] # Target variable


In [10]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 80% training and 20% test

This line splits the dataset into training and testing parts so you can train your
 model on one set (X_train, y_train) and then evaluate it on unseen data (X_test, y_test)

In [11]:
# feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) 
X_test_scaled = scaler.transform(X_test)


## **Modeling the data**

Various data mining techniques to build predictive models and discover patterns in the housing data are applied here


Linear Regression

In [None]:
# using Linear Regression model
housing_model = LinearRegression() # create a linear regression model
housing_model.fit(X_train_scaled, y_train) # train the model

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


The algorithm looks at the relationship between the inputs (X_train_scaled) and the target (y_train).

It tries to find the best-fitting line (or hyperplane in multiple dimensions) that minimizes the error (difference between predicted and actual prices).

The model stores this relationship in the form of an intercept and coefficients (weights for each input feature).

In [16]:
# using Random Forest Regressor model
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train) # train the model


0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


Each decision tree in the forest learns slightly different patterns in the data (because of randomness in sampling features and data points). Then, the forest combines all these trees by averaging their predictions . This reduces overfitting and improves accuracy.

In [19]:
#using k-Means clustering
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_train_scaled) # fit the model



0,1,2
,n_clusters,3
,init,'k-means++'
,n_init,'auto'
,max_iter,300
,tol,0.0001
,verbose,0
,random_state,42
,copy_x,True
,algorithm,'lloyd'


The KMeans model with 3 clusters and trained it on the scaled data. The model grouped the data into 3 natural clusters without needing output labels.

## **Evaluation of the Trainned and Modeled**

In [13]:
#predictions
y_pred = housing_model.predict(X_test_scaled) # make predictions on the test set
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.5f}")
print(f"R^2 Score: {r2:.5f}")

Mean Squared Error: 0.00305
R^2 Score: 0.73900


Deployment

In [None]:
# deployment
new_data = pd.DataFrame({
    'LT': [50],
    'ST': [30],
    'BR': [3],
    'AR': [1500],
    'AGE': [10],
    'BA': [2],
    'GAR': [1],
    'TAX': [3000],
    'FIRE': [1]
})
new_data_scaled = scaler.transform(new_data)
predicted_price = housing_model.predict(new_data_scaled)
print(f"Predicted Price: {predicted_price[0]:.2f}")



ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

In [None]:
predictions = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
display(predictions.head())

Unnamed: 0,Actual,Predicted
173,0.51,0.553223
274,0.447,0.477914
491,0.609,0.700797
72,0.413,0.446654
452,0.713,0.669173
