
# <h3> **Project: Determine Median Price of Houses Using Regression Trees - Brandon Andrews** </h3>

<h4>Problem:</h4>
The goal of this project is to predict the median price of houses using various features such as crime rate, number of rooms, 
and more. The challenge is to build a regression model that can accurately estimate house prices based on these features. Key 
questions include:
1. Can we accurately predict house prices using a regression model?
2. How does a regression tree perform in predicting these prices compared to other methods?

<h4>Solution:</h4>
I used a Regression Tree model to solve the problem. The steps taken are outlined below:

1. **Install Necessary Libraries:**
   - Installed the required Python libraries such as pandas, matplotlib, scikit-learn, and numpy.

2. **Load and Explore the Dataset:**
   - The dataset was loaded from a CSV file using pandas.
   - Basic exploratory data analysis was done to check the dataset size (506 rows, 13 columns) and for any missing values.

3. **Data Preprocessing:**
   - Removed missing values from the dataset using dropna().
   - Features (X) and target variable (Y) were separated, with X containing house features and Y representing the median price (MEDV).
   - Split the data into training and test sets using train_test_split.

4. **Train a Regression Tree:**
   - A **DecisionTreeRegressor** was trained using the **mean squared error (MSE)** criterion.
   - The model was fit to the training data to learn from the house features and predict prices.

5. **Model Evaluation:**
   - The model was evaluated on the test set using the score() method to calculate the R-squared value.
   - The prediction error was calculated by comparing predicted values with the actual test set values.

6. **Alternative Regression Tree (MAE Criterion):**
   - A second regression tree was trained using **mean absolute error (MAE)** as the criterion to compare performance with the first tree.
   - The R-squared value and average prediction error were reported to assess model performance.

<h4>Conclusion:</h4>
The regression tree model was able to predict house prices based on features, with the results evaluated by both the MSE and MAE criteria. This approach provided 
a clear estimate of house prices, offering an alternative to other regression methods.


----


1. Install Libraries 

In [2]:
# Install libraries not already in the environment using pip
#!pip install pandas==1.3.4
#!pip install sklearn==0.20.1

In [3]:
# Pandas will allow us to create a dataframe of the data so it can be used and manipulated
import pandas as pd
# Regression Tree Algorithm
from sklearn.tree import DecisionTreeRegressor
# Split our data into a training and testing data
from sklearn.model_selection import train_test_split

2. Load Data

In [4]:
data = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/real_estate_data.csv")

In [5]:
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3,222,18.7,,36.2


Now lets learn about the size of our data, there are 506 rows and 13 columns


In [6]:
data.shape

(506, 13)

In [7]:
data.isna().sum()

CRIM       20
ZN         20
INDUS      20
CHAS       20
NOX         0
RM          0
AGE        20
DIS         0
RAD         0
TAX         0
PTRATIO     0
LSTAT      20
MEDV        0
dtype: int64

3. Data Preprocessing

In [8]:
data.dropna(inplace=True)

In [9]:
data.isna().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
LSTAT      0
MEDV       0
dtype: int64

In [10]:
X = data.drop(columns=["MEDV"])
Y = data["MEDV"]

In [11]:
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94
5,0.02985,0.0,2.18,0.0,0.458,6.43,58.7,6.0622,3,222,18.7,5.21


In [12]:
Y.head()

0    24.0
1    21.6
2    34.7
3    33.4
5    28.7
Name: MEDV, dtype: float64

In [13]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=1)

4. Train Regression Tree

In [14]:
regression_tree = DecisionTreeRegressor(criterion = 'mse')

In [15]:
regression_tree.fit(X_train, Y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

5. Evaluation

In [16]:
regression_tree.score(X_test, Y_test)

0.8517429863227787

In [17]:
prediction = regression_tree.predict(X_test)

print("$",(prediction - Y_test).abs().mean()*1000)

$ 2743.037974683544


6. Alt Regression Tree

Train a regression tree using the `criterion` `mae` then report its $R^2$ value and average error


In [18]:
regression_tree = DecisionTreeRegressor(criterion = "mae")

regression_tree.fit(X_train, Y_train)

print(regression_tree.score(X_test, Y_test))

prediction = regression_tree.predict(X_test)

print("$",(prediction - Y_test).abs().mean()*1000)

0.8677767519304183
$ 2616.455696202531


<!--
## Change Log
|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2020-07-20|0.2|Azim|Modified Multiple Areas|
|2020-07-17|0.1|Azim|Created Lab Template|
--!>
