Hello
I will use some of what I learned…….or copied 🙂 from chapter 2 of Hands on ML by Aurélien Géron to create a prediction model.

For any comments please take it easy on me. I work in accounting and have been programming for a little over 2 years. 🙂

**Imports and looking at the data**



In [None]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.impute import SimpleImputer

HOUSING_PATH = "/kaggle/input/usa-housing/"
FILE_NAME = "USA_Housing.csv"

def load_data(housing_path = HOUSING_PATH, file_name = FILE_NAME):
    csv_path = os.path.join(housing_path, file_name)
    return pd.read_csv(csv_path)

housing = load_data()

housing.head()

Now lets check the summary of columns count and data types of the DataFrame.

In [None]:
housing.info()

The DataFrame is made up of 5000 rows. Luckily for us we have no missing values. We also see that all of the columns contain numbers(floats) except the “Address” column.

The median income or average income is usually the main indicator for housing prices. We have an “Avg. Area Income” column in the DataFrame. Let’s create a category by using the “Avg. Area Income” column.

In [None]:
# This will turn the labels into integers instead of floats
housing["income_cat"] = pd.cut(housing["Avg. Area Income"],
                               bins=[0, 40000, 60000, 80000, 90000, 110000],
                               labels=[1, 2, 3, 4, 5])

pd.cut(housing["Avg. Area Income"],
             bins=[0, 40000, 60000, 80000, 90000, 110000]).value_counts()

housing["income_cat"].value_counts()

**Create a train and test set for our prediction model**

It is good practice to set aside a test set and not look at it. If we don’t then we may fall prey to data snooping. I got the simplest explanation on data snooping from quantitrader.com .On their website is says “Data snooping refers to statistical inference that the researcher decides to perform after looking at the data“.

Since we have 5 categories in our data it is important that we draw out samples in proportion to those categories. This is referred to as stratified sampling. It means that the sample should be representative of the overall population. The below shows how the data is split by percentage between the 5 categories.

In [None]:
housing["income_cat"].value_counts() / len(housing) *100

From the above we see that category 3 represents most of the population with 65.44%. Now lets do some stratified sampling.

In [None]:
split = StratifiedShuffleSplit(n_splits=1,test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing,housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

# Drop the 'income_cat' so that the data is back to its original state
for set in (strat_train_set,strat_test_set):
    set.drop(["income_cat"],axis=1,inplace=True)

**Visualize data to gain insights**

Before we start lets create a copy of our training set.

In [None]:
housing = strat_train_set.copy()
housing

Let’s see how much each attribute(column) correlates with the “Price”

In [None]:
corr_matrix = housing.corr()
corr_matrix["Price"].sort_values(ascending=False)

As mentioned earlier,median income or average income is usually the main indicator for housing prices. By looking at the above correlations it shows that indeed “Avg. Area Income” has the best correlation to the “Price”.

We can visualize how the 3 top attributes correlate with the house price via a scatter matrix.

In [None]:
import matplotlib.pyplot as plt

attributes = ["Price", "Avg. Area Income", "Avg. Area House Age", "Area Population"]
pd.plotting.scatter_matrix(housing[attributes],figsize=(12,8))
plt.show()


Now we can zoom in to the “Avg. Area Income” and “Price” since it shows the most promise.

In [None]:
housing.plot(kind="scatter", x="Avg. Area Income", y="Price", alpha=0.5)
plt.show()

The above reveals a strong correlation as the points are not too dispersed.  We can use the Avg. Area Income in our prediction model to determine the house prices.

**Prepare data for the machine learning algorithm**

First we need to separate the predictors and the labels because we do not want to apply the same transformations to them

In [None]:
housing = strat_train_set.drop("Price", axis=1)
housing_labels = strat_train_set["Price"].copy()
housing
housing_labels

We do not have any missing data in our DataFrame. But you might get another data set that has missing values. Therefore we need to account for that missing by using Scikit-Learn’s Imputer.

In [None]:
imputer = SimpleImputer(strategy="median")
housing_num = housing.drop("Address",axis=1)
imputer.fit(housing_num)

All we are saying here is that if we encounter a missing value, we want to replace that missing value with the median value. We can check to see if the median value is indeed being used be executing the below code.

In [None]:
imputer.statistics_

In [None]:
housing_num.median().values

Both results are the same which is what we wanted. Now we can use the trained imputer to transform the training set by replacing the missing values by the learned medians. This won’t do anything for this data set because it has no missing values but it will be very handy for data sets with missing values.

In [None]:
# Returns a Numpy array
X = imputer.transform(housing_num)
# Turn the array into a Pandas DataFrame
housing_tr = pd.DataFrame(X,columns=housing_num.columns)

Let’s now build a small pipeline. The pipeline in the Hands on ML book is bigger than what we are going to do here. The Pipeline class helps to execute the data transformation steps in the right order. We will also include scaling in our data using Scikit-Learn’s StandardScalar.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("std_scaler", StandardScaler()),
])

housing_tr = pipeline.fit_transform(housing_num)
housing_tr

**Select and Train a model**

I found out though practice(the few models I built) and by listening to Dr Eugene Dubossarsky that the RandomForestRegressor is best most cases.  Because of this I’m not going to show how different models compare with each other. So let’s create a model, fit it and get save some prediction and actual values.

In [None]:
from sklearn.ensemble import RandomForestRegressor

my_model = RandomForestRegressor()
my_model.fit(housing_tr, housing_labels)

# Get some data and some labels. That last 10 instances
some_data = housing_num.iloc[:10]
some_labels = housing_labels.iloc[:10]

some_data_prepared = pipeline.transform(some_data)

Now let’s calculate some indicators like MSE(mean squared error) and RMSE(root mean squared error).

In [None]:
from sklearn.metrics import mean_squared_error

housing_predictions = my_model.predict(housing_tr)
my_model_mse = mean_squared_error(housing_labels,housing_predictions)
my_model_rmse = np.sqrt(my_model_mse)

"Model Root Mean Squared Error:", my_model_rmse

The price range of a house is between USD15,938 and USD2,469,065 which is huge. The RMSE is equal to USD45,104 which means a typical prediction error will be around USD45,104. Is the RMSE to big? I don’t think so but lets use cross validation before we predict and then see.

In [None]:
from sklearn.model_selection import cross_val_score

my_model_scores = cross_val_score(RandomForestRegressor(),
                  housing_tr,housing_lables,scoring="neg_mean_squared_error",cv=10)
my_model_rmse_scores = np.sqrt(-my_model_scores)

def display_scores(scores):
    print("Mean: ", scores.mean())
    print("Standard deviation: ", scores.std())

display_scores(my_model_rmse_scores)

We get standard deviation of US4,704 . Is that good or bad? I think it’s good.

**Predictions**

Finally we can do some predictions and compare with actual prices

In [None]:
# Set prediction price and actual price variables
predictions = np.ceil(my_model.predict(some_data_prepared))
actuals = np.ceil(list(some_labels))

for p, a in list(zip(predictions, actuals)):
    percentage_diff = (a - p) / a * 100
    percentage_diff = round(percentage_diff,2)
    print("Prediction:", int(p), "Actual:", int(a), "Percentage difference:", percentage_diff)

The percentage difference is ranges between -2.2% to 4.3% which comes to a 6-7% variance. Is it good? I think so again. Maybe you can build a better one. Maybe I missed something that you can see and improve on. 