<a href="https://www.kaggle.com/code/shameerhussain5817/california-housing-prices-alternative-for-practice?scriptVersionId=195528597" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

This project focuses on predicting house prices in King County, USA. It involves extensive data transformation and cleaning processes, along with experimenting with various models to identify the most effective one for accurate predictions. The project also explores multiple approaches to deepen understanding and improve the prediction outcomes.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
housing = pd.read_csv("/kaggle/input/housesalesprediction/kc_house_data.csv")

In [None]:
housing.head()

In [None]:
housing.info()

In [None]:
housing.describe()

Make histograms for all the columns where number of instances on the vertical axis and values of the columns on the horizontal axis

In [None]:
import matplotlib.pyplot as plt

housing.hist(bins = 50, figsize = (12,8))
plt.show()

# Create Training and Test Set

1. shuffle_and_split_data function

In [None]:
def shuffle_and_split_data(data,test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data)*test_ratio)
    Test_Indices = shuffled_indices[:test_set_size]
    Train_Indices = shuffled_indices[test_set_size:]
    return data.iloc[Train_Indices],data.iloc[Test_Indices]

In [None]:
train_set,test_set = shuffle_and_split_data(housing,0.2)

2.Use ids as hash or refrences to split in to training and test set to store same kind of data everytime we go through this code.

In [None]:
from zlib import crc32

def is_id_in_test_set(identifier,test_ratio):
    return crc32(np.int64(identifier))<test_ratio*2**32
def split_data_with_id_hash(data,test_ratio,ids_column):
    ids = data[ids_column]
    in_test_set = ids.apply(lambda id_ : is_id_in_test_set(id_,test_ratio))
    in_test_set = in_test_set.astype(bool)
    return data.loc[~in_test_set],data.loc[in_test_set]

In [None]:
train_set,test_set = split_data_with_id_hash(housing,0.2,'id')

In [None]:
per_train_set = train_set.shape[0]/housing.shape[0]*100
per_test_set = test_set.shape[0]/housing.shape[0]*100
print(f"{per_train_set:.2f}%")
print(f"{per_test_set:.2f}%")

3. By using train_test_split

In [None]:
from sklearn.model_selection import train_test_split

train_set,test_set = train_test_split(housing,test_size = 0.2,random_state =442)

In [None]:
per_train_set = train_set.shape[0]/housing.shape[0]*100
per_test_set = test_set.shape[0]/housing.shape[0]*100
print(f"{per_train_set:.2f}%")
print(f"{per_test_set:.2f}%")

4. Stratified Shuffle Split

For this we have to look for an element in the data which could be very helpful in predicting housing prices like the grade which housing inspectors gives it. We will divide it's values in to categories and then based on that we will assign values in train and test set where the values adjusted in these sets are true representatives of the population and will reduce the sampling bias.

In [None]:
# Get the value counts and sort by the grade index
grade_counts = housing["grade"].value_counts().sort_index()

# Plotting
grade_counts.plot(kind="bar")
plt.xlabel("Grade")
plt.ylabel("Number of Houses")
plt.title("Distribution of House Grades")
plt.show()

In [None]:
housing["grade_cat"] = pd.cut(housing["grade"],bins =[1, 4, 7, 10, 13, np.inf],labels = [1,2,3,4,5])

In [None]:
housing["grade_cat"].value_counts().sort_index().plot.bar(rot =0, grid = True)
plt.xlabel("Grade category")
plt.ylabel("Number of Houses")
plt.show()

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

housing_cleaned = housing.dropna(subset=["grade_cat"])
splitter = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
strat_splits = []

for train_index, test_index in splitter.split(housing_cleaned, housing_cleaned["grade_cat"]):
    strat_train_set_n = housing_cleaned.iloc[train_index]
    strat_test_set_n = housing_cleaned.iloc[test_index]
    strat_splits.append([strat_train_set_n, strat_test_set_n])

Due to applying split mehtod there are different splits of training and test data sets so we can use the first one.

In [None]:
train_set,test_set = strat_splits[0]

In [None]:
per_train_set = train_set.shape[0]/housing.shape[0]*100
per_test_set = test_set.shape[0]/housing.shape[0]*100
print(f"{per_train_set:.2f}%")
print(f"{per_test_set:.2f}%")

**Another way of doing stratified sampling**

In [None]:
 train_set,test_set = train_test_split(
housing_cleaned, test_size=0.2, stratify=housing_cleaned["grade_cat"], random_state=42)

In [None]:
per_train_set = train_set.shape[0]/housing.shape[0]*100
per_test_set = test_set.shape[0]/housing.shape[0]*100
print(f"{per_train_set:.2f}%")
print(f"{per_test_set:.2f}%")

In [None]:
test_set["grade_cat"].value_counts()/len(test_set)

You won’t use the income_cat column again, so you might as well drop it,
reverting the data back to its original state:

In [None]:
for set_ in (train_set,test_set):
 set_.drop("grade_cat", axis=1, inplace=True)

# See and Visualize the data to gain insights

In [None]:
housing = train_set.copy()

Since you’re going to experiment with various
transformations of the full training set, you should make a copy of the
original so you can revert to it afterwards:


In [None]:
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib.image as mpimg
import urllib.request
from PIL import Image
import numpy as np

# Load your image from URL
url = "https://images.squarespace-cdn.com/content/v1/5bf9022b5cfd79f62c905004/1609870717560-TI9F0J5T71BOX54O6F9D/Geography-Region.png"
with urllib.request.urlopen(url) as response:
    img = Image.open(response)
    img = np.array(img)  # Convert the image to a numpy array for matplotlib

# Create a GridSpec with 2 columns, keeping the heights of both plots consistent
fig = plt.figure(figsize=(12, 8))
gs = gridspec.GridSpec(2, 2, width_ratios=[2, 1], height_ratios=[0.1, 2])

# Title for the whole figure
fig.suptitle("Scatter Plot and Original Image Side by Side", fontsize=16)

# Add title section (1st row, full width)
ax_title = plt.subplot(gs[0, :])
ax_title.text(0.5, 0.5, "This is a figure with two sections", ha='center', fontsize=14, weight='bold')
ax_title.axis('off')  # Hide the axes for the title section

# Scatter plot in the first subplot (2nd row, 1st column)
ax0 = plt.subplot(gs[1, 0])
housing.plot(kind="scatter", x="long", y="lat", grid=True, ax=ax0)
ax0.set_xlabel("Longitude")
ax0.set_ylabel("Latitude")
ax0.set_title("Scatter Plot of Housing Data")

# Image in the second subplot (2nd row, 2nd column)
ax1 = plt.subplot(gs[1, 1])
ax1.imshow(img)
ax1.set_title("Original Image")
ax1.axis('off')  # Hide the axes

# Adjust layout for better spacing
plt.tight_layout(rect=[0, 0, 1, 0.95])  # Leave space for the suptitle
plt.show()


In below figure red is expensive, blue is cheap, larger circles indicate areas
with a larger size of the house

In [None]:
housing.plot(kind="scatter", x="long", y="lat", grid=True,
s=housing["sqft_lot"] / 3000, label="population",
c="price", cmap="jet", colorbar=True,
legend=True, sharex=False, figsize=(10, 7))
plt.show()

In [None]:
housing = housing.drop("date",axis = 1)

In [None]:
corr_matrix = housing.corr()
corr_matrix["price"].sort_values(ascending=False)

sqft_living 

grade       
sqft_above  
sqft_living15  

These are found to be strongly positively corelated

In [None]:
from pandas.plotting import scatter_matrix
attributes = ["price", "sqft_living", "grade",
"sqft_above", "sqft_living15"]
scatter_matrix(housing[attributes], figsize=(12, 8))
plt.show()

In [None]:
housing.plot(kind="scatter", x="sqft_living", y="price",
alpha=0.1, grid=True)
plt.show()

the correlation is indeed quite strong;
you can clearly see the upward trend

# Prepare the Data for Machine Learning Algorithms


Now Separate the labels and input features as we are working on a supervised learning task (specificaly linear regression)

In [None]:
housing = train_set.drop("price", axis=1)
housing_labels = train_set["price"].copy()

In [None]:
housing_labels

Clean the Data

In [None]:
from sklearn.impute import SimpleImputer
housing_num = housing.select_dtypes(include = [np.number])

# Step 1: Explicitly cast each column to float64 first
housing_num['sqft_basement'] = housing_num['sqft_basement'].astype('float64')
housing_num['yr_renovated'] = housing_num['yr_renovated'].astype('float64')

# Step 2: Replace 0 with NaN
housing_num.loc[:, ["sqft_basement", "yr_renovated"]] = housing_num.loc[:, ["sqft_basement", "yr_renovated"]].replace(0, np.nan)


imputer = SimpleImputer(strategy= "most_frequent")
imputer.fit(housing_num)

In [None]:
X = imputer.transform(housing_num)

In [None]:
housing_tr = pd.DataFrame(X,columns= housing_num.columns,index = housing_num.index)

In [None]:
# Convert the relevant columns to int64 after imputation
columns_to_convert = ["bedrooms", "bathrooms", "floors", "waterfront", "view", "condition",
                       "grade", "sqft_above", "sqft_basement", "yr_built", "yr_renovated",
                       "zipcode"]

# Convert columns to integers
housing_tr[columns_to_convert] = housing_tr[columns_to_convert].astype('int64')

In [None]:
housing = housing.drop("date",axis = 1)

# Feature Scaling and Transformation to make sample predictions

In [None]:
from sklearn.compose import TransformedTargetRegressor

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression

some_new_data = housing[["sqft_living"]].iloc[:5]

model = TransformedTargetRegressor(LinearRegression(),
                                   
transformer=StandardScaler())

model.fit(housing[["sqft_living"]], housing_labels)

sample_predictions = model.predict(some_new_data)

In [None]:
sample_predictions.round(2)

# Now through Transformation Pipelines


In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_selector, make_column_transformer


attributes = [
    ['id', 'price'],
    ['bedrooms', 'bathrooms', 'sqft_living'],
    ['sqft_lot', 'floors', 'waterfront'],
    ['view', 'condition', 'grade'],
    ['sqft_above', 'sqft_basement', 'yr_built'],
    ['yr_renovated', 'zipcode', 'lat'],
    ['long', 'sqft_living15', 'sqft_lot15']
]


pipeline = make_pipeline(SimpleImputer(strategy = "median"),StandardScaler())

preprocessing = make_column_transformer(
(pipeline, make_column_selector(dtype_include=np.number)))

In [None]:
housing_prepared = preprocessing.fit_transform(housing_num)

# Select and Train a Model


1. Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
lin_reg = make_pipeline(preprocessing, LinearRegression())
lin_reg.fit(housing, housing_labels)

In [None]:
housing_predictions = lin_reg.predict(housing)
housing_predictions[:5].round(-2) # -2 = rounded to the nearest hundred

In [None]:
housing_labels.iloc[:5].values

Now check the differnce

In [None]:
from sklearn.metrics import mean_squared_error
lin_rmse = mean_squared_error(housing_labels, housing_predictions,
squared=False)
lin_rmse

# Model is Underfitting the data

It is a huge difference because the price ranges from 75,000 Dollars to 7,700,000 dollars 

Let's try another model which is Decision Tree Regressor

2. DcisionTree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor
tree_reg = make_pipeline(preprocessing, DecisionTreeRegressor(random_state=42))
tree_reg.fit(housing, housing_labels)

In [None]:
housing_predictions = tree_reg.predict(housing)
tree_rmse = mean_squared_error(housing_labels, housing_predictions,
squared=False)
tree_rmse

Not a perfect one but looks alike a good difference let's try it out on our test set

# Try the best model on the test set

In [None]:
X_test = test_set.drop("price", axis=1)
y_test = test_set["price"].copy()
predictions = tree_reg.predict(X_test)
final_rmse = mean_squared_error(y_test, predictions, squared=False)
print(final_rmse)

In [None]:
final_predictions = predictions.round(2)

for i, pred in enumerate(final_predictions[:5], start=1):
    print(f"Prediction of price for house No {i}: ${int(pred):,}")


# Finally Display these predictions using matplotlib

In [None]:
import matplotlib.pyplot as plt

# Sample final predictions
final_predictions = predictions.round(2)

# Get the first 5 predictions
house_numbers = [f'House No {i}' for i in range(1, 6)]
predicted_prices = [f"${int(pred):,}" for pred in final_predictions[:5]]

# Create a figure and axis
fig, ax = plt.subplots(figsize=(8, 5))

# Plot the predictions
ax.barh(house_numbers, final_predictions[:5], color='lightblue')

# Add labels
for i, v in enumerate(final_predictions[:5]):
    ax.text(v, i, f" {predicted_prices[i]}", va='center', fontsize=12)

ax.set_xlabel('Price ($)')
ax.set_title('Final Predictions for Houses')

# Remove x-axis labels for a cleaner look
ax.get_xaxis().set_ticks([])
plt.savefig('final_predictions.png', dpi=300, bbox_inches='tight')

# Show the plot
plt.show()
