# Project: House Sales

This project aims to determine the market price of a house based on a set of features. We will analyze and predict housing prices using attributes such as:

- Square footage of the house and/or lot
- Number of bedrooms
- Number of bathrooms
- Number of floors
- Age of the house
- Location (e.g., zip code, neighborhood)
- Condition or quality of the house
- Additional features like a swimming pool, garage, or garden

By applying various data analysis and machine learning techniques, we will build a model to predict house prices and explore the factors that influence the real estate market.


## About the Dataset

This dataset contains house sale prices for **King County**, which includes **Seattle**, and covers homes sold between **May 2014 and May 2015**. It includes various features related to the houses such as square footage, number of bedrooms, and more. 

The dataset was sourced from [Kaggle - House Sales Prediction](https://www.kaggle.com/harlfoxem/housesalesprediction?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-wwwcourseraorg-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2022-01-01).

## Import the required libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
from sklearn.linear_model import LinearRegression

## Load the Dataset and Display First Few Rows

Let's load the dataset and take a look at the first few rows to understand its structure.

In [2]:
df = pd.read_csv("kc_house_data.csv")
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


## Find the Feature Most Correlated with Price

We can use the Pandas `corr()` method to find the correlation between all features and the target variable (price). This will help us identify which feature is most strongly correlated with house prices.

In [3]:
df.corr()['price'].sort_values()

zipcode         -0.053203
id              -0.016762
long             0.021626
condition        0.036362
yr_built         0.054012
sqft_lot15       0.082447
sqft_lot         0.089661
yr_renovated     0.126434
floors           0.256794
waterfront       0.266369
lat              0.307003
bedrooms         0.308350
sqft_basement    0.323816
view             0.397293
bathrooms        0.525138
sqft_living15    0.585379
sqft_above       0.605567
grade            0.667434
sqft_living      0.702035
price            1.000000
Name: price, dtype: float64

## Fit a Linear Regression Model and Calculate R-Squared Score

Now, let's fit a linear regression model using the features most correlated with the target variable (`price`). We will use the top features based on their correlation with the price and evaluate the model using the R-squared score.


In [4]:
features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"] 
X = df[features]
Y = df["price"]
lm = LinearRegression()
lm.fit(X,Y)
lm.score(X,Y)

0.6576821190183728

## Create a Pipeline to Predict Price and Calculate R-Squared Score

We will use a pipeline to automate the process of data preprocessing and model fitting. The pipeline will include the following steps:
1. Scaling the features.
2. Fitting a linear regression model to predict the `price`.
3. Calculating the R-squared score for evaluation.

In [5]:
Input=[('scale',StandardScaler()),('polynomial', PolynomialFeatures(include_bias=False)),('model',LinearRegression())]
pipe = Pipeline(Input)
X = df[features]
Y = df['price']
pipe.fit(X,Y)
pipe.score(X,Y)

0.7513474236019944

# Model Evaluation and Refinement using Scikit-Learn

In [6]:
#Import necessery modules
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge

### Split the Data into Training and Testing Sets

We will divide the dataset into training and testing sets. The training set will be used to train the model, while the testing set will be used to evaluate the model's performance. 

In [7]:
X = df[features]
Y = df['price']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=1)

print("number of test samples:", x_test.shape[0])
print("number of training samples:",x_train.shape[0])

number of test samples: 3242
number of training samples: 18371


## Create and Fit a Ridge Regression Model

We will now create a Ridge regression model, set the regularization parameter (alpha) to 0.1, and evaluate the model's performance using the R-squared score on the test data.

In [8]:
RM = Ridge(alpha = 0.1)
RM.fit(x_train,y_train)
score = RM.score(x_test,y_test)
print(score)

0.6480374087702241


## Perform a Second Order Polynomial Transformation and Fit Ridge Regression

We will now first apply a second-order polynomial transformation to both the training and testing data. This will allow us to capture non-linear relationships between the features and the target. After that, we will create and fit a Ridge regression model using the transformed data and calculate the R squared score on the test data.

In [9]:
pf = PolynomialFeatures(degree = 2)
x_train_pf = pf.fit_transform(x_train)
x_test_pf = pf.fit_transform(x_test)
RM = Ridge(alpha = 0.1)
RM.fit(x_train_pf,y_train)
score = RM.score(x_test_pf,y_test)
print(score)

0.7004432078660063


## Conclusion

In this project, we compared different regression models to predict house prices based on various features. The key metric used to evaluate the performance of each model was the **R-squared (R²) score**.

- The **R² score** indicates how well the model explains the variance in the target variable (price).
- A higher **R² score**, closer to **1**, indicates a better fit of the model to the data.

After comparing multiple models, the model with the **highest R² score** was found to be the most effective for predicting house prices. This model demonstrated the best ability to capture the relationship between the features and the target variable.