## Importing Data

First, we import the packages that we will be using. Then, we import our dataset.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
%matplotlib inline

In [None]:
data = pd.read_csv('/kaggle/input/housesalesprediction/kc_house_data.csv')

We look at the first 5 columns of the dataframe to know how it looks like. Then, we display the data types of each column.

In [None]:
data.head(5)

In [None]:
data.dtypes

## Data Wrangling

We drop the columns "id" and "date".

In [None]:
data.drop(["id","date"], axis=1, inplace=True)

Next, we check whether there are missing values in our data.

In [None]:
data.isnull().sum()

Since there are no missing values in all columns, then we proceed to obtain the statistical summary of our data to examine their averages and other relevant values.

In [None]:
data.describe()

## Exploratory Data Analysis

In the boxplots below, we could infer that bedrooms, bathrooms, waterfront, and grade have effect on price. View has also an effect on price but only less, while floors has no effect at all.

In [None]:
f, axes = plt.subplots(1, 2,figsize=(15,5))
sns.boxplot(x="bedrooms", y="price", data=data, ax=axes[0])
sns.boxplot(x="bathrooms", y="price", data=data, ax=axes[1])
f, axes = plt.subplots(1, 2,figsize=(15,5))
sns.boxplot(x="floors", y="price", data=data, ax=axes[0])
sns.boxplot(x="waterfront", y="price", data=data, ax=axes[1])
f, axes = plt.subplots(1, 2,figsize=(15,5))
sns.boxplot(x="view", y="price", data=data, ax=axes[0])
sns.boxplot(x="grade", y="price", data=data, ax=axes[1])

Using the function regplot, we can see that the feature "sqft_living" has a positive correlation with "price".

In [None]:
sns.regplot(x="sqft_living", y="price", data=data)

In the graph below, we could see that the most commonly sold houses are those having 3 bedrooms.

In [None]:
data['bedrooms'].value_counts().plot(kind='bar')
plt.title('Number of Bedrooms')
plt.xlabel('Bedrooms')
plt.ylabel('Count')

Now, we explore the correlation between different features.

In [None]:
corrmat = data.corr()
f, ax1 = plt.subplots(figsize=(12,9))

ax1=sns.heatmap(corrmat,vmax = 0.8);

Let's take a closer look by examining their correlation coefficients.

In [None]:
corrmat = data.corr()
f, ax1 = plt.subplots(figsize=(12,9))

ax1=sns.heatmap(corrmat,vmax = 0.8,annot = True);

## Model Development

### Multiple Linear Regression

We use multiple linear regression to predict the price. We set price as our criterion variable (y_data) and all the other features except price as our predictors (x_data).

In [None]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()

In [None]:
x_data = data.drop('price',axis=1)
y_data = data['price']

70% of the data will be used as our training data, while 30% will be our test data. We randomize the splitting of the data using random_state.

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.3, random_state=1)

print("Number of test samples :", x_test.shape[0])
print("Number of training samples:",x_train.shape[0])

In [None]:
lm.fit(x_train,y_train)

In [None]:
lm.score(x_test,y_test)

The accuracy of the model is only at **69%**.

In [None]:
lm.intercept_

In [None]:
lm.coef_

### Gradient Boosting Regression

Since we want higher accuracy, we try to use a different method. We will now use gradient boosting regression.

In [None]:
from sklearn import ensemble
clf = ensemble.GradientBoostingRegressor(n_estimators=400, max_depth=5, min_samples_split=2, 
                                         learning_rate=0.1, loss='ls')

In [None]:
clf.fit(x_train,y_train)

In [None]:
clf.score(x_test,y_test)

The accuracy of the new model is now at **89%**—a lot better than the first one!