<h2>House price prediction</h2>

<h3>Importing libraries</h3>

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import Lasso, Ridge
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor


<h3>Importing the dataset</h3>

In [2]:
#Importing the datasets
train = pd.read_csv("archive/test.csv")
test = pd.read_csv("archive/train.csv")

In [3]:
#The dataset contains the following features
train.columns, test.columns

(Index(['beds', 'baths', 'size', 'size_units', 'lot_size', 'lot_size_units',
        'zip_code', 'price'],
       dtype='object'),
 Index(['beds', 'baths', 'size', 'size_units', 'lot_size', 'lot_size_units',
        'zip_code', 'price'],
       dtype='object'))

Target variable is `price`

In [4]:
# first five rows
train.head()

Unnamed: 0,beds,baths,size,size_units,lot_size,lot_size_units,zip_code,price
0,3,3.0,2850.0,sqft,4200.0,sqft,98119,1175000.0
1,4,5.0,3040.0,sqft,5002.0,sqft,98106,1057500.0
2,3,1.0,1290.0,sqft,6048.0,sqft,98125,799000.0
3,3,2.0,2360.0,sqft,0.28,acre,98188,565000.0
4,3,3.5,1942.0,sqft,1603.0,sqft,98107,1187000.0


In [5]:
test.head()

Unnamed: 0,beds,baths,size,size_units,lot_size,lot_size_units,zip_code,price
0,3,2.5,2590.0,sqft,6000.0,sqft,98144,795000.0
1,4,2.0,2240.0,sqft,0.31,acre,98106,915000.0
2,4,3.0,2040.0,sqft,3783.0,sqft,98107,950000.0
3,4,3.0,3800.0,sqft,5175.0,sqft,98199,1950000.0
4,2,2.0,1042.0,sqft,,,98102,950000.0


In [6]:
## full dataset
pd.merge(train, test, on = "zip_code", how="inner")


Unnamed: 0,beds_x,baths_x,size_x,size_units_x,lot_size_x,lot_size_units_x,zip_code,price_x,beds_y,baths_y,size_y,size_units_y,lot_size_y,lot_size_units_y,price_y
0,3,3.0,2850.0,sqft,4200.0,sqft,98119,1175000.0,3,2.5,2974.0,sqft,3840.00,sqft,1750000.0
1,3,3.0,2850.0,sqft,4200.0,sqft,98119,1175000.0,2,2.0,871.0,sqft,0.29,acre,430000.0
2,3,3.0,2850.0,sqft,4200.0,sqft,98119,1175000.0,3,2.5,2620.0,sqft,4080.00,sqft,1300000.0
3,3,3.0,2850.0,sqft,4200.0,sqft,98119,1175000.0,1,1.0,607.0,sqft,,,427000.0
4,3,3.0,2850.0,sqft,4200.0,sqft,98119,1175000.0,4,4.0,3040.0,sqft,4000.00,sqft,1585000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47374,3,2.0,1301.0,sqft,3000.0,sqft,98103,895000.0,3,3.0,1650.0,sqft,1376.00,sqft,974999.0
47375,3,2.0,1301.0,sqft,3000.0,sqft,98103,895000.0,3,2.0,1210.0,sqft,1374.00,sqft,875000.0
47376,3,2.0,1301.0,sqft,3000.0,sqft,98103,895000.0,2,1.0,1770.0,sqft,4800.00,sqft,926000.0
47377,3,2.0,1301.0,sqft,3000.0,sqft,98103,895000.0,1,1.0,705.0,sqft,,,490000.0


<h3>Data Cleaning</h3>
- Data cleaning involves identifying and correcting any errors or inconsistencies in the data, such as missing values, duplicate records or incorrect data types. This is an important step because it helps to ensure that the data is complete and accurate, which is necessary for building reliable models.



- First of all, all the columns that are not likely to help in predicting the target variable are dropped from the data frame.



In [18]:
train_copy = train.copy()
test_copy = test.copy()

In [19]:
train_copy.head()

Unnamed: 0,beds,baths,size,size_units,lot_size,lot_size_units,zip_code,price
0,3,3.0,2850.0,sqft,4200.0,sqft,98119,1175000.0
1,4,5.0,3040.0,sqft,5002.0,sqft,98106,1057500.0
2,3,1.0,1290.0,sqft,6048.0,sqft,98125,799000.0
3,3,2.0,2360.0,sqft,0.28,acre,98188,565000.0
4,3,3.5,1942.0,sqft,1603.0,sqft,98107,1187000.0


In [20]:
# train dataset
train_trimmed = train_copy.drop(columns=["size_units", "lot_size_units"], axis=1)
train_trimmed.head()

Unnamed: 0,beds,baths,size,lot_size,zip_code,price
0,3,3.0,2850.0,4200.0,98119,1175000.0
1,4,5.0,3040.0,5002.0,98106,1057500.0
2,3,1.0,1290.0,6048.0,98125,799000.0
3,3,2.0,2360.0,0.28,98188,565000.0
4,3,3.5,1942.0,1603.0,98107,1187000.0


In [23]:
# test dataset
test_copy.head()
test_trimmed = test_copy.drop(columns=["size_units", "lot_size_units"], axis=1)
test_trimmed.head()

Unnamed: 0,beds,baths,size,lot_size,zip_code,price
0,3,2.5,2590.0,6000.0,98144,795000.0
1,4,2.0,2240.0,0.31,98106,915000.0
2,4,3.0,2040.0,3783.0,98107,950000.0
3,4,3.0,3800.0,5175.0,98199,1950000.0
4,2,2.0,1042.0,,98102,950000.0


<h3>Missing values</h3>
- It is important to fill in missing data with NaN (Not a Number) because it allows you to identify missing values in your data clearly.  Filling missing values with NaN allows us to identify which values are missing and take appropriate action easily.



In [46]:
train_trimmed.isna().sum()

beds          0
baths         0
size          0
lot_size    505
zip_code      0
price         0
dtype: int64

In [47]:
test_trimmed.isna().sum()

beds          0
baths         0
size          0
lot_size    347
zip_code      0
price         0
dtype: int64