# California Housing Price Prediction

### Background of Problem Statement :

The US Census Bureau has published California Census Data which has 10 types of metrics such as the population, median income, median housing price, and so on for each block group in California. The dataset also serves as an input for project scoping and tries to specify the functional and nonfunctional requirements for it.

### Problem Objective :

The project aims at building a model of housing prices to predict median house values in California using the provided dataset. This model should learn from the data and be able to predict the median housing price in any district, given all the other metrics.

Districts or block groups are the smallest geographical units for which the US Census Bureau
publishes sample data (a block group typically has a population of 600 to 3,000 people). There are 20,640 districts in the project dataset.

### Domain: Finance and Housing

### Analysis Tasks to be performed:

a. Build a model of housing prices to predict median house values in California using the provided dataset.

b. Train the model to learn from the data to predict the median housing price in any district, given all the other metrics.

c. Predict housing prices based on median_income and plot the regression chart for it.


#### 1. Load the data :
Read the “housing.csv” file from the folder into the program.
Print first few rows of this data.
Extract input (X) and output (Y) data from the dataset.

#### 2. Handle missing values :
Fill the missing values with the mean of the respective column.

#### 3. Encode categorical data :
Convert categorical column in the dataset to numerical data.

#### 4. Split the dataset : 
Split the data into 80% training dataset and 20% test dataset.

#### 5. Standardize data :
Standardize training and test datasets.

#### 6. Perform Linear Regression : 
Perform Linear Regression on training data.
Predict output for test dataset using the fitted model.
Print root mean squared error (RMSE) from Linear Regression.
            [ HINT: Import mean_squared_error from sklearn.metrics ]


#### 7. Bonus exercise: Perform Linear Regression with one independent variable :
Extract just the median_income column from the independent variables (from X_train and X_test).
Perform Linear Regression to predict housing values based on median_income.
Predict output for test dataset using the fitted model.
Plot the fitted model for training data as well as for test data to check if the fitted model satisfies the test data.


### Dataset Description :
##### Field	Description
longitude	(signed numeric - float) : Longitude value for the block in California, USA
latitude	(numeric - float ) : Latitude value for the block in California, USA

housing_median_age	(numeric - int ) : Median age of the house in the block

total_rooms	(numeric - int ) : Count of the total number of rooms (excluding bedrooms) in all houses in the block

total_bedrooms	(numeric - float ) : Count of the total number of bedrooms in all houses in the block

population	(numeric - int ) : Count of the total number of population in the block

households	(numeric - int ) : Count of the total number of households in the block

median_income	(numeric - float ) : Median of the total household income of all the houses in the block

ocean_proximity	(numeric - categorical ) : Type of the landscape of the block [ Unique Values : 'NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'  ]

median_house_value	(numeric - int ) : Median of the household prices of all the houses in the block

In [1]:
## importing libaries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
from sklearn.preprocessing import LabelEncoder,OneHotEncoder

In [2]:
df = pd.read_excel("Data/1553768847_housing.xlsx")

In [3]:
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
0,-122.23,37.88,41,880,129.0,322,126,8.3252,NEAR BAY,452600
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,NEAR BAY,358500
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,NEAR BAY,352100
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,NEAR BAY,341300
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,NEAR BAY,342200
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25,1665,374.0,845,330,1.5603,INLAND,78100
20636,-121.21,39.49,18,697,150.0,356,114,2.5568,INLAND,77100
20637,-121.22,39.43,17,2254,485.0,1007,433,1.7000,INLAND,92300
20638,-121.32,39.43,18,1860,409.0,741,349,1.8672,INLAND,84700


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  int64  
 3   total_rooms         20640 non-null  int64  
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  int64  
 6   households          20640 non-null  int64  
 7   median_income       20640 non-null  float64
 8   ocean_proximity     20640 non-null  object 
 9   median_house_value  20640 non-null  int64  
dtypes: float64(4), int64(5), object(1)
memory usage: 1.6+ MB


In [5]:
df.isnull()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
20635,False,False,False,False,False,False,False,False,False,False
20636,False,False,False,False,False,False,False,False,False,False
20637,False,False,False,False,False,False,False,False,False,False
20638,False,False,False,False,False,False,False,False,False,False


In [6]:
df.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
ocean_proximity         0
median_house_value      0
dtype: int64

In [7]:
df.shape

(20640, 10)

In [8]:
df1 = df.copy()

In [9]:
df1.total_bedrooms.fillna(df1.total_bedrooms.mean(), inplace = True)

In [10]:
df1.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
ocean_proximity       0
median_house_value    0
dtype: int64

In [11]:
df1.ocean_proximity.unique()

array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
      dtype=object)

In [12]:
le = LabelEncoder()
df1['ocean_proximity'] = le.fit_transform(df1.ocean_proximity)

In [13]:
df1

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
0,-122.23,37.88,41,880,129.0,322,126,8.3252,3,452600
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,3,358500
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,3,352100
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,3,341300
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,3,342200
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25,1665,374.0,845,330,1.5603,1,78100
20636,-121.21,39.49,18,697,150.0,356,114,2.5568,1,77100
20637,-121.22,39.43,17,2254,485.0,1007,433,1.7000,1,92300
20638,-121.32,39.43,18,1860,409.0,741,349,1.8672,1,84700


In [14]:
df1["ocean_proximity"].value_counts()

0    9136
1    6551
4    2658
3    2290
2       5
Name: ocean_proximity, dtype: int64

In [15]:
df1.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,1.165843,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,419.266592,1132.462122,382.329753,1.899822,1.420662,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,0.0,14999.0
25%,-121.8,33.93,18.0,1447.75,297.0,787.0,280.0,2.5634,0.0,119600.0
50%,-118.49,34.26,29.0,2127.0,438.0,1166.0,409.0,3.5348,1.0,179700.0
75%,-118.01,37.71,37.0,3148.0,643.25,1725.0,605.0,4.74325,1.0,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,4.0,500001.0


In [16]:
X = df1.loc[:, df1.columns != 'median_house_value']  # independent variables

y = df1.loc[:, df1.columns == 'median_house_value']  # Target variable


In [17]:
X.head(3)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
0,-122.23,37.88,41,880,129.0,322,126,8.3252,3
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,3
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,3


In [18]:
y.head(3)

Unnamed: 0,median_house_value
0,452600
1,358500
2,352100


In [19]:
df1['ocean_proximity'] = df1['ocean_proximity'].astype('category')

In [20]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   longitude           20640 non-null  float64 
 1   latitude            20640 non-null  float64 
 2   housing_median_age  20640 non-null  int64   
 3   total_rooms         20640 non-null  int64   
 4   total_bedrooms      20640 non-null  float64 
 5   population          20640 non-null  int64   
 6   households          20640 non-null  int64   
 7   median_income       20640 non-null  float64 
 8   ocean_proximity     20640 non-null  category
 9   median_house_value  20640 non-null  int64   
dtypes: category(1), float64(4), int64(5)
memory usage: 1.4 MB


In [21]:
X = pd.get_dummies(X,drop_first=True)

In [22]:
X.head(3)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
0,-122.23,37.88,41,880,129.0,322,126,8.3252,3
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,3
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,3


In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=1)

In [24]:
X_train.shape,X_test.shape

((16512, 9), (4128, 9))

In [25]:
X_train.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
15961,-122.43,37.71,52,1410,286.0,879,282,3.1908,3
1771,-122.35,37.95,42,1485,290.0,971,303,3.6094,3
16414,-121.24,37.9,16,50,10.0,20,6,2.625,1
5056,-118.35,34.02,34,5218,1576.0,3538,1371,1.5143,0
8589,-118.39,33.89,38,1851,332.0,750,314,7.3356,0


In [27]:
from sklearn.preprocessing import StandardScaler
independent_scaler = StandardScaler()
X_train = independent_scaler.fit_transform(X_train)
X_test = independent_scaler.transform(X_test)
print(X_train[0:5,:])
print("test data")
print(X_test[0:5,:])

[[-1.42250942  0.97229046  1.85890297 -0.56497684 -0.60419991 -0.4861138
  -0.57159385 -0.36232605  1.28811826]
 [-1.38265919  1.08459626  1.06434823 -0.53051556 -0.59464877 -0.40424308
  -0.51668155 -0.14102329  1.28811826]
 [-0.8297373   1.06119922 -1.0014941  -1.18987464 -1.26322834 -1.25053723
  -1.29329827 -0.66144956 -0.1168232 ]
 [ 0.60985212 -0.75441118  0.42870444  1.18473701  2.47604167  1.8801282
   2.27600079 -1.24864731 -0.81929393]
 [ 0.58992701 -0.81524349  0.74652633 -0.36234454 -0.49436183 -0.6009108
  -0.48791797  1.82892019 -0.81929393]]
test data
[[ 0.60487084 -0.73569355  0.82598181  0.07830031  0.31270922 -0.28143698
   0.32269207 -0.33102858 -0.81929393]
 [-0.10247067  0.53710549  0.66707086 -0.20887699 -0.20066438 -0.25118041
  -0.16367395 -1.0032899  -0.1168232 ]
 [-1.41752814  0.98164928  1.38217013 -0.37704801 -0.30572688  0.09677018
  -0.24734983  0.0724551   1.28811826]
 [-1.34779025  1.01908454  1.85890297 -1.05662437 -1.05549111 -1.09035537
  -1.08149371 

In [30]:
from sklearn.linear_model import LinearRegression
linreg=LinearRegression()
linreg.fit(X_train,y_train)

LinearRegression()

In [31]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

LinearRegression()

In [33]:
y_predict = linreg.predict(X_test)

In [37]:
import math
from sklearn.metrics import mean_squared_error, r2_score
print(math.sqrt(mean_squared_error(y_test,y_predict)))
print((r2_score(y_test,y_predict)))

69888.79391558649
0.6276223517950293


In [38]:
lin = LinearRegression()

In [39]:
lin.fit(X_train, y_train)
print(lin.coef_)
print(lin.intercept_)

[[-85719.62419285 -90753.20159388  14624.24311588 -15959.37414613
   37190.05030209 -43836.23163718  27376.37429372  76258.38625495
     609.71258031]]
[207735.06419574]


In [41]:
predictions = lin.predict(X_test)
print(math.sqrt(mean_squared_error(y_test, predictions)))

69888.79391558649


In [42]:
lin.score(X_test, y_test)

0.6276223517950293