# Linear Regression on the Housing Prices Dataset

The dataset we have used here pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data.The columns are as follows, their names are self explanitory:

1. longitude
2. latitude
3. housing_median_age
4. total_rooms
5. total_bedrooms
6. population
7. households
8. median_income
9. median_house_value
10. ocean_proximity

## Mount data into Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Import all Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import keras
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense, Flatten, Conv1D, Conv2D, MaxPooling1D, MaxPooling2D, LSTM, Embedding, Dropout
from keras.preprocessing import image
from sklearn.model_selection import RandomizedSearchCV, KFold
from sklearn.metrics import make_scorer
from keras.wrappers.scikit_learn import KerasClassifier

## Preprocessing

### 1. Read Dataset

In [None]:
df=pd.read_csv("drive/MyDrive/housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


### 2. Remove Rows containing Null values

As there are some null elements present in the dataset and the number is very less so we drop them as small number of elements missing from such a big dataset wont make much of a big difference.

In [None]:
df.dropna(inplace=True)
df.shape

(20433, 10)

### 3. Split the dataset into Input and Output values

We now seperate the dataframe into X (the features ) and Y the output variable (Its a value as we are doing regression)

In [None]:
X=pd.DataFrame(columns=['longitude','latitude','housing_median_age','total_rooms','total_bedrooms','population','households','median_income','ocean_proximity'],data=df)
y=pd.DataFrame(columns=['median_house_value'],data=df)

### 4. Substitute Non Numeric Values

If we see the dataset , we can see that the feature ```ocean_proximity``` has non integer values such as ```NEAR BAY``` , ```NEAROCEAN``` , ```INLAND``` , ```ISLAND```

Therefore we convert these fourn into 4 new features with the prefix ocean_proximity attached to the 4 options available.

Value 1 is given if it belongs to that class or else 0 is given to it.

In [None]:
X = pd.get_dummies(data = X, columns = ['ocean_proximity'] , prefix = ['ocean_proximity'] , drop_first = True)
X.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,0,0,1,0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,0,0,1,0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,0,0,1,0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,0,0,1,0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,0,0,1,0


### 5. Normalize Inputs

Since the different features in the inputs might be of different orders, we have to normalize them so that they have an equal chance of influencing our model.

We use SciKit Learn's ```StandardScaler``` to do this.

In [None]:
scaler = StandardScaler()
scaler.fit_transform(X) # First fit to input data, then transform it

array([[-1.32731375,  1.05171726,  0.98216331, ..., -0.01564487,
         2.82866074, -0.38418614],
       [-1.32232256,  1.04235526, -0.60621017, ..., -0.01564487,
         2.82866074, -0.38418614],
       [-1.33230494,  1.03767426,  1.85576873, ..., -0.01564487,
         2.82866074, -0.38418614],
       ...,
       [-0.82320322,  1.77727236, -0.92388486, ..., -0.01564487,
        -0.35352419, -0.38418614],
       [-0.87311515,  1.77727236, -0.84446619, ..., -0.01564487,
        -0.35352419, -0.38418614],
       [-0.83318561,  1.74918635, -1.00330353, ..., -0.01564487,
        -0.35352419, -0.38418614]])

### 6. Adjust Output Scale

Since the ```median_house_value``` is very large, we divide the output by ```100,000``` to obtain output in lakhs. The data becomes easy to visualize.

In [None]:
y = y / 100000

### 7. Split the dataset into Training and Testing datasets

We use SciKit Learn's ```train_test_split``` function to split our dataset into training and testing datasets with a 80/20 split.

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state = 0)

In [None]:
X_train.shape

(16346, 12)

## Training

### 1. Create the Model

Now we define our model and use a function ```create_model``` which returns the model to be used in our Random Search Function. The function takes the following inputs
  - ```nl``` which is the number of layers,
  - ```nn``` which is an array of the number of neurons in each layer,
  - ```lr``` which is the learning rate of the, and
  - ```act``` which is the activation function.


We then create ```nl``` hidden layers and the output layer. We use the **RELU Activation** function for the output layer as it is a **Linear Regression Problem**. We use the **Adam Optimizer** to optimize our ANN and use **Mean Squared Error** as our loss function and metric.

In [None]:
def create_model(nl=1, nn=[1000], lr=0.01,act="relu"):
    opt = keras.optimizers.Adam(lr=lr)                                       
    model = Sequential()
    # for the first layer we need to specify the input dimensions
    first=True
    for i in range(nl):
        if first:
            model.add(Dense(nn[i], activation='relu', input_dim = X.shape[1]))
            first=False
        else: 
            model.add(Dense(nn[i], activation=act))            
    model.add(Dense(1,activation="relu"))
    model.compile(loss='mean_squared_error', optimizer=opt, metrics='mse',)
    return model

### 2. Random Search and Training

#### 2.1 Custom Implementation of Random Search

With exploration and learing in mind, we have used a **custom built random search funtion** instead of using a Random search library. This function performs all the functionalities as ScikitLearn's ```RandomizedSearchCV``` function. We define our Random Function ```getRandom()```.

The function generates the parameters of the model based on the following
1. ```nl``` - Uniform Distribution of Integers ```[5, 30]```,
2. ```nn``` - Uniform Distribution of Integers ```[16, 1024]```
3. ```lr``` - Randomly picked from set ```{0.01, 0.001, 0.0001, 0.00001}```
4. ```act``` - relu.

In each iteration we train the created model using Keras' ```model.fit()```. We then evaluate the model against our test dataset and compute the **Mean Square Error** Loss. 

After each iteration we return the ```params```, ```model```, and ```mseSum```. These are processed to find the model, and set of params, which minimize the MSE.

In [None]:
import random

def getRandom() :
  # Number of layers
  nl = random.randint(5,30)

  # Number of neurons per layer
  nn = []
  for i in range(nl) :
    nn.append(random.randint(16,1024))
  
  # Activation function
  act = "relu"

  # Learning Rate
  lr = random.choice([0.01, 0.001,0.0001,0.00001])

  params = [nl, nn, act, lr]

  model = create_model(nl=nl, nn=nn, act="relu", lr=lr)
  model.fit(x=X_train,y=y_train.to_numpy(),epochs=100,verbose=1,batch_size=45)

  #predict and get the mse on the test set
  predy = model.predict(X_test)

  acty = y_test.to_numpy()
  mseSum = 0
  for i in range (predy.size) :
    temp = predy[i]- acty[i][0]
    mseSum =  mseSum + (temp*temp)
  mseSum = mseSum/predy.size

  return params, model, mseSum

We now initialize lists to store the Randomly Generated models, their params, and their MSE Losses.

We then call ```getRandom()``` a total of ```n = n_iter(20)``` times and store the results of each call.

In [None]:
modelList = []
paramsList = []
mseSumList = []
finalM = None
n_iter = 20
mseMin = 999999999
for i in range(n_iter) :
  print("Iteration ",i+1)
  params, model, mseSum = getRandom()
  paramsList.append(params)
  modelList.append(model)
  mseSumList.append(mseSum)

Iteration  1
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77

We now process the models obtained in the previous step and select the one with the minimum MSE.

In [None]:
min = 99999
min_i = 0

print("Number of models tried:", len(mseSumList))

for i in range(len(mseSumList)) :
  if mseSumList[i] < min :
    min = mseSumList[i]
    min_i = i

print("The minimum MSE error on the test data is :-",min)
print("The number of layers are :-",paramsList[min_i][0])
print("Number of neutrons per layer are :- ",paramsList[min_i][1])
print("Activation Function Used :- ",paramsList[min_i][2])
print("Learning rate is :-",paramsList[min_i][3])

Number of models tried: 20
The minimum MSE error on the test data is :- 0.3811634
The number of layers are :- 22
Number of neutrons per layer are :-  [239, 350, 417, 461, 328, 787, 318, 488, 725, 646, 883, 419, 983, 29, 113, 647, 345, 207, 392, 571, 478, 528]
Activation Function Used :-  relu
Learning rate is :- 0.0001


Printing a summary of the Best Model

In [None]:
modelList[min_i].summary()

Model: "sequential_33"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_723 (Dense)            (None, 239)               3107      
_________________________________________________________________
dense_724 (Dense)            (None, 350)               84000     
_________________________________________________________________
dense_725 (Dense)            (None, 417)               146367    
_________________________________________________________________
dense_726 (Dense)            (None, 461)               192698    
_________________________________________________________________
dense_727 (Dense)            (None, 328)               151536    
_________________________________________________________________
dense_728 (Dense)            (None, 787)               258923    
_________________________________________________________________
dense_729 (Dense)            (None, 318)             

Printing all the Random Parameters generated, along with their errors. And show the top 5 set of parameters

In [None]:
def Extract(lst,i):
    return [item[i] for item in lst]

mseSumList = Extract(mseSumList,0)  
b = Extract(paramsList,0)
c = Extract(paramsList,1)
d = Extract(paramsList,2)
e = Extract(paramsList,3)

In [None]:
tempList = []
for i in mseSumList :
  tempList.append(i)
tempList.sort()
rankList = [None]*n_iter
for i in range(n_iter) :
  for j in range(n_iter) :
    if(tempList[j] == mseSumList[i]) :
      rankList[i] = j+1
      break


In [None]:
paramData = {
    "Number Of Layers " : b,
    "Neuron/Layer" : c,
    "Activation Fn" : d,
    "Learning Rate" : e ,
    "Mse " : mseSumList,
    "Rank" : rankList
}
dframe = pd.DataFrame(data=paramData)
dframe.to_csv('drive/MyDrive/paramData.csv')
dframe.to_csv('paramData.csv')
dframe.sort_values("Rank").head()

Unnamed: 0,Number Of Layers,Neuron/Layer,Activation Fn,Learning Rate,Mse,Rank
12,22,"[239, 350, 417, 461, 328, 787, 318, 488, 725, ...",relu,0.0001,0.381163,1
7,12,"[196, 834, 922, 790, 329, 167, 101, 700, 378, ...",relu,0.0001,0.432452,2
6,26,"[213, 719, 914, 1011, 681, 724, 827, 112, 1007...",relu,1e-05,0.502728,3
4,29,"[490, 791, 787, 511, 354, 299, 692, 459, 784, ...",relu,1e-05,0.577672,4
13,26,"[549, 707, 272, 1024, 173, 46, 141, 694, 177, ...",relu,1e-05,0.583462,5


## 3. Result

We obtained the following results after Performing Regression on Housing Prices Dataset using a Deep Neural Network and Optimizing its parameters using a custom implementation of Random Search.

Optimal Parameters were found to be

- Number of layers: 22
- Number of neurons per layer: [239, 350, 417, 461, 328, 787, 318, 488, 725, 646, 883, 419, 983, 29, 113, 647, 345, 207, 392, 571, 478, 528]
- Activation Function Used: relu
- Learning rate: 0.0001

The minimum MSE error obtained on the test data is 0.3811634.

