# Assignment 1: Bucharest Housing Dataset





## Dataset Description
In the dataset linked below you have over three thousand apartments listed for sale on the locally popular website *imobiliare.ro*. Each entry provides details about different aspects of the house or apartment:
1. `Nr Camere` indicates the number of rooms;
2. `Suprafata` specifies the total area of the dwelling;
3. `Etaj` specifies the floor that the home is located at;
4. `Total Etaje` is the total number of floors of the block of flats;
5. `Sector` represents the administrative district of Bucharest in which the apartment is located;
6. `Pret` represents the listing price of each dwelling;
7. `Scor` represents a rating between 1 and 5 of location of the apartment. It was computed in the following manner by the dataset creator:
  1. The initial dataset included the address of each flat;
  2. An extra dataset was used, which included the average sales price of dwellings in different areas of town;
  3. Using all of these monthly averages, a clusterization algorithm grouped them into 5 classes, which were then labelled 1-5;
  4. You can think of these scores as an indication of the value of the surrounding area, with 1 being expensive, and 5 being inexpensive.

Dataset Source: [kaggle.com/denisadutca](https://www.kaggle.com/denisadutca/bucharest-house-price-dataset/kernels)




## To Do

To complete this assignment, you must:
1. Get the data in a PyTorch-friendly format;
2. Predict the `Nr Camere` of each dwelling, treating it as a **classification** problem. Choose an appropriate loss function;
3. Predict the `Nr Camere` of each dwelling, treating it as a **regression** problem. Choose an appropriate loss function;
4. Compare the results of the two approaches, displaying the Confusion Matrix for the two, as well as any comparing any other metrics you think are interesting (e.g. MSE). Comment on the results;
5. Choose to predict a feature more suitable to be treated as a **regression** problem, then successfully solve it.
6. What values should the loss have when the predictions are random (when your network is not trained at all)?
7. Don't forget to split the dataset in training and validation.




## Hints
1. It might prove useful to link your Google Drive to this Notebook. See the code cell below;
2. You might want to think of ways of preprocessing your data (e.g. One Hot Encoding, etc.);
3. Don't be afraid of using text cells to actually write your thoughts about the data/results. Might prove useful at the end of the semester when you'll need to walk us through your solution 😉.



## Deadline
March 18, 2021, 23:59

**Punctaj maxim:** 2 puncte.

Depunctarea este de 0.25 puncte pe zi intarziata. Dupa mai mult de 4 zile intarziere, punctajul maxim care se poate obtine ramane 1 punct.

Trimite notebookul si datasetul intr-o arhiva `NumePrenume_Grupa_Tema1.zip` aici: https://forms.gle/MGrLvehEjmtWmQZP7 (la sustinerea temei, vei rula codul din arhiva).

Import Libraries

In [1]:
import torch
import torch.nn as nn
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from typing import Iterator

#Predict the Nr Camere of each dwelling, treating it as a regression problem. Choose an appropriate loss function.
Use Linear Regression from Lab2.

In [2]:
class MSE():
    """The Mean Squared Error loss"""
    def __call__(self, y: torch.Tensor, target: torch.Tensor) -> torch.Tensor:
        y = torch.Tensor(y)
        target = torch.Tensor(target)
        mse = ((y - target) ** 2).sum().sqrt().mean()
        return mse

In [3]:
class GDLinearRegression(nn.Module):
    """A simple Linear Regression model"""

    def __init__(self):
        super().__init__()
        # We're initializing our model with random weights
        self.w = nn.Parameter(torch.randn(6, requires_grad=True))
        self.b = nn.Parameter(torch.randn(1, requires_grad=True))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = torch.Tensor(x)
        y = x @ self.w + self.b
        return y

  # PyTorch is accumulating gradients
  # After each Gradient Descent step we should reset the gradients
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = torch.Tensor(x)
        y = x @ self.w + self.b
        return y
     

In [4]:
class GD:
  """
  Gradient Descent optimizer
  """
  def __init__(self, params: Iterator[nn.Parameter], lr: int):
    self.w, self.b = list(params)
    self.lr = lr

  def step(self):
    """
    Perform a gradient decent step. Update the parameters w and b by using:
     - the gradient of the loss with respect to the parameters
     - the learning rate
    This method is called after backward(), so the gradient of the loss wrt
    the parameters is already computed.
    """
    with torch.no_grad():
      self.w -= self.w.grad * self.lr
      self.b -= self.b.grad * self.lr

In [5]:
def train(model: GDLinearRegression, data: torch.Tensor,
          target: torch.Tensor, optim: GD, criterion: MSE):
    """Linear Regression train routine"""
    # forward pass: compute predictions (hint: use model.forward)
    predictions = model(data)

    # forward pass: compute loss (hint: use criterion)
    loss = criterion(predictions, target)

    # backpropagation: compute gradients of loss wrt weights
    loss.backward()

    # GD step: update weights using the gradients (hint: use optim)
    optim.step()

    # reset the gradients (hint: use model)
    model.zero_grad()

    return model

#Get the data in a PyTorch-friendly format
Split the data in traing and validation and tandardize features by removing the mean and scaling to unit variance.

In [6]:
data = pd.read_csv('Bucharest_HousePriceDataset.csv', sep=',')

predict = 'Nr Camere'
x = data.drop(columns=predict).values

y = data[[predict]].values.ravel()
y = y-1
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)

scaler = StandardScaler()
scaler.fit(x)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

In [7]:
mse = MSE()
lr = 0.0015 #@param {type: "slider", min: 0.001, max: 2, step: 0.005}
total_steps = 100 #@param {type:"slider", min: 0, max: 100, step: 1}

model = GDLinearRegression()
optimizer = GD(model.parameters(), lr=lr)
criterion = MSE()

for i in range(total_steps):
    train(model, x_train, y_train, optimizer, criterion)

with torch.no_grad():
    y_pred = model(x_train)

Print the accuracy for training and validation, as well as the Confusion Matrix and MSE.

In [8]:
print(accuracy_score(y_train, torch.round(y_pred).numpy()))

with torch.no_grad():
    y_pred = model(x_test)

print(accuracy_score(y_test, torch.round(y_pred).numpy()))

predicted = torch.round(y_pred).numpy()
regression_matrix = confusion_matrix(predicted,y_test)
regression_mse = mse(predicted,y_test)

print(regression_matrix)
print('MSE:', regression_mse)

0.683669854764435
0.6883852691218131
[[  6   0   0   0   0   0   0]
 [ 54 313  71   1   0   0   0]
 [  0  14 145  42   0   0   0]
 [  0   0  15  21   5   0   0]
 [  0   0   1   8   1   1   0]
 [  0   0   1   2   2   0   0]
 [  0   0   1   2   0   0   0]]
MSE: tensor(16.4621)


Predict the Nr Camere of each dwelling, treating it as a 
classification problem. 
Choose an appropriate loss function
Using the nn from lab2 ex 3

In [9]:
#Define out Nn
class ThreeClassNN(nn.Module):
    def __init__(self,
                 input_size: int,
                 hidden_size: int,
                 output_size: int,
                 hidden_activation_fn=nn.ReLU()):
        # Initialise the base class (nn.Module)
        super().__init__()

        # As stated above, we'll simply use `torch.nn.Linear` to define our 2 layers
        self._layer1 = nn.Linear(input_size, hidden_size)
        self._layer2 = nn.Linear(hidden_size, output_size)

        self._hidden_activation = hidden_activation_fn

    def forward(self, x):

        # Layer 1 using ReLU as activation
        h = self._hidden_activation(self._layer1(x))
        out = self._layer2(h)

        return out

Data Preprocessing

In [10]:
predict = 'Nr Camere'

x = data.drop(columns=predict).values

y = data[[predict]].values
y = y-1
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)

scaler = StandardScaler()
scaler.fit(x)
x_train = torch.tensor(scaler.transform(x_train)).float()
x_test = torch.tensor(scaler.transform(x_test)).float()

Train the model

In [11]:
model = ThreeClassNN(6, 1000, 9)

NUM_EPOCHS = 500

optim = torch.optim.Adam(model.parameters(), lr=1.5)
criterion = nn.CrossEntropyLoss()

for i in range(NUM_EPOCHS):
    # Set the model to train mode and reset the gradients
    model.train()
    optim.zero_grad()

    output = model(x_train)
    target = torch.tensor(y_train).long().squeeze(1)
    loss = criterion(output, target)

    loss.backward()
    optim.step()

Print the accuracy for training and validation, as well as the Confusion Matrix and MSE.

In [12]:
predicted = np.array(torch.argmax(model(x_train), dim=-1))
accuracy = (predicted==y_train.ravel()).sum()/predicted.shape[0]
print(accuracy)

predicted = np.array(torch.argmax(model(x_test), dim=-1))
accuracy = (predicted==y_test.ravel()).sum()/predicted.shape[0]
print(accuracy)


class_matrix = confusion_matrix(predicted,y_test)
class_mse = mse(predicted,y_test)

print(class_matrix)
print('MSE:', class_mse)

0.7357421183138505
0.6841359773371105
[[ 51  10   0   0   0   0   0]
 [ 27 233  53   5   0   0   0]
 [  0  54 180  45   2   1   0]
 [  0   2  20  19   0   0   0]
 [  0   0   0   3   0   0   0]
 [  0   0   0   0   0   0   0]
 [  0   0   0   1   0   0   0]]
MSE: tensor(815.0227)


Loss when the network is not trained.
Random variables uniform distribuited.
Random  means that any number can occur with the same probability so the prob to be a certain number
of rooms is 1/max number (max number = 9)

In [13]:
p = 1.0/9.0
Q = [p,p,p,p,p,p,p,p,p]

#calculate the probability for each nr of rooms from our 
#dataset nr_favourite_cases/nr of posible cases

nr_y = [0,0,0,0,0,0,0,0,0]
for i in range(len(y)):
    nr_y[y[i][0]] += 1

P = [nr_y[i]/len(y) for i in range(len(nr_y))]


#Cross-entropy can be calculated using the probabilities of
#the events from P and Q, as follows:
#H(P, Q) = – sum x in X P(x) * log(Q(x))
H = 0
for i in range(len(Q)):
    H += P[i] * np.log(Q[i]+1e-9) #we don't want 0 values
    
H = -H
print(H)

2.1972245683362193


Choose to predict a feature more suitable to be treated as a regression problem, then successfully solve it.
Choose: pret

In [14]:
predict = 'Pret'

x = data.drop(columns=predict).values

y = data[[predict]].values.ravel()

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)

scaler = StandardScaler()
scaler.fit(x)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

In [15]:
lr = 100
total_steps = 100
model = GDLinearRegression()
optimizer = GD(model.parameters(), lr=lr)
criterion = MSE()

for i in range(total_steps):
    train(model, x_train, y_train, optimizer, criterion)

with torch.no_grad():
    y_pred = model(x_train)

Traing and validation accuracy

In [16]:
mse = MSE()
print(mse(y_pred, y_train).item())

with torch.no_grad():
    y_pred = model(x_test)

mse = MSE()
print(mse(y_pred, y_test).item())

1650838.75
852138.875
