# Real-Estate-Estimator

In the following challenge, we will try to figure out whether there exists a ***LINEAR RELATIONSHIP*** between :
- the **price** of a flat (our **target** for each flat)
- and some usual **features** such as like surface area, bedrooms, etc...

## Imports

In [1]:
import numpy as np

## Data and approach 

Suppose that we were able to collect data for 4 flats down below: 
- their **features**:
    - `surface` (square feet)
    - `bedrooms`
    - `floors` 
- their **target**:
    - `price` (in thousands of USD)

|flats |surface (square feet)|bedrooms|floors|price (k USD)|
|------|-------------|--------|------|------------|
|flat1 |620|1|1|244|
|flat2 |3280|4|2|671|
|flat3 |1900|2|2|504|
|flat4 |1320|3|3|510|

👉 A first approach to **predict the price of an apartment** is to try to **find a linear relationship between the  target and the features** (*i.e. between the price and the (surface, bedrooms, floor)*), by solving the following **system of $n = 4$ linear equations with $p = 4$ unknown variables**: 



$$\begin{cases}
    244 = \theta_0 + 620\theta_1 + 1\theta_2 + 1\theta_3 \\
    671 = \theta_0 + 3280\theta_1 + 4\theta_2 + 2\theta_3 \\
    504 = \theta_0 + 1900\theta_1 + 2\theta_2 + 2\theta_3 \\
    510 = \theta_0 + 1320\theta_1 + 3\theta_2 + 3\theta_3 \\
\end{cases}$$

which can be translated into a matricial equation:

$$\boldsymbol y = \boldsymbol {X \cdot \theta}$$

$$\begin{bmatrix}
    244 \\
    671 \\
    504 \\
    510
\end{bmatrix}_{4 \times 1} = \begin{bmatrix}
    1 & 620 & 1 & 1 \\
    1 & 3280 & 4 & 2 \\
    1 & 1900 & 2 & 2 \\
    1 & 1320 & 3 & 3
\end{bmatrix}_{4 \times 4} \begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \theta_2 \\
    \theta_3
\end{bmatrix}_{4 \times 1}$$

where :
* $\boldsymbol y$ is the **`target`**, the vector of `Price`
* $\boldsymbol X$ represents the **`matrix of features`**
* $\boldsymbol {\theta} = \begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \theta_2 \\
    \theta_3
\end{bmatrix}$ (*theta*) is the vector of **coefficients/variables/unknowns** to be found

----

Here, we are using the Greek letter `theta` $\boldsymbol \theta = \begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \theta_2 \\
    \theta_3 \\
\end{bmatrix}$, to represent the coefficients of our **features**:

- A flat with no surface, no bedroom and no floor would cost $\theta_0$
- An increase of one sqm - *holding the number of bedrooms and the floor  number constant* -  would increase the price by $\theta_1$ thousand USD
- An additional bedroom - *holding the surface and the floor number constant* -   would increase the price by $\theta_2$ thousand USD
- An increase of one floor number - *holding the surface and the number of bedrooms constant* - would increase the price by $\theta_3$ thousand USD

----

If we manage to solve this system of linear equations (i.e. if we find $\theta_0$, $\theta_1$, $\theta_2$, $\theta_3$), the price of any new flat could be estimated using the following formula: $$y_{newflat} = \boldsymbol x_{newflat} \cdot \boldsymbol \theta$$

## Defining the matrix $\boldsymbol X$ of `features`:

Create a $(4,3)$ `numpy.ndarray` storing the values of the 3 features (surface, bedrooms, floors) for the 4 observations. 

In [2]:
# flat θ = [surface θ1, bedrooms θ2, floors θ3] 
flat1 = [620,1,1] 
flat2 = [3280,4,2] 
flat3 = [1900,2,2] 
flat4 = [1320,3,3]


X = np.array([
            flat1,
            flat2,
            flat3,
            flat4
            ])

print(f"""
{type(flat1) = }
--------------------------------------------
{X = }
--------------------------------------------
{type(X)= }
{len(X)= }
{X.shape = }
{X.ndim = }
{X.dtype = }
""")



type(flat1) = <class 'list'>
--------------------------------------------
X = array([[ 620,    1,    1],
       [3280,    4,    2],
       [1900,    2,    2],
       [1320,    3,    3]])
--------------------------------------------
type(X)= <class 'numpy.ndarray'>
len(X)= 4
X.shape = (4, 3)
X.ndim = 2
X.dtype = dtype('int64')



Add a "constant" vector of ones $ = \begin{bmatrix}
    1 \\
    1 \\
    1 \\
    1 \\
\end{bmatrix}$ to create the $(4,4)$ matrix $\boldsymbol X$ representing the linear system of equations

<details>
    <summary><i>Explanations</i></summary>

As you've probably noticed, the linear system of equations includes a $\theta_0$ coefficient which appears in the 4 equations. 

We need an additional feature to represent the y-intercept of the linear regression line 

_Note_ : we talk about an [affine relation](https://math.stackexchange.com/questions/275310/what-is-the-difference-between-linear-and-affine-function) rather than a strict linear relation between the `price` and the features (_Cf. Decision Science Module_)
    
    
</details>

## Define x0 as a (4,1) vector filled with 1 with the fastest NumPy method

In [3]:
x0 = np.ones((4,1)) 
print(f"""
{x0 = }
--------------------------------------------
{type(x0)= }
{len(x0)= }
{x0.shape = }
{x0.ndim = }
{x0.dtype = }
""")


x0 = array([[1.],
       [1.],
       [1.],
       [1.]])
--------------------------------------------
type(x0)= <class 'numpy.ndarray'>
len(x0)= 4
x0.shape = (4, 1)
x0.ndim = 2
x0.dtype = dtype('float64')



## Use `numpy.hstack` to create the (4,4) matrix X by concatenating x0 to your previous (4,3) matrix


In [4]:
X = np.hstack((x0,X))

print(f"""
{X = }
--------------------------------------------
{type(X)= }
{len(X)= }
{X.shape = }
{X.ndim = }
{X.dtype = }
""")



X = array([[1.00e+00, 6.20e+02, 1.00e+00, 1.00e+00],
       [1.00e+00, 3.28e+03, 4.00e+00, 2.00e+00],
       [1.00e+00, 1.90e+03, 2.00e+00, 2.00e+00],
       [1.00e+00, 1.32e+03, 3.00e+00, 3.00e+00]])
--------------------------------------------
type(X)= <class 'numpy.ndarray'>
len(X)= 4
X.shape = (4, 4)
X.ndim = 2
X.dtype = dtype('float64')



## Define the vector $\boldsymbol y$ of `Prices`

$\boldsymbol y  = \begin{bmatrix}
    244 \\
    671 \\
    504 \\
    510
\end{bmatrix}$

In order to match our matricial representation $\boldsymbol y  = \boldsymbol {X\cdot \theta}$, what should the shape of $\boldsymbol y$ be? Define $\boldsymbol y$ down below. 

<details>
    <summary><i>Hint</i></summary>

$\boldsymbol y$ should be a $(4,1)$ array, equivalent to a flat "vector", represented vertically
</details>

In [5]:
y = np.array([
    [244],
    [671],
    [504],
    [510]])

print(f"""
{y = }
--------------------------------------------
{type(y)= }
{len(y)= }
{y.shape = }
{y.ndim = }
{y.dtype = }
""")


y = array([[244],
       [671],
       [504],
       [510]])
--------------------------------------------
type(y)= <class 'numpy.ndarray'>
len(y)= 4
y.shape = (4, 1)
y.ndim = 2
y.dtype = dtype('int64')



## Find the solution of the system

Now, it's time to find the vector of coefficients $\boldsymbol \theta = \begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \theta_2 \\
    \theta_3
\end{bmatrix}$ !

The solution of the equation is:
 
$$ \large \boldsymbol X \cdot \boldsymbol \theta = \boldsymbol y 
\large \iff \boldsymbol X^{-1} \cdot \boldsymbol X \boldsymbol \cdot \theta = \boldsymbol X^{-1} \cdot \boldsymbol y 
\large \iff \boldsymbol \theta = \boldsymbol X^{-1} \cdot \boldsymbol y
$$

where $\large \boldsymbol X^{-1}$ is the inverse of $\large \boldsymbol X$.

## Compute the inverse of the matrix X with the right NumPy method

You can check that the inversion worked by testing the following equality:

$$\boldsymbol X^{-1} \cdot\boldsymbol X = \boldsymbol I_4$$
where $\boldsymbol I_4$ is the $ 4 \times 4 $ identity matrix $ \begin{bmatrix}
    1 & 0 & 0 & 0 \\
    0 & 1 & 0 & 0 \\
    0 & 0 & 1 & 0 \\
    0 & 0 & 0 & 1
\end{bmatrix}$

In [11]:
X_inverse = np.linalg.inv(X)
I4 = np.eye(4)
X_invX = np.dot(X_inverse,X) # != Xinv * X


print(f"""
{X_inverse = }
--------------------------------------------
{type(X_inverse)= }
{len(X_inverse)= }
{X_inverse.shape = }
{X_inverse.ndim = }
{X_inverse.dtype = }
..............................................................................
{I4 = }
--------------------------------------------
{type(I4)= }
{len(I4)= }
{I4.shape = }
{I4.ndim = }
{I4.dtype = }
..............................................................................
{X_invX = }
--------------------------------------------
{type(X_invX)= }
{len(X_invX)= }
{X_invX.shape = }
{X_invX.ndim = }
{X_invX.dtype =}
""")


X_inverse = array([[ 1.64516129e+00,  4.42419702e-17, -2.90322581e-01,
        -3.54838710e-01],
       [-5.37634409e-04, -2.50426246e-19,  1.07526882e-03,
        -5.37634409e-04],
       [ 3.70967742e-01,  5.00000000e-01, -1.24193548e+00,
         3.70967742e-01],
       [-6.82795699e-01, -5.00000000e-01,  8.65591398e-01,
         3.17204301e-01]])
--------------------------------------------
type(X_inverse)= <class 'numpy.ndarray'>
len(X_inverse)= 4
X_inverse.shape = (4, 4)
X_inverse.ndim = 2
X_inverse.dtype = dtype('float64')
..............................................................................
I4 = array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])
--------------------------------------------
type(I4)= <class 'numpy.ndarray'>
len(I4)= 4
I4.shape = (4, 4)
I4.ndim = 2
I4.dtype = dtype('float64')
..............................................................................
X_invX = array([[ 1.00000000e+00, -5.10702591e-14, 

Does it look like $\boldsymbol I_4 = $ 

If it doesn't, you probably used the `*` operator to perform the multiplication between $\boldsymbol X^{-1}$ and $\boldsymbol X$. Here we want to perform the matrix product. You should find the right Numpy method to do so.

If it does, you might have noticed that you do not get exactly zeros and ones in the resulting product. To be sure, you can try the [`numpy.allclose()`](https://numpy.org/doc/stable/reference/generated/numpy.allclose.html?highlight=allclose#numpy.allclose) method to check your result:

In [12]:
np.allclose(X_invX, I4)

True

You are finally able to compute `theta` using the following formula: $ \large \boldsymbol \theta = \boldsymbol X^{-1}\cdot \boldsymbol y $:

In [14]:
theta = np.dot(X_inverse, y)

print(f"""
{theta = }
--------------------------------------------
{type(theta)= }
{len(theta)= }
{theta.shape = }
{theta.ndim = }
{theta.dtype = }
""")


theta = array([[ 74.12903226],
       [  0.13655914],
       [-10.72580645],
       [ 95.93010753]])
--------------------------------------------
type(theta)= <class 'numpy.ndarray'>
len(theta)= 4
theta.shape = (4, 1)
theta.ndim = 2
theta.dtype = dtype('float64')



## Estimation of a new flat price

You finally solved the system finding $\boldsymbol \theta$ and are now able to estimate the `Price` (in thousands of $) of a 5th flat given these characteristics:

- `Surface`: 3000 $ft^2$
- `Bedrooms`: 5 
- `Floors`: 1

with the following formula:

$$y_{flat5} = \boldsymbol x_{flat5} \cdot \boldsymbol \theta$$

## Define x5

In [16]:
x5 = np.array([[1,3000,5,1]])

## Compute y5


In [17]:
# You should find a Price of 526,000 $
y5 = np.dot(x5,theta) 

print(f"""
{y5 = }
--------------------------------------------
{type(y5)= }
{len(y5)= }
{y5.shape = }
{y5.ndim = }
{y5.dtype = }
""")


y5 = array([[526.10752688]])
--------------------------------------------
type(y5)= <class 'numpy.ndarray'>
len(y5)= 1
y5.shape = (1, 1)
y5.ndim = 2
y5.dtype = dtype('float64')



**In reality, a flat price is never entirely determined by its surface, number of bedrooms and  the floor number.**

Let's imagine that we measure the real price $y_{flat5}$ at $700,000$ instead of $526,000$ as predicted. 

**Could we take this new information into account to improve our model?**

Update the linear system of equations $ \large \boldsymbol X \cdot \boldsymbol \theta = \boldsymbol y$ to incorporate the information carried out by this new flat.

## Create the new matrix of features X of shape (5,4)

In [19]:
new_flat = np.array([[1, 3000, 5, 1]])
X2 = np.vstack((X, new_flat))

print(f"""
{X2 = }
--------------------------------------------
{type(X2)= }
{len(X2)= }
{X2.shape = }
{X2.ndim = }
{X2.dtype = }
""")



X2 = array([[1.00e+00, 6.20e+02, 1.00e+00, 1.00e+00],
       [1.00e+00, 3.28e+03, 4.00e+00, 2.00e+00],
       [1.00e+00, 1.90e+03, 2.00e+00, 2.00e+00],
       [1.00e+00, 1.32e+03, 3.00e+00, 3.00e+00],
       [1.00e+00, 3.00e+03, 5.00e+00, 1.00e+00]])
--------------------------------------------
type(X2)= <class 'numpy.ndarray'>
len(X2)= 5
X2.shape = (5, 4)
X2.ndim = 2
X2.dtype = dtype('float64')



## Create new y of shape (5,1)


In [20]:
y2 = np.vstack((y, np.array([[700]])))

print(f"""
{y2 = }
--------------------------------------------
{type(y2)= }
{len(y2)= }
{y2.shape = }
{y2.ndim = }
{y2.dtype = }
""")


y2 = array([[244],
       [671],
       [504],
       [510],
       [700]])
--------------------------------------------
type(y2)= <class 'numpy.ndarray'>
len(y2)= 5
y2.shape = (5, 1)
y2.ndim = 2
y2.dtype = dtype('int64')



Let's try to predict the price of a 6th flat from our updated model.  
To do so, try to solve $\boldsymbol \theta$ from $\boldsymbol X \cdot \boldsymbol \theta = \boldsymbol y$ using [`numpy.linalg.solve`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.solve.html).

What is going on? What can you conclude?

In [21]:
np.linalg.solve(X2, y2)

LinAlgError: Last 2 dimensions of the array must be square

$ \large \boldsymbol X$ is not a square matrix 

$ \large  \rightarrow$  therefore it cannot be inversible: $ \large  \boldsymbol X^{-1}$ does not exist
 
$ \large  \rightarrow$ $ \large \boldsymbol \theta$ cannot be computed from $ \large  \boldsymbol y = \boldsymbol X \cdot \boldsymbol \theta$ 
    
Our initial approach, which consists in finding a closed mathematical formula to compute a predicted price of a flat as a linear combination of only 3 features **does not hold true** for our 5 observed flats. 

***Trust the process !*** 

$ \large  \rightarrow$ Instead, we will learn in the coming weeks methods to **approximate** a flat price based on these features.

For instance, instead of solving $\large  \boldsymbol y = \boldsymbol X \cdot \boldsymbol \theta$ we could find $ \large  \hat{\boldsymbol \theta}$ that minimizes the error $ \large \boldsymbol e = \boldsymbol X \cdot \hat{\boldsymbol \theta} - \boldsymbol y $: This approach is called a **Linear Regression model**

This new estimator can then be used to give an **approximate** estimation of the price on any new flats with $ \large  \hat y_{flat_6} = \boldsymbol x_{flat_6} \cdot \hat{\boldsymbol \theta}$ 

