## Indice:
* [Data-set description](#1)
* * [Data Manipulation in R](#2)
* * [Data Manipulation in Python](#3)
* [Introduction to the Linear Regression Model](#4)
* * [Usefulness of the Linear Regression Model](#5)
* * [Formal Approach to the Linear Regression Modelo](#6)
* *[Basic Assumptions](#7)
* *[Assumptions Consequences](#8)
* *[Matrix representation of the basic assumption of the model](#9)
* [Estimation](#10)
* * [Prediction of Response Variable](#11)
* * [Estimation of model coefficients](#12)
* * [Estimation of model errors](#13)
* * [Regression Hyperplane ](#14)
* * [Hat-Matrix](#15)
* * [Right Join](#16)
* * [Semi Join](#17)
* * [Anti Join](#18)
* * [Union](#19)
* * [Intersect](#20)
* * [Difference](#21)
*  [Concatenate](#22)
*  [Group and Summarize](#23)
*  [Other usuful functions ](#24)

## Data-set description <a class="anchor" id="1"></a>



We are going to describe the data-set we will use in this article.

The data are 1905 observation about 38 variables on housing features.

Here is the link where the data was loaded:
<https://www.kaggle.com/datasets/dataregress/dubai-properties-dataset?resource=download>



The variables of our interest are the following:

-   id : identificator

-   neighborhood: the name of the neighborhood

-   latitude: the latitude of the house

-   longitude: the longitude of the house

-   price: the market price of the house

-   size_in_sqft: the size of the house in square foot

    -   1 sqft = 0.092903 $m^2$

-   price_per_sqft: the market price of the house per square foot

-   no_of_bedrooms: number of bedrooms in the house

-   no_of_bathrooms: number of bathrooms in the house

-   quality: quality of the house. Based on the number of services. Her
    categories are Ultra, High, Medium and Low

-   maid_room: indicates if the house has maid room (cuarto de servicio)
    (true/false)

-   unfurnished: indicates if the house is unfurnished (sin amueblar)
    (true/false)

-   balcony: indicates if the house has balcony (true/false)

-   barbecue_area: indicates if the house has barbecue area (true/false)

-   central_ac: indicates if the house has central air conditioning
    (true/false)

-   childrens_play_area: indicatees if the house has childrens game area
    (true/false)

-   childrens_pool: indicates if the house has childrens pool
    (true/false)

-   concierge: indicates if the house has concierge (true/false)

-   covered_parking: indicates if the house has covered parking
    (true/false)

-   kitchen_appliances: indicates if the house has kitchen appliances
    (electrodomesticos de cocina) (true/false)

-   maid_service: indicates if the house has maid service (servicio de
    limpieza) (true/false)

-   pets_allowed: indicates if pets are allowed(true/false)

-   private_garden: indicates if the house has private garden
    (true/false)

-   private_gym: indicates if the house has private gym (true/false)

-   private_jacuzzi: indicates if the house has private jacuzzi
    (true/false)

-   private_pool: indicates if the house has private pool (true/false)

-   security: indicates if the house has private secutity (true/false)

-   shared_gym: indicates if the house has shared gym (true/false)

-   shared_pool: indicates if the house has shared pool (true/false)

-   shared_spa: indicates if the house has shared spa (true/false)

-   view_of_water: indicates if the house has view of the water
    (true/false)





Now we are going to do the following:

1. We are going to load an manipulate the data-set in R

2. We will repeat this task in Python



### Data Manipulation in R <a class="anchor" id="2"></a>

In [54]:
import rpy2

%load_ext rpy2.ipython

import rpy2.robjects as robjects

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [55]:
%%R

library(tidyverse)


We load the data-set with which we are going to work:


In [56]:
%%R 

url = 'https://raw.githubusercontent.com/FabioScielzoOrtiz/Estadistica4all-blog/main/Linear%20Regression%20in%20Python%20and%20R/properties_data.csv'

properties_data <- read_csv(url)

Rows: 1905 Columns: 38
-- Column specification --------------------------------------------------------
Delimiter: ","
chr  (2): neighborhood, quality
dbl  (8): id, latitude, longitude, price, size_in_sqft, price_per_sqft, no_o...
lgl (28): maid_room, unfurnished, balcony, barbecue_area, built_in_wardrobes...

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.


Now, we are going to tranformate the variables that are measured in square foot (sqft) to square meters $(m^2)$



size_in_m\_2 = 0.092903 \* size_in_sqft

price_per_m\_2 = price_per_sqft / 0.092903





Now, we are going to tranformate the variables that are measured in square foot (sqft) to square meters $(m^2)$

size_in_m\_2 = 0.092903 \* size_in_sqft

price_per_m\_2 = price_per_sqft / 0.092903



In [57]:
%%R 

size_in_m_2 <-  0.092903*properties_data$size_in_sqft

properties_data$size_in_m_2 <- size_in_m_2

price_per_m_2 <- properties_data$price_per_sqft /  0.092903 

properties_data$price_per_m_2 <- price_per_m_2




The following step will be remove in the data-set the variables that we will not take into account:


In [58]:
%%R 

 properties_data$quality = recode(properties_data$quality , "Low"=0, "Medium"=1, "High"=2 , "Ultra"=3)

properties_data$quality = factor(properties_data$quality)

In [59]:
%%R 

data_R <- properties_data %>% select("price", "size_in_m_2", "longitude", "latitude", "no_of_bedrooms", "no_of_bathrooms", "quality")

head(data_R)

# A tibble: 6 x 7
    price size_in_m_2 longitude latitude no_of_bedrooms no_of_bathrooms quality
    <dbl>       <dbl>     <dbl>    <dbl>          <dbl>           <dbl> <fct>  
1 2700000       100.       55.1     25.1              1               2 1      
2 2850000       147.       55.2     25.1              2               2 1      
3 1150000       181.       55.1     25.1              3               5 1      
4 2850000       188.       55.3     25.2              2               3 0      
5 1729200        47.1      55.1     25.1              0               1 1      
6 3119900        94.3      55.1     25.1              1               2 1      


### Data Manipulation in Python <a class="anchor" id="3"></a>

In [60]:
import pandas as pd

from dfply import *

import warnings
warnings.filterwarnings('ignore')

In [61]:
url = 'https://raw.githubusercontent.com/FabioScielzoOrtiz/Estadistica4all-blog/main/Linear%20Regression%20in%20Python%20and%20R/properties_data.csv'

data_Python = pd.read_csv(url)

data_Python

Unnamed: 0,id,neighborhood,latitude,longitude,price,size_in_sqft,price_per_sqft,no_of_bedrooms,no_of_bathrooms,quality,...,private_pool,security,shared_gym,shared_pool,shared_spa,study,vastu_compliant,view_of_landmark,view_of_water,walk_in_closet
0,5528049,Palm Jumeirah,25.113208,55.138932,2700000,1079,2502.32,1,2,Medium,...,False,False,True,False,False,False,False,False,True,False
1,6008529,Palm Jumeirah,25.106809,55.151201,2850000,1582,1801.52,2,2,Medium,...,False,False,True,True,False,False,False,False,True,False
2,6034542,Jumeirah Lake Towers,25.063302,55.137728,1150000,1951,589.44,3,5,Medium,...,False,True,True,True,False,False,False,True,True,True
3,6326063,Culture Village,25.227295,55.341761,2850000,2020,1410.89,2,3,Low,...,False,False,False,False,False,False,False,False,False,False
4,6356778,Palm Jumeirah,25.114275,55.139764,1729200,507,3410.65,0,1,Medium,...,False,True,True,True,True,False,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1900,7705450,Mohammed Bin Rashid City,25.176892,55.310712,1500000,1087,1379.94,2,2,Ultra,...,False,True,True,True,True,True,True,True,True,True
1901,7706287,Mohammed Bin Rashid City,25.166145,55.276684,1230000,760,1618.42,1,2,Medium,...,False,False,True,True,False,False,False,False,True,True
1902,7706389,Dubai Creek Harbour (The Lagoons),25.206500,55.345056,2900000,1930,1502.59,3,5,Medium,...,False,False,False,True,False,False,False,False,False,False
1903,7706591,Jumeirah Village Circle,25.073858,55.229844,675000,740,912.16,1,2,Medium,...,False,True,True,True,False,False,False,False,True,True


In [62]:
data_Python['size_in_m_2'] = 0.092903*data_Python['size_in_sqft']
data_Python['price_per_m_2'] = data_Python['price_per_sqft']/0.092903

In [63]:
data_Python = data_Python >> select(X.price , X.size_in_m_2, X.longitude, X.latitude, X.no_of_bedrooms, X.no_of_bathrooms, X.quality)
data_Python

Unnamed: 0,price,size_in_m_2,longitude,latitude,no_of_bedrooms,no_of_bathrooms,quality
0,2700000,100.242337,55.138932,25.113208,1,2,Medium
1,2850000,146.972546,55.151201,25.106809,2,2,Medium
2,1150000,181.253753,55.137728,25.063302,3,5,Medium
3,2850000,187.664060,55.341761,25.227295,2,3,Low
4,1729200,47.101821,55.139764,25.114275,0,1,Medium
...,...,...,...,...,...,...,...
1900,1500000,100.985561,55.310712,25.176892,2,2,Ultra
1901,1230000,70.606280,55.276684,25.166145,1,2,Medium
1902,2900000,179.302790,55.345056,25.206500,3,5,Medium
1903,675000,68.748220,55.229844,25.073858,1,2,Medium


In [64]:
data_Python.dtypes

price                int64
size_in_m_2        float64
longitude          float64
latitude           float64
no_of_bedrooms       int64
no_of_bathrooms      int64
quality             object
dtype: object

In [65]:
data_Python['quality'] = data_Python['quality'].astype('category')

In [66]:
data_Python.dtypes

price                 int64
size_in_m_2         float64
longitude           float64
latitude            float64
no_of_bedrooms        int64
no_of_bathrooms       int64
quality            category
dtype: object

In [67]:
data_Python['quality'].unique()

['Medium', 'Low', 'High', 'Ultra']
Categories (4, object): ['High', 'Low', 'Medium', 'Ultra']

In [68]:
(data_Python['quality_recode']) = 0

for i in range(0 , len(data_Python)) :

    if (data_Python['quality'])[i] == 'Low' :

        (data_Python['quality_recode'])[i] = 0

    if (data_Python['quality'])[i] == 'Medium' :

        (data_Python['quality_recode'])[i] = 1

    if (data_Python['quality'])[i] == 'High' :

        (data_Python['quality_recode'])[i] = 2

    if (data_Python['quality'])[i] == 'Ultra' :

        (data_Python['quality_recode'])[i] = 3

In [69]:
data_Python.head()

Unnamed: 0,price,size_in_m_2,longitude,latitude,no_of_bedrooms,no_of_bathrooms,quality,quality_recode
0,2700000,100.242337,55.138932,25.113208,1,2,Medium,1
1,2850000,146.972546,55.151201,25.106809,2,2,Medium,1
2,1150000,181.253753,55.137728,25.063302,3,5,Medium,1
3,2850000,187.66406,55.341761,25.227295,2,3,Low,0
4,1729200,47.101821,55.139764,25.114275,0,1,Medium,1


In [70]:
data_Python = data_Python >> select( ~X.quality )

In [71]:
data_Python = data_Python >> rename(quality = X.quality_recode)

In [72]:
data_Python['quality'] = data_Python['quality'].astype('category')

The final python data-set would be:

In [73]:
data_Python

Unnamed: 0,price,size_in_m_2,longitude,latitude,no_of_bedrooms,no_of_bathrooms,quality
0,2700000,100.242337,55.138932,25.113208,1,2,1
1,2850000,146.972546,55.151201,25.106809,2,2,1
2,1150000,181.253753,55.137728,25.063302,3,5,1
3,2850000,187.664060,55.341761,25.227295,2,3,0
4,1729200,47.101821,55.139764,25.114275,0,1,1
...,...,...,...,...,...,...,...
1900,1500000,100.985561,55.310712,25.176892,2,2,3
1901,1230000,70.606280,55.276684,25.166145,1,2,1
1902,2900000,179.302790,55.345056,25.206500,3,5,1
1903,675000,68.748220,55.229844,25.073858,1,2,1


**Important**: to use categorical variables in a linear regression model in Python they must be recoded (their values must be numbers that represents their categories), i.e, we cannot use the variable *quality* , insteaf of it we can use *quality_recode*

This is the reason we have recoded *quality* in Python but not in R, because in R is not strictly necessary.

Note we have obtained the same data-set that was obtained with R.

- The R data-set has been called *data_R*

- The Python data-set has been called *data_python*


We will use both of them throughout this article. 

## Introduction to the Linear Regression Model <a class="anchor" id="4"></a>


The principal propose of this article is carry out a theoretical and
also practical exposition of the linear regression model.

Without any doubt the this is the most know statistical model.

There is the idea that the linear regression model is outdated compared
with other modern statistical models. But I would like to defend his
validity nowadays, first of all as a statistical tool, and second as a
previous necessary step to learn other most modern and complex methods.

The linear regression model is the base of many modern regression
techniques, so that is highly recommended study it enough, before to go deeper in other statistical models.

The most important references on which this article is based are "Linear Models with
R" by Julian Faraway (second edition), and "An introduction to
statisticcal learning" by Gareth James (second edition), the blog [cienciadedatos.net](https://www.cienciadedatos.net/index.html) by Joaquin Amat Rodrigo, and the web page [realpython](https://realpython.com/)





### Usefulness of the Linear Regression Model <a class="anchor" id="5"></a>



The main usefulness of the linear regression model is to predict the
values of a **quantitative** variable  depending on the values of other variables (**quantitative or categorical**),
called predictors.

There are other usefulness of the model besides the commented. We will
see them later.





### Formal Approach to the Linear Regression Modelo <a class="anchor" id="6"></a>

We have the following elements:

-   Response Variable:  a **quantitative** variable
      $Y=(y_{1} , y_2,...,y_n)^t$

-   Predictors: a set of **quantitative** or **categorical**
    variables:


\begin{gather*}
X_1 = (x_{11}, x_{21}, ..., x_{n1})^t \\
X_2 = (x_{12}, x_{22}, ..., x_{n2})^t \\
... \\
X_p = (x_{1p}, x_{2p}, ..., x_{np})^t
\end{gather*}


-   Predictors Matrix:

    
    \begin{gather*}
    X=(1, X_1, X_2,...,X_p) = 
    \begin{pmatrix}
    1 & x_{11}&x_{12}&...&x_{1p}\\
    1 & x_{21}&x_{22}&...&x_{2p}\\
    &...&\\
    1& x_{n1}&x_{n2}&...&x_{np}
    \end{pmatrix} = 
    \begin{pmatrix}
    x_{1}\\
    x_{2}\\
    ...\\
    x_{n}
    \end{pmatrix}
    \end{gather*}
    

-   Coefficients vector:


\begin{gather*}
\beta=(\beta_{1}, \beta_{2}, ..., \beta_{n})^t 
\end{gather*}

-   Errors vector:


\begin{gather*}
\varepsilon=(\varepsilon_{1}, \varepsilon_{2}, ..., \varepsilon_{n})^t 
\end{gather*}




### Basic Assumptions <a class="anchor" id="7"></a>


The basic assumptions of the model are the following:

<br>

- $ y_i \hspace{0.1cm} =  \hspace{0.1cm} x_i^t \cdot \beta  +  \varepsilon_i \hspace{0.1cm} =  \hspace{0.1cm}   \beta_0 + \sum_{j=1}^{p} \left( \beta_j \cdot x_{ij} \right) + \varepsilon_i \hspace{0.1cm} =  \hspace{0.1cm}  \beta_0 + \beta_1 \cdot x_{i1} + \beta_2 \cdot x_{i2} + ... + \beta_p \cdot x_{ip} + \varepsilon_i $
 
 <br>
 
-  $\varepsilon_i$ is a random variable such that:


   - $E[\varepsilon_i]=0$ 
   - $Var(\varepsilon_i)=\sigma^2$
   - $\varepsilon_i \sim N(0,\sigma)$ 
   - $cov(\varepsilon_i , \varepsilon_j)=0$

<br>

- Additional assumptions:

  -  $n > p+1$ ( nº observations \> nº of coefficients to estimate )

  -  $Rg(X)=p+1$




### Assumptions Consequences <a class="anchor" id="8"></a>


   -  $y_i$ is a random variable because  $\varepsilon_i$ is a random variable

   -  $E[y_i]= x_i^t \cdot \beta$

   -  $Var(y_i) = \sigma^2$

   -  $y_i \sim N(x_i^t \cdot \beta , \sigma)$

   -  $cov(y_i , y_j)=0$




### Matrix representation of the basic assumption of the model <a class="anchor" id="9"></a>

<br>

- $ Y=X\cdot \beta + \varepsilon $

- $\varepsilon_i \sim N(0,\sigma) \hspace{0.4cm} \forall \hspace{0.1cm} i=1,...,n $
  
- $cov(\varepsilon_i , \varepsilon_j)=0 \hspace{0.4cm} \forall \hspace{0.1cm} i\neq j =1,...,n $



## Estimation  <a class="anchor" id="10"></a>



###  Prediction of Response Variable <a class="anchor" id="11"></a>


The linear regression model predict the response variable value $y_i$  for the combination of predictors values  $x_i = (x_{i1}, x_{i2}, ..., x_{ip})^t$  as:

\begin{gather*}
\hat{y}_i = x_i^t \cdot \hat{\beta}  = \hat{\beta}_0 + \sum_{j=1}^{p} \hat{\beta}_j \cdot x_{ij} = \hat{\beta}_0 + \hat{\beta}_1 \cdot x_{i1} + \hat{\beta}_2 \cdot x_{i2} + ... + \hat{\beta}_p \cdot x_{ip} 
\end{gather*}




### Estimation of model coefficients <a class="anchor" id="12"></a>


The estimation of $\beta$  in the classic linear regression model is done
using the ordinary least square (OLS) method.

$\hat{\beta}$  is compute as the solution of the following optimitation
problem:

\begin{gather*}
Min  \sum_{i=1}^{n} (y_i - \widehat{y}_i)^2 = \sum_{i=1}^{n} (y_i - x_i^t \cdot \beta)^2 
\end{gather*}

<br>

The problem solution is:

\begin{gather*}
\hat{\beta}=(X^t \cdot X)^{-1} \cdot X^t \cdot Y
\end{gather*}

<br>

**Observation:**

We will not view here the mathematical details about the resolution of
this optimization problem. But is a classic convex optimization problem,
so it´s enough to take first derivatives of the objetive function with
respect to the coefficients  $\beta_0,\beta_1,...,\beta_p$ ,  set them equal to zero (0), and solve the resultant equation system with respect
to  $\beta$



### Estimation of model errors <a class="anchor" id="13"></a>


The model errors  $\varepsilon_i$  are estimated as:

$$
\hat{\varepsilon}_i \hspace{0.1 cm} = \hspace{0.1 cm} y_i - \hat{y}_i \hspace{0.1 cm} = \hspace{0.1 cm} y_i - x_i^t \cdot \hat{\beta}  
$$

for $\hspace{0.1cm}$ $i=1,...,n$


**Observation:**

$\hat{\varepsilon}_i$  is the error done by the model when it
estimates/predicts  $y_i$  as  $\hat{y}_i=x_i^t \cdot \hat{\beta}$




### Regression Hyperplane <a class="anchor" id="14"></a>

The regression hyperplane is the matrix expression of the predictions
that the model does of the response variable values:


\begin{gather*}
\hat{Y} = X \cdot \hat{\beta}  
\end{gather*}

Where:    $\hat{Y}=(\hat{y}_1,\hat{y}_2,...,\hat{y}_n)^t$




### Hat-Matrix <a class="anchor" id="15"></a>



\begin{gather*}
\hat{Y} = X \cdot \hat{\beta} = X \cdot (X^t \cdot X)^{-1} \cdot X^t \cdot Y = H \cdot Y  
\end{gather*}



Where:   $H= X \cdot (X^t \cdot X)^{-1} \cdot X^t$  is called Hat-Matrix


## Estimation of the Linear Regression Model in R




In this section we are going to show how estimate a linear regression
model in R, using for this purpose the data-set that was showed at the begining of the article.

The linear regression model that we propose is the following:

\begin{gather*}
price_i = \beta_0 +  \beta_1 \cdot size\_in\_m\_2_i + \beta_2 \cdot no\_of\_bedrooms_i +  \beta_3 \cdot no\_of\_bathrooms_i + \\ + \beta_4 \cdot quality_i + \beta_5\cdot  latitude_i +  \beta_6 \cdot longitude_i + \varepsilon_i
\end{gather*}



In [74]:
%%R

model_R_1 <- lm( price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + latitude + longitude ,
data = data_R)

summary(model_R_1)


Call:
lm(formula = price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + 
    quality + latitude + longitude, data = data_R)

Residuals:
      Min        1Q    Median        3Q       Max 
-13398393   -562302     68143    562733  15384235 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -6.207e+07  2.995e+07  -2.073   0.0383 *  
size_in_m_2      3.566e+04  7.238e+02  49.271  < 2e-16 ***
no_of_bedrooms  -8.367e+05  8.282e+04 -10.102  < 2e-16 ***
no_of_bathrooms -5.712e+04  6.829e+04  -0.836   0.4030    
quality1         1.400e+05  8.358e+04   1.675   0.0940 .  
quality2         3.406e+05  1.551e+05   2.196   0.0282 *  
quality3         2.788e+05  1.976e+05   1.410   0.1586    
latitude         6.115e+06  7.809e+05   7.830 8.03e-15 ***
longitude       -1.677e+06  6.908e+05  -2.428   0.0153 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1605000 on 1896 degrees of freedom
Multiple R-squared:  0.6

## Estimation of Linear Regression Model in Python with `statsmodels`



We can implement a linear regression model in Python with the following code:

In [75]:
import statsmodels.formula.api as smf
import statsmodels.api as sm


In [76]:
model_Python_1 = smf.ols(formula = 'price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + latitude + longitude', 
                 data =data_Python)

model_Python_1 = model_Python_1.fit()
 
print(model_Python_1.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.698
Model:                            OLS   Adj. R-squared:                  0.697
Method:                 Least Squares   F-statistic:                     547.4
Date:               do., 10 jul. 2022   Prob (F-statistic):               0.00
Time:                        14:03:24   Log-Likelihood:                -29918.
No. Observations:                1905   AIC:                         5.985e+04
Df Residuals:                    1896   BIC:                         5.990e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept       -6.207e+07   2.99e+07     

The previous output gives us the estimation of the model coefficients
(betas), both outputs give similar results (but we will consider the python output):

<br>

$
\hat{\beta}_0 =  -6.207e+07  \\
\hat{\beta}_{quality1} =1.4e+05\\
\hat{\beta}_{quality2} = 3.406e+05 \\ 
\hat{\beta}_{quality3} =2.788e+05 \\
\hat{\beta}_{size\_in\_m\_2} =3.566e+04 \\
\hat{\beta}_{no\_of\_bedrooms} = -8.367e+05 \\
\hat{\beta}_{no\_of\_bathrooms} = -5.712e+04 \\
\hat{\beta}_{latitude}=6.115e+06 \\
\hat{\beta}_{longitude}= -1.677e+06 \\
$

<br>

So, the estimated model is:


\begin{gather*}
price_i =  -6.207e+07 +  3.566e+04 \cdot size\_in\_m\_2_i -8.367e+05 \cdot no\_of\_bedrooms_i -5.712e+04 \cdot no\_of\_bathrooms_i - \\ 1.4e+05 \cdot quality1_i + 3.406e+05\cdot quality2_i + 2.788e+05  \cdot quality3_i  +6.115e+06\cdot  latitude_i -1.677e+06   \cdot longitude_i 
\end{gather*}

<br>

**Observation:**

The  categorical variable, *quality*, that has 4 categories (Low (0), Medium (1),
High (2), Ultra (3)), enter in the model with 3 variables (quality1 ,
quality2, quality3 ). The category. that is out of the model is Low (0) because is the firs category. 

This isn´t a particularity of this variable, but rather it´s a property of the categorical variables in the regression models.

Later it will be seen how this affects model coefficients interpretation.  


## Estimation of Linear Regression Model in Python with `scikit-learn`


In [77]:
# pip install sklearn

import sklearn

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split



We can use a training data-set to train the model in Python with the  `scikit-learn` module. 

This concepts will be seen with much more detail in a specific article about validation techniques.



In [78]:
X = data_Python[['size_in_m_2', 'no_of_bedrooms', 'no_of_bathrooms' , 'quality' , 'latitude' , 'longitude' ]]
y = data_Python[['price']]

In [79]:
X_train, X_test, y_train, y_test = train_test_split( 
                                           X , 
                                           y.values.reshape(-1,1) ,
                                           train_size = 0.8,
                                           random_state = 1234, 
                                           shuffle = True )

In [80]:
X.head()

Unnamed: 0,size_in_m_2,no_of_bedrooms,no_of_bathrooms,quality,latitude,longitude
0,100.242337,1,2,1,25.113208,55.138932
1,146.972546,2,2,1,25.106809,55.151201
2,181.253753,3,5,1,25.063302,55.137728
3,187.66406,2,3,0,25.227295,55.341761
4,47.101821,0,1,1,25.114275,55.139764


In [81]:
y.head()

Unnamed: 0,price
0,2700000
1,2850000
2,1150000
3,2850000
4,1729200


In [82]:
X_train, X_test, y_train, y_test = train_test_split( 
                                           X , 
                                           y.values.reshape(-1,1) ,
                                           train_size = 0.8,
                                           random_state = 1234, 
                                           shuffle = True )


y.values.reshape(-1,1) to transformate y in a colum array


In [83]:
y.values.reshape(-1,1)

array([[2700000],
       [2850000],
       [1150000],
       ...,
       [2900000],
       [ 675000],
       [ 760887]], dtype=int64)

train_size = 0.8 --> the size of the training data-set is the 80% of the original data-set

random_state = 1234 --> a seed (semilla) to replicate the random process that select the observations that will be consider as training data 

shuffle --> whether or not to shuffle (permutar/barajar aleatoriamente) the data before splitting (antes de dividirlos en training set y test set)
      

In [84]:
data_train = pd.DataFrame( 

              np.hstack((X_train, y_train)) , 
            
              columns=['size_in_m_2', 'no_of_bedrooms', 'no_of_bathrooms' , 
                       'quality' , 'latitude' , 'longitude', 'price'] 
                       
                       )

In [85]:
X_train.head()

Unnamed: 0,size_in_m_2,no_of_bedrooms,no_of_bathrooms,quality,latitude,longitude
92,97.641053,1,1,1,25.132445,55.152216
95,242.383927,4,5,0,25.086726,55.145205
1838,38.554745,0,1,2,25.07913,55.154713
411,165.181534,2,3,0,25.197316,55.274196
192,278.709,4,5,1,25.076319,55.133627


In [86]:
y_train

array([[2100000],
       [5500000],
       [ 400888],
       ...,
       [1800000],
       [ 999999],
       [ 770000]], dtype=int64)

In [87]:
np.hstack((X_train, y_train))

array([[9.76410530e+01, 1.00000000e+00, 1.00000000e+00, ...,
        2.51324450e+01, 5.51522160e+01, 2.10000000e+06],
       [2.42383927e+02, 4.00000000e+00, 5.00000000e+00, ...,
        2.50867260e+01, 5.51452050e+01, 5.50000000e+06],
       [3.85547450e+01, 0.00000000e+00, 1.00000000e+00, ...,
        2.50791300e+01, 5.51547130e+01, 4.00888000e+05],
       ...,
       [1.82647298e+02, 2.00000000e+00, 3.00000000e+00, ...,
        2.50805800e+01, 5.51391470e+01, 1.80000000e+06],
       [7.78527140e+01, 1.00000000e+00, 2.00000000e+00, ...,
        2.51914040e+01, 5.52738960e+01, 9.99999000e+05],
       [7.49727210e+01, 1.00000000e+00, 2.00000000e+00, ...,
        2.50850470e+01, 5.51424150e+01, 7.70000000e+05]])

Let´s see how hstack (and vstack) works:


In [88]:
a = np.array((1,2,3))
b = np.array((4,5,6))

In [89]:
np.hstack((a,b)) # similar to cbind in R

array([1, 2, 3, 4, 5, 6])

In [90]:
np.vstack((a,b)) # similar to rbind in R

array([[1, 2, 3],
       [4, 5, 6]])

In [91]:

model_train_Python_1 = smf.ols(formula = 'price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + latitude + longitude', 
                 data =data_train)

model_train_Python_1 = model_train_Python.fit()
 
print(model_train_Python_1.summary())

AttributeError: 'OLSResults' object has no attribute 'fit'

## Precision of model coefficient estimates


The precision of model coefficients estimates is given by the
coefficients estimator variance, that is, by $\hspace{0.05cm}$  $Var(\hat{\beta_j})$

It´s true that $\hspace{0.05cm}$ 
$\hat{\beta_j} \sim N(\beta_j , \sqrt{ \sigma^2 \cdot q_{jj} } )$ $\hspace{0.05cm}$ $\Rightarrow$ $\hspace{0.05cm}$ $Var(\hat{\beta_j})=\sigma^2 \cdot q_{jj}$

Therefore, the estimation of the variance of $\hspace{0.05cm}$  $\hat{\beta_j}$  $\hspace{0.05cm}$  is $\hspace{0.05cm}$  $\widehat{Var}(\hat{\beta_j})= \hat{\sigma}^2 \cdot q_{jj}$ 
 

Where:

$\hat{\sigma}^2$  is the estimation of the variance of the error  $\varepsilon_i$ $\hspace{0.05cm}$ , i.e, $\hspace{0.05cm}$ $\hat{\sigma}^2 = \widehat{Var}(\varepsilon_i)$


$q_{jj}$ $\hspace{0.05cm}$  is the element  $j+1$  of the principal diagonal of the matrix $\hspace{0.05cm}$ $(X^t \cdot X)^{-1}$  $\hspace{0.05cm}$ , for  $j=0,1,...,p$

<br>

- ***¿ Why are the variance of the coefficient estimators important ?***


The standard deviation of the coefficient estimators indicates how much
the estimates of the coefficients deviate, in mean, when the estimates
are recalculated using many different samples.

Suppose many samples are obtained, and with each of them a linear
regression model is trained. Then, we obtein many estimates of the model
coefficients, one with each sample.

Then  $\sqrt{\widehat{Var}(\hat{\beta_j})}$  indicates how much
$\hat{\beta_j}$ varies, in mean, from one sample to another.

If the standard deviation is high, this indicates that will be obtained
big differences when $\beta_j$ is estimate with $\hat{\beta_j}$
depending on the sample that is used for estimate it, that means
estimator $\hat{\beta_j}$ is imprecise, because it will be more
dispersion of the values of $\hat{\beta_j}$ respect to the mean.

On the contrary, if the standard deviation is low, this indicates that
will be obtained small differences when $\beta_j$ is estimate with
$\hat{\beta_j}$ depending on the sample that is used for estimate it,
that means estimator $\hat{\beta_j}$ is precise, because it will be less
dispersion of the values of $\hat{\beta_j}$ respect to the mean.



### Estimate  coefficients estimators standard deviation in R


In [92]:
%%R 

summary(model_R_1)


Call:
lm(formula = price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + 
    quality + latitude + longitude, data = data_R)

Residuals:
      Min        1Q    Median        3Q       Max 
-13398393   -562302     68143    562733  15384235 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -6.207e+07  2.995e+07  -2.073   0.0383 *  
size_in_m_2      3.566e+04  7.238e+02  49.271  < 2e-16 ***
no_of_bedrooms  -8.367e+05  8.282e+04 -10.102  < 2e-16 ***
no_of_bathrooms -5.712e+04  6.829e+04  -0.836   0.4030    
quality1         1.400e+05  8.358e+04   1.675   0.0940 .  
quality2         3.406e+05  1.551e+05   2.196   0.0282 *  
quality3         2.788e+05  1.976e+05   1.410   0.1586    
latitude         6.115e+06  7.809e+05   7.830 8.03e-15 ***
longitude       -1.677e+06  6.908e+05  -2.428   0.0153 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1605000 on 1896 degrees of freedom
Multiple R-squared:  0.6

### Estimate  coefficients estimators standard deviation in Python


In [None]:
print(model_Python_1.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.698
Model:                            OLS   Adj. R-squared:                  0.697
Method:                 Least Squares   F-statistic:                     547.4
Date:               do., 10 jul. 2022   Prob (F-statistic):               0.00
Time:                        13:53:51   Log-Likelihood:                -29918.
No. Observations:                1905   AIC:                         5.985e+04
Df Residuals:                    1896   BIC:                         5.990e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept       -6.207e+07   2.99e+07     


These outputs give us a lot of information about the model, some of this
information has been seen (estimates coefficients), an other information
will be seen later.

Now we will focus in the part of the output where are the estimation of   coefficients estimators standard error (`std err` in Python , `Std.Error` in R).

$
\sqrt{\widehat{Var}(\hat{\beta_0})}=2.995e+07 \\
 \sqrt{\widehat{Var}(\hat{\beta}_{quality1})}=8.358e+04 \\
\sqrt{\widehat{Var}(\hat{\beta}_{quality2})}=1.551e+05\\ 
\sqrt{\widehat{Var}(\hat{\beta}_{quality3})}= 1.976e+05 \\ 
\sqrt{\widehat{Var}(\hat{\beta}_{size\_in\_m\_2})}= 7.238e+02 \\
\sqrt{\widehat{Var}(\hat{\beta}_{no\_of\_bedrooms})}=8.282e+04 \\ 
\sqrt{\widehat{Var}(\hat{\beta}_{no\_of\_bathrooms})}=6.829e+04 \\  
\sqrt{\widehat{Var}(\hat{\beta}_{latitude})}=7.809e+05\\ 
\sqrt{\widehat{Var}(\hat{\beta}_{longitude})}=6.908e+05
$

The standard deviation estimates of the coefficients estimators are, in
general, so high. This implies if we train the model with another
samples, we will get estimates of the coefficients quite different than
the one obtained with our initial sample.

And this is a big problem, because from one sample to another are
obtained very different linear regression models, so that, very
different results with each sample.



## Estimation of model errors in R




With the function `predict` we can get the predictions made by the model
for the response variable in R:



In [93]:
%%R

predict(model_R_1)[1:30]

         1          2          3          4          5          6          7 
 1781425.9  2551624.8  2522740.0  4222873.3   785153.0  1574504.1  4350316.8 
         8          9         10         11         12         13         14 
 1574504.1  3521401.8  2610791.1  2934947.8   446056.9  -147809.7  1908923.1 
        15         16         17         18         19         20         21 
 1841127.9  2032066.6  2874430.5  2078235.2  1527279.8  1369157.0   394709.0 
        22         23         24         25         26         27         28 
 1241331.5  4403015.1  9109695.4 13905182.9  4487692.4  3257719.2 12177081.6 
        29         30 
 2544105.1  1411866.2 



We estimate the model errors as $\hspace{0.1cm}$ $\hat{\varepsilon}_i= y_i - \hat{y}_i$


In [94]:
%%R 

(estimated_errors <- data_R$price - predict(model_R_1))[1:30]

          1           2           3           4           5           6 
  918574.13   298375.22 -1372740.03 -1372873.34   944047.04  1545395.91 
          7           8           9          10          11          12 
 4153283.22  1545395.91 -1421401.76    79208.94   615052.21  1648942.06 
         13          14          15          16          17          18 
 1197808.65   -59923.08   248871.10   317923.42   624569.55   621764.80 
         19          20          21          22          23          24 
  -37279.84  -419157.01   705290.95   148668.47  -403115.06 -5760695.43 
         25          26          27          28          29          30 
-5435182.86 -1988692.41  -457719.22 -3677081.56  -669105.11  -136866.23 



We put in a data frame the values of response variable observed in the
sample, the model predictions of the response, and the estimates of
model errors:


In [98]:
%%R

(df_predictions <- tibble(price=data_R$price , 
                         price_predictions=predict(model_R_1), 
                         estimated_errors))[1:15, ]

# A tibble: 15 x 3
     price price_predictions estimated_errors
     <dbl>             <dbl>            <dbl>
 1 2700000          1781426.          918574.
 2 2850000          2551625.          298375.
 3 1150000          2522740.        -1372740.
 4 2850000          4222873.        -1372873.
 5 1729200           785153.          944047.
 6 3119900          1574504.         1545396.
 7 8503600          4350317.         4153283.
 8 3119900          1574504.         1545396.
 9 2100000          3521402.        -1421402.
10 2690000          2610791.           79209.
11 3550000          2934948.          615052.
12 2094999           446057.         1648942.
13 1049999          -147810.         1197809.
14 1849000          1908923.          -59923.
15 2089999          1841128.          248871.


## Estimation of model errors in Python


In [121]:
predictions = pd.DataFrame( {'predictions': model_Python_1.predict(X)} )
predictions

Unnamed: 0,predictions
0,1.781426e+06
1,2.551625e+06
2,2.522740e+06
3,4.222873e+06
4,7.851530e+05
...,...
1900,1.211313e+06
1901,8.171580e+05
1902,2.981084e+06
1903,2.651215e+05


In [111]:
data_Python

Unnamed: 0,price,size_in_m_2,longitude,latitude,no_of_bedrooms,no_of_bathrooms,quality
0,2700000,100.242337,55.138932,25.113208,1,2,1
1,2850000,146.972546,55.151201,25.106809,2,2,1
2,1150000,181.253753,55.137728,25.063302,3,5,1
3,2850000,187.664060,55.341761,25.227295,2,3,0
4,1729200,47.101821,55.139764,25.114275,0,1,1
...,...,...,...,...,...,...,...
1900,1500000,100.985561,55.310712,25.176892,2,2,3
1901,1230000,70.606280,55.276684,25.166145,1,2,1
1902,2900000,179.302790,55.345056,25.206500,3,5,1
1903,675000,68.748220,55.229844,25.073858,1,2,1


In [141]:
price = pd.DataFrame( {'price': data_Python['price'] } )
price

Unnamed: 0,price
0,2700000
1,2850000
2,1150000
3,2850000
4,1729200
...,...
1900,1500000
1901,1230000
1902,2900000
1903,675000


In [129]:
price - predictions

Unnamed: 0,predictions,price
0,,
1,,
2,,
3,,
4,,
...,...,...
1900,,
1901,,
1902,,
1903,,


In [143]:
df_predictions_Python = pd.concat([price, predictions], axis=1)
df_predictions_Python

Unnamed: 0,price,predictions
0,2700000,1.781426e+06
1,2850000,2.551625e+06
2,1150000,2.522740e+06
3,2850000,4.222873e+06
4,1729200,7.851530e+05
...,...,...
1900,1500000,1.211313e+06
1901,1230000,8.171580e+05
1902,2900000,2.981084e+06
1903,675000,2.651215e+05


In [147]:
from dfply import *

In [151]:
df_predictions_Python = df_predictions_Python >> mutate(train_errors = X.price - X.predictions) 

df_predictions_Python

Unnamed: 0,price,predictions,train_errors
0,2700000,1.781426e+06,9.185741e+05
1,2850000,2.551625e+06,2.983752e+05
2,1150000,2.522740e+06,-1.372740e+06
3,2850000,4.222873e+06,-1.372873e+06
4,1729200,7.851530e+05,9.440470e+05
...,...,...,...
1900,1500000,1.211313e+06,2.886873e+05
1901,1230000,8.171580e+05,4.128420e+05
1902,2900000,2.981084e+06,-8.108440e+04
1903,675000,2.651215e+05,4.098785e+05


## Estimation of the error variance


The estimator of the error variance $Var(\varepsilon_i)=\sigma^2$ is
called residual variance, and is defined as:

$$
S_R^2 = \dfrac{1}{n-p-1} \cdot \sum_{i=1}^{n} \hat{\varepsilon}_i =  \dfrac{1}{n-p-1} \cdot (Y^t \cdot Y - \hat{\beta}^t \cdot X^t \cdot Y)
$$

The following is fulfilled:

$$
\dfrac{n-p-1}{\sigma^2} \cdot S_R^2 \sim \chi_{n-p-1}^2 \\
E[S_R^2]=\sigma^2 \\
Var(S_R^2)=\dfrac{2 \cdot \sigma^4}{n-p-1}
$$



### Estimation of the error variance in R


In [161]:
%%R
n<- length(estimated_errors)
p<-6
( estimated_variance_error <- sum(estimated_errors^2)/(n-p-1) )

[1] 2.572224e+12


In [169]:
%%R

( estimated_standard_deviation_error <- sqrt(estimated_variance_error) )

[1] 1603816


### Estimation of the error variance in Python


In [159]:
n = len(data_Python)
p = 6

In [156]:
df_predictions_Python['train_errors']

0       9.185741e+05
1       2.983752e+05
2      -1.372740e+06
3      -1.372873e+06
4       9.440470e+05
            ...     
1900    2.886873e+05
1901    4.128420e+05
1902   -8.108440e+04
1903    4.098785e+05
1904   -6.187149e+04
Name: train_errors, Length: 1905, dtype: float64

In [166]:
estimated_variance_train_error = sum(df_predictions_Python['train_errors']**2)/(n-p-1) 
estimated_variance_train_error

2572224474734.9785

In [168]:
import math

estimated_standard_deviation_train_error = math.sqrt(estimated_variance_train_error)
estimated_standard_deviation_train_error

1603815.5987316554

# Model Training Validation


We can compute some metric in order to measure how much distant are the
predictions and observations of the response variable.

One of the most common metrics is the median absolute deviation (MAD)

\begin{gather*}
MAD=  \dfrac{1}{n} \sum_{i=1}^{n} \mid y_i - \hat{y}_i \mid = \dfrac{1}{n} \sum_{i=1}^{n} \mid \hat{\varepsilon}_i \mid
\end{gather*}



### Model Training Validation in R


In [170]:
%%R

( MAD <- mean(abs((estimated_errors))) ) 

[1] 938065.2


### Model Training Validation in Python

In [176]:
MAD_Py = (df_predictions_Python['train_errors'].abs()).mean()
MAD_Py

938065.2280944842


In mean, the predictions that the model made of the response variable
deviates from the observations, in absolute value, in 938065 units.

This is an estimation of model error, but training error, because we
have used the predictions of the response variables made by the model
using the observations with which it has been trained.

There is a more interesting model error, called test error, that is
computed with predictors observations which haven´t been used to train
the model.

In this article, we will not go deeper unto that, but this concepts will
be more developed in another article about validation techniques.




## Model Coefficientes Interpretation


### Null Coefficient


We have the following estimated linear regression model 

  $$\hat{y}= \hat{\beta_0} + \hat{\beta_0}\cdot x_{i1} + ...+ \hat{\beta_p}\cdot x_{ip}$$

- $\hat{\beta}_0$ $\hspace{0.05cm}$ is the model estimated value for the response variable, i.e $\hspace{0.05cm}$  $\hat{y}_i$ $\hspace{0.05cm}$
, when $\hspace{0.05cm}$ $x_{ij}=0$ $\hspace{0.05cm}$ , $\forall j=1,2,...,p$




###  Cuantitative Predictor Coefficient

Let $X_k$ a **quantitative** variable, and $\hspace{0.05cm}$ $h>0$,

We have the following estimated linear regression model  

$$\hat{y}_i= \hat{\beta_0} + \hat{\beta_0}\cdot x_{i1} + .. + \hat{\beta_k}\cdot x_{ik} + ..+ \hat{\beta_p}\cdot x_{ip}$$


-   If $\hat{\beta}_k > 0$  , then

    -   If  $x_{ik}$  **increases** in $h$ units $\hspace{0.05cm}$  $\Rightarrow$  $\hspace{0.05cm}$ $\hat{y}_i$
        **increases** in $\hspace{0.05cm}$ $\hat{\beta}_k \cdot h$  $\hspace{0.05cm}$ units.
        
        And the opposite if it decreases.
        
   
-   If  $\hat{\beta}_k < 0$  , then

    -  If $x_{ik}$ **increases** in $h$ units $\hspace{0.05cm}$ $\Rightarrow$ $\hspace{0.05cm}$  $\hat{y}_i$
        **decreases** in $\hspace{0.05cm}$ $\hat{\beta}_k \cdot h$ $\hspace{0.05cm}$ units. 

        And the opposite if it decreases.


-   If  $\hat{\beta}_k = 0$  , then

    -   $\hat{y}_i$ $\hspace{0.05cm}$ doesn´t depend on $\hspace{0.05cm}$  $x_{ik}$


**Observation:**


The above affirmations are based in the following:

- $(\hat{y}_i \hspace{0.05cm} | \hspace{0.05cm} x_{ik}=c+h ) - (\hat{y}_i  \hspace{0.05cm} | \hspace{0.05cm}  x_{ik}=c ) =  \hat{\beta_k}\cdot h$


## Categorical Predictor Coefficient 


### Categorical Predictors with 2 categories

Let $X_k$ a categorical variable with 2 categories 
$\lbrace A_0 , A_1 \rbrace$,

If the reference category is  $A_0$  , then  $X_k$ enter in the model as the binary (0,1) variable $X_{k, A_1}$ defined as:

$$
x_{i k, A_1}=1  \hspace{0.05cm} \Leftrightarrow \hspace{0.05cm}  x_{i k}=A_1 \\
x_{i k, A_1}=0  \hspace{0.05cm} \Leftrightarrow \hspace{0.05cm}  x_{i k}=A_0
$$

<br>

We have the following estimated linear regression model:

$$\hat{y}_i= \hat{\beta_0} + \hat{\beta_0}\cdot x_{i1} + .. + \hat{\beta}_{k, A_1} \cdot x_{ik, A_1} + ..+ \hat{\beta_p}\cdot x_{ip}$$

<br>

-   If $\hat{\beta}_{k, A_1} > 0$ , then



    -   $\hat{y}_i$ $\hspace{0.05cm}$ is $\hspace{0.05cm}$ $\hat{\beta}_{k,A_1}$ $\hspace{0.05cm}$ units greater if  $\hspace{0.05cm}$ $x_{ik}=A_1$ $\hspace{0.05cm}$ than if $\hspace{0.05cm}$ $x_{ik}= A_0$

<br>

-   If $\hat{\beta}_{k, A_1} < 0$ , then

    -   $\hat{y}_i$ $\hspace{0.05cm}$ is $\hspace{0.05cm}$ $\hat{\beta}_{k,A_1}$ $\hspace{0.05cm}$ units less if $\hspace{0.05cm}$  $x_{ik}= A_1$ $\hspace{0.05cm}$ than if $\hspace{0.05cm}$ $x_{ik}=A_0$




### Categorical Predictors with 3 categories:

Let $X_k$ a categorical variable with 3 categories
$\lbrace A_0 , A_1, A_2 \rbrace$,

If the reference category is $A_0$, then $X_k$ enter in the model with
two binary (0,1) variables $X_{k, A_1}$ y $X_{k, A_2}$ defined as:

\begin{gather*}
x_{i k, A_1}=1   \Leftrightarrow   x_{i k}=A_1 \\
x_{i k, A_2}=1  \Leftrightarrow     x_{i k}=A_2
\end{gather*}

<br>

We have the following estimated linear regression model:

$\hat{y}= \hat{\beta_0} + \hat{\beta_0}\cdot x_{i1} + .. + \hat{\beta}_{k, A_1} \cdot x_{ik, A_1} + \hat{\beta}_{k, A_2} \cdot x_{ik, A_2} + ..+ \hat{\beta_p}\cdot x_{ip}$

<br>

-   If  $\hat{\beta}_{k, A_1} > 0$  , then

    -  $\hat{y}_i$ $\hspace{0.05cm}$  is $\hspace{0.05cm}$ $\hat{\beta}_{k,A_1}$ $\hspace{0.05cm}$ units **greater** if $\hspace{0.05cm}$ $x_{ik}= A_1$ $\hspace{0.05cm}$  than if  $\hspace{0.05cm}$ $x_{ik}= A_0$

<br>

-   If  $\hat{\beta}_{k, A_1} < 0$  , then

    -   $\hat{y}_i$ $\hspace{0.05cm}$ is $\hspace{0.05cm}$ $\hat{\beta}_{k,A_1}$ $\hspace{0.05cm}$ units **less** if $\hspace{0.05cm}$  $x_{ik}= A_1$ $\hspace{0.05cm}$ than if $\hspace{0.05cm}$ $x_{ik}= A_0$

<br>

-   If $\hat{\beta}_{k, A_2} > 0$ , then

    -   $\hat{y}_i$ $\hspace{0.05cm}$  is $\hspace{0.05cm}$ $\hat{\beta}_{k,A_2}$ $\hspace{0.05cm}$ units **greater** if $\hspace{0.05cm}$  $x_{ik}= A_1$  $\hspace{0.05cm}$ than if $\hspace{0.05cm}$ $x_{ik}= A_0$

<br>

-   If $\hat{\beta}_{k, A_2} < 0$  , then 

    -  $\hat{y}_i$ $\hspace{0.05cm}$ is $\hspace{0.05cm}$ $\hat{\beta}_{k,A_2}$ $\hspace{0.05cm}$ units **less** if  $\hspace{0.05cm}$ $x_{ik}= A_1$ $\hspace{0.05cm}$ than if $\hspace{0.05cm}$ $x_{ik}= A_0$

<br>

-   If  $\hat{\beta}_{k, A_2} > \hat{\beta}_{k, A_1}$ , then

    -   $\hat{y}_i$ $\hspace{0.05cm}$ is $\hspace{0.05cm}$ $\hat{\beta}_{k,A_2} - \hat{\beta}_{k,A_1}$  $\hspace{0.05cm}$ units  **greater** if $\hspace{0.05cm}$ $x_{ik}= A_2$ $\hspace{0.05cm}$ than if $\hspace{0.05cm}$ $x_{ik}= A_1$

<br>

-   If  $\hat{\beta}_{k, A_2} < \hat{\beta}_{k, A_1}$  , then

    -   $\hat{y}_i$ $\hspace{0.05cm}$ is $\hspace{0.05cm}$ $\hat{\beta}_{k,A_2} - \hat{\beta}_{k,A_1}$ $\hspace{0.05cm}$ units
        **less** if $\hspace{0.05cm}$ $x_{ik}= A_2$ $\hspace{0.05cm}$ than if $\hspace{0.05cm}$ $x_{ik}= A_1$

<br>

**Observation:**

The above is easily extrapolated to the case in which we have a
categorical predictor with $r$ categories, for $r>3$.



### Example of coefficient interpretation

We had obtained the following estimated model:


$\widehat{price}_i =  -61730799 +  35664 \cdot size\_in\_m\_2_i - 836683 \cdot no\_of\_bedrooms_i -57121 \cdot no\_of\_bathrooms_i - \\ -340613 \cdot qualityLow_i -200594 \cdot qualityMedium_i -61852  \cdot qualityUltra_i  +6114932\cdot  latitude_i -1677161   \cdot longitude_i$

The interpretation of the estimated model coefficients is the following:

-   $\hat{\beta}_0 = -61730799$   is the estimated \ $price$ \ by the model for the
    houses with $size\_in\_m\_2_i =0$  ,  $no\_of\_bedrooms_i =0$  ,  $no\_of\_bathrooms_i =0$  ,   $qualityLow_i=0$ ,   $qualityMedium_i=0$  ,  $qualityUltra_i=0$  ,  $latitude_i=longitude_i=0$


-   $\hat{\beta}_{size\_in\_m\_2} =35664$   $\Rightarrow$   if
    $size\_in\_m\_2_i$ increases in $h$ units, the estimated housing
    $price$ increases in  $h\cdot 35664$  units.


-   $\hat{\beta}_{no\_of\_bedrooms} = - 836683$   $\Rightarrow$   if
    $no\_of\_bedrooms_i$ increases in  $h$  units, the estimated housing
    $price$ decreases in  $-h\cdot 836683$  units.


-   $\hat{\beta}_{no\_of\_bathrooms} = -57121$   $\Rightarrow$   if
    $no\_of\_bathrooms_i$ increases in  $h$  units, the estimated housing
    $price$ decreases in  $-h\cdot 57121$  units.


-   $\hat{\beta}_{qualityLow} = -340613$   $\Rightarrow$    the estimated
    $price$ of houses with low quality  $(qualityLow_i=1)$  is  $-340613$   units lower than the estimated price of houses with high quality
    $(qualityHigh_i=1)$ , because high quality is the reference category  of $quality$ variable.


-   $\hat{\beta}_{qualityMedium} = -200594$  $\Rightarrow$    the
    estimated $price$ of houses with medium quality 
    $(qualityMedium_i=1)$  is  $-200594$ units lower than the estimated
    price of houses with high quality  $(qualityHigh_i=1)$  , because high
    quality is the reference category of  $quality$  variable.


-   $\hat{\beta}_{qualityUltra} = -61852$   $\Rightarrow$     the estimated
    $price$ of houses with ultra quality  $(qualityUltra_i=1)$   is
  $-61852$  units lower than the estimated price of houses with high quality  $(qualityHigh_i=1)$  , because high quality is the reference category of $quality$  variable.


-   $\hat{\beta}_{qualityUltra} - \hat{\beta}_{qualityMedium} = -61852  -( -200594 )=138742$     $\Rightarrow$   the estimated price of houses  with ultra quality  $(qualityUltra_i=1)$   is  $138742$ units greater   than the estimated price of houses with medium quality  $(qualityMedium_i=1)$ 