## Indice:
* [Data-set description](#1)
* * [Data Manipulation in R](#2)
* * [Data Manipulation in Python](#3)
* [Introduction to the Linear Regression Model](#4)
* * [Usefulness of the Linear Regression Model](#5)
* * [Formal Approach to the Linear Regression Modelo](#6)
* *[Basic Assumptions](#7)
* *[Assumptions Consequences](#8)
* *[Matrix representation of the basic assumption of the model](#9)
* [Estimation](#10)
* * [Prediction of Response Variable](#11)
* * [Estimation of model coefficients](#12)
* * [Estimation of model errors](#13)
* * [Regression Hyperplane ](#14)
* * [Hat-Matrix](#15)
* * [Right Join](#16)
* * [Semi Join](#17)
* * [Anti Join](#18)
* * [Union](#19)
* * [Intersect](#20)
* * [Difference](#21)
*  [Concatenate](#22)
*  [Group and Summarize](#23)
*  [Other usuful functions ](#24)

## Data-set description <a class="anchor" id="1"></a>



We are going to describe the data-set we will use in this article.

The data are 1905 observation about 38 variables on housing features.

Here is the link where the data was loaded:
<https://www.kaggle.com/datasets/dataregress/dubai-properties-dataset?resource=download>



The variables of our interest are the following:

-   id : identificator

-   neighborhood: the name of the neighborhood

-   latitude: the latitude of the house

-   longitude: the longitude of the house

-   price: the market price of the house

-   size_in_sqft: the size of the house in square foot

    -   1 sqft = 0.092903 $m^2$

-   price_per_sqft: the market price of the house per square foot

-   no_of_bedrooms: number of bedrooms in the house

-   no_of_bathrooms: number of bathrooms in the house

-   quality: quality of the house. Based on the number of services. Her
    categories are Ultra, High, Medium and Low

-   maid_room: indicates if the house has maid room (cuarto de servicio)
    (true/false)

-   unfurnished: indicates if the house is unfurnished (sin amueblar)
    (true/false)

-   balcony: indicates if the house has balcony (true/false)

-   barbecue_area: indicates if the house has barbecue area (true/false)

-   central_ac: indicates if the house has central air conditioning
    (true/false)

-   childrens_play_area: indicatees if the house has childrens game area
    (true/false)

-   childrens_pool: indicates if the house has childrens pool
    (true/false)

-   concierge: indicates if the house has concierge (true/false)

-   covered_parking: indicates if the house has covered parking
    (true/false)

-   kitchen_appliances: indicates if the house has kitchen appliances
    (electrodomesticos de cocina) (true/false)

-   maid_service: indicates if the house has maid service (servicio de
    limpieza) (true/false)

-   pets_allowed: indicates if pets are allowed(true/false)

-   private_garden: indicates if the house has private garden
    (true/false)

-   private_gym: indicates if the house has private gym (true/false)

-   private_jacuzzi: indicates if the house has private jacuzzi
    (true/false)

-   private_pool: indicates if the house has private pool (true/false)

-   security: indicates if the house has private secutity (true/false)

-   shared_gym: indicates if the house has shared gym (true/false)

-   shared_pool: indicates if the house has shared pool (true/false)

-   shared_spa: indicates if the house has shared spa (true/false)

-   view_of_water: indicates if the house has view of the water
    (true/false)





Now we are going to do the following:

1. We are going to load an manipulate the data-set in R

2. We will repeat this task in Python



### Data Manipulation in R <a class="anchor" id="2"></a>

In [1]:
import rpy2

%load_ext rpy2.ipython

import rpy2.robjects as robjects

Unable to determine R home: [WinError 2] El sistema no puede encontrar el archivo especificado


In [2]:
%%R

library(tidyverse)

R[write to console]: -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

R[write to console]: v ggplot2 3.3.6     v purrr   0.3.4
v tibble  3.1.7     v dplyr   1.0.9
v tidyr   1.2.0     v stringr 1.4.0
v readr   2.1.2     v forcats 0.5.1

R[write to console]: -- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()




We load the data-set with which we are going to work:


In [5]:
%%R 

url = 'https://raw.githubusercontent.com/FabioScielzoOrtiz/Estadistica4all-blog/main/Linear%20Regression%20in%20Python%20and%20R/properties_data.csv'

properties_data <- read_csv(url)

Rows: 1905 Columns: 38
-- Column specification --------------------------------------------------------
Delimiter: ","
chr  (2): neighborhood, quality
dbl  (8): id, latitude, longitude, price, size_in_sqft, price_per_sqft, no_o...
lgl (28): maid_room, unfurnished, balcony, barbecue_area, built_in_wardrobes...

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.


Now, we are going to tranformate the variables that are measured in square foot (sqft) to square meters $(m^2)$



size_in_m\_2 = 0.092903 \* size_in_sqft

price_per_m\_2 = price_per_sqft / 0.092903





Now, we are going to tranformate the variables that are measured in square foot (sqft) to square meters $(m^2)$

size_in_m\_2 = 0.092903 \* size_in_sqft

price_per_m\_2 = price_per_sqft / 0.092903



In [6]:
%%R 

size_in_m_2 <-  0.092903*properties_data$size_in_sqft

properties_data$size_in_m_2 <- size_in_m_2

price_per_m_2 <- properties_data$price_per_sqft /  0.092903 

properties_data$price_per_m_2 <- price_per_m_2




The following step will be remove in the data-set the variables that we will not take into account:


In [7]:
%%R 

data_R <- properties_data %>% select("price", "size_in_m_2", "longitude", "latitude", "no_of_bedrooms", "no_of_bathrooms", "quality")

### Data Manipulation in Python <a class="anchor" id="3"></a>

In [35]:
import pandas as pd

from dfply import *

import warnings
warnings.filterwarnings('ignore')

In [10]:
url = 'https://raw.githubusercontent.com/FabioScielzoOrtiz/Estadistica4all-blog/main/Linear%20Regression%20in%20Python%20and%20R/properties_data.csv'

data_Python = pd.read_csv(url)

data_Python

Unnamed: 0,id,neighborhood,latitude,longitude,price,size_in_sqft,price_per_sqft,no_of_bedrooms,no_of_bathrooms,quality,...,private_pool,security,shared_gym,shared_pool,shared_spa,study,vastu_compliant,view_of_landmark,view_of_water,walk_in_closet
0,5528049,Palm Jumeirah,25.113208,55.138932,2700000,1079,2502.32,1,2,Medium,...,False,False,True,False,False,False,False,False,True,False
1,6008529,Palm Jumeirah,25.106809,55.151201,2850000,1582,1801.52,2,2,Medium,...,False,False,True,True,False,False,False,False,True,False
2,6034542,Jumeirah Lake Towers,25.063302,55.137728,1150000,1951,589.44,3,5,Medium,...,False,True,True,True,False,False,False,True,True,True
3,6326063,Culture Village,25.227295,55.341761,2850000,2020,1410.89,2,3,Low,...,False,False,False,False,False,False,False,False,False,False
4,6356778,Palm Jumeirah,25.114275,55.139764,1729200,507,3410.65,0,1,Medium,...,False,True,True,True,True,False,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1900,7705450,Mohammed Bin Rashid City,25.176892,55.310712,1500000,1087,1379.94,2,2,Ultra,...,False,True,True,True,True,True,True,True,True,True
1901,7706287,Mohammed Bin Rashid City,25.166145,55.276684,1230000,760,1618.42,1,2,Medium,...,False,False,True,True,False,False,False,False,True,True
1902,7706389,Dubai Creek Harbour (The Lagoons),25.206500,55.345056,2900000,1930,1502.59,3,5,Medium,...,False,False,False,True,False,False,False,False,False,False
1903,7706591,Jumeirah Village Circle,25.073858,55.229844,675000,740,912.16,1,2,Medium,...,False,True,True,True,False,False,False,False,True,True


In [12]:
data_Python['size_in_m_2'] = 0.092903*data_Python['size_in_sqft']
data_Python['price_per_m_2'] = data_Python['price_per_sqft']/0.092903

In [17]:
data_Python = data_Python >> select(X.price , X.size_in_m_2, X.longitude, X.latitude, X.no_of_bedrooms, X.no_of_bathrooms, X.quality)
data_Python

Unnamed: 0,price,size_in_m_2,longitude,latitude,no_of_bedrooms,no_of_bathrooms,quality
0,2700000,100.242337,55.138932,25.113208,1,2,Medium
1,2850000,146.972546,55.151201,25.106809,2,2,Medium
2,1150000,181.253753,55.137728,25.063302,3,5,Medium
3,2850000,187.664060,55.341761,25.227295,2,3,Low
4,1729200,47.101821,55.139764,25.114275,0,1,Medium
...,...,...,...,...,...,...,...
1900,1500000,100.985561,55.310712,25.176892,2,2,Ultra
1901,1230000,70.606280,55.276684,25.166145,1,2,Medium
1902,2900000,179.302790,55.345056,25.206500,3,5,Medium
1903,675000,68.748220,55.229844,25.073858,1,2,Medium


In [18]:
data_Python.dtypes

price                int64
size_in_m_2        float64
longitude          float64
latitude           float64
no_of_bedrooms       int64
no_of_bathrooms      int64
quality             object
dtype: object

In [20]:
data_Python['quality'] = data_Python['quality'].astype('category')

In [21]:
data_Python.dtypes

price                 int64
size_in_m_2         float64
longitude           float64
latitude            float64
no_of_bedrooms        int64
no_of_bathrooms       int64
quality            category
dtype: object

In [23]:
data_Python['quality'].unique()

['Medium', 'Low', 'High', 'Ultra']
Categories (4, object): ['High', 'Low', 'Medium', 'Ultra']

In [36]:
(data_Python['quality_recode']) = 0

for i in range(0 , len(data_Python)) :

    if (data_Python['quality'])[i] == 'Low' :

        (data_Python['quality_recode'])[i] = 0

    if (data_Python['quality'])[i] == 'Medium' :

        (data_Python['quality_recode'])[i] = 1

    if (data_Python['quality'])[i] == 'High' :

        (data_Python['quality_recode'])[i] = 2

    if (data_Python['quality'])[i] == 'Ultra' :

        (data_Python['quality_recode'])[i] = 3

In [27]:
data_Python.head()

Unnamed: 0,price,size_in_m_2,longitude,latitude,no_of_bedrooms,no_of_bathrooms,quality,quality_recode
0,2700000,100.242337,55.138932,25.113208,1,2,Medium,1
1,2850000,146.972546,55.151201,25.106809,2,2,Medium,1
2,1150000,181.253753,55.137728,25.063302,3,5,Medium,1
3,2850000,187.66406,55.341761,25.227295,2,3,Low,0
4,1729200,47.101821,55.139764,25.114275,0,1,Medium,1


In [30]:
data_Python = data_Python >> select( ~X.quality )

In [33]:
data_Python = data_Python >> rename(quality = X.quality_recode)

The final python data-set would be:

In [34]:
data_Python

Unnamed: 0,price,size_in_m_2,longitude,latitude,no_of_bedrooms,no_of_bathrooms,quality
0,2700000,100.242337,55.138932,25.113208,1,2,1
1,2850000,146.972546,55.151201,25.106809,2,2,1
2,1150000,181.253753,55.137728,25.063302,3,5,1
3,2850000,187.664060,55.341761,25.227295,2,3,0
4,1729200,47.101821,55.139764,25.114275,0,1,1
...,...,...,...,...,...,...,...
1900,1500000,100.985561,55.310712,25.176892,2,2,3
1901,1230000,70.606280,55.276684,25.166145,1,2,1
1902,2900000,179.302790,55.345056,25.206500,3,5,1
1903,675000,68.748220,55.229844,25.073858,1,2,1


**Important**: to use categorical variables in a linear regression model in Python they must be recoded (their values must be numbers that represents their categories), i.e, we cannot use the variable *quality* , insteaf of it we can use *quality_recode*

This is the reason we have recoded *quality* in Python but not in R, because in R is not strictly necessary.

Note we have obtained the same data-set that was obtained with R.

- The R data-set has been called *data_R*

- The Python data-set has been called *data_python*


We will use both of them throughout this article. 

## Introduction to the Linear Regression Model <a class="anchor" id="4"></a>


The principal propose of this article is carry out a theoretical and
also practical exposition of the linear regression model.

Without any doubt the this is the most know statistical model.

There is the idea that the linear regression model is outdated compared
with other modern statistical models. But I would like to defend his
validity nowadays, first of all as a statistical tool, and second as a
previous necessary step to learn other most modern and complex methods.

The linear regression model is the base of many modern regression
techniques, so that is highly recommended study it enough, before to go deeper in other statistical models.

The two main text on which this article is based are "Linear Models with
R" by Julian Faraway (second edition), and "An introduction to
statisticcal learning" by Gareth James (second edition).





### Usefulness of the Linear Regression Model <a class="anchor" id="5"></a>



The main usefulness of the linear regression model is to predict the
values of a **quantitative** variable  depending on the values of other variables (**quantitative or categorical**),
called predictors.

There are other usefulness of the model besides the commented. We will
see them later.





### Formal Approach to the Linear Regression Modelo <a class="anchor" id="6"></a>

We have the following elements:

-   Response Variable:  a **quantitative** variable
      $Y=(y_{1} , y_2,...,y_n)^t$

-   Predictors: a set of **quantitative** or **categorical**
    variables:


\begin{gather*}
X_1 = (x_{11}, x_{21}, ..., x_{n1})^t \\
X_2 = (x_{12}, x_{22}, ..., x_{n2})^t \\
... \\
X_p = (x_{1p}, x_{2p}, ..., x_{np})^t
\end{gather*}


-   Predictors Matrix:

    
    \begin{gather*}
    X=(1, X_1, X_2,...,X_p) = 
    \begin{pmatrix}
    1 & x_{11}&x_{12}&...&x_{1p}\\
    1 & x_{21}&x_{22}&...&x_{2p}\\
    &...&\\
    1& x_{n1}&x_{n2}&...&x_{np}
    \end{pmatrix} = 
    \begin{pmatrix}
    x_{1}\\
    x_{2}\\
    ...\\
    x_{n}
    \end{pmatrix}
    \end{gather*}
    

-   Coefficients vector:


\begin{gather*}
\beta=(\beta_{1}, \beta_{2}, ..., \beta_{n})^t 
\end{gather*}

-   Errors vector:


\begin{gather*}
\varepsilon=(\varepsilon_{1}, \varepsilon_{2}, ..., \varepsilon_{n})^t 
\end{gather*}




### Basic Assumptions <a class="anchor" id="7"></a>


The basic assumptions of the model are the following:

-  $y_i =  x_i^t \cdot \beta  +  \varepsilon_i = \beta_0 + \beta_1 \cdot x_{i1} + \beta_2 \cdot x_{i2} + ... + \beta_p \cdot x_{ip} + \varepsilon_i = \beta_0 + \sum_{j=1}^{p} \beta_j \cdot x_{ij} + \varepsilon_i$

 
-  $\varepsilon_i$ is a random variable such that:


   - $E[\varepsilon_i]=0$ 
   - $Var(\varepsilon_i)=\sigma^2$
   - $\varepsilon_i \sim N(0,\sigma)$ 
   - $cov(\varepsilon_i , \varepsilon_j)=0$



- Additional assumptions:

  -  $n > p+1$ ( nº observations \> nº of coefficients to estimate )

  -  $Rg(X)=p+1$




### Assumptions Consequences <a class="anchor" id="8"></a>


   -  $y_i$ is a random variable because  $\varepsilon_i$ is a random variable

   -  $E[y_i]= x_i^t \cdot \beta$

   -  $Var(y_i) = \sigma^2$

   -  $y_i \sim N(x_i^t \cdot \beta , \sigma)$

   -  $cov(y_i , y_j)=0$




### Matrix representation of the basic assumption of the model <a class="anchor" id="9"></a>


\begin{gather*}
Y=X\cdot \beta + \varepsilon \\ \\
\varepsilon_i \sim N(0,\sigma) \\
cov(\varepsilon_i , \varepsilon_j)=0
\end{gather*}






## Estimation  <a class="anchor" id="10"></a>



###  Prediction of Response Variable <a class="anchor" id="11"></a>


The linear regression model predict the response variable value $y_i$  for the combination of predictors values  $x_i = (x_{i1}, x_{i2}, ..., x_{ip})^t$  as:

\begin{gather*}
\hat{y}_i = x_i^t \cdot \hat{\beta}  = \hat{\beta}_0 + \sum_{j=1}^{p} \hat{\beta}_j \cdot x_{ij} = \hat{\beta}_0 + \hat{\beta}_1 \cdot x_{i1} + \hat{\beta}_2 \cdot x_{i2} + ... + \hat{\beta}_p \cdot x_{ip} 
\end{gather*}




### Estimation of model coefficients <a class="anchor" id="12"></a>


The estimation of $\beta$  in the classic linear regression model is done
using the ordinary least square (OLS) method.

$\hat{\beta}$  is compute as the solution of the following optimitation
problem:

\begin{gather*}
Min  \sum_{i=1}^{n} (y_i - \widehat{y}_i)^2 = \sum_{i=1}^{n} (y_i - x_i^t \cdot \beta)^2 
\end{gather*}


The problem solution is:

\begin{gather*}
\hat{\beta}=(X^t \cdot X)^{-1} \cdot X^t \cdot Y
\end{gather*}


**Observation:**

We will not view here the mathematical details about the resolution of
this optimization problem. But is a classic convex optimization problem,
so it´s enough to take first derivatives of the objetive function with
respect to the coefficients  $\beta_0,\beta_1,...,\beta_p$ ,  set them equal to zero (0), and solve the resultant equation system with respect
to  $\beta$



### Estimation of model errors <a class="anchor" id="13"></a>


The model errors  $\varepsilon_i$  are estimated as:

\begin{gather*}
\hat{\varepsilon}_i = y_i - \hat{y}_i = y_i - x_i^t \cdot \hat{\beta}  
\end{gather*}

for  $i=1,...,n$


**Observation:**

$\hat{\varepsilon}_i$  is the error done by the model when it
estimates/predicts  $y_i$  as  $\hat{y}_i=x_i^t \cdot \hat{\beta}$




### Regression Hyperplane <a class="anchor" id="14"></a>

The regression hyperplane is the matrix expression of the predictions
that the model does of the response variable values:


\begin{gather*}
\hat{Y} = X \cdot \hat{\beta}  
\end{gather*}

Where:    $\hat{Y}=(\hat{y}_1,\hat{y}_2,...,\hat{y}_n)^t$




### Hat-Matrix <a class="anchor" id="15"></a>



\begin{gather*}
\hat{Y} = X \cdot \hat{\beta} = X \cdot (X^t \cdot X)^{-1} \cdot X^t \cdot Y = H \cdot Y  
\end{gather*}



Where:   $H= X \cdot (X^t \cdot X)^{-1} \cdot X^t$  is called Hat-Matrix


## Estimation of the Linear Regression Model in R


In this section we are going to show how estimate a linear regression
model in R, using for this purpose the data-set that was showed at the begining of the article.

The linear regression model that we propose is the following:

\begin{gather*}
price_i = \beta_0 +  \beta_1 \cdot size\_in\_m\_2_i + \beta_2 \cdot no\_of\_bedrooms_i +  \beta_3 \cdot no\_of\_bathrooms_i + \\ + \beta_4 \cdot quality_i + \beta_5\cdot  latitude_i +  \beta_6 \cdot longitude_i + \varepsilon_i
\end{gather*}



In [None]:
%%R

Model_1 <- lm( price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + latitude + longitude ,
data = data_R)

Model_1

The previous output gives us the estimation of the model coefficients
(betas):

$\hat{\beta}_0 = -61730799 \ , \ \hat{\beta}_1 =35664 \ , \ \hat{\beta}_2 = -836683 \ , \ \hat{\beta}_3 =-57121 \ , \ \hat{\beta}_4 =-340613 \ , \\ \hat{\beta}_5 = -200594 \ , \  \hat{\beta}_6= -61852 \ , \ \hat{\beta}_7=6114932 \ , \ \hat{\beta}_8= -1677161$


The estimated model is:


\begin{gather*}
price_i =  -61730799 +  35664 \cdot size\_in\_m\_2_i -836683 \cdot no\_of\_bedrooms_i -57121 \cdot no\_of\_bathrooms_i - \\ -340613 \cdot qualityLow_i -200594 \cdot qualityMedium_i -61852  \cdot qualityUltra_i  +6114932\cdot  latitude_i -1677161   \cdot longitude_i 
\end{gather*}



**Observation:**

The  categorical variable, *quality*, that has 4 categories (Low, Medium,
High, Ultra), enter in the model with 3 variables (qualityLow ,
qualityMedium, qualityUltra ). The category. that is out of the model is High because is the firs category. if we order them alphabetically, and R take it as the reference category. If quality variable would have been recode with numbers, like for example Low=0, Medium=1, High=2, Ultra=3, then the category. out of the model would be Low, because has been recode as 0, and R take it as the reference cateforie.

This isn´t a particularity of this variable, but rather it´s a property of the categorical variables in the regression models.

Later it will be seen how this affects the model coefficients interpretation.  


## Estimation of Linear Regression Model in Python

We can implement a linear regression model in Python with the following code:


In [None]:
modelo = smf.ols(formula = 'price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality_recode + latitude + longitude', data =data_python_2)

modelo = modelo.fit()
 
print(modelo.summary())


We can use a training data-set to train the model in Python with the following code. This concepts will be seen with much more detail in a specific article about validation techniques.



In [None]:
X = data_python[['size_in_m_2', 'no_of_bedrooms', 'no_of_bathrooms' , 'quality_recode' , 'latitude' , 'longitude' ]]
y = data_python[['price']]

X_train, X_test, y_train, y_test = train_test_split( X , y.values.reshape(-1,1) , train_size = 0.8, random_state = 1234, shuffle = True )

# y.values.reshape(-1,1) to transformate y in a colum vector

#  train_size = 0.8 --> the size of the training data-set is the 80% of the original data-set

# random_state = 1234 --> a seed (semilla) to replicate the random process that select the observations that will be consider as training data 

# shuffle --> whether or not to shuffle (permutar/barajar aleatoriamente) the data before splitting (antes de dividirlos en training set y test set)
                                    
data_train = pd.DataFrame( np.hstack((X_train, y_train)), columns=['size_in_m_2', 'no_of_bedrooms', 'no_of_bathrooms' , 'quality_recode' , 'latitude' , 'longitude', 'price'] )


# np.hstack((X_train, y_train)) is more or less like rbind() in R

modelo = smf.ols(formula = 'price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality_recode + latitude + longitude', data =data_train)

modelo = modelo.fit()
 
print(modelo.summary())


Let´s see how hstack (and vstack) works:



In [None]:
a = np.array((1,2,3))
b = np.array((4,5,6))

np.hstack((a,b))

In [None]:
np.vstack((a,b))