# Uso de fórmulas estilo R en `statsmodels`

Uno de los aspectos más versátiles de `statsmodels` es la posibilidad de construir modelos estadísticos utilizando **fórmulas simbólicas**, al estilo del lenguaje R. Este enfoque es más legible y expresivo, lo que facilita tanto la escritura como la interpretación de los modelos.

Estas fórmulas se especifican como cadenas de texto y siguen una sintaxis basada en el paquete `patsy`, el cual interpreta y transforma las fórmulas en matrices de diseño para los modelos.

### Sintaxis general

```r
y ~ x1 + x2 + x3
```

donde,

* `y`: variable dependiente (target)
* `x1+x2+x3`: variables independientes (regresores)
* Se pueden aplicar transformaciones directamente: `y~np.log(x1) + x2**2`
* Incluir iteraciones: `y~x1*x2` (equivalente a `x1+x2+x1:x2`)

El uso de fórmulas permite especificar modelos de forma más declarativa y natural, y facilita el uso de variables categóricas mediente la sintaxis `C(var_categorica)` automáticamente sin tener que codificarlas manualmente. Además, es compatible con el módulo de Pandas para DataFrames.

## Importación de librerías

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

## Importación de datos

In [2]:
df = sm.datasets.get_rdataset("Guerry", "HistData").data

variables = ["Lottery", "Literacy", "Wealth", "Region"]

df = df[variables].dropna()

df.head()

Unnamed: 0,Lottery,Literacy,Wealth,Region
0,41,37,73,E
1,38,51,22,N
2,66,13,61,C
3,80,46,76,E
4,79,69,83,E


## Ejemplo de OLS utilizando fórmula

In [3]:
model = smf.ols(formula="Lottery ~ Literacy + Wealth + Region", data=df)
model

<statsmodels.regression.linear_model.OLS at 0x10348ac20>

In [4]:
res = model.fit()
res

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x103494c10>

In [5]:
res.summary()

0,1,2,3
Dep. Variable:,Lottery,R-squared:,0.338
Model:,OLS,Adj. R-squared:,0.287
Method:,Least Squares,F-statistic:,6.636
Date:,"Wed, 28 May 2025",Prob (F-statistic):,1.07e-05
Time:,19:15:06,Log-Likelihood:,-375.3
No. Observations:,85,AIC:,764.6
Df Residuals:,78,BIC:,781.7
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,38.6517,9.456,4.087,0.000,19.826,57.478
Region[T.E],-15.4278,9.727,-1.586,0.117,-34.793,3.938
Region[T.N],-10.0170,9.260,-1.082,0.283,-28.453,8.419
Region[T.S],-4.5483,7.279,-0.625,0.534,-19.039,9.943
Region[T.W],-10.0913,7.196,-1.402,0.165,-24.418,4.235
Literacy,-0.1858,0.210,-0.886,0.378,-0.603,0.232
Wealth,0.4515,0.103,4.390,0.000,0.247,0.656

0,1,2,3
Omnibus:,3.049,Durbin-Watson:,1.785
Prob(Omnibus):,0.218,Jarque-Bera (JB):,2.694
Skew:,-0.34,Prob(JB):,0.26
Kurtosis:,2.454,Cond. No.,371.0


Podemos observar que ya se hicieron variables dummy automáticamente. Si quisieramos hacerlo más fácil de leer para una persona podríamos hacer lo siguiente.

## Variables Categóricas

In [7]:
df.info()
# vemos que la variable Region es categórica

<class 'pandas.core.frame.DataFrame'>
Index: 85 entries, 0 to 84
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Lottery   85 non-null     int64 
 1   Literacy  85 non-null     int64 
 2   Wealth    85 non-null     int64 
 3   Region    85 non-null     object
dtypes: int64(3), object(1)
memory usage: 3.3+ KB


In [8]:
# valores únicos de la variable Region
df['Region'].unique()

array(['E', 'N', 'C', 'S', 'W'], dtype=object)

In [9]:
# Para poner explícitamente que la variable Region es una categoría
model = smf.ols(formula="Lottery ~ Literacy + Wealth + C(Region)", data=df)
res = model.fit()
res.summary()

0,1,2,3
Dep. Variable:,Lottery,R-squared:,0.338
Model:,OLS,Adj. R-squared:,0.287
Method:,Least Squares,F-statistic:,6.636
Date:,"Wed, 28 May 2025",Prob (F-statistic):,1.07e-05
Time:,21:34:03,Log-Likelihood:,-375.3
No. Observations:,85,AIC:,764.6
Df Residuals:,78,BIC:,781.7
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,38.6517,9.456,4.087,0.000,19.826,57.478
C(Region)[T.E],-15.4278,9.727,-1.586,0.117,-34.793,3.938
C(Region)[T.N],-10.0170,9.260,-1.082,0.283,-28.453,8.419
C(Region)[T.S],-4.5483,7.279,-0.625,0.534,-19.039,9.943
C(Region)[T.W],-10.0913,7.196,-1.402,0.165,-24.418,4.235
Literacy,-0.1858,0.210,-0.886,0.378,-0.603,0.232
Wealth,0.4515,0.103,4.390,0.000,0.247,0.656

0,1,2,3
Omnibus:,3.049,Durbin-Watson:,1.785
Prob(Omnibus):,0.218,Jarque-Bera (JB):,2.694
Skew:,-0.34,Prob(JB):,0.26
Kurtosis:,2.454,Cond. No.,371.0


## Ignorar variables 

En la siguiente sintaxis

```r
y ~ x1 + x2 + x3
```

el símbolo `+` representa una entrada más en la matriz $X$ en la ecuación $\beta = (X^TX)^{-1}X^TY$ para obtener los parámetros óptimos mediante OLS. Si quisieramos quitar la columna de 1's en esta matriz, relacionada con el factor $\beta_0$ (en statsmodels esto se denomina **intercept**) entonces podemos utilizar el signo `-`.

In [11]:
# Remover explicitamente intercept
model_c = smf.ols(formula="Lottery ~ Literacy + Wealth + C(Region) - 1", data=df)
res_c = model_c.fit()
res_c.summary()

0,1,2,3
Dep. Variable:,Lottery,R-squared:,0.338
Model:,OLS,Adj. R-squared:,0.287
Method:,Least Squares,F-statistic:,6.636
Date:,"Wed, 28 May 2025",Prob (F-statistic):,1.07e-05
Time:,21:43:38,Log-Likelihood:,-375.3
No. Observations:,85,AIC:,764.6
Df Residuals:,78,BIC:,781.7
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
C(Region)[C],38.6517,9.456,4.087,0.000,19.826,57.478
C(Region)[E],23.2239,14.931,1.555,0.124,-6.501,52.949
C(Region)[N],28.6347,13.127,2.181,0.032,2.501,54.769
C(Region)[S],34.1034,10.370,3.289,0.002,13.459,54.748
C(Region)[W],28.5604,10.018,2.851,0.006,8.616,48.505
Literacy,-0.1858,0.210,-0.886,0.378,-0.603,0.232
Wealth,0.4515,0.103,4.390,0.000,0.247,0.656

0,1,2,3
Omnibus:,3.049,Durbin-Watson:,1.785
Prob(Omnibus):,0.218,Jarque-Bera (JB):,2.694
Skew:,-0.34,Prob(JB):,0.26
Kurtosis:,2.454,Cond. No.,653.0


Vemos que ya no tenemos el factor de **intercept** de la regresión lineal.

## Término de interacción

Los **términos de interacción** permiten modelar efectos conjuntos entre dos o más variables. En otras palabras, nos permiten incluir en el modelo la idea de que **el efecto de una variable sobre la respuesta puede depender del valor de otra variable**.

#### Sintaxis básica

- `x1 + x2`: Incluye los efectos individuales de `x1` y `x2`.
- `x1:x2`: Incluye **solo el término de interacción** entre `x1` y `x2` (no los efectos individuales).
- `x1 * x2`: Expande automáticamente a `x1 + x2 + x1:x2`.

Esto es útil cuando se sospecha que la influencia de una variable cambia dependiendo del nivel de otra.

In [12]:
# Interacciones, por ejemplo tomar en cuenta interaccion de Literacy con Wealth
model_int1 = smf.ols(formula="Lottery ~ Literacy:Wealth + C(Region)", data=df)
res_int1 = model_int1.fit()
res_int1.summary()

0,1,2,3
Dep. Variable:,Lottery,R-squared:,0.229
Model:,OLS,Adj. R-squared:,0.181
Method:,Least Squares,F-statistic:,4.7
Date:,"Wed, 28 May 2025",Prob (F-statistic):,0.000827
Time:,21:46:59,Log-Likelihood:,-381.76
No. Observations:,85,AIC:,775.5
Df Residuals:,79,BIC:,790.2
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,47.8130,5.909,8.092,0.000,36.051,59.575
C(Region)[T.E],-31.8424,8.850,-3.598,0.001,-49.459,-14.226
C(Region)[T.N],-27.6259,7.707,-3.584,0.001,-42.967,-12.285
C(Region)[T.S],-8.8565,7.752,-1.143,0.257,-24.285,6.573
C(Region)[T.W],-11.0910,7.734,-1.434,0.156,-26.486,4.304
Literacy:Wealth,0.0070,0.002,2.989,0.004,0.002,0.012

0,1,2,3
Omnibus:,8.998,Durbin-Watson:,1.8
Prob(Omnibus):,0.011,Jarque-Bera (JB):,3.891
Skew:,-0.251,Prob(JB):,0.143
Kurtosis:,2.079,Cond. No.,11000.0


In [13]:
# Interacciones, por ejemplo tomar en cuenta interaccion de Literacy con Wealth, y cada una por separado también
model_int2 = smf.ols(formula="Lottery ~ Literacy*Wealth + C(Region)", data=df)
res_int2 = model_int2.fit()
res_int2.summary()

0,1,2,3
Dep. Variable:,Lottery,R-squared:,0.338
Model:,OLS,Adj. R-squared:,0.278
Method:,Least Squares,F-statistic:,5.615
Date:,"Wed, 28 May 2025",Prob (F-statistic):,2.96e-05
Time:,21:48:19,Log-Likelihood:,-375.3
No. Observations:,85,AIC:,766.6
Df Residuals:,77,BIC:,786.1
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,39.0993,17.470,2.238,0.028,4.312,73.887
C(Region)[T.E],-15.4451,9.807,-1.575,0.119,-34.973,4.082
C(Region)[T.N],-9.9728,9.432,-1.057,0.294,-28.753,8.808
C(Region)[T.S],-4.5754,7.380,-0.620,0.537,-19.270,10.119
C(Region)[T.W],-10.1122,7.275,-1.390,0.169,-24.598,4.374
Literacy,-0.1960,0.396,-0.495,0.622,-0.984,0.592
Wealth,0.4432,0.290,1.530,0.130,-0.133,1.020
Literacy:Wealth,0.0002,0.007,0.031,0.976,-0.013,0.013

0,1,2,3
Omnibus:,3.076,Durbin-Watson:,1.784
Prob(Omnibus):,0.215,Jarque-Bera (JB):,2.709
Skew:,-0.341,Prob(JB):,0.258
Kurtosis:,2.452,Cond. No.,15600.0


In [16]:
# Sin Region ni Intercept, con pura interaccion
model_int3 = smf.ols(formula="Lottery ~ Literacy:Wealth -1", data=df)
res_int3 = model_int3.fit()
res_int3.summary()

0,1,2,3
Dep. Variable:,Lottery,R-squared (uncentered):,0.542
Model:,OLS,Adj. R-squared (uncentered):,0.536
Method:,Least Squares,F-statistic:,99.33
Date:,"Wed, 28 May 2025",Prob (F-statistic):,6.79e-16
Time:,21:51:27,Log-Likelihood:,-419.22
No. Observations:,85,AIC:,840.4
Df Residuals:,84,BIC:,842.9
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Literacy:Wealth,0.0182,0.002,9.966,0.000,0.015,0.022

0,1,2,3
Omnibus:,6.8,Durbin-Watson:,1.536
Prob(Omnibus):,0.033,Jarque-Bera (JB):,6.531
Skew:,-0.675,Prob(JB):,0.0382
Kurtosis:,3.15,Cond. No.,1.0


In [17]:
# Sin Region ni Intercept, con interaccion y por separado
model_int4 = smf.ols(formula="Lottery ~ Literacy*Wealth -1", data=df)
res_int4 = model_int4.fit()
res_int4.summary()

0,1,2,3
Dep. Variable:,Lottery,R-squared (uncentered):,0.817
Model:,OLS,Adj. R-squared (uncentered):,0.811
Method:,Least Squares,F-statistic:,122.3
Date:,"Wed, 28 May 2025",Prob (F-statistic):,3.5499999999999996e-30
Time:,21:52:03,Log-Likelihood:,-380.15
No. Observations:,85,AIC:,766.3
Df Residuals:,82,BIC:,773.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Literacy,0.4274,0.099,4.297,0.000,0.230,0.625
Wealth,1.0810,0.104,10.397,0.000,0.874,1.288
Literacy:Wealth,-0.0136,0.003,-4.265,0.000,-0.020,-0.007

0,1,2,3
Omnibus:,2.001,Durbin-Watson:,1.946
Prob(Omnibus):,0.368,Jarque-Bera (JB):,2.002
Skew:,-0.321,Prob(JB):,0.367
Kurtosis:,2.609,Cond. No.,89.8


## Funciones 

Una de las características más potentes de las fórmulas en `statsmodels` (gracias a `patsy`) es que permiten **aplicar funciones matemáticas directamente dentro de la fórmula**, sin necesidad de transformar previamente los datos en el DataFrame.

Esto es muy útil para modelar relaciones no lineales, aplicar transformaciones logarítmicas, cuadráticas, normalizar variables, o incluir combinaciones personalizadas directamente en la especificación del modelo, con **numpy** por ejemplo.


In [18]:
model = smf.ols(formula="Lottery ~ np.log(Literacy)", data=df)
res = model.fit()
res.summary()

0,1,2,3
Dep. Variable:,Lottery,R-squared:,0.161
Model:,OLS,Adj. R-squared:,0.151
Method:,Least Squares,F-statistic:,15.89
Date:,"Wed, 28 May 2025",Prob (F-statistic):,0.000144
Time:,21:54:58,Log-Likelihood:,-385.38
No. Observations:,85,AIC:,774.8
Df Residuals:,83,BIC:,779.7
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,115.6091,18.374,6.292,0.000,79.064,152.155
np.log(Literacy),-20.3940,5.116,-3.986,0.000,-30.570,-10.218

0,1,2,3
Omnibus:,8.907,Durbin-Watson:,2.019
Prob(Omnibus):,0.012,Jarque-Bera (JB):,3.299
Skew:,0.108,Prob(JB):,0.192
Kurtosis:,2.059,Cond. No.,28.7


In [19]:
# Podemos definir una función custom equivalente

def log_plus_1(x):
    return np.log(x) + 1

In [20]:
model = smf.ols(formula="Lottery ~ log_plus_1(Literacy)", data=df)
res = model.fit()
res.summary()

0,1,2,3
Dep. Variable:,Lottery,R-squared:,0.161
Model:,OLS,Adj. R-squared:,0.151
Method:,Least Squares,F-statistic:,15.89
Date:,"Wed, 28 May 2025",Prob (F-statistic):,0.000144
Time:,22:12:27,Log-Likelihood:,-385.38
No. Observations:,85,AIC:,774.8
Df Residuals:,83,BIC:,779.7
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,136.0031,23.454,5.799,0.000,89.354,182.652
log_plus_1(Literacy),-20.3940,5.116,-3.986,0.000,-30.570,-10.218

0,1,2,3
Omnibus:,8.907,Durbin-Watson:,2.019
Prob(Omnibus):,0.012,Jarque-Bera (JB):,3.299
Skew:,0.108,Prob(JB):,0.192
Kurtosis:,2.059,Cond. No.,45.5


Y observamos que al hacer una función custom, el resultado es el mismo.