## Variable interactions

In [2]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
trainf = pd.read_csv('./Datasets/Car_features_train.csv')
trainp = pd.read_csv('./Datasets/Car_prices_train.csv')
testf = pd.read_csv('./Datasets/Car_features_test.csv')
testp = pd.read_csv('./Datasets/Car_prices_test.csv')
train = pd.merge(trainf,trainp)
train.head()

Unnamed: 0,carID,brand,model,year,transmission,mileage,fuelType,tax,mpg,engineSize,price
0,18473,bmw,6 Series,2020,Semi-Auto,11,Diesel,145,53.3282,3.0,37980
1,15064,bmw,6 Series,2019,Semi-Auto,10813,Diesel,145,53.043,3.0,33980
2,18268,bmw,6 Series,2020,Semi-Auto,6,Diesel,145,53.4379,3.0,36850
3,18480,bmw,6 Series,2017,Semi-Auto,18895,Diesel,145,51.514,3.0,25998
4,18492,bmw,6 Series,2015,Automatic,62953,Diesel,160,51.4903,3.0,18990


Until now, we have have assumed that the association between a predictor $X_j$ and response $Y$ does not depend on the value of other predictors. For example, the muliple linear regression model that we developed in Chapter [2](https://nustat.github.io/STAT303-2-class-notes/Lec2_MultipleLinearRegression.html) assumes that the average increase in price associated with a unit increase in engineSize is always $12,180, regardless of the value of other predictors. However, this assumption may be incorrect.

We can relax this assumption by considering another predictor, called an interaction term. Let us assume that the average increase in `price` associated with a one-unit increase in `engineSize` depends on the model `year` of the car. In other words, there is an interaction between `engineSize` and `year`. This interaction can be included as a predictor, which is the product of `engineSize` and `year`. *Note that there are several possible interactions that we can consider. Here the interaction between `engineSize` and `year` is just an example.*

In [4]:
#Considering interaction between engineSize and year
ols_object = smf.ols(formula = 'price~year*engineSize+mileage+mpg', data = train)
model = ols_object.fit()
model.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.682
Model:,OLS,Adj. R-squared:,0.681
Method:,Least Squares,F-statistic:,2121.0
Date:,"Tue, 17 Jan 2023",Prob (F-statistic):,0.0
Time:,02:19:05,Log-Likelihood:,-52338.0
No. Observations:,4960,AIC:,104700.0
Df Residuals:,4954,BIC:,104700.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,5.606e+05,2.74e+05,2.048,0.041,2.4e+04,1.1e+06
year,-275.3833,135.695,-2.029,0.042,-541.405,-9.361
engineSize,-1.796e+06,9.97e+04,-18.019,0.000,-1.99e+06,-1.6e+06
year:engineSize,896.7687,49.431,18.142,0.000,799.861,993.676
mileage,-0.1525,0.008,-17.954,0.000,-0.169,-0.136
mpg,-84.3417,9.048,-9.322,0.000,-102.079,-66.604

0,1,2,3
Omnibus:,2330.413,Durbin-Watson:,0.524
Prob(Omnibus):,0.0,Jarque-Bera (JB):,29977.437
Skew:,1.908,Prob(JB):,0.0
Kurtosis:,14.423,Cond. No.,76600000.0


Note that the R-squared has increased as compared to the model in Chapter [2](https://nustat.github.io/STAT303-2-class-notes/Lec2_MultipleLinearRegression.html) since we added a predictor.

The model equation is:

\begin{equation}
price = \beta_0 + \beta_1*year + \beta_2*engineSize + \beta_3*(year*engineSize) + \beta4*mileage + \beta_5*mpg,
\end{equation}or

\begin{equation}
price = \beta_0 + \beta_1*year + (\beta_2+\beta_3*year)*engineSize + \beta4*mileage + \beta_5*mpg,
\end{equation}or

\begin{equation}
price = \beta_0 + \beta_1*year + \tilde \beta*engineSize + \beta4*mileage + \beta_5*mpg,
\end{equation}