In [None]:
# Import useful libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

In [None]:
Auto_df=pd.read_csv('C:\\Users\\jheredi2\\Documents\\PythonDataAnalytics\\1-Datasets\\Auto_ISLR.csv')

<br>

Let's study the __possible__ interaction effect between horsepower and year. That is, we want to know if the effect of horsepower on mpg depends on the year.

If the interaction term between these predictors is significant, the answer to this question is "YES".

<br>

Before we start this analysis, let's posit a practical (non-statistical) question:

Does it make sense to think that there might be an interaction between these predictors? In other words, in practice, is it reasonable to suspect that horsepower and year interact?

<br>

To study the __possible__ interaction effect between horsepower and year, we are going to study this model:

Estimated mpg= bo + b1* horsepower + b2* year + b3* (horsepower)* (year)

<br>

__Note__: We could also study the effect of this interaction term in more complex models involving these two predictors. For example, in a polynomial model:

Estimated mpg= bo + b1* horsepower + b2* (horsepower squared) + b3 * year + b4 * (year squared) + b5 * (horsepower)* (year)


However, the interpretation of the interaction term is less obvious when non-linear terms are included. Thus, for the sake of a clean interpretation, let's study the interaction in the model with only linear terms.

In [None]:
regression_object1=smf.ols('mpg~ horsepower+ year + horsepower:year', data=Auto_df)

# An alternative way of writing the previous code is:

# smf.ols('mpg~ horsepower*year', data=Auto_df)

In [None]:
regression_model1= regression_object1.fit()

In [None]:
regression_model1.summary() 

Now, let's compare the model with interaction with the one without the interaction term (Estimated mpg= bo + b1* horsepower + b2* year)

Run the following code cell and then take a moment to compare the models.

In [None]:
smf.ols('mpg~ horsepower+ year', data=Auto_df).fit().summary()

Now, let's write the equation with the interaction term and interpret the interaction:

In [None]:
regression_model1.params

<br>

Estimated mpg= -126.61 + 1.05 * hp + 2.19 * year - 0.02* (hp) *(year)

Estimated mpg= -126.61 + 1.05 *hp - 0.02 *(hp) *(year) + 2.19 *year 

Estimated mpg= -126.61 + (1.05 - 0.02 year)* hp + 2.19* year


__Some comments/interpretations about the previous model__:

As we already knew from the previous class session, horsepower and mpg are negatively related . More hp, less mpg.


How can we tell from this equation that horsepower and mpg are negatively related? 


Notice that the expression -0.02*year + 1.05 is always negative (i.e., hp is always multiplying a negative number). See next code cells:

In [None]:
Auto_df['year'].describe()

In [None]:
# When year equals its minimum value of 70, the expresion takes a negative value

-0.02*70 +1.05

In [None]:
# As the year increases (for ex, from 70 to 80), the expression (-0.02 year + 1.05) becomes even more negative:

-0.02*80 +1.05

__Interpreting the interaction__

Let's interpret the interaction by using an example. Let's increase the value of horsepower and see how this affects mpg under two different values of year.

In [None]:
# Computing the reduction in mpg when horsepower increases from 100 to 120 and year=75

regression_model1.predict({'horsepower':100, 'year':75}) - regression_model1.predict({'horsepower':120, 'year':75})

In [None]:
# Computing the reduction in mpg when horsepower increases from 100 to 120 and year=80

regression_model1.predict({'horsepower':100, 'year':80}) - regression_model1.predict({'horsepower':120, 'year':80})

__Interpretation__: The effect of horsepower on mpg intensifies (becomes stronger) as the year goes by (as time passes)

There is more reduction in mpg as horsepower increases for larger (more recent) years.


### Interaction between a qualitative and a quantitative predictor

As we discussed in the Polynomial Regression session, the Auto dataset has a qualitative predictor: the car's origin
However, __origin__ is improrly labeled.

Last class, I chose to avoid the data cleaning needed to be able to use __origin__ as a predictor, but tonight we are going to do it. Let's fix __origin__ to be able to use it as a qualitative predictor !

In the current version of the Auto dataset, __origin__ is an integer variable, with values of 1 for American cars, 2 for European cars, and 3 is for Japanese cars.

In [None]:
Auto_df.info()

Let us the Pandas method cut() to fix origin.

Note: An alternative to using cut() is to do this by using a loop and if-else statements.

In [None]:
Auto_df['origin_cat']= pd.cut(Auto_df['origin'], bins=[0,1,2,3], labels=['American','European','Japanese'])

# Lower limit not considered in argument 'bins' when doing the cutting

In [None]:
Auto_df[['origin','origin_cat']].head(50)

In [None]:
Auto_df_dummies=pd.get_dummies(Auto_df,columns=['origin_cat'], drop_first=True)

In [None]:
Auto_df_dummies.head()

<br>

__Let's do a regression analysis that includes Origin and... What quantitative predictor?__

I am not sure what's the best quantitative predictor to combine with the two dummies created from origin.
We can do something similar to what we did in the Polynomial regresion of mpg VS horsepower. Remember that we ran a loop to find out what was the best predictor to add to the second degree Poly on horsepower.

We could run a loop to see what's the best quantitative predictor to combine with the dummies created from origin.

For the sake of saving time, we are going to assume we are interested on the interaction between Origin and Horsepower. Does the car's origin change the effect of horsepower on mpg?

In [None]:
regression_object2=smf.ols('mpg~ horsepower+ origin_cat_European + origin_cat_Japanese + horsepower:origin_cat_European + horsepower:origin_cat_Japanese', data=Auto_df_dummies)

In [None]:
regression_model2= regression_object2.fit()

In [None]:
regression_model2.summary() 

Let's write the equation with the interaction terms and interpret it:

In [None]:
regression_model2.params

Predicted mpg = 34.48 - 0.12 * hp + 11 * European + 14.34 * Japanese - 0.1 * (hp)(European) - 0.11 * (hp)(Japanese)

1) For an American car:

__Predicted mpg = 34.48 - 0.12 * hp__

2) For a Europen car:

Predicted mpg = 34.48 - 0.12 * hp + 11 * 1 - 0.1 * hp (1)

__Predicted mpg = 45.48 -0.22 * hp__

3) For a Japanese car:

Predicted mpg = 34.48 - 0.12 * hp + 14.34 * 1 - 0.11 * (hp)(1)

__Predicted mpg = 48.82 -0.23 * hp__

<br>

__Interpretation of interaction__

The effect of horsepower on mpg is stronger for european cars compared to americans.

The effect of horsepower on mpg is stronger for japanese cars compared to americans.

<br>

Now let's do a scatteplot of mpg VS horsepower, with a different regression line for each origin.

<br>

In [None]:
plt.style.use('seaborn')

plt.scatter(Auto_df['horsepower'], Auto_df['mpg'],c='grey',marker='o')

plt.xlabel("Horsepower")

plt.ylabel("MPG")


# To get the line for American

plt.plot(Auto_df['horsepower'], regression_model2.predict({'horsepower':Auto_df['horsepower'],'origin_cat_European':np.full(392,0),'origin_cat_Japanese':np.full(392,0)}), c='red', ls='--')


# To get the line for European

plt.plot(Auto_df['horsepower'], regression_model2.predict({'horsepower':Auto_df['horsepower'],'origin_cat_European':np.full(392,1),'origin_cat_Japanese':np.full(392,0)}), c='green', ls='--')

#To get the line for Japanese

plt.plot(Auto_df['horsepower'], regression_model2.predict({'horsepower':Auto_df['horsepower'],'origin_cat_European':np.full(392,0),'origin_cat_Japanese':np.full(392,1)}), c='black', ls='--')

#plt.figure(figsize=(12,12))
              
plt.show()