<a href="https://colab.research.google.com/github/MadhurJain06/Carseats-ISLP-models/blob/main/carseats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install ISLP

Collecting ISLP
  Downloading ISLP-0.4.0-py3-none-any.whl.metadata (7.0 kB)
Collecting lifelines (from ISLP)
  Downloading lifelines-0.30.0-py3-none-any.whl.metadata (3.2 kB)
Collecting pygam (from ISLP)
  Downloading pygam-0.10.1-py3-none-any.whl.metadata (9.7 kB)
Collecting pytorch-lightning (from ISLP)
  Downloading pytorch_lightning-2.5.5-py3-none-any.whl.metadata (20 kB)
Collecting torchmetrics (from ISLP)
  Downloading torchmetrics-1.8.2-py3-none-any.whl.metadata (22 kB)
Collecting autograd-gamma>=0.3 (from lifelines->ISLP)
  Downloading autograd-gamma-0.5.0.tar.gz (4.0 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting formulaic>=0.2.2 (from lifelines->ISLP)
  Downloading formulaic-1.2.1-py3-none-any.whl.metadata (7.0 kB)
Collecting lightning-utilities>=0.10.0 (from pytorch-lightning->ISLP)
  Downloading lightning_utilities-0.15.2-py3-none-any.whl.metadata (5.7 kB)
Collecting interface-meta>=1.2.0 (from formulaic>=0.2.2->lifelines->ISLP)
  Downloading interface_

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from ISLP import load_data
from ISLP.models import (ModelSpec,
                         summarize,
                         Column,
                         Feature,
                         build_columns)

In [None]:
Carseats = load_data('Carseats')
Carseats.columns
Carseats['ShelveLoc']


Unnamed: 0,ShelveLoc
0,Bad
1,Good
2,Medium
3,Medium
4,Bad
...,...
395,Good
396,Medium
397,Medium
398,Bad


In [None]:
MS = ModelSpec(['ShelveLoc', 'Price'])
X = MS.fit_transform(Carseats)
X.iloc[:10]

Unnamed: 0,intercept,ShelveLoc[Good],ShelveLoc[Medium],Price
0,1.0,0.0,0.0,120
1,1.0,1.0,0.0,83
2,1.0,0.0,1.0,80
3,1.0,0.0,1.0,97
4,1.0,0.0,0.0,128
5,1.0,0.0,0.0,72
6,1.0,0.0,1.0,108
7,1.0,1.0,0.0,120
8,1.0,0.0,1.0,124
9,1.0,0.0,1.0,124


We note that a column has been added for the intercept by default. This can be changed using the intercept argument.

In [None]:
MS_no1 = ModelSpec(['ShelveLoc','Price'], intercept=False)
MS_no1.fit_transform(Carseats)[:10]

Unnamed: 0,ShelveLoc[Good],ShelveLoc[Medium],Price
0,0.0,0.0,120
1,1.0,0.0,83
2,0.0,1.0,80
3,0.0,1.0,97
4,0.0,0.0,128
5,0.0,0.0,72
6,0.0,1.0,108
7,1.0,0.0,120
8,0.0,1.0,124
9,0.0,1.0,124


ShelveLoc still only contributes two columns to the design(Kyunki by default, ek category ko base (reference) bana diya jaata hai aur baaki categories ko dummy variables se represent kiya jaata hai.).

Normally regression models ek column of 1’s (intercept) add karte hain, par yeh object aise automatic nahi kar raha.(although wo hmne hi drop kiya h, just to add the 3rd column that is 'bad')

To include this intercept via ShelveLoc we can use 3 columns to encode this categorical variable. Following the nomenclature of R, we call this a Contrast of the categorical variable.

### _*contrast ka kaam hai categorical variable ko numeric design matrix me convert karna — matlab encoding decide karna.*_

Jab variable categorical hota hai (jaise ShelveLoc = {Bad, Medium, Good}), usko directly regression me use nahi kar sakte.

contrast batata hai ki har category ko columns me kaise todna hai (dummy variables, effect coding, full-rank coding, etc.).

In [None]:
from ISLP.models import contrast
shelve = contrast('ShelveLoc', None)
# Normally dummy encoding ek category drop kar deta hai (like "Treatment coding"),
# but yaha None diya matlab drop mat karo, sab categories ko columns bana do.
MS_contr = ModelSpec([shelve, 'Price'],intercept=False)
MS_contr.fit_transform(Carseats)[:10]

Unnamed: 0,ShelveLoc[Bad],ShelveLoc[Good],ShelveLoc[Medium],Price
0,1.0,0.0,0.0,120
1,0.0,1.0,0.0,83
2,0.0,0.0,1.0,80
3,0.0,0.0,1.0,97
4,1.0,0.0,0.0,128
5,1.0,0.0,0.0,72
6,0.0,0.0,1.0,108
7,0.0,1.0,0.0,120
8,0.0,0.0,1.0,124
9,0.0,0.0,1.0,124


In [None]:
shelve

Column(idx='ShelveLoc', name='ShelveLoc', is_categorical=True, is_ordinal=False, columns=(), encoder=Contrast(method=None))

In [None]:
shelve.get_columns(Carseats)

(array([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.],
        ...,
        [0., 0., 1.],
        [1., 0., 0.],
        [0., 1., 0.]]),
 ['ShelveLoc[Bad]', 'ShelveLoc[Good]', 'ShelveLoc[Medium]'])

Fit a simple OLS model with this design.

In [None]:
X= MS_contr.transform(Carseats)
Y= Carseats['Sales']
M_ols=sm.OLS(Y,X).fit()
summarize(M_ols)

Unnamed: 0,coef,std err,t,P>|t|
ShelveLoc[Bad],12.0018,0.503,23.839,0.0
ShelveLoc[Good],16.8976,0.522,32.386,0.0
ShelveLoc[Medium],13.8638,0.487,28.467,0.0
Price,-0.0567,0.004,-13.967,0.0



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



# Interactions

In [None]:
ModelSpec([(shelve,'Price'),'Price']).fit_transform(Carseats).iloc[:10]

Unnamed: 0,intercept,ShelveLoc[Bad]:Price,ShelveLoc[Good]:Price,ShelveLoc[Medium]:Price,Price
0,1.0,120.0,0.0,0.0,120
1,1.0,0.0,83.0,0.0,83
2,1.0,0.0,0.0,80.0,80
3,1.0,0.0,0.0,97.0,97
4,1.0,128.0,0.0,0.0,128
5,1.0,72.0,0.0,0.0,72
6,1.0,0.0,0.0,108.0,108
7,1.0,0.0,120.0,0.0,120
8,1.0,0.0,0.0,124.0,124
9,1.0,0.0,0.0,124.0,124


#OrdinalVariables

In [None]:
Carseats['OIncome'] = pd.cut(Carseats['Income'],
                             [0,50,90,200],
                             labels=['L','M','H'])
MS_order = ModelSpec(['OIncome']).fit(Carseats)

In [None]:
MS_order.column_info_

{'Sales': Column(idx='Sales', name='Sales', is_categorical=False, is_ordinal=False, columns=('Sales',), encoder=None),
 'CompPrice': Column(idx='CompPrice', name='CompPrice', is_categorical=False, is_ordinal=False, columns=('CompPrice',), encoder=None),
 'Income': Column(idx='Income', name='Income', is_categorical=False, is_ordinal=False, columns=('Income',), encoder=None),
 'Advertising': Column(idx='Advertising', name='Advertising', is_categorical=False, is_ordinal=False, columns=('Advertising',), encoder=None),
 'Population': Column(idx='Population', name='Population', is_categorical=False, is_ordinal=False, columns=('Population',), encoder=None),
 'Price': Column(idx='Price', name='Price', is_categorical=False, is_ordinal=False, columns=('Price',), encoder=None),
 'ShelveLoc': Column(idx='ShelveLoc', name='ShelveLoc', is_categorical=np.True_, is_ordinal=np.False_, columns=('ShelveLoc[Good]', 'ShelveLoc[Medium]'), encoder=Contrast()),
 'Age': Column(idx='Age', name='Age', is_categor

## Structure of  ModelSpec
### the first argument of model spec acts as the terms attribute.
### the sequence is expected to produce terms_ attribute whic speify the objects that will ultimately create the design.

In [None]:
MS= ModelSpec(['ShelveLoc','Price'])
MS.fit(Carseats)
MS.terms_

[Feature(variables=('ShelveLoc',), name='ShelveLoc', encoder=None, use_transform=True, pure_columns=True, override_encoder_colnames=False),
 Feature(variables=('Price',), name='Price', encoder=None, use_transform=True, pure_columns=True, override_encoder_colnames=False)]

In [None]:
shelve_var = MS.terms_[0]

We can find the columns associated to each term using the build_columns

In [None]:
df, names = build_columns(MS.column_info_,
                          Carseats,
                          shelve_var)
df

Unnamed: 0,ShelveLoc[Good],ShelveLoc[Medium]
0,0.0,0.0
1,1.0,0.0
2,0.0,1.0
3,0.0,1.0
4,0.0,0.0
...,...,...
395,1.0,0.0
396,0.0,1.0
397,0.0,1.0
398,0.0,0.0


# Feature Objects
Each element of terms_ should be a Feature which describes a set of columns to be extracted from a columnar data form as well as a possible encoder.

Feature objects have a tuple of variables as well as an encoder attribute.

The tuple of variables first creates a concatenated dataframe from all corresponding variables and then is run through encoder.transform


In [None]:
new_var = Feature(('Price', 'Income', 'OIncome'), name='mynewvar', encoder=None)
build_columns(MS.column_info_,
              Carseats,
              new_var)[0]

Unnamed: 0,Price,Income,OIncome
0,120.0,73.0,2.0
1,83.0,48.0,1.0
2,80.0,35.0,1.0
3,97.0,100.0,0.0
4,128.0,64.0,2.0
...,...,...,...
395,128.0,108.0,0.0
396,120.0,23.0,1.0
397,159.0,26.0,1.0
398,95.0,79.0,2.0


transform these columns within encoder, we will first build the arrays above and then call pca.fit, and finally pca.transform within design.build.column

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(build_columns(MS.column_info_, Carseats, new_var)[0])
pca_var = Feature(('Price','Income','OIncome'),name='mynewvar', encoder=pca)
build_columns(MS.column_info_, Carseats, pca_var)[0]

Unnamed: 0,mynewvar[0],mynewvar[1]
0,3.595740,4.850530
1,-15.070401,-35.706773
2,-27.412228,-40.772377
3,33.983048,-13.468087
4,-6.580644,11.287452
...,...,...
395,36.856308,18.418138
396,-45.731520,-3.243768
397,-49.087659,35.727136
398,13.565178,-18.847760


# Predicting at new points


In [None]:
MS = ModelSpec(['Price','Income']).fit(Carseats)
X = MS.transform(Carseats)
Y= Carseats['Sales']
M_ols = sm.OLS(Y,X).fit()
M_ols.params

Unnamed: 0,0
intercept,12.661546
Price,-0.052213
Income,0.012829


As ModelSpec is a transformer, it can be evaluated at new feature values. Constructing the design matrix at any values is carried out by the transform method.

In [None]:
new_data = pd.DataFrame({'Price':[40,50], 'Income':[10,20]})
new_X = MS.transform(new_data)
M_ols.get_prediction(new_X).predicted_mean

array([10.70130676, 10.307465  ])