**Categorical Variables in Linear Models**

We need to discuss the inclusion of categorical predictor variables as predictors in linear models (as in logistic regression, which strictly speaking should be referred to as a generalized linear model). 

A categorical predictor is usually referred to as a *factor* and the values it can take on are referred to as its *levels*.

A new variable *emp_status* has been added to the mortage data.

In [2]:
import pandas as pd
import patsy as ps

df=pd.read_csv("mortgage_data2.csv")
print(df)
print(df.emp_status.value_counts())

      location  princ  irate  cscore emp_status  default
0     suburban    358   7.00     728      ftime      0.0
1     suburban    637   7.25     675      ftime      1.0
2     suburban    303   7.25     645      noemp      0.0
3     suburban    397   7.25     609      noemp      0.0
4     suburban    420   7.75     669      noemp      0.0
...        ...    ...    ...     ...        ...      ...
9859  suburban    769   7.75     586      noemp      0.0
9860  suburban    451   7.25     684      ftime      1.0
9861  suburban    410   7.00     702      ftime      1.0
9862  suburban    851   7.00     774      ptime      0.0
9863  suburban    260   7.50     657      ftime      1.0

[9864 rows x 6 columns]
emp_status
ftime    7457
ptime    1449
noemp     958
Name: count, dtype: int64


We want to fit a logistic regression model in which location and employed are included as factors.

We also wish to include an intercept parameter in our model. So we create a model in which the conditional probability of default is of the form
$$
\frac{1}{1+\exp(-L)}
$$

where

$$
L = \beta_0 + \left( \begin{array}{c} \beta_{suburban}\\ \beta_{urban}\\ \beta_{rural}\end{array} \right)
+ \beta_{princ} \times princ + \beta_{irate} \times irate + \beta_{cscore} \times cscore
+ \left( \begin{array}{c} \beta_{noemp}\\ \beta_{ptime}\\ \beta_{ftime}\end{array} \right)
$$

where all of the $\beta$ terms are unknown. We interpret this equation as saying that for an applicant for a suburban home and who is part time employed, with their values of princ, irate and cscore the expression for $L$ would be

$$
\beta_0 + \beta_{suburban} + \beta_{princ} \times princ + \beta_{irate} \times irate + \beta_{cscore} \times cscore
+ \beta_{ptime}
$$

and similarly for an applicant for an urban home who is full time emplotyed it would be 

$$
\beta_0 + \beta_{urban} + \beta_{princ} \times princ + \beta_{irate} \times irate + \beta_{cscore} \times cscore
+ \beta_{ftime}
$$

and so on.

We can interpret this model using what we refer to as dummy variables. For each value that location (these are usually referred to as *levels*) can take on e.g. suburban, we create an *indicator* of that value. This would be a 0/1-valued column in which we have a 1 when the location variable is that value and 0 otherwise. 

If we denote these dummy location variables by $I_{suburban},$ $I_{urban},$ $I_{rural},$ and the dummy emp_status variables as
$I_{noemp},$ $I_{ptime},$ and $I_{ftime},$ and introduce a variable $int$ which is a variable that always takes the value 1 our model takes the form

$$
L = \beta_0 \times int + \beta_{suburban} \times I_{suburban} + \beta_{urban} \times I_{urban} + \beta_{rural} \times I_{rural} +
$$
$$
\beta_{princ} \times princ + \beta_{irate} \times irate + \beta_{cscore} \times cscore
+ \beta_{noemp} \times I_{noemp} + + \beta_{ptime} \times I_{ptime} + + \beta_{ftime} \times I_{ftime} 
$$

and having done this, $L$ is represented as a linear combination of known variables with unknown coefficients to be determined.


 **Identifiability of the Parameters and Reference Values**

When we consider $L$ of the above form we run into a technical issue in that the parameters are not identifiable. This means that we can set the values of the parameters in multiple ways that lead to the same value of $L$ and hence the prediction probablities.

For example starting with some choice of $\beta_0,$ $\beta_{suburban},$ $\beta_{urban}$ and $\beta_{rural}$ for any constant $c$ we can replace $\beta_0$ by $\beta_0+c$ and $\beta_{suburban},$ $\beta_{urban}$ and $\beta_{rural}$ by
$\beta_{suburban}-c,$ $\beta_{urban}-c$ and $\beta_{rural}-c$  and we get the same value of $L$.

To remedy this, we can pick one of the levels and view it as a *reference*. If we take the reference level for location to be suburban, we take $\beta_{suburban}=0$ and interpret $\beta_{urban}$ and $\beta_{rural}$ as *adjustments to be made relative to the reference location for an application in the urban or rural areas.

We can also select a level of the emp_status variable e.g. ftime as the reference level so $\beta_{ftime}$ is taken to be zero, 
and the other terms $\beta_{ptime}$ and $\beta_{no}$ become adjustments to be made for those whose employment status is ptime or no.

Now our model can be written in the form

$$
L = \beta_0 + \left( \begin{array}{c} 0\\ \beta_{urban}\\ \beta_{rural}\end{array} \right)
+ \beta_{princ} \times princ + \beta_{irate} \times irate + \beta_{cscore} \times cscore
+ \left( \begin{array}{c} \beta_{no}\\ \beta_{ptime}\\ 0\end{array} \right)
$$

and the identifiability issue goes away. 

So when we create dummy variables for a given factor, we need to drop one of the levels and view it as reference.


We can create dummy variables for a given categorical variable using pandas *get_dummies* function. We'll use the prefix I attached to the name of our dummy variables so that we don't reuse the old variable names.

This function produces a new data frame with those dummy variables included.

Observe that the values are Boolean but these are interpreted by sklearn as 0/1-valued variables.

In [93]:
pd.get_dummies(df["location"],prefix="I")

Unnamed: 0,I_rural,I_suburban,I_urban
0,False,True,False
1,False,True,False
2,False,True,False
3,False,True,False
4,False,True,False
...,...,...,...
9859,False,True,False
9860,False,True,False
9861,False,True,False
9862,False,True,False


**Dropping columns**

This does not drop a column but we have a couple of options for doing that. 

**Option 1**: use drop_first, which drops the first level when the names are sorted alphabetically.

In [94]:
pd.get_dummies(df["location"],prefix="I",drop_first=True)

Unnamed: 0,I_suburban,I_urban
0,True,False
1,True,False
2,True,False
3,True,False
4,True,False
...,...,...
9859,True,False
9860,True,False
9861,True,False
9862,True,False


**Option 2:** drop a column after creating all dummies (which allows us to select which one to drop).

We'll use get_dummies to get all of our dummy variables and drop the ones we want to drop.

In [95]:
catvars=df.loc[:,["location","emp_status"]]
dummies=pd.get_dummies(catvars, prefix="I")
dummies.drop(columns=["I_suburban","I_ftime"],inplace=True)
dummies

Unnamed: 0,I_rural,I_urban,I_noemp,I_ptime
0,False,False,False,False
1,False,False,False,False
2,False,False,True,False
3,False,False,True,False
4,False,False,True,False
...,...,...,...,...
9859,False,False,True,False
9860,False,False,False,False
9861,False,False,False,False
9862,False,False,False,True


Now we create a data frame with all of the predictors included  (the design matrix) and ones we don't need excluded.

In [96]:
df2=pd.concat([df,dummies],axis=1)
Y=df2.default
df2.drop(columns=["location","emp_status"],inplace=True)
df2

Unnamed: 0,princ,irate,cscore,default,I_rural,I_urban,I_noemp,I_ptime
0,358,7.00,728,0.0,False,False,False,False
1,637,7.25,675,1.0,False,False,False,False
2,303,7.25,645,0.0,False,False,True,False
3,397,7.25,609,0.0,False,False,True,False
4,420,7.75,669,0.0,False,False,True,False
...,...,...,...,...,...,...,...,...
9859,769,7.75,586,0.0,False,False,True,False
9860,451,7.25,684,1.0,False,False,False,False
9861,410,7.00,702,1.0,False,False,False,False
9862,851,7.00,774,0.0,False,False,False,True


And we can now use these to fit our logistic regression model using sklearn.
(Here we don't bother to separate training and testing).

In [97]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(df2,Y)

**Patsy**

Patsy is a nice package for setting up linear models for fitting in sklearn. 

It creates the matrices needed for modeling various methods (like regression) in sklearn

- the matrix of predictor variable columns i.e. the *design matrix*
- the column of response variable values

It allows us to specify models using *formulas* (as in R) rather than by doing things by hand. 

In particular, it handles the creation of dummies.

We go back to using the original dataset.

In [98]:
import patsy as ps

formula="default~location+princ+irate+cscore+emp_status"
Y,X=ps.dmatrices(formula,df)
Y=np.squeeze(Y)
clf = LogisticRegression().fit(X,Y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


This doesn't work because the logistic model fit is really poor. The main point here was to illustrate how patsy did a lot of the work for us by 

In [99]:
import patsy as ps

formula="default~irate+cscore"
Y,X=ps.dmatrices(formula,df)
Y=np.squeeze(Y)
clf = LogisticRegression().fit(X,Y)

**How to use patsy if we separate training and testing**

When we build a model on training data and determine its performace on testing data we need to make sure that we use the same operations to produce the design matrices on the training and testing data frames. To do this we use the following approach.


In [103]:
I=np.random.permutation(range(df.shape[0]))
dftrain=df.loc[I[0:9000]]
dftest=df.loc[I[9000:]]
formula="default~irate"
Ytrain,Xtrain=ps.dmatrices(formula,dftrain)
Ytrain=np.squeeze(Ytrain)
clf = LogisticRegression().fit(Xtrain,Ytrain)
Ytest=dftest.default
Xtest=ps.build_design_matrices([Xtrain.design_info],dftest)[0]