# 13.1 Interfacing Between pandas and Model Code

A common workflow for model development is to use pandas for data loading and cleaning before switching over to a modeling library to build the model itself. An important part of the model development process is called feature engineering in machine learning. This can describe any data transformation or analytics that extract information from a raw dataset that may be useful in a modeling context

The point of contact between pandas and other analysis libraries is usually NumPy arrays. To turn a DataFrame into a NumPy array, use the .values property

In [1]:
import pandas as pd
import numpy as np 

data = pd.DataFrame({
    'x0': [1,2,3,4,5],
    'x1': [0.01, -0.01, 0.25, -4.1, 0.],
    'y': [-1.5, 0., 3.6, 1.3, -2.]
})
data

Unnamed: 0,x0,x1,y
0,1,0.01,-1.5
1,2,-0.01,0.0
2,3,0.25,3.6
3,4,-4.1,1.3
4,5,0.0,-2.0


In [2]:
data.columns

Index(['x0', 'x1', 'y'], dtype='object')

In [3]:
data.index

RangeIndex(start=0, stop=5, step=1)

In [4]:
data.values

array([[ 1.  ,  0.01, -1.5 ],
       [ 2.  , -0.01,  0.  ],
       [ 3.  ,  0.25,  3.6 ],
       [ 4.  , -4.1 ,  1.3 ],
       [ 5.  ,  0.  , -2.  ]])

In [5]:
type(data.values)

numpy.ndarray

In [6]:
df2 = pd.DataFrame(data.values, columns = ['one', 'two', 'three'])
df2

Unnamed: 0,one,two,three
0,1.0,0.01,-1.5
1,2.0,-0.01,0.0
2,3.0,0.25,3.6
3,4.0,-4.1,1.3
4,5.0,0.0,-2.0


The .values attribute is intended to be used when the data is homogeneous, for example, all numeric types. If you have heterogeneous data, the result will be an ndarray of python objects

In [7]:
df3 = data.copy()
df3['strings'] = ['a', 'b', 'c', 'd', 'e']
df3

Unnamed: 0,x0,x1,y,strings
0,1,0.01,-1.5,a
1,2,-0.01,0.0,b
2,3,0.25,3.6,c
3,4,-4.1,1.3,d
4,5,0.0,-2.0,e


In [8]:
df3.values

array([[1, 0.01, -1.5, 'a'],
       [2, -0.01, 0.0, 'b'],
       [3, 0.25, 3.6, 'c'],
       [4, -4.1, 1.3, 'd'],
       [5, 0.0, -2.0, 'e']], dtype=object)

In [13]:
type(df3.values)

numpy.ndarray

For some models, you may only with to use a subset of the columns. Use loc indexing with values

In [14]:
model_cols = ['x0', 'x1']
data.loc[:, model_cols].values

array([[ 1.  ,  0.01],
       [ 2.  , -0.01],
       [ 3.  ,  0.25],
       [ 4.  , -4.1 ],
       [ 5.  ,  0.  ]])

Some libraries have native support for pandas and do some of this work for you automatically: converting to NumPy from DataFrame and attaching model parameter names to the columns of output tables or Series. In other cases, you will have to perform this metadata management manually

For categorical variables

In [15]:
data['category'] = pd.Categorical(values = ['a', 'b', 'a', 'a', 'b'], 
                                  categories = ['a', 'b'])
data

Unnamed: 0,x0,x1,y,category
0,1,0.01,-1.5,a
1,2,-0.01,0.0,b
2,3,0.25,3.6,a
3,4,-4.1,1.3,a
4,5,0.0,-2.0,b


Replace the 'categoey' column with dummy variables, we create dummy variables, drop the 'category' column, and then join the result

In [16]:
dummies = pd.get_dummies(data.category, prefix = 'category')
data_with_dummies = data.drop('category', axis = 1).join(dummies)
data_with_dummies

Unnamed: 0,x0,x1,y,category_a,category_b
0,1,0.01,-1.5,1,0
1,2,-0.01,0.0,0,1
2,3,0.25,3.6,1,0
3,4,-4.1,1.3,1,0
4,5,0.0,-2.0,0,1


In [17]:
dummies

Unnamed: 0,category_a,category_b
0,1,0
1,0,1
2,1,0
3,1,0
4,0,1


# 13.2 Creating Model Descriptions with Patsy

Patsy is a python library for describing statistical models (especially linear models) with a small string-based 'formula syntax', which is inspired by the formula syntax used by the R and S statistical programming languages

y ~ x0 + x1

The syntax a + b does not mean to add a to b, but rather that these are terms in the design matrix created for the model. The patsy.dmatrices function takes a formula string along with a dataset (which can be a DataFrame or a dict of arrays) and produces design matrices for a linear model

In [18]:
data = pd.DataFrame({
    'x0': [1,2,3,4,5],
    'x1': [0.01, -0.01, 0.25, -4.1, 0.],
    'y': [-1.5, 0., 3.6, 1.3, -2.]
})
data

Unnamed: 0,x0,x1,y
0,1,0.01,-1.5
1,2,-0.01,0.0
2,3,0.25,3.6
3,4,-4.1,1.3
4,5,0.0,-2.0


In [23]:
import patsy

y, X = patsy.dmatrices('y ~ x0 + x1', data)

In [24]:
y

DesignMatrix with shape (5, 1)
     y
  -1.5
   0.0
   3.6
   1.3
  -2.0
  Terms:
    'y' (column 0)

In [25]:
X

DesignMatrix with shape (5, 3)
  Intercept  x0     x1
          1   1   0.01
          1   2  -0.01
          1   3   0.25
          1   4  -4.10
          1   5   0.00
  Terms:
    'Intercept' (column 0)
    'x0' (column 1)
    'x1' (column 2)

These Patsy DesignMatrix instances are NumPy ndarrays with additional metadata

In [26]:
np.asarray(y)

array([[-1.5],
       [ 0. ],
       [ 3.6],
       [ 1.3],
       [-2. ]])

In [28]:
np.asarray(X)

array([[ 1.  ,  1.  ,  0.01],
       [ 1.  ,  2.  , -0.01],
       [ 1.  ,  3.  ,  0.25],
       [ 1.  ,  4.  , -4.1 ],
       [ 1.  ,  5.  ,  0.  ]])

Suppress the intercept by adding the term + 0 to the model

In [29]:
patsy.dmatrices('y ~ x0 + x1 + 0', data)[1]

DesignMatrix with shape (5, 2)
  x0     x1
   1   0.01
   2  -0.01
   3   0.25
   4  -4.10
   5   0.00
  Terms:
    'x0' (column 0)
    'x1' (column 1)

The Patsy objects can be passed directly into algorithms like numpy.linalg.lstsq, which performs an ordinary least squares regression

In [31]:
coef, resid, _, _ = np.linalg.lstsq(X, y, rcond = None)

In [32]:
coef

array([[ 0.31290976],
       [-0.07910564],
       [-0.26546384]])

In [33]:
coef = pd.Series(coef.squeeze(), index = X.design_info.column_names)
coef

Intercept    0.312910
x0          -0.079106
x1          -0.265464
dtype: float64

## 13.2.1 Data Transformations in Patsy Formulas

In [34]:
y, X = patsy.dmatrices('y ~ x0 + np.log(np.abs(x1) + 1)', data)
X

DesignMatrix with shape (5, 3)
  Intercept  x0  np.log(np.abs(x1) + 1)
          1   1                 0.00995
          1   2                 0.00995
          1   3                 0.22314
          1   4                 1.62924
          1   5                 0.00000
  Terms:
    'Intercept' (column 0)
    'x0' (column 1)
    'np.log(np.abs(x1) + 1)' (column 2)

In [35]:
y, X = patsy.dmatrices('y ~ standardize(x0) + center(x1)', data)
X

DesignMatrix with shape (5, 3)
  Intercept  standardize(x0)  center(x1)
          1         -1.41421        0.78
          1         -0.70711        0.76
          1          0.00000        1.02
          1          0.70711       -3.33
          1          1.41421        0.77
  Terms:
    'Intercept' (column 0)
    'standardize(x0)' (column 1)
    'center(x1)' (column 2)

In [36]:
X.design_info

DesignInfo(['Intercept', 'standardize(x0)', 'center(x1)'],
           factor_infos={EvalFactor('standardize(x0)'): FactorInfo(factor=EvalFactor('standardize(x0)'),
                                    type='numerical',
                                    state=<factor state>,
                                    num_columns=1),
                         EvalFactor('center(x1)'): FactorInfo(factor=EvalFactor('center(x1)'),
                                    type='numerical',
                                    state=<factor state>,
                                    num_columns=1)},
           term_codings=OrderedDict([(Term([]),
                                      [SubtermInfo(factors=(),
                                                   contrast_matrices={},
                                                   num_columns=1)]),
                                     (Term([EvalFactor('standardize(x0)')]),
                                      [SubtermInfo(factors=(EvalFactor('standardiz

As part of a modeling process, you may fit a model on one dataset, then evaluate the model based on another. This might be a hold-out portion or new data that is observed later. When applying transformations like center and standardize, you should be careful when using the model to form predictions based on new data. These are called stateful transformations, because you must use statistics like the mean or standard deviation of the original dataset when transforming a new dataset

The pasty.build_design_matrices function can apply transformations to new out-of-sample data using the saved information from the original in-sample dataset

In [37]:
new_data = pd.DataFrame({
    'x0':[6,7,8,9],
    'x1':[3.1,-0.5,0,2.3],
    'y':[1,2,3,4]})

new_data

Unnamed: 0,x0,x1,y
0,6,3.1,1
1,7,-0.5,2
2,8,0.0,3
3,9,2.3,4


In [38]:
new_X = patsy.build_design_matrices([X.design_info], new_data)
new_X

[DesignMatrix with shape (4, 3)
   Intercept  standardize(x0)  center(x1)
           1          2.12132        3.87
           1          2.82843        0.27
           1          3.53553        0.77
           1          4.24264        3.07
   Terms:
     'Intercept' (column 0)
     'standardize(x0)' (column 1)
     'center(x1)' (column 2)]

Because the plus symbol (+) in the context of Patsy formulas does not mean addition, when you want to add columns from a dataset by name, you must wrap then in the special I function

In [39]:
y, X = patsy.dmatrices('y ~ I(x0 + x1)', data)
X

DesignMatrix with shape (5, 2)
  Intercept  I(x0 + x1)
          1        1.01
          1        1.99
          1        3.25
          1       -0.10
          1        5.00
  Terms:
    'Intercept' (column 0)
    'I(x0 + x1)' (column 1)

## 13.2.2 Categorical Data and Patsy

When you use non-numeric terms in a Patsy formula, they are converted to dummy variables by default. If there is an intercept, one of the levels will be left out to avoid collinearity

In [40]:
data = pd.DataFrame({
    'key1': ['a','a','b','b','a','b','a','b'],
    'key2': [0,1,0,1,0,1,0,0],
    'v1':[1,2,3,4,5,6,7,8],
    'v2': [-1,0,2.5,-0.5,4.0,-1.2,0.2,-1.7]
})
data

Unnamed: 0,key1,key2,v1,v2
0,a,0,1,-1.0
1,a,1,2,0.0
2,b,0,3,2.5
3,b,1,4,-0.5
4,a,0,5,4.0
5,b,1,6,-1.2
6,a,0,7,0.2
7,b,0,8,-1.7


In [41]:
y, X = patsy.dmatrices('v2 ~ key1', data)
X

DesignMatrix with shape (8, 2)
  Intercept  key1[T.b]
          1          0
          1          0
          1          1
          1          1
          1          0
          1          1
          1          0
          1          1
  Terms:
    'Intercept' (column 0)
    'key1' (column 1)

If you omit the intercept from the model, then columns for each category value will be included in the model design matrix

In [42]:
y, X = patsy.dmatrices('v2 ~ key1 + 0', data)
X

DesignMatrix with shape (8, 2)
  key1[a]  key1[b]
        1        0
        1        0
        0        1
        0        1
        1        0
        0        1
        1        0
        0        1
  Terms:
    'key1' (columns 0:2)

Numeric columns can be interpreted as categorical with the C function

In [43]:
y, X = patsy.dmatrices('v2 ~ C(key2)', data)
X

DesignMatrix with shape (8, 2)
  Intercept  C(key2)[T.1]
          1             0
          1             1
          1             0
          1             1
          1             0
          1             1
          1             0
          1             0
  Terms:
    'Intercept' (column 0)
    'C(key2)' (column 1)

In [44]:
data['key2'] = data['key2'].map({0: 'zero', 1: 'one'})
data

Unnamed: 0,key1,key2,v1,v2
0,a,zero,1,-1.0
1,a,one,2,0.0
2,b,zero,3,2.5
3,b,one,4,-0.5
4,a,zero,5,4.0
5,b,one,6,-1.2
6,a,zero,7,0.2
7,b,zero,8,-1.7


In [45]:
y, X = patsy.dmatrices('v2 ~ key1 + key2', data)
X

DesignMatrix with shape (8, 3)
  Intercept  key1[T.b]  key2[T.zero]
          1          0             1
          1          0             0
          1          1             1
          1          1             0
          1          0             1
          1          1             0
          1          0             1
          1          1             1
  Terms:
    'Intercept' (column 0)
    'key1' (column 1)
    'key2' (column 2)

Interaction terms key1:key2

In [46]:
y, X = patsy.dmatrices('v2 ~ key1 + key2 + key1:key2', data)
X

DesignMatrix with shape (8, 4)
  Intercept  key1[T.b]  key2[T.zero]  key1[T.b]:key2[T.zero]
          1          0             1                       0
          1          0             0                       0
          1          1             1                       1
          1          1             0                       0
          1          0             1                       0
          1          1             0                       0
          1          0             1                       0
          1          1             1                       1
  Terms:
    'Intercept' (column 0)
    'key1' (column 1)
    'key2' (column 2)
    'key1:key2' (column 3)

# 13.3 Introduction to statsmodels

statsmodels is a python library for fitting many kinds of statistical models, performing statistical tests, and data exploration and visualization. Statsmodels contains more classical frequentist statistical methods, while Bayesian methods and machine learning models are found in other libraries

Some kind of models found in statsmodels include:
- Linear models, generalized linear models, and robust linear models 
- Linear mixed effects models 
- Analysis of variance (ANOVA) methods 
- Time series process and state space models 
- Generalized method of moments 

## 13.3.1 Estimating Linear Models

Linear models in statsmodels have two different main interfaces: array-based and formula-based. These are accessed through these API module imports 

In [47]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [48]:
def dnorm(mean, variance, size = 1):
    if isinstance(size, int):
        size = size,
    return mean + np.sqrt(variance) * np.random.randn(*size)

np.random.seed(12345)

N = 100
X = np.c_[dnorm(0, 0.4, size = N),
          dnorm(0, 0.6, size = N),
          dnorm(0, 0.2, size = N)]
eps = dnorm(0, 0.1, size = N)
beta = [0.1, 0.3, 0.5]

y = np.dot(X, beta) + eps

In [49]:
X[:5]

array([[-0.12946849, -1.21275292,  0.50422488],
       [ 0.30291036, -0.43574176, -0.25417986],
       [-0.32852189, -0.02530153,  0.13835097],
       [-0.35147471, -0.71960511, -0.25821463],
       [ 1.2432688 , -0.37379916, -0.52262905]])

In [50]:
y[:5]

array([ 0.42786349, -0.67348041, -0.09087764, -0.48949442, -0.12894109])

In [51]:
X_model = sm.add_constant(X)
X_model[:5]

array([[ 1.        , -0.12946849, -1.21275292,  0.50422488],
       [ 1.        ,  0.30291036, -0.43574176, -0.25417986],
       [ 1.        , -0.32852189, -0.02530153,  0.13835097],
       [ 1.        , -0.35147471, -0.71960511, -0.25821463],
       [ 1.        ,  1.2432688 , -0.37379916, -0.52262905]])

In [52]:
model = sm.OLS(y, X)
model

<statsmodels.regression.linear_model.OLS at 0x7fe7d4356dc0>

The model's fit method returns a regression results object containing estimated model parameters and other diagnostics

In [60]:
results = model.fit()
results

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fe7d4bd7550>

In [61]:
results.params

array([0.17826108, 0.22303962, 0.50095093])

In [62]:
results.summary()

0,1,2,3
Dep. Variable:,y,R-squared (uncentered):,0.43
Model:,OLS,Adj. R-squared (uncentered):,0.413
Method:,Least Squares,F-statistic:,24.42
Date:,"Wed, 08 Jul 2020",Prob (F-statistic):,7.44e-12
Time:,15:30:29,Log-Likelihood:,-34.305
No. Observations:,100,AIC:,74.61
Df Residuals:,97,BIC:,82.42
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,0.1783,0.053,3.364,0.001,0.073,0.283
x2,0.2230,0.046,4.818,0.000,0.131,0.315
x3,0.5010,0.080,6.237,0.000,0.342,0.660

0,1,2,3
Omnibus:,4.662,Durbin-Watson:,2.201
Prob(Omnibus):,0.097,Jarque-Bera (JB):,4.098
Skew:,0.481,Prob(JB):,0.129
Kurtosis:,3.243,Cond. No.,1.74


In [63]:
print(results.summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.430
Model:                            OLS   Adj. R-squared (uncentered):              0.413
Method:                 Least Squares   F-statistic:                              24.42
Date:                Wed, 08 Jul 2020   Prob (F-statistic):                    7.44e-12
Time:                        15:30:33   Log-Likelihood:                         -34.305
No. Observations:                 100   AIC:                                      74.61
Df Residuals:                      97   BIC:                                      82.42
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

In [64]:
data = pd.DataFrame(X, columns = ['col0', 'col1', 'col2'])
data['y'] = y
data[:5]

Unnamed: 0,col0,col1,col2,y
0,-0.129468,-1.212753,0.504225,0.427863
1,0.30291,-0.435742,-0.25418,-0.67348
2,-0.328522,-0.025302,0.138351,-0.090878
3,-0.351475,-0.719605,-0.258215,-0.489494
4,1.243269,-0.373799,-0.522629,-0.128941


In [65]:
results = smf.ols('y ~ col0 + col1 + col2', data = data).fit()
results

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fe7d43a7cd0>

In [67]:
results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.435
Model:,OLS,Adj. R-squared:,0.418
Method:,Least Squares,F-statistic:,24.68
Date:,"Wed, 08 Jul 2020",Prob (F-statistic):,6.37e-12
Time:,15:32:12,Log-Likelihood:,-33.835
No. Observations:,100,AIC:,75.67
Df Residuals:,96,BIC:,86.09
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.0336,0.035,0.952,0.343,-0.036,0.104
col0,0.1761,0.053,3.320,0.001,0.071,0.281
col1,0.2248,0.046,4.851,0.000,0.133,0.317
col2,0.5148,0.082,6.304,0.000,0.353,0.677

0,1,2,3
Omnibus:,4.504,Durbin-Watson:,2.223
Prob(Omnibus):,0.105,Jarque-Bera (JB):,3.957
Skew:,0.475,Prob(JB):,0.138
Kurtosis:,3.222,Cond. No.,2.38


In [68]:
results.params

Intercept    0.033559
col0         0.176149
col1         0.224826
col2         0.514808
dtype: float64

In [69]:
results.tvalues

Intercept    0.952188
col0         3.319754
col1         4.850730
col2         6.303971
dtype: float64

Given new out-of-sample data, you can compute predicted values given the estimated model parameters

In [70]:
results.predict(data[:5])

0   -0.002327
1   -0.141904
2    0.041226
3   -0.323070
4   -0.100535
dtype: float64

## 13.3.2 Estimating Time Series Processes

In [71]:
import random 

init_x = 4

values = [init_x, init_x]
N = 1000

b0 = 0.8
b1 = -0.4
noise = dnorm(0, 0.1, N)

for i in range(N):
    new_x = values[-1] * b0 + values[-2] * b1 + noise[i]
    values.append(new_x)

In [72]:
values[:10]

[4,
 4,
 1.8977509636904242,
 0.08686526220610424,
 -0.5769469132535335,
 -0.4995023802308947,
 0.27887732380859565,
 0.601921099044195,
 0.5143399251963061,
 0.23230090795613595]

In [74]:
MAXLAGS = 5
model = sm.tsa.AR(values)
results = model.fit(MAXLAGS)
results

<statsmodels.tsa.ar_model.ARResultsWrapper at 0x7fe7d4a558b0>

In [76]:
results.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,1002.0
Model:,AR(5),Log Likelihood,-241.974
Method:,cmle,S.D. of innovations,0.308
Date:,"Wed, 08 Jul 2020",AIC,-2.338
Time:,15:36:29,BIC,-2.304
Sample:,0,HQIC,-2.325
,,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-0.0062,0.010,-0.628,0.530,-0.025,0.013
L1.y,0.7845,0.032,24.696,0.000,0.722,0.847
L2.y,-0.4085,0.040,-10.108,0.000,-0.488,-0.329
L3.y,-0.0136,0.042,-0.321,0.748,-0.097,0.070
L4.y,0.0150,0.040,0.373,0.709,-0.064,0.094
L5.y,0.0143,0.030,0.475,0.635,-0.045,0.073

0,1,2,3,4
,Real,Imaginary,Modulus,Frequency
AR.1,0.8072,-1.2276j,1.4692,-0.1574
AR.2,0.8072,+1.2276j,1.4692,0.1574
AR.3,2.4132,-0.0000j,2.4132,-0.0000
AR.4,-2.5374,-2.6442j,3.6647,-0.3717
AR.5,-2.5374,+2.6442j,3.6647,0.3717


In [77]:
results.params

array([-0.00616093,  0.78446347, -0.40847891, -0.01364148,  0.01496872,
        0.01429462])

# 13.4 Introduction to scikit-learn

scikit-learn is one of the most widely used and trusted general-purpose python machine learning toolkits. It contains a broad selection of standard supervised and unsupervised machine learning methods with tools for model selection and evaluation, data transformation, data loading, and model persistence. These models can be used for classification, clustering, prediction, and other common tasks 

pandas can be very useful for massaging datasets prior to model fitting 

In [78]:
train = pd.read_csv('/Users/boyuan/Desktop/OneDrive/Python for data analysis 2nd/datasets/titanic/train.csv')

In [79]:
test = pd.read_csv('/Users/boyuan/Desktop/OneDrive/Python for data analysis 2nd/datasets/titanic/test.csv')

In [80]:
train[:4]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


In [83]:
train.isnull().sum(axis = 0)

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [86]:
test.isnull().sum(axis = 0)

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

A model is fitted on a training dataset and then evaluated on an out-of-sample testing dataset

In [87]:
impute_value = train['Age'].median()
train['Age'] = train['Age'].fillna(impute_value)
test['Age'] = test['Age'].fillna(impute_value)

In [88]:
train['IsFemale'] = (train['Sex'] == 'female').astype(int)
test['IsFemale'] = (test['Sex'] == 'female').astype(int)

In [89]:
predictors = ['Pclass', 'IsFemale', 'Age']
X_train = train[predictors].values
X_test = test[predictors].values
y_train = train['Survived'].values

In [90]:
X_train[:5]

array([[ 3.,  0., 22.],
       [ 1.,  1., 38.],
       [ 3.,  1., 26.],
       [ 1.,  1., 35.],
       [ 3.,  0., 35.]])

In [91]:
y_train[:5]

array([0, 1, 1, 1, 0])

In [92]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

In [93]:
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [94]:
y_predict = model.predict(X_test)
y_predict[:10]

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0])

In [None]:
(y_true == y_predict).mean()

In practice, there are often many additional layers of complexity in model training. Many models have parameters that can be tuned, and there are techniques such as cross-validation that can be used for parameter tuning to avoid overfitting to the training data. This can often yield better predictive performance or robustness on new data.

Cross-validation works by splitting the training data to simulate out-of-sample prediction. Based on a model accuracy score like mean squared error, one can perform a grid search on model parameters. Some models, like logistic regression, have estimator classes with built-in cross-validation. For example, the LogisticRegressionCV class can be used with a parameter indicating how fine-grained of a grid search to do on the model regularization parameter C

In [96]:
from sklearn.linear_model import LogisticRegressionCV

In [98]:
model_cv = LogisticRegressionCV(10)
model_cv.fit(X_train, y_train)

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='auto', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

To do cross-validation by hand, you can use the cross_val_score helper function, which handles the data splitting process

In [99]:
from sklearn.model_selection import cross_val_score

In [100]:
model = LogisticRegression(C = 10)
scores = cross_val_score(model, X_train, y_train, cv = 4)
scores

array([0.77578475, 0.79820628, 0.77578475, 0.78828829])

The default scoring metric is model-dependent, but it is possible to choose an explicit scoring function. Cross-validation models take longer to train, but can often yield better model performance

# 13.5 Continuing Your Education

- Introduction to Machine Learning with Python by Andreas Mueller and Sarah Guido (O’Reilly)
- Python Data Science Handbook by Jake VanderPlas (O’Reilly)
- Data Science from Scratch: First Principles with Python by Joel Grus (O’Reilly)
- Python Machine Learning by Sebastian Raschka (Packt Publishing)
- Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron (O’Reilly)

While books can be valuable resources for learning, they can sometimes grow out of date when the underlying open source software changes. It is a good idea to be familiar with the documentation for the various statistics or machine learning frameworks to stay up to date on the latest features and API