# Intro to Modeling

Plan ➔ Acquire ➔ Prepare ➔ Explore ➔ **Model** ➔ Deliver

Before modeling:

0. Split your data
1. Data preprocessing

The modeling "loop"

1. Create a model
    - algorithm + hyperparams
    - training data
1. Evaluate the model
1. Repeat

After a certain amount of time or repititions has passed:

1. Compare models
1. Evaluate on test

In [27]:
# We'll use sklearn's Dummy Classifier as a standin for other classification algorithms
# it behaves the same way and we use it the same way that we'll use the "real" models
from sklearn.dummy import DummyClassifier
from sklearn.metrics import classification_report
import acquire
import prepare
import pandas as pd

## Data Split

In [2]:
train, validate, test = prepare.prep_titanic(acquire.get_titanic_data())
train.shape, validate.shape, test.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['age'] = imputer.transform(test[['age']])


((497, 14), (214, 14), (178, 14))

In [3]:
X_train, y_train = train.drop(columns='survived'), train.survived
X_validate, y_validate = validate.drop(columns='survived'), validate.survived
X_test, y_test = test.drop(columns='survived'), test.survived

## Create our First Model

### Aside: Working with sklearn ML objects

1. Create the object
1. Fit the object on training data
1. Use the object (.score, .predict, .transform)

In [4]:
# 1. Create the object
model = DummyClassifier(strategy='constant', constant=1)
# 2. Fit the object
model.fit(X_train, y_train)

DummyClassifier(constant=1, strategy='constant')

Ways we use sklearn classification models:

- `.score` gives us accuracy
- `.predict` lets us make predictions given a set of indep vars
- `.predict_proba` gives us the probability that each observation falls into each label
- some specific model types have additional properties

In [5]:
print('Training accuracy: %.4f' % model.score(X_train, y_train))

Training accuracy: 0.3823


In [6]:
# TODO: view the accuracy on the validate split
print('Validate accuracy: %.4f' % model.score(X_validate, y_validate))

Validate accuracy: 0.3832


In [7]:
model.predict(X_validate)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [8]:
# TODO: create a new column on the train dataframe that contains the models predictions
train['prediction'] = model.predict(X_train)

In [9]:
train.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,embark_town,alone,Q,S,prediction
734,734,0,2,male,23.0,0,0,13.0,S,Second,Southampton,1,0,1,1
93,93,0,3,male,26.0,1,2,20.575,S,Third,Southampton,0,0,1,1
64,64,0,1,male,30.750423,0,0,27.7208,C,First,Cherbourg,1,0,0,1
809,809,1,1,female,33.0,1,0,53.1,S,First,Southampton,0,0,1,1
571,571,1,1,female,53.0,2,0,51.4792,S,First,Southampton,0,0,1,1


In [10]:
# use the column you just created and the actual values in the survived column
# to generate a classification report
print(classification_report(train.survived, train.prediction))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       307
           1       0.38      1.00      0.55       190

    accuracy                           0.38       497
   macro avg       0.19      0.50      0.28       497
weighted avg       0.15      0.38      0.21       497



  _warn_prf(average, modifier, msg_start, len(result))


## More models

Now we'll make more models, one model is the unique combination of:

- algorithm
- hyperparameters
- training data

In [11]:
model1 = DummyClassifier(strategy='constant', constant=0)
# TODO: fit the model on the training data
model1.fit(X_train, y_train)
# TODO: see how this model performs on train and validate
train['prediction1'] = model1.predict(X_train)
# how does it do on train?
print('Training accuracy: %.4f' % model1.score(X_train, y_train))
# how does it do on validate?
print('Validate accuracy: %.4f' % model1.score(X_validate, y_validate))

Training accuracy: 0.6177
Validate accuracy: 0.6168


In [12]:
model2 = DummyClassifier(strategy='uniform', random_state=0)
# TODO: fit the model on the training data
model2.fit(X_train, y_train)
train['prediction2'] = model2.predict(X_train)
# TODO: see how this model performs on train and validate
print('Training accuracy: %.4f' % model2.score(X_train, y_train))
print('Validate accuracy: %.4f' % model2.score(X_validate, y_validate))

Training accuracy: 0.4748
Validate accuracy: 0.5748


In [13]:
# Following the pattern above, create 2 more models that vary in either hyperparameters or training data
# fit the models and view their performance
model3 = DummyClassifier(strategy='stratified', random_state=1221)
model3.fit(X_train, y_train)
print('Training accuracy: %.4f' % model3.score(X_train, y_train))
print('Validate accuracy: %.4f' % model3.score(X_validate, y_validate))

Training accuracy: 0.5372
Validate accuracy: 0.5607


In [14]:
model4 = DummyClassifier(strategy='stratified', random_state=78215)
model4.fit(X_train, y_train)
print('Training accuracy: %.4f' % model4.score(X_train, y_train))
print('Validate accuracy: %.4f' % model4.score(X_validate, y_validate))

Training accuracy: 0.5191
Validate accuracy: 0.5187


what are we looking for in model evaluation performace?

- Is the model overfit? (does it perform drastically better on train than validate?)
- how does the model perform compared to beseline/other models?

## Compare and Finalize

In [None]:
# TODO: compare the performance of your models on the validate split

Model 1 is the best on validate, with 61% accuracy

In [None]:
# TODO: find the performance of your best model on the test split

In [15]:
model1.score(X_test, y_test)

0.6179775280898876

In [17]:
X_train

Unnamed: 0,passenger_id,pclass,sex,age,sibsp,parch,fare,embarked,class,embark_town,alone,Q,S
734,734,2,male,23.000000,0,0,13.0000,S,Second,Southampton,1,0,1
93,93,3,male,26.000000,1,2,20.5750,S,Third,Southampton,0,0,1
64,64,1,male,30.750423,0,0,27.7208,C,First,Cherbourg,1,0,0
809,809,1,female,33.000000,1,0,53.1000,S,First,Southampton,0,0,1
571,571,1,female,53.000000,2,0,51.4792,S,First,Southampton,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
877,877,3,male,19.000000,0,0,7.8958,S,Third,Southampton,1,0,1
164,164,3,male,1.000000,4,1,39.6875,S,Third,Southampton,0,0,1
144,144,2,male,18.000000,0,0,11.5000,S,Second,Southampton,1,0,1
412,412,1,female,33.000000,1,0,90.0000,Q,First,Queenstown,0,1,0


In [18]:
y_train

734    0
93     0
64     0
809    1
571    1
      ..
877    0
164    0
144    0
412    1
518    1
Name: survived, Length: 497, dtype: int64

In [29]:
model2p = pd.DataFrame(model2.predict(X_train))

In [32]:
model2p.value_counts()

1    261
0    236
dtype: int64

In [34]:
model3p = pd.DataFrame(model3.predict(X_validate))
model3p.value_counts()

0    142
1     72
dtype: int64

In [None]:
# notes and take-aways

# 