# Saving and loading the model

Before we can save a model, we have to train it. 
Below, is all the code necessary for model training. 


In [5]:
import pandas as pd
import numpy as np
 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
 
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Data preparation
" When using Jupyter Notebook, it’s important to note that ‘!’ indicates the execution "
"of a shell command, and the ‘$’ symbol, as seen in ‘$data,’ is the way to reference "
"data within this shell command."

data = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv'
!wget $data -O data-week-3.csv
df = pd.read_csv('data-week-3.csv')
 
df.columns = df.columns.str.lower().str.replace(' ', '_')
 
categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)
 
for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')
 
df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')
df.totalcharges = df.totalcharges.fillna(0)
 
df.churn = (df.churn == 'yes').astype(int)

# Data splitting
 
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

numerical = ['tenure', 'monthlycharges', 'totalcharges']
 
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
       'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod']



--2025-10-23 10:44:21--  https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 977501 (955K) [text/plain]
Saving to: ‘data-week-3.csv’


2025-10-23 10:44:21 (73.9 MB/s) - ‘data-week-3.csv’ saved [977501/977501]



To the train function we have the training dataframe and the target values y_train, and the third argument is C which is a LogisticRegression parameter for our model. 

At first we use DictVectorizer to encode the categorical columns, remember the numerical columns are ignored here. Then we use a logistic regression model for training (fit function) based on the training data (X_train and y_train). To apply the model later we need to return the DictVectorizer and the model as well.

In [6]:
def train(df_train, y_train, C=1.0):
    dicts = df_train[categorical + numerical].to_dict(orient='records')
 
    dv = DictVectorizer(sparse=False)
    X_train = dv.fit_transform(dicts)
 
    model = LogisticRegression(C=C, max_iter=1000)
    model.fit(X_train, y_train)
 
    return dv, model

For the predict function we also need a dataframe where we can provide a prediction for. We need to get the dictionaries to get the X to make a prediction on. We return the predicted probability for churning.

In [7]:
def predict(df, dv, model):
     dicts = df[categorical + numerical].to_dict(orient='records')
 
     X = dv.transform(dicts)
     y_pred = model.predict_proba(X)[:,1]
 
     return y_pred

In [8]:
#C is the value for the Logistic Regression model
# ‘n_splits’ parameter tells us how many splits we’re going to use in K-Fold cross-validation

C = 1.0
n_splits = 5

Next we implement K-Fold cross validation onusing train and validation datasets, the for loop loops over all folds and does a training for each fold. After that we calculate the roc_auc_score and collect the values. At the end the mean score and the standard deviation for all folds are printed.

In [9]:
kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)  
 
scores = []
 
for train_idx, val_idx in kfold.split(df_full_train):
    df_train = df_full_train.iloc[train_idx]
    df_val = df_full_train.iloc[val_idx]
 
    y_train = df_train.churn.values
    y_val = df_val.churn.values
 
    dv, model = train(df_train, y_train, C=C)
    y_pred = predict(df_val, dv, model)
 
    auc = roc_auc_score(y_val, y_pred)
    scores.append(auc)
 
print('C=%s %.3f +- %.3f' % (C, np.mean(scores), np.std(scores)))
 

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to 

C=1.0 0.842 +- 0.007


In [10]:
scores

[0.8446829053857807,
 0.8451798602834062,
 0.8332289785269917,
 0.8347808882778027,
 0.8517225691067114]

Last step is to train the final model based on the full_train data and make predictions on the test dataset.

In [11]:
dv, model = train(df_full_train, df_full_train.churn.values, C=1.0)
y_pred = predict(df_test, dv, model)
y_test = df_test.churn.values
 
auc = roc_auc_score(y_test, y_pred)
auc

0.8583517501381259

So far, our model only exists in the Jupyter notebook. To use it in a web service for customer scoring, we first need to save the model so it can be loaded later. To save the model, we’ll use pickle, a built-in Python library for serializing objects. 


In [12]:
import pickle

First, let’s name our model file. Here are two ways to do it:

In [13]:
output_file = 'model_C=%s.bin' % C
# or
output_file = f'model_C={C}.bin'
# Output: 'model_C=1.0.bin'


Next, we create and write to the file. The 'wb' mode means write in binary. We save both the DictVectorizer and the model, since the model alone can’t convert customer data into feature matrices. Don’t forget to close the file to ensure it’s properly saved:

In [14]:
f_out = open(output_file, 'wb')
pickle.dump((dv, model), f_out)
f_out.close()


A safer way is to use a with statement, which automatically closes the file:

In [15]:
with open(output_file, 'wb') as f_out:
    pickle.dump((dv, model), f_out)


# Loading the model with Pickle

For loading the model we’ll also use pickle.

In [16]:
import pickle

In [17]:
model_file = 'model_C=1.0.bin'

# Use  ‘with’ statement for loading the model. Here, ‘rb’ denotes Read Binary. 
# The ‘load’ function from pickle returns both the DictVectorizer and the model.

with open(model_file, 'rb') as f_in:
    dv, model = pickle.load(f_in)
 
dv, model

(DictVectorizer(sparse=False), LogisticRegression(max_iter=1000))

After loading the model, let’s use it to score one sample customer. Before we can apply the predict function we need to turn it into a feature matrix. The DictVectorizer expects a list of dictionaries, that’s why we create a list with one customer.

In [18]:
customer = {
    'gender': 'female',
    'seniorcitizen': 0,
    'partner': 'yes',
    'dependents': 'no',
    'phoneservice': 'no',
    'multiplelines': 'no_phone_service',
    'internetservice': 'dsl',
    'onlinesecurity': 'no',
    'onlinebackup': 'yes',
    'deviceprotection': 'no',
    'techsupport': 'no',
    'streamingtv': 'no',
    'streamingmovies': 'no',
    'contract': 'month-to-month',
    'paperlessbilling': 'yes',
    'paymentmethod': 'electronic_check',
    'tenure': 1,
    'monthlycharges': 29.85,
    'totalcharges': 29.85
}

In [19]:
X = dv.transform([customer])
X

array([[ 1.  ,  0.  ,  0.  ,  1.  ,  0.  ,  1.  ,  0.  ,  0.  ,  1.  ,
         0.  ,  1.  ,  0.  ,  0.  , 29.85,  0.  ,  1.  ,  0.  ,  0.  ,
         0.  ,  1.  ,  1.  ,  0.  ,  0.  ,  0.  ,  1.  ,  0.  ,  1.  ,
         0.  ,  0.  ,  1.  ,  0.  ,  1.  ,  0.  ,  0.  ,  1.  ,  0.  ,
         0.  ,  1.  ,  0.  ,  0.  ,  1.  ,  0.  ,  0.  ,  1.  , 29.85]])

We use predict function to get the probability that this particular customer is going to churn which is the second element in the output, so we need to set the row=0 and column=1.

In [20]:
model.predict_proba(X)
# Output: array([[0.36364158, 0.63635842]])
 
model.predict_proba(X)[0,1]
# Output: 0.6363584152758612

np.float64(0.6273177838726128)

We can turn the Jupyter Notebook code into a Python file. One easy way of doing this is click on “File” -> “Download as” and then “Python (.py)