# Dataset

I will use A [Telco Customer Churn](https://www.kaggle.com/datasets/blastchar/telco-customer-churn "Customer Churn") dataset from Kaggle

# Plan for the Customer Churn Prediction Project
1. Initisl data preparation
2. Splitting data up into train, validation, and test parts
3. Exploratory data analysis
4. Caculate risk ration for all variables
5. Feature engineering
6. Training logistic reggression model to predict churn
7. Model interpretation
8. Using model

In [135]:
# Impoerting libraries
import pandas as pd
import numpy as np

import seaborn as sns
from matplotlib import pyplot as plt

# Display images inline
%matplotlib inline

In [136]:
# Reading in our dataset
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

In [80]:
len(df)

7043

## Initial data preparation

In [81]:
# Exploring the first few rows
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [82]:
# Transposing dataframe to be able to see more information
df.head().T

Unnamed: 0,0,1,2,3,4
customerID,7590-VHVEG,5575-GNVDE,3668-QPYBK,7795-CFOCW,9237-HQITU
gender,Female,Male,Male,Male,Female
SeniorCitizen,0,0,0,0,0
Partner,Yes,No,No,No,No
Dependents,No,No,No,No,No
tenure,1,34,2,45,2
PhoneService,No,Yes,Yes,No,Yes
MultipleLines,No phone service,No,No,No phone service,No
InternetService,DSL,DSL,DSL,DSL,Fiber optic
OnlineSecurity,No,Yes,Yes,Yes,No


In [83]:
# Checking if all datatypes are correct
df.dtypes

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

**TotalCharges** is of object types which is incorrect

In [84]:
# Force TotalCharges column to be numeric
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
# Fill in the missing values with zeros
df['TotalCharges'] = df['TotalCharges'].fillna(0)

In [85]:
# Make columns naming convention uniform by lowercasing everything and replacing spaces with underscores
df.columns = df.columns.str.lower().str.replace(' ', '_')

string_columns = list(df.dtypes[df.dtypes == 'object'].index)

for col in string_columns:
    df[col] = df[col].str.lower().str.replace(' ', '_')

In [86]:
# Convert catagorical variable Churn into continuous, i.e. we create a Pandas series of type boolean
df.churn = (df.churn == 'yes').astype(int)

In [87]:
# Explore the first few rows
df.head().T

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
gender,female,male,male,male,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,no,no
dependents,no,no,no,no,no
tenure,1,34,2,45,2
phoneservice,no,yes,yes,no,yes
multiplelines,no_phone_service,no,no,no_phone_service,no
internetservice,dsl,dsl,dsl,dsl,fiber_optic
onlinesecurity,no,yes,yes,yes,no


# Splitting data up into train, validation, and test parts

In [88]:
# import function necessary for training model
from sklearn.model_selection import train_test_split

In [89]:
# Splitting dataframes into 2 new dataframes - 20% of data goes to test
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=1)

In [90]:
# Split the df_train_full dataframe into train and validation
df_train, df_val = train_test_split(df_train_full, test_size=0.33, random_state=11)

In [91]:
# Take the column with the target variable, churn, and save it outside the dataframe
y_train = df_train.churn.values
y_val = df_val.churn.values

In [92]:
# Delete columns to make sure we do not accidentally use the churn variable as a feature during training
del df_train['churn']
del df_val['churn']

## Exploratory data analysis

In [93]:
# Check for missing values
df_train_full.isnull().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

Looks all good as there are no missing values found above

In [94]:
# Check the distribution of values in the target variable
df_train_full.churn.value_counts()

0    4113
1    1521
Name: churn, dtype: int64

In [95]:
# Calculate the global mean
global_mean = df_train_full.churn.mean()
round(global_mean, 3)

0.27

In [96]:
# Create a list of categorical and a list of numerical variables
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
               'phoneservice', 'multiplelines', 'internetservice',
               'onlinesecurity', 'onlinebackup', 'deviceprotection',
               'techsupport', 'streamingtv', 'streamingmovies',
               'contract', 'paperlessbilling', 'paymentmethod']
numerical = ['tenure', 'monthlycharges', 'totalcharges']

In [97]:
# Check the number of unique values in each variable
df_train_full[categorical].nunique()

gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

## Feature importance analysis
We will have a look at some features such as gender and presence/absense of a partner, compute churn rate for them and then compare it to the global churn rate

In [98]:
# Compute churn rate for all female customers
female_mean = df_train_full[df_train_full.gender == 'female'].churn.mean()
print('gender == female:', round(female_mean, 3))

# Coputer churn rate for all male customers
male_mean = df_train_full[df_train_full.gender == 'male'].churn.mean()
print('gender == male:  ', round(male_mean, 3))

gender == female: 0.277
gender == male:   0.263


The difference in churn rate for male and female customers seems to be quite small. It means that gender does not help in identifying whether customers churn or not

In [99]:
female_mean / global_mean

1.0253955354648652

In [100]:
male_mean / global_mean

0.9749802969838747

In [101]:
# Compute churn rate for customers who have a partner
partner_yes = df_train_full[df_train_full.partner == 'yes'].churn.mean()
print('partner == yes:', round(partner_yes, 3))

# Compute churn rate for customers who do not  have a partner
partner_no = df_train_full[df_train_full.partner == 'no'].churn.mean()
print('partner == no :', round(partner_no, 3))

partner == yes: 0.205
partner == no : 0.33


There is a significant difference between people with partner and people without partner. Those with partner churn less. Thus, partner variable is useful for predicting churn.

In [102]:
partner_yes / global_mean

0.7594724924338315

In [103]:
partner_no / global_mean

1.2216593879412643

In [104]:
# Calculate the avarage (churn)
df_group = df_train_full.groupby(by='gender').churn.agg(['mean'])
# Calculate the difference between group churn rate and global rate
df_group['diff'] = df_group['mean'] - global_mean
# Calculate the risk of chirning
df_group['risk'] = df_group['mean'] / global_mean
df_group

Unnamed: 0_level_0,mean,diff,risk
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.276824,0.006856,1.025396
male,0.263214,-0.006755,0.97498


# Calculate risk ratio for all variables

In [105]:
from IPython.display import display

In [106]:
global_mean = df_train_full.churn.mean()
global_mean

0.26996805111821087

In [107]:
for col in categorical:
    df_group = df_train_full.groupby(by=col).churn.agg(['mean'])
    df_group['diff'] = df_group['mean'] - global_mean
    df_group['risk'] = df_group['mean'] / global_mean
    display(df_group)

Unnamed: 0_level_0,mean,diff,risk
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.276824,0.006856,1.025396
male,0.263214,-0.006755,0.97498


Unnamed: 0_level_0,mean,diff,risk
seniorcitizen,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.24227,-0.027698,0.897403
1,0.413377,0.143409,1.531208


Unnamed: 0_level_0,mean,diff,risk
partner,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.329809,0.059841,1.221659
yes,0.205033,-0.064935,0.759472


Unnamed: 0_level_0,mean,diff,risk
dependents,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.31376,0.043792,1.162212
yes,0.165666,-0.104302,0.613651


Unnamed: 0_level_0,mean,diff,risk
phoneservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.241316,-0.028652,0.89387
yes,0.273049,0.003081,1.011412


Unnamed: 0_level_0,mean,diff,risk
multiplelines,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.257407,-0.012561,0.953474
no_phone_service,0.241316,-0.028652,0.89387
yes,0.290742,0.020773,1.076948


Unnamed: 0_level_0,mean,diff,risk
internetservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
dsl,0.192347,-0.077621,0.712482
fiber_optic,0.425171,0.155203,1.574895
no,0.077805,-0.192163,0.288201


Unnamed: 0_level_0,mean,diff,risk
onlinesecurity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.420921,0.150953,1.559152
no_internet_service,0.077805,-0.192163,0.288201
yes,0.153226,-0.116742,0.56757


Unnamed: 0_level_0,mean,diff,risk
onlinebackup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.404323,0.134355,1.497672
no_internet_service,0.077805,-0.192163,0.288201
yes,0.217232,-0.052736,0.80466


Unnamed: 0_level_0,mean,diff,risk
deviceprotection,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.395875,0.125907,1.466379
no_internet_service,0.077805,-0.192163,0.288201
yes,0.230412,-0.039556,0.85348


Unnamed: 0_level_0,mean,diff,risk
techsupport,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.418914,0.148946,1.551717
no_internet_service,0.077805,-0.192163,0.288201
yes,0.159926,-0.110042,0.59239


Unnamed: 0_level_0,mean,diff,risk
streamingtv,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.342832,0.072864,1.269897
no_internet_service,0.077805,-0.192163,0.288201
yes,0.302723,0.032755,1.121328


Unnamed: 0_level_0,mean,diff,risk
streamingmovies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.338906,0.068938,1.255358
no_internet_service,0.077805,-0.192163,0.288201
yes,0.307273,0.037305,1.138182


Unnamed: 0_level_0,mean,diff,risk
contract,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
month-to-month,0.431701,0.161733,1.599082
one_year,0.120573,-0.149395,0.446621
two_year,0.028274,-0.241694,0.10473


Unnamed: 0_level_0,mean,diff,risk
paperlessbilling,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.172071,-0.097897,0.637375
yes,0.338151,0.068183,1.25256


Unnamed: 0_level_0,mean,diff,risk
paymentmethod,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bank_transfer_(automatic),0.168171,-0.101797,0.622928
credit_card_(automatic),0.164339,-0.10563,0.608733
electronic_check,0.45589,0.185922,1.688682
mailed_check,0.19387,-0.076098,0.718121


Main conclusions:
* There is not much difference in churn between maleS (0.97) and females (1.02)
* Senior people tend to churn more than nonseniors (1.53 vs. 0.89)
* People with a partner tend to churn less than people without partner (0.75 vs. 1.22)
* People using phone service do not tend to churn. People not using the phone are even less likely to churn (1.01 vs. 0.89)
* People with tech support churn less that those without tech support ( 0.55 vs. 1.55)
* People with month-to-month contract churn way more than people with a two-year contracts (1.59 vs. 0.10)


In [108]:
# Import Scikit-learn function that will calculate the mutual info score
from sklearn.metrics import mutual_info_score

In [109]:
def calculate_mi(series):
    return mutual_info_score(series, df_train_full.churn)

# apply function defined above to each categorical  column of the dataset
df_mi = df_train_full[categorical].apply(calculate_mi)
# sort the values of result in ascending order
df_mi = df_mi.sort_values(ascending=False).to_frame(name='MI')


display(df_mi.head())
display(df_mi.tail())

Unnamed: 0,MI
contract,0.09832
onlinesecurity,0.063085
techsupport,0.061032
internetservice,0.055868
onlinebackup,0.046923


Unnamed: 0,MI
partner,0.009968
seniorcitizen,0.00941
multiplelines,0.000857
phoneservice,0.000229
gender,0.000117


We can see that contract, onlinesecurity and techsupport are the most important features, while gender is the least important feature.

Mutual information works on the categorical variables only, so we cannot use it on our numerical variables. What we can measure is the dependency between a binary target variable and a numerical variable. We'll use a Pearson's correlation coefficient.

In [110]:
df_train_full[numerical].corrwith(df_train_full.churn).to_frame('correlation')

Unnamed: 0,correlation
tenure,-0.351885
monthlycharges,0.196805
totalcharges,-0.196353


Conclusions:
* tenure has a high negative correlation: as tenure grows, churn rate goes down
* monthlycharges has positive correlation: the more customers pay, the more likely they are to churn.
* totalcharges has a negative correlation: the longer people stay with the company, the more they have paid in total, so it’s less likely that they will churn

In [111]:
df_train_full.groupby(by='churn')[numerical].mean()

Unnamed: 0_level_0,tenure,monthlycharges,totalcharges
churn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,37.531972,61.176477,2548.021627
1,18.070348,74.521203,1545.689415


# Feature engineering

## One-hot encoding
One-hot Encoding is a type of vector representation in which all of the elements in a vector are 0, except for one, which has 1 as its value, where 1 represents a boolean specifying a category of the element.

In [112]:
from sklearn.feature_extraction import DictVectorizer

In [113]:
# onvert dataframe into a list of dictionaries
train_dict = df_train[categorical + numerical].to_dict(orient='records')

In [114]:
# inscpect the first element
train_dict[0]

{'gender': 'male',
 'seniorcitizen': 0,
 'partner': 'yes',
 'dependents': 'no',
 'phoneservice': 'yes',
 'multiplelines': 'no',
 'internetservice': 'dsl',
 'onlinesecurity': 'yes',
 'onlinebackup': 'yes',
 'deviceprotection': 'yes',
 'techsupport': 'yes',
 'streamingtv': 'yes',
 'streamingmovies': 'yes',
 'contract': 'two_year',
 'paperlessbilling': 'yes',
 'paymentmethod': 'bank_transfer_(automatic)',
 'tenure': 71,
 'monthlycharges': 86.1,
 'totalcharges': 6045.9}

In [115]:
# create a vector out of the dictionary
dv = DictVectorizer(sparse=False)
# fit vector to the list of dictionaries we've created previously
dv.fit(train_dict)

DictVectorizer(sparse=False)

In [116]:
# convert dictionaries to a matrix
X_train = dv.transform(train_dict)

In [117]:
# show the shape of the matrix
X_train.shape

(3774, 45)

In [133]:
# get the names of all columns
dv.get_feature_names_out()

array(['contract=month-to-month', 'contract=one_year',
       'contract=two_year', 'dependents=no', 'dependents=yes',
       'deviceprotection=no', 'deviceprotection=no_internet_service',
       'deviceprotection=yes', 'gender=female', 'gender=male',
       'internetservice=dsl', 'internetservice=fiber_optic',
       'internetservice=no', 'monthlycharges', 'multiplelines=no',
       'multiplelines=no_phone_service', 'multiplelines=yes',
       'onlinebackup=no', 'onlinebackup=no_internet_service',
       'onlinebackup=yes', 'onlinesecurity=no',
       'onlinesecurity=no_internet_service', 'onlinesecurity=yes',
       'paperlessbilling=no', 'paperlessbilling=yes', 'partner=no',
       'partner=yes', 'paymentmethod=bank_transfer_(automatic)',
       'paymentmethod=credit_card_(automatic)',
       'paymentmethod=electronic_check', 'paymentmethod=mailed_check',
       'phoneservice=no', 'phoneservice=yes', 'seniorcitizen',
       'streamingmovies=no', 'streamingmovies=no_internet_service',

## Training logistic regression to predict churn
Logistic regression is also a linear model, but unlike linear regression, it’s a classification model, not regression, even though the name might suggest that.

In this case, yi = 1 means that the customer churned, and yi = 0 means that the customer stayed.

In [119]:
from sklearn.linear_model import LogisticRegression

In [120]:
# create the model and train it using fit method
model = LogisticRegression(solver='liblinear', random_state=1)
model.fit(X_train, y_train)

LogisticRegression(random_state=1, solver='liblinear')

In [121]:
# apply one hot encoding to all categorical variables from the validation part of the dataset
val_dict = df_val[categorical + numerical].to_dict(orient='records')
# convert dictionaries to a matrix
X_val = dv.transform(val_dict)

In [122]:
model.predict_proba(X_val)

array([[0.76508712, 0.23491288],
       [0.73112726, 0.26887274],
       [0.68054585, 0.31945415],
       ...,
       [0.94274521, 0.05725479],
       [0.38476843, 0.61523157],
       [0.93872794, 0.06127206]])

The result of predict_proba is a two-dimensional NumPy array, or a two-column matrix. The first column of the array contains the probability that the target is negative (no churn), and the second column contains the probability that the target is positive (churn)

In [123]:
# only one colums is sufficient so we do the slicing so that we get the 2nd column only
y_pred = model.predict_proba(X_val)[:, 1]

In [124]:
y_pred

array([0.23491288, 0.26887274, 0.31945415, ..., 0.05725479, 0.61523157,
       0.06127206])

In [125]:
# To get the binary predictions, we take the probabilities and cut them above a certain threshold.
# If the probability for a customer is higher than this threshold, we predict churn, otherwise, not churn
churn = y_pred > 0.5

In [126]:
# calculate accuracy of the predictions
(y_val == churn).mean()

0.8016129032258065

## Model interpretation

In [127]:
# Get the bias term
model.intercept_[0]

-0.12198931607030322

In [128]:
# See which feature is associated with each weight by zipping feature names together with coefficients
dict(zip(dv.get_feature_names_out(), model.coef_[0].round(3)))

{'contract=month-to-month': 0.563,
 'contract=one_year': -0.086,
 'contract=two_year': -0.599,
 'dependents=no': -0.03,
 'dependents=yes': -0.092,
 'deviceprotection=no': 0.1,
 'deviceprotection=no_internet_service': -0.116,
 'deviceprotection=yes': -0.106,
 'gender=female': -0.027,
 'gender=male': -0.095,
 'internetservice=dsl': -0.323,
 'internetservice=fiber_optic': 0.317,
 'internetservice=no': -0.116,
 'monthlycharges': 0.001,
 'multiplelines=no': -0.168,
 'multiplelines=no_phone_service': 0.127,
 'multiplelines=yes': -0.081,
 'onlinebackup=no': 0.136,
 'onlinebackup=no_internet_service': -0.116,
 'onlinebackup=yes': -0.142,
 'onlinesecurity=no': 0.258,
 'onlinesecurity=no_internet_service': -0.116,
 'onlinesecurity=yes': -0.264,
 'paperlessbilling=no': -0.213,
 'paperlessbilling=yes': 0.091,
 'partner=no': -0.048,
 'partner=yes': -0.074,
 'paymentmethod=bank_transfer_(automatic)': -0.027,
 'paymentmethod=credit_card_(automatic)': -0.136,
 'paymentmethod=electronic_check': 0.175,


### Let’s redo the same steps we did for training, this time using a smaller set of features:

In [129]:
subset = ['contract', 'tenure', 'totalcharges']
train_dict_small = df_train[subset].to_dict(orient='records')
dv_small = DictVectorizer(sparse=False)
dv_small.fit(train_dict_small)

X_small_train = dv_small.transform(train_dict_small)

# Check which names the small model will use
dv_small.get_feature_names_out()

array(['contract=month-to-month', 'contract=one_year',
       'contract=two_year', 'tenure', 'totalcharges'], dtype=object)

In [130]:
# Train the small model on this subset of features
model_small = LogisticRegression(solver='liblinear', random_state=1)
model_small.fit(X_small_train, y_train)

LogisticRegression(random_state=1, solver='liblinear')

In [131]:
# Check the bias term
model_small.intercept_[0]

-0.5772299145133957

In [137]:
# Check the other weights
dict(zip(dv_small.get_feature_names_out(), model_small.coef_[0].round(3)))

{'contract=month-to-month': 0.866,
 'contract=one_year': -0.327,
 'contract=two_year': -1.117,
 'tenure': -0.094,
 'totalcharges': 0.001}

In [138]:
val_dict_small = df_val[subset].to_dict(orient='records')
X_small_val = dv_small.transform(val_dict_small)

In [139]:
y_pred_small = model_small.predict_proba(X_small_val)[:, 1]

## Using the model

In [None]:
# Take a customer we want to score and put all the variable values in a dictionary
customer = {
    'customerid': '8879-zkjof',
    'gender': 'female',
    'seniorcitizen': 0,
    'partner': 'no',
    'dependents': 'no',
    'tenure': 41,
    'phoneservice': 'yes',
    'multiplelines': 'no',
    'internetservice': 'dsl',
    'onlinesecurity': 'yes',
    'onlinebackup': 'no',
    'deviceprotection': 'yes',
    'techsupport': 'yes',
    'streamingtv': 'yes',
    'streamingmovies': 'yes',
    'contract': 'one_year',
    'paperlessbilling': 'yes',
    'paymentmethod': 'bank_transfer_(automatic)',
    'monthlycharges': 79.85,
    'totalcharges': 3320.75,
}

In [142]:
# Convert this dictionary into a matrix
X_test = dv.transform([customer])
# Make a prediction for one customer (we need the first row and second column)
model.predict_proba(X_test)[0, 1]

0.8321664335468258

There is a 83% probablity to churn for this customer, so we would need to send her a promotional email.

In [143]:
print(list(X_test[0]))

[1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 85.7, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 85.7]


In [144]:
customer = {
    'gender': 'female',
    'seniorcitizen': 1,
    'partner': 'no',
    'dependents': 'no',
    'phoneservice': 'yes',
    'multiplelines': 'yes',
    'internetservice': 'fiber_optic',
    'onlinesecurity': 'no',
    'onlinebackup': 'no',
    'deviceprotection': 'no',
    'techsupport': 'no',
    'streamingtv': 'yes',
    'streamingmovies': 'no',
    'contract': 'month-to-month',
    'paperlessbilling': 'yes',
    'paymentmethod': 'electronic_check',
    'tenure': 1,
    'monthlycharges': 85.7,
    'totalcharges': 85.7
}

In [145]:
X_test = dv.transform([customer])
model.predict_proba(X_test)[0, 1]

0.8321664335468258

Similar situation as with the customer above, i. e. a very high probablity of churning