# ** Exercise **

Data from https://www.kaggle.com/blastchar/telco-customer-churn

###### The project aims to identify customers that are likely to churn or stoping to use a service. Each customer has a score associated with the probability of churning. Considering this data, the company would send an email with discounts or other promotions to avoid churning.

### Downloading the data 

In [1]:
!wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv

--2021-10-03 19:44:05--  https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 977501 (955K) [text/plain]
Saving to: ‘WA_Fn-UseC_-Telco-Customer-Churn.csv’


2021-10-03 19:44:06 (95.9 MB/s) - ‘WA_Fn-UseC_-Telco-Customer-Churn.csv’ saved [977501/977501]



In [2]:
#Importing important libraries
import pandas as pd 
import numpy as np

In [3]:
# Read the data frame 
df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
df

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.6,Yes


In [4]:
len(df)

7043

### Initial data preparation

In [5]:
# This for reading the columns clearly
df.head().T

Unnamed: 0,0,1,2,3,4
customerID,7590-VHVEG,5575-GNVDE,3668-QPYBK,7795-CFOCW,9237-HQITU
gender,Female,Male,Male,Male,Female
SeniorCitizen,0,0,0,0,0
Partner,Yes,No,No,No,No
Dependents,No,No,No,No,No
tenure,1,34,2,45,2
PhoneService,No,Yes,Yes,No,Yes
MultipleLines,No phone service,No,No,No phone service,No
InternetService,DSL,DSL,DSL,DSL,Fiber optic
OnlineSecurity,No,Yes,Yes,Yes,No


In [6]:
# To see the data types
df.dtypes

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

In [7]:
# covert Total Charges type from object to numeric
df['TotalCharges']= pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges']= df['TotalCharges'].fillna(0)

In [8]:
df['TotalCharges'].dtype

dtype('float64')

In [9]:
# clean the data
df.columns= df.columns.str.lower().str.replace(" ","_")


In [10]:
# clean the data objects
string_col = list(df.dtypes[df.dtypes == 'object'].index)
for col in string_col:
    df[col] = df[col].str.lower().str.replace(" ","_")


In [11]:
# our data after cleaning
df

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,7590-vhveg,female,0,yes,no,1,no,no_phone_service,dsl,no,...,no,no,no,no,month-to-month,yes,electronic_check,29.85,29.85,no
1,5575-gnvde,male,0,no,no,34,yes,no,dsl,yes,...,yes,no,no,no,one_year,no,mailed_check,56.95,1889.50,no
2,3668-qpybk,male,0,no,no,2,yes,no,dsl,yes,...,no,no,no,no,month-to-month,yes,mailed_check,53.85,108.15,yes
3,7795-cfocw,male,0,no,no,45,no,no_phone_service,dsl,yes,...,yes,yes,no,no,one_year,no,bank_transfer_(automatic),42.30,1840.75,no
4,9237-hqitu,female,0,no,no,2,yes,no,fiber_optic,no,...,no,no,no,no,month-to-month,yes,electronic_check,70.70,151.65,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-resvb,male,0,yes,yes,24,yes,yes,dsl,yes,...,yes,yes,yes,yes,one_year,yes,mailed_check,84.80,1990.50,no
7039,2234-xaduh,female,0,yes,yes,72,yes,yes,fiber_optic,no,...,yes,no,yes,yes,one_year,yes,credit_card_(automatic),103.20,7362.90,no
7040,4801-jzazl,female,0,yes,yes,11,no,no_phone_service,dsl,yes,...,no,no,no,no,month-to-month,yes,electronic_check,29.60,346.45,no
7041,8361-ltmkd,male,1,yes,no,4,yes,yes,fiber_optic,no,...,no,no,no,no,month-to-month,yes,mailed_check,74.40,306.60,yes


In [12]:
# convert churn data to integer
df.churn = (df.churn == "yes").astype(int)

In [13]:
df.head().T

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
gender,female,male,male,male,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,no,no
dependents,no,no,no,no,no
tenure,1,34,2,45,2
phoneservice,no,yes,yes,no,yes
multiplelines,no_phone_service,no,no,no_phone_service,no
internetservice,dsl,dsl,dsl,dsl,fiber_optic
onlinesecurity,no,yes,yes,yes,no


In [14]:
#Spliting data with scikit-learn
from sklearn.model_selection import train_test_split

In [15]:
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=1)

In [16]:
df_train, df_val = train_test_split(df_train_full, test_size = 0.25, random_state = 11)

In [17]:
# select y values for each y 
y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values
# delete the target from our data  
del df_train["churn"]
del df_val["churn"]
del df_test["churn"]

In [18]:
len(df_train), len(df_val) , len(df_test)

(4225, 1409, 1409)

### Exploratory data analysis

In [19]:
df_train_full.isnull().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

In [20]:
df_train_full.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges        float64
churn                 int64
dtype: object

In [21]:
numeric = ["tenure", "monthlycharges","totalcharges"]
numeric

['tenure', 'monthlycharges', 'totalcharges']

In [22]:
df_train_full.columns

Index(['customerid', 'gender', 'seniorcitizen', 'partner', 'dependents',
       'tenure', 'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod', 'monthlycharges', 'totalcharges', 'churn'],
      dtype='object')

In [23]:
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
       'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod']
categorical

['gender',
 'seniorcitizen',
 'partner',
 'dependents',
 'phoneservice',
 'multiplelines',
 'internetservice',
 'onlinesecurity',
 'onlinebackup',
 'deviceprotection',
 'techsupport',
 'streamingtv',
 'streamingmovies',
 'contract',
 'paperlessbilling',
 'paymentmethod']

In [24]:
df_train_full[categorical].nunique()

gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

### Feature importance

In [25]:
from IPython.display import display #To display the result below

In [26]:
global_mean = df_train_full.churn.mean().round(3)
global_mean

0.27

In [27]:
# If diff > 0 more likely to churn, If diff < 0 less likely to churn
# If Risk > 1 more likely to churn, If Risk < 1 less likely to churn

for col in categorical:
    df_group = df_train_full.groupby(by = col).churn.agg(["mean"])
    df_group["diff"] = df_group["mean"] - global_mean
    df_group["risk"] = df_group["mean"] / global_mean
    display(df_group)


Unnamed: 0_level_0,mean,diff,risk
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.276824,0.006824,1.025274
male,0.263214,-0.006786,0.974865


Unnamed: 0_level_0,mean,diff,risk
seniorcitizen,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.24227,-0.02773,0.897297
1,0.413377,0.143377,1.531027


Unnamed: 0_level_0,mean,diff,risk
partner,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.329809,0.059809,1.221515
yes,0.205033,-0.064967,0.759383


Unnamed: 0_level_0,mean,diff,risk
dependents,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.31376,0.04376,1.162074
yes,0.165666,-0.104334,0.613579


Unnamed: 0_level_0,mean,diff,risk
phoneservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.241316,-0.028684,0.893764
yes,0.273049,0.003049,1.011292


Unnamed: 0_level_0,mean,diff,risk
multiplelines,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.257407,-0.012593,0.953361
no_phone_service,0.241316,-0.028684,0.893764
yes,0.290742,0.020742,1.07682


Unnamed: 0_level_0,mean,diff,risk
internetservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
dsl,0.192347,-0.077653,0.712398
fiber_optic,0.425171,0.155171,1.574709
no,0.077805,-0.192195,0.288167


Unnamed: 0_level_0,mean,diff,risk
onlinesecurity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.420921,0.150921,1.558967
no_internet_service,0.077805,-0.192195,0.288167
yes,0.153226,-0.116774,0.567503


Unnamed: 0_level_0,mean,diff,risk
onlinebackup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.404323,0.134323,1.497494
no_internet_service,0.077805,-0.192195,0.288167
yes,0.217232,-0.052768,0.804564


Unnamed: 0_level_0,mean,diff,risk
deviceprotection,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.395875,0.125875,1.466205
no_internet_service,0.077805,-0.192195,0.288167
yes,0.230412,-0.039588,0.853379


Unnamed: 0_level_0,mean,diff,risk
techsupport,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.418914,0.148914,1.551534
no_internet_service,0.077805,-0.192195,0.288167
yes,0.159926,-0.110074,0.59232


Unnamed: 0_level_0,mean,diff,risk
streamingtv,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.342832,0.072832,1.269747
no_internet_service,0.077805,-0.192195,0.288167
yes,0.302723,0.032723,1.121195


Unnamed: 0_level_0,mean,diff,risk
streamingmovies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.338906,0.068906,1.255209
no_internet_service,0.077805,-0.192195,0.288167
yes,0.307273,0.037273,1.138047


Unnamed: 0_level_0,mean,diff,risk
contract,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
month-to-month,0.431701,0.161701,1.598893
one_year,0.120573,-0.149427,0.446568
two_year,0.028274,-0.241726,0.104718


Unnamed: 0_level_0,mean,diff,risk
paperlessbilling,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.172071,-0.097929,0.6373
yes,0.338151,0.068151,1.252412


Unnamed: 0_level_0,mean,diff,risk
paymentmethod,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bank_transfer_(automatic),0.168171,-0.101829,0.622854
credit_card_(automatic),0.164339,-0.105661,0.608661
electronic_check,0.45589,0.18589,1.688482
mailed_check,0.19387,-0.07613,0.718036


### Feature Importance: Mutual Information 

In [28]:
from sklearn.metrics import mutual_info_score

In [29]:
def mutual_info_churn_score(series):
    return mutual_info_score( series, df_train_full.churn)

In [30]:
mi = df_train_full[categorical].apply(mutual_info_churn_score)
mi.sort_values(ascending = False)
# This is the most important variables come first such as contract

contract            0.098320
onlinesecurity      0.063085
techsupport         0.061032
internetservice     0.055868
onlinebackup        0.046923
deviceprotection    0.043453
paymentmethod       0.043210
streamingtv         0.031853
streamingmovies     0.031581
paperlessbilling    0.017589
dependents          0.012346
partner             0.009968
seniorcitizen       0.009410
multiplelines       0.000857
phoneservice        0.000229
gender              0.000117
dtype: float64

### Feature Importance: Correlation 

In [31]:
# Now we want to know the correlation coefficient of numerical variables
df_train_full[numeric].corrwith(df_train_full.churn)

tenure           -0.351885
monthlycharges    0.196805
totalcharges     -0.196353
dtype: float64

In [32]:
df_train_full.groupby(by='churn')[numeric].mean()


Unnamed: 0_level_0,tenure,monthlycharges,totalcharges
churn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,37.531972,61.176477,2548.021627
1,18.070348,74.521203,1545.689415


### One-Hot Encoding
-use scikit-Learn to encode categorical features

In [33]:
from sklearn.feature_extraction import DictVectorizer

In [34]:
train_dict = df_train[categorical + numeric].to_dict(orient='records')
train_dict[0]

{'gender': 'female',
 'seniorcitizen': 0,
 'partner': 'yes',
 'dependents': 'yes',
 'phoneservice': 'yes',
 'multiplelines': 'yes',
 'internetservice': 'fiber_optic',
 'onlinesecurity': 'no',
 'onlinebackup': 'yes',
 'deviceprotection': 'yes',
 'techsupport': 'no',
 'streamingtv': 'yes',
 'streamingmovies': 'yes',
 'contract': 'one_year',
 'paperlessbilling': 'yes',
 'paymentmethod': 'credit_card_(automatic)',
 'tenure': 58,
 'monthlycharges': 105.2,
 'totalcharges': 6225.4}

In [35]:
df_train[["gender","contract"]]

Unnamed: 0,gender,contract
5323,female,one_year
3026,female,month-to-month
1860,male,two_year
5251,female,month-to-month
2642,female,month-to-month
...,...,...
3977,male,two_year
6273,female,month-to-month
3790,male,one_year
5712,female,month-to-month


In [36]:
dv = DictVectorizer(sparse = False)
x_train = dv.fit_transform(train_dict)

val_dict = df_val[categorical + numeric].to_dict(orient='records')
x_val = dv.fit_transform(val_dict)

x_train.shape , x_val.shape

((4225, 45), (1409, 45))

In [37]:
dv.get_feature_names()

['contract=month-to-month',
 'contract=one_year',
 'contract=two_year',
 'dependents=no',
 'dependents=yes',
 'deviceprotection=no',
 'deviceprotection=no_internet_service',
 'deviceprotection=yes',
 'gender=female',
 'gender=male',
 'internetservice=dsl',
 'internetservice=fiber_optic',
 'internetservice=no',
 'monthlycharges',
 'multiplelines=no',
 'multiplelines=no_phone_service',
 'multiplelines=yes',
 'onlinebackup=no',
 'onlinebackup=no_internet_service',
 'onlinebackup=yes',
 'onlinesecurity=no',
 'onlinesecurity=no_internet_service',
 'onlinesecurity=yes',
 'paperlessbilling=no',
 'paperlessbilling=yes',
 'partner=no',
 'partner=yes',
 'paymentmethod=bank_transfer_(automatic)',
 'paymentmethod=credit_card_(automatic)',
 'paymentmethod=electronic_check',
 'paymentmethod=mailed_check',
 'phoneservice=no',
 'phoneservice=yes',
 'seniorcitizen',
 'streamingmovies=no',
 'streamingmovies=no_internet_service',
 'streamingmovies=yes',
 'streamingtv=no',
 'streamingtv=no_internet_servic

In [38]:
x_train

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        5.80000e+01, 6.22540e+03],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.60000e+01, 1.37825e+03],
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        7.10000e+01, 1.37845e+03],
       ...,
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.00000e+00, 2.83000e+01],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.30000e+01, 4.70600e+02],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        6.40000e+01, 5.32725e+03]])

In [39]:
x_val

array([[1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.90000e+01, 1.28605e+03],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.90000e+01, 1.88865e+03],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        5.10000e+01, 4.90575e+03],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        7.10000e+01, 1.89810e+03],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        6.40000e+01, 6.72160e+03],
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        7.20000e+01, 1.36325e+03]])

### Logistic Regression
Training Logistic Regression with Scikit-Learn

In [40]:
from sklearn.linear_model import LogisticRegression

In [41]:
model = LogisticRegression(solver='lbfgs')

In [42]:
model.fit(x_train, y_train)

LogisticRegression()

In [43]:
model.coef_[0].round(3)

array([ 0.29 , -0.157, -0.249,  0.062, -0.177,  0.099, -0.097, -0.118,
       -0.022, -0.094, -0.29 ,  0.27 , -0.097,  0.001, -0.193,  0.036,
        0.041,  0.113, -0.097, -0.132,  0.282, -0.097, -0.301, -0.257,
        0.141, -0.054, -0.061, -0.07 , -0.108,  0.266, -0.204,  0.036,
       -0.151,  0.197, -0.071, -0.097,  0.052, -0.063, -0.097,  0.044,
        0.271, -0.097, -0.291, -0.069,  0.   ])

In [44]:
model.intercept_[0]

-0.11617142437089834

In [45]:
model.predict(x_val)

array([0, 0, 0, ..., 0, 0, 0])

In [46]:
model.predict_proba(x_val).round(3)

array([[0.809, 0.191],
       [0.796, 0.204],
       [0.717, 0.283],
       ...,
       [0.997, 0.003],
       [0.895, 0.105],
       [0.998, 0.002]])

In [47]:
model.predict_proba(x_val)[:,1]

array([0.19074268, 0.20387398, 0.2832973 , ..., 0.00348612, 0.10513765,
       0.00155225])

In [48]:
y_pred = model.predict(x_val)
y_pred

array([0, 0, 0, ..., 0, 0, 0])

In [49]:
churn_dicision = (y_pred >= 0.5)
churn_dicision

array([False, False, False, ..., False, False, False])

In [50]:
y_val 

array([0, 1, 0, ..., 0, 0, 0])

In [51]:
churn_dicision.astype(int)

array([0, 0, 0, ..., 0, 0, 0])

In [52]:
(y_val ==churn_dicision ).mean()

0.794889992902768

In [53]:
df_pred = pd.DataFrame()
df_pred["probability"] = y_pred
df_pred["prediction"] = churn_dicision.astype(int)
df_pred["actual"] = y_val 
df_pred

Unnamed: 0,probability,prediction,actual
0,0,0,0
1,0,0,1
2,0,0,0
3,0,0,1
4,0,0,0
...,...,...,...
1404,0,0,0
1405,1,1,0
1406,0,0,0
1407,0,0,0


In [54]:
from sklearn.metrics import accuracy_score

In [55]:
accuracy = np.round(accuracy_score(y_val, y_pred),2)
accuracy

0.79

### Model Interpretation

In [56]:
dict(zip(dv.get_feature_names(), model.coef_[0].round(3)))

{'contract=month-to-month': 0.29,
 'contract=one_year': -0.157,
 'contract=two_year': -0.249,
 'dependents=no': 0.062,
 'dependents=yes': -0.177,
 'deviceprotection=no': 0.099,
 'deviceprotection=no_internet_service': -0.097,
 'deviceprotection=yes': -0.118,
 'gender=female': -0.022,
 'gender=male': -0.094,
 'internetservice=dsl': -0.29,
 'internetservice=fiber_optic': 0.27,
 'internetservice=no': -0.097,
 'monthlycharges': 0.001,
 'multiplelines=no': -0.193,
 'multiplelines=no_phone_service': 0.036,
 'multiplelines=yes': 0.041,
 'onlinebackup=no': 0.113,
 'onlinebackup=no_internet_service': -0.097,
 'onlinebackup=yes': -0.132,
 'onlinesecurity=no': 0.282,
 'onlinesecurity=no_internet_service': -0.097,
 'onlinesecurity=yes': -0.301,
 'paperlessbilling=no': -0.257,
 'paperlessbilling=yes': 0.141,
 'partner=no': -0.054,
 'partner=yes': -0.061,
 'paymentmethod=bank_transfer_(automatic)': -0.07,
 'paymentmethod=credit_card_(automatic)': -0.108,
 'paymentmethod=electronic_check': 0.266,
 'p

In [57]:
# Now we can send an email with discounts or other promotions to avoid churning for those customers.
df_val.loc[churn_dicision, 'customerid']

155     6551-gnydg
3360    0689-nkylf
6625    3398-fshon
1384    4704-eryfc
5098    8258-gstjk
           ...    
6716    9850-owrhq
3726    1965-ddbwu
4986    2694-ciumo
5494    8837-vvwlq
5354    4273-mbhya
Name: customerid, Length: 329, dtype: object

### End of my project 

### Thank you