# 1. Machine learning for marketing basics
## Investigate the data
---
Great work so far! Now you know the key techniques to explore and prepare datasets for supervised machine learning models. You will now test your knowledge in practice. In this exercise, you will explore the key characteristics of the telecom churn dataset. You should run each line separately before submitting the assignment so you get valuable information about the dataset.

In [1]:
import pandas as pd
telco_raw = pd.read_csv('..//Datasets/telco.csv')

# Print the data types of telco_raw dataset
print(telco_raw.dtypes)

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object


In [2]:
# convert to numeric
telco_raw.TotalCharges = pd.to_numeric(telco_raw.TotalCharges, errors='coerce')

# drop null values
telco_raw = telco_raw.dropna().reset_index(drop=True)

In [3]:
# Print the header of telco_raw dataset
telco_raw

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.50,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7027,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.50,No
7028,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.90,No
7029,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7030,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.60,Yes


In [4]:
# Print the number of unique values in each telco_raw column
telco_raw.nunique()

customerID          7032
gender                 2
SeniorCitizen          2
Partner                2
Dependents             2
tenure                72
PhoneService           2
MultipleLines          3
InternetService        3
OnlineSecurity         3
OnlineBackup           3
DeviceProtection       3
TechSupport            3
StreamingTV            3
StreamingMovies        3
Contract               3
PaperlessBilling       2
PaymentMethod          4
MonthlyCharges      1584
TotalCharges        6530
Churn                  2
dtype: int64

## Separate numerical and categorical columns
---
In the last exercise, you have explored the dataset characteristics and are ready to do some data pre-processing. You will now separate categorical and numerical variables from the telco_raw DataFrame with a customized categorical vs. numerical unique value count threshold. The pandas module has been loaded for you as pd.

The raw telecom churn dataset telco_raw has been loaded for you as a pandas DataFrame. You can familiarize with the dataset by exploring it in the console.



In [5]:
# Store customerID and Churn column names
custid = ['customerID']
target = ['Churn']

# Store categorical column names
categorical = telco_raw.nunique()[telco_raw.nunique() < 5].keys().tolist()

# Remove target from the list of categorical variables
categorical.remove(target[0])

# Store numerical column names
numerical = [x for x in telco_raw.columns if x not in custid + target + categorical]

In [6]:
categorical

['gender',
 'SeniorCitizen',
 'Partner',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod']

In [7]:
numerical

['tenure', 'MonthlyCharges', 'TotalCharges']

## Encode categorical and scale numerical variables
---
In this final step, you will perform one-hot encoding on the categorical variables and then scale the numerical columns. The pandas library has been loaded for you as pd, as well as the StandardScaler module from the sklearn.preprocessing module.

The raw telecom churn dataset telco_raw has been loaded for you as a pandas DataFrame, as well as the lists custid, target, categorical, and numerical with column names you have created in the previous exercise. You can familiarize yourself with the dataset by exploring it in the console.

In [8]:
# Perform one-hot encoding to categorical variables 
telco_raw = pd.get_dummies(data = telco_raw, columns = categorical, drop_first=True)

In [9]:
telco_raw

Unnamed: 0,customerID,tenure,MonthlyCharges,TotalCharges,Churn,gender_Male,SeniorCitizen_1,Partner_Yes,Dependents_Yes,PhoneService_Yes,...,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,7590-VHVEG,1,29.85,29.85,No,0,0,1,0,0,...,0,0,0,0,0,0,1,0,1,0
1,5575-GNVDE,34,56.95,1889.50,No,1,0,0,0,1,...,0,0,0,0,1,0,0,0,0,1
2,3668-QPYBK,2,53.85,108.15,Yes,1,0,0,0,1,...,0,0,0,0,0,0,1,0,0,1
3,7795-CFOCW,45,42.30,1840.75,No,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,9237-HQITU,2,70.70,151.65,Yes,0,0,0,0,1,...,0,0,0,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7027,6840-RESVB,24,84.80,1990.50,No,1,0,1,1,1,...,0,1,0,1,1,0,1,0,0,1
7028,2234-XADUH,72,103.20,7362.90,No,0,0,1,1,1,...,0,1,0,1,1,0,1,1,0,0
7029,4801-JZAZL,11,29.60,346.45,No,0,0,1,1,0,...,0,0,0,0,0,0,1,0,1,0
7030,8361-LTMKD,4,74.40,306.60,Yes,1,1,1,0,1,...,0,0,0,0,0,0,1,0,0,1


In [10]:
# Initialize StandardScaler instance
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Fit and transform the scaler on numerical columns
scaled_numerical = scaler.fit_transform(telco_raw[numerical])

# Build a DataFrame from scaled_numerical
scaled_numerical = pd.DataFrame(scaled_numerical, columns=numerical)

In [11]:
scaled_numerical

Unnamed: 0,tenure,MonthlyCharges,TotalCharges
0,-1.280248,-1.161694,-0.994194
1,0.064303,-0.260878,-0.173740
2,-1.239504,-0.363923,-0.959649
3,0.512486,-0.747850,-0.195248
4,-1.239504,0.196178,-0.940457
...,...,...,...
7027,-0.343137,0.664868,-0.129180
7028,1.612573,1.276493,2.241056
7029,-0.872808,-1.170004,-0.854514
7030,-1.158016,0.319168,-0.872095


## Split data to training and testing
---
You are now ready to build an end-to-end machine learning model by following a few simple steps! You will explore modeling nuances in much more detail in the next chapters, but for now you will practice and understand the key steps.

The independent features have been loaded for you as a pandas DataFrame named X, and the dependent values as a pandas Series named Y.

Also, the train_test_split function has been loaded from the sklearn library. You will now create training and testing datasets, and then make sure the data was correctly split.

In [12]:
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [13]:
X = telco_raw.drop(['Churn', 'customerID'], axis=1)
Y = telco_raw['Churn'].replace(regex={'Yes':1, 'No':0})

In [14]:
# Split X and Y into training and testing datasets
train_X, test_X, train_Y, test_Y = train_test_split(X, Y, test_size=0.25)

# Ensure training dataset has only 75% of original X data
print(train_X.shape[0] / X.shape[0])

# Ensure testing dataset has only 25% of original X data
print(test_Y.shape[0] / Y.shape[0])

0.75
0.25


## Fit a decision tree
---
Now, you will take a stab at building a decision tree model. The decision tree is a list of machine-learned if-else rules that decide in the telecom churn case, whether customers will churn or not. Here's an example decision tree graph built on the famous Titanic survival dataset.

![image](https://assets.datacamp.com/production/repositories/4976/datasets/9af90703e437e4972c8db2342eb149eac33456a9/Decision-tree-on-Titanic-survival-data-Source-https-en.png)

The train_X, test_X, train_Y, test_Y from the previous exercise have been loaded for you. Also, the tree module and the accuracy_score function have been loaded from the sklearn library. You will now build your model and check its performance on unseen data.

In [15]:
# Initialize the model with max_depth set at 5
mytree = tree.DecisionTreeClassifier(max_depth = 5)

# Fit the model on the training data
treemodel = mytree.fit(train_X, train_Y)

# Predict values on the testing data
pred_Y = treemodel.predict(test_X)

# Measure model performance on testing data
accuracy_score(test_Y, pred_Y)

0.7963594994311718

## Predict churn with decision tree
---
Now you will build on the skills you acquired in the earlier exercise, and build a more complex decision tree with additional parameters to predict customer churn. You will dive deep into the churn prediction problem in the next chapter. Here you will run the decision tree classifier again on your training data, predict the churn rate on unseen (test) data, and assess model accuracy on both datasets.

The tree module from the sklearn library has been loaded for you, as well as the accuracy_score function from sklearn.metrics. The features and target variables have also been imported as train_X, train_Y for training data, and test_X, test_Y for test data.

In [16]:
import numpy as np 

# Initialize the Decision Tree
clf = tree.DecisionTreeClassifier(max_depth = 7, 
               criterion = 'gini', 
               splitter  = 'best')

# Fit the model to the training data
clf = clf.fit(train_X, train_Y)

# Predict the values on test dataset
pred_Y = clf.predict(test_X)

# Print accuracy values
print("Training accuracy: ", np.round(clf.score(train_X, train_Y), 3)) 
print("Test accuracy: ", np.round(accuracy_score(test_Y, pred_Y), 3))

Training accuracy:  0.822
Test accuracy:  0.795


# 2. Churn prediction and drivers
## Explore churn rate and split data
---
Building on top of the overview you saw in Chapter 1, in this lesson, you're going to dig deeper into the data preparation needed for using machine learning to perform churn prediction. You will explore the churn distribution and split the data into training and testing before you proceed to modeling. In this step you get to understand how the churn rate is distributed, and pre-process the data so you can build a model on the training set, and measure its performance on unused testing data.

In [17]:
telcom = pd.concat([telco_raw.drop(numerical, axis=1), scaled_numerical], axis=1).replace(regex={'Yes':1, 'No':0})

# Print the unique Churn values
print(set(telcom['Churn']))

{0, 1}


In [18]:
# Calculate the ratio size of each churn group
telcom.groupby(['Churn']).size() / telcom.shape[0] * 100

Churn
0    73.421502
1    26.578498
dtype: float64

In [19]:
# Import the function for splitting data to train and test
from sklearn.model_selection import train_test_split

# Split the data into train and test
train, test = train_test_split(telcom, test_size = .25)

## Separate features and target variable
---
Now that you have split the data intro training and testing, it's time to perform he final step before fitting the model which is to separate the features and target variables into different datasets. You will use the list of columns names that have been loaded for you.

The main dataset is loaded as telcom, and split into training and testing datasets which are loaded as pandas DataFrames into train and test respectively. The target and custid lists contain the names of the target variable and the customer ID respectively. You will have to create the cols list with the names of the remaining columns. Feel free to explore the datasets in the console.

In [20]:
# Store column names from `telcom` excluding target variable and customer ID
cols = [col for col in telcom.columns if col not in custid + target]

# Extract training features
train_X = train[cols]

# Extract training target
train_Y = train[target].Churn

# Extract testing features
test_X = test[cols]

# Extract testing target
test_Y = test[target].Churn

## Fit logistic regression model
---
Logistic regression is a simple yet very powerful classification model that is used in many different use cases. You will now fit a logistic regression on the training part of the telecom churn dataset, and then predict labels on the unseen test set. Afterwards, you will calculate the accuracy of your model predictions.

The accuracy_score function has been imported, and a LogisticRegression instance from sklearn has been initialized as logreg. The training and testing datasets that you've built previously have been loaded as train_X and test_X for features, and train_Y and test_Y for target variables.

In [21]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [22]:
# Fit logistic regression on training data
logreg.fit(train_X, train_Y)

# Predict churn labels on testing data
pred_test_Y = logreg.predict(test_X)

# Calculate accuracy score on testing data
test_accuracy = accuracy_score(test_Y, pred_test_Y)

# Print test accuracy score rounded to 4 decimals
print('Test accuracy:', round(test_accuracy, 4))

Test accuracy: 0.8072


## Fit logistic regression with L1 regularization
---
You will now run a logistic regression model on scaled data with L1 regularization to perform feature selection alongside model building. In the video exercise you have seen how the different C values have an effect on your accuracy score and the number of non-zero features. In this exercise, you will set the C value to 0.025.

The LogisticRegression and accuracy_score functions from sklearn library have been loaded for you. Also, the scaled features and target variables have been loaded as train_X, train_Y for training data, and test_X, test_Y for test data.

In [23]:
# Initialize logistic regression instance 
logreg = LogisticRegression(penalty='l1', C=0.025, solver='liblinear')

# Fit the model on training data
logreg.fit(train_X, train_Y)

# Predict churn values on test data
pred_test_Y = logreg.predict(test_X)

# Print the accuracy score on test data
print('Test accuracy:', round(accuracy_score(test_Y, pred_test_Y), 4))

Test accuracy: 0.7981


## Identify optimal L1 penalty coefficient
---
You will now tune the C parameter for the L1 regularization to discover the one which reduces model complexity while still maintaining good model performance metrics. You will run a for loop through possible C values and build logistic regression instances on each, as well as calculate performance metrics.

A list C has been created with the possible values. The l1_metrics array has been built with 3 columns, with the first being the C values, and the next two being placeholders for non-zero coefficient counts and the recall score of the model. The scaled features and target variables have been loaded as train_X, train_Y for training, and test_X, test_Y for testing.

In [24]:
C = [1, 0.5, 0.25, 0.1, 0.05, 0.025, 0.01, 0.005, 0.0025]

l1_metrics = np.array([[1.    , 0.    , 0.    ],
       [0.5   , 0.    , 0.    ],
       [0.25  , 0.    , 0.    ],
       [0.1   , 0.    , 0.    ],
       [0.05  , 0.    , 0.    ],
       [0.025 , 0.    , 0.    ],
       [0.01  , 0.    , 0.    ],
       [0.005 , 0.    , 0.    ],
       [0.0025, 0.    , 0.    ]])

In [25]:
from sklearn.metrics import precision_score, recall_score

# Run a for loop over the range of C list length
for index in range(0, len(C)):
  # Initialize and fit Logistic Regression with the C candidate
  logreg = LogisticRegression(penalty='l1', C=C[index], solver='liblinear')
  logreg.fit(train_X, train_Y)
  # Predict churn on the testing data
  pred_test_Y = logreg.predict(test_X)
  # Create non-zero count and recall score columns
  l1_metrics[index,1] = np.count_nonzero(logreg.coef_)
  l1_metrics[index,2] = recall_score(test_Y, pred_test_Y)

# Name the columns and print the array as pandas DataFrame
col_names = ['C','Non-Zero Coeffs','Recall']
print(pd.DataFrame(l1_metrics, columns=col_names))

        C  Non-Zero Coeffs    Recall
0  1.0000             29.0  0.527311
1  0.5000             26.0  0.523109
2  0.2500             22.0  0.521008
3  0.1000             17.0  0.510504
4  0.0500             16.0  0.493697
5  0.0250             12.0  0.468487
6  0.0100              8.0  0.386555
7  0.0050              3.0  0.313025
8  0.0025              2.0  0.002101


## Fit decision tree model
---
Now you will fit a decision tree on the training set of the telecom dataset, and then predict labels on the unseen testing data, and calculate the accuracy of your model predictions. You will see the difference in the performance compared to the logistic regression.

The accuracy_score function has been imported, also the training and testing datasets that you've built previously have been loaded as train_X and test_X for features, and train_Y and test_Y for target variables.

In [26]:
# Initialize decision tree classifier
mytree = tree.DecisionTreeClassifier()

# Fit the decision tree on training data
mytree.fit(train_X, train_Y)

# Predict churn labels on testing data
pred_test_Y = mytree.predict(test_X)

# Calculate accuracy score on testing data
test_accuracy = accuracy_score(test_Y, pred_test_Y)

# Print test accuracy
print('Test accuracy:', round(test_accuracy, 4))

Test accuracy: 0.7372


## Identify optimal tree depth
---
Now you will tune the max_depth parameter of the decision tree to discover the one which reduces over-fitting while still maintaining good model performance metrics. You will run a for loop through multiple max_depth parameter values and fit a decision tree for each, and then calculate performance metrics.

The list called depth_list with the parameter candidates has been loaded for you. The depth_tuning array has been built for you with 2 columns, with the first one being filled with the depth candidates, and the next one being a placeholder for the recall score. Also, the features and target variables have been loaded as train_X, train_Y for the training data, and test_X, test_Y for the test data. Both numpy and pandas libraries are loaded as np and pd respectively.

In [27]:
depth_list = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

depth_tuning = np.array([[ 2.,  0.],
       [ 3.,  0.],
       [ 4.,  0.],
       [ 5.,  0.],
       [ 6.,  0.],
       [ 7.,  0.],
       [ 8.,  0.],
       [ 9.,  0.],
       [10.,  0.],
       [11.,  0.],
       [12.,  0.],
       [13.,  0.],
       [14.,  0.]])

In [28]:
# Run a for loop over the range of depth list length
for index in range(0, len(depth_list)):
  # Initialize and fit decision tree with the `max_depth` candidate
  mytree = tree.DecisionTreeClassifier(max_depth=depth_list[index])
  mytree.fit(train_X, train_Y)
  # Predict churn on the testing data
  pred_test_Y = mytree.predict(test_X)
  # Calculate the recall score 
  depth_tuning[index,1] = recall_score(test_Y, pred_test_Y)

# Name the columns and print the array as pandas DataFrame
col_names = ['Max_Depth','Recall']
print(pd.DataFrame(depth_tuning, columns=col_names))

    Max_Depth    Recall
0         2.0  0.378151
1         3.0  0.378151
2         4.0  0.497899
3         5.0  0.441176
4         6.0  0.481092
5         7.0  0.445378
6         8.0  0.474790
7         9.0  0.474790
8        10.0  0.453782
9        11.0  0.462185
10       12.0  0.445378
11       13.0  0.472689
12       14.0  0.491597


## Explore logistic regression coefficients
---
You will now explore the coefficients of the logistic regression to understand what is driving churn to go up or down. For this exercise, you will extract the logistic regression coefficients from your fitted model, and calculate their exponent to make them more interpretable.

The fitted logistic regression instance is loaded as logreg and the scaled features are loaded as a pandas DataFrame called train_X. The numpy and pandas libraries are loaded as np and pd respectively.

In [29]:
# Initialize logistic regression instance 
logreg = LogisticRegression(penalty='l1', C=0.025, solver='liblinear')

# Fit the model on training data
logreg.fit(train_X, train_Y)

LogisticRegression(C=0.025, penalty='l1', solver='liblinear')

In [30]:
# Combine feature names and coefficients into pandas DataFrame
feature_names = pd.DataFrame(train_X.columns, columns = ['Feature'])
log_coef = pd.DataFrame(np.transpose(logreg.coef_), columns = ['Coefficient'])
coefficients = pd.concat([feature_names, log_coef], axis = 1)

# Calculate exponent of the logistic regression coefficients
coefficients['Exp_Coefficient'] = np.exp(coefficients['Coefficient'])

# Remove coefficients that are equal to zero
coefficients = coefficients[coefficients['Coefficient']!=0]

# Print the values sorted by the exponent coefficient
print(coefficients.sort_values(by=['Exp_Coefficient']))

                           Feature  Coefficient  Exp_Coefficient
27                          tenure    -0.880811         0.414447
4                 PhoneService_Yes    -0.851411         0.426812
22               Contract_Two year    -0.637240         0.528750
21               Contract_One year    -0.431142         0.649766
10              OnlineSecurity_Yes    -0.416260         0.659508
16                 TechSupport_Yes    -0.383711         0.681328
3                   Dependents_Yes    -0.138064         0.871043
12                OnlineBackup_Yes    -0.031121         0.969358
14            DeviceProtection_Yes    -0.021501         0.978729
23            PaperlessBilling_Yes     0.030106         1.030563
25  PaymentMethod_Electronic check     0.247494         1.280811
28                  MonthlyCharges     0.887041         2.427935


## Break down decision tree rules
---
In this exercise you will extract the if-else rules from the decision tree and plot them to identify the main drivers of the churn.

The fitted decision tree instance is loaded as mytree and the scaled features are loaded as a pandas DataFrame called train_X. The tree module from sklearn library and the graphviz library have been already loaded for you.

Note that we've used a proprietary display_image() function instead of display(graph) to make it easier for you to view the output.

In [31]:
import graphviz

mytree = tree.DecisionTreeClassifier(max_depth=5)
mytree.fit(train_X, train_Y)

# Export graphviz object from the trained decision tree 
exported = tree.export_graphviz(decision_tree=mytree, 
			# Assign feature names
            out_file=None, feature_names=train_X.columns, 
			# Set precision to 1 and add class names
			precision=1, class_names=['Not churn','Churn'], filled = True)

# Call the Source function and pass the exported graphviz object
graph = graphviz.Source(exported)

In [None]:
display(graph)