# Data Science for Business

## Spring 2020, module 4 @ HSE

---

## Home assignment 4


Author: **Miron Rogovets**

---

_"For this task your main goal is to decrease company losses due to customer’s churn. We will compare two discount strategies: providing a 20% discount with a 75% acceptance rate and a 30% discount with a 90% acceptance rate."_

In [83]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.metrics import roc_auc_score, confusion_matrix

In [2]:
pd.set_option('display.float_format', lambda x: '{:.2f}'.format(x))
sns.set_style('darkgrid')

In [24]:
df = pd.read_csv('data/telecom_data.csv')
df.head(3)

Unnamed: 0,index,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,tenure_group
0,1,5575-GNVDE,Male,No,No,No,34,Yes,No,DSL,...,No,No,No,One year,No,Mailed check,56.95,1889.5,No,Tenure_24-48
1,2,3668-QPYBK,Male,No,No,No,2,Yes,No,DSL,...,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,Tenure_0-12
2,3,7795-CFOCW,Male,No,No,No,45,No,No phone service,DSL,...,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No,Tenure_24-48


In [25]:
data = df.drop(columns=['index', 'tenure_group'])

### I. Data preprocessing (5).

This time we will skip all the exploration steps and only do some simple features preprocessing:

1. Generate `tenure_group` attribute: discretize `tenure` into 6 groups: “0-12”, “12-24”, “24-36”, “36-48”, “48-60”, “60+” (all are left closed intervals). What are the sizes of these groups?

_Tenure refers to the number of months that a customer has subscribed for. Do not drop the `tenure` column._


In [27]:
data['tenure_group'] = 0
data.loc[(data.tenure >= 12) & (data.tenure < 24), 'tenure_group'] = 1
data.loc[(data.tenure >= 24) & (data.tenure < 36), 'tenure_group'] = 2
data.loc[(data.tenure >= 36) & (data.tenure < 48), 'tenure_group'] = 3
data.loc[(data.tenure >= 48) & (data.tenure < 60), 'tenure_group'] = 4
data.loc[data.tenure >= 60, 'tenure_group'] = 5

data.head(3)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,tenure_group
0,5575-GNVDE,Male,No,No,No,34,Yes,No,DSL,Yes,...,No,No,No,One year,No,Mailed check,56.95,1889.5,No,2
1,3668-QPYBK,Male,No,No,No,2,Yes,No,DSL,Yes,...,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,0
2,7795-CFOCW,Male,No,No,No,45,No,No phone service,DSL,Yes,...,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No,3


In [32]:
data['tenure_group'].value_counts().sort_index()

0    1573
1     844
2     719
3     670
4     775
5    1462
Name: tenure_group, dtype: int64

2. Preprocess categorical columns with only 2 unique values (“binary” columns): replace one unique value with 0 and another with 1 (label encoding). How many __such__ columns do you have?

_E.g. for the `gender` attribute you may replace Female with 1 and Male with 0 or vice versa._


In [42]:
cats = data.dtypes[data.dtypes == 'object'].index.drop('customerID').values
binary = [c for c in cats if len(data[c].unique()) == 2]
nonbinary = [c for c in cats if len(data[c].unique()) != 2]

f"Binary columns count: {len(binary)}"

'Binary columns count: 13'

In [43]:
le = LabelEncoder()

for col in binary:
    data[col] = le.fit_transform(data[col])

data.head(3)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,tenure_group
0,5575-GNVDE,1,0,0,0,34,1,No,DSL,1,...,0,0,0,One year,0,Mailed check,56.95,1889.5,0,2
1,3668-QPYBK,1,0,0,0,2,1,No,DSL,1,...,0,0,0,Month-to-month,1,Mailed check,53.85,108.15,1,0
2,7795-CFOCW,1,0,0,0,45,0,No phone service,DSL,1,...,1,0,0,One year,0,Bank transfer (automatic),42.3,1840.75,0,3


3. Preprocess categorical columns with more then 2 unique values using dummy encoding (=one-hot encoding). How many __such__ columns (before dummy encoding) do you have? 


In [40]:
f"Nonbinary columns count: {len(nonbinary)}"

'Nonbinary columns count: 4'

In [50]:
data = pd.get_dummies(data, prefix=nonbinary, columns=nonbinary, drop_first=True)
data.head(3)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,...,tenure_group,MultipleLines_No phone service,MultipleLines_Yes,InternetService_Fiber optic,InternetService_No,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,5575-GNVDE,1,0,0,0,34,1,1,0,1,...,2,0,0,0,0,1,0,0,0,1
1,3668-QPYBK,1,0,0,0,2,1,1,1,0,...,0,0,0,0,0,0,0,0,0,1
2,7795-CFOCW,1,0,0,0,45,0,1,0,1,...,3,1,0,0,0,1,0,0,0,0


4. Drop customerID attribute.

In [51]:
data.drop(columns=['customerID'], inplace=True)

### II. Build a churn model (5).

1. Build 2 classification models to predict customers churn:
   - Logistic Regression. What is the ROC AUC of this model?
   - Random Forest. What is the ROC AUC of this model?


In [75]:
y = data.Churn
X = data.drop(columns=['Churn'])

In [98]:
logit = LogisticRegression(max_iter=1000)
forest = RandomForestClassifier()

In [99]:
#f"{cross_val_score(logit, X, y, scoring=('roc_auc')).mean():.2f}"

In [100]:
#f"{cross_val_score(forest, X, y, scoring=('roc_auc')).mean():.2f}"

In [101]:
lpred = cross_val_predict(logit, X, y)
f"{roc_auc_score(y, lpred):.2f}"

'0.85'

In [110]:
rpred = cross_val_predict(forest, X, y)
f"{roc_auc_score(y, rpred):.2f}"

'0.84'

### III. Compare two discount strategies (25 + 10).

**Assumptions:**
- Every customer pays the same price p which is the average of `MonthlyCharges`.
- If we decide to provide a discount we provide it to all the customers who are predicted as Churn=Yes.
- When we compute gains, costs and losses we compute them for the short term.
- Strategy’s profit is the difference between gains, costs and losses: `profit = gains - costs - losses`
- Profit per customer is the total profit divided by the number of customers (if the person churns the person is not a customer anymore).

**Strategy A:** Provide a 20% discount with a 75% acceptance rate.

**Strategy B:** Provide a 30% discount with a 90% acceptance rate.



In [None]:
p = data['MonthlyCharges'].mean()
f"{p:.2f}"

#### 1. Use the default threshold of 0.5 to compute the confusion matrix. Based on this confusion matrix report (5 points):
- TP, FP, TN, FN
- Losses if you do not apply any discount strategy.
- Total **gains** from the discount strategy B.
- Total **costs** of the discount strategy B.
- Total **losses** of the discount strategy B.
- Total profit of the discount strategy B.
- Profit per customer $p_d$ (using strategy B). 


In [104]:
cm = confusion_matrix(y, lpred, labels=[1,0])
cm

array([[1416,  453],
       [ 228, 3946]])

In [109]:
tn, fp, fn, tp = cm.ravel()
print(f"TP = {tp}\nFP = {fp}\nTN = {tn}\nFN = {fn}")

TP = 3946
FP = 453
TN = 1416
FN = 228
