# Churn prediction 

Churn definition will depend on the company and its business model, but essentially churn event happens when a customer stops buying a product, using a service, or engaging with a product or application. Churn can happen in either a contractual or non-contractual business context. 

Contractual churn happens when customers explicitly cancel a service or subscription, while non-contractual is harder to observe and requires in-depth data exploration. Also, churn can be viewed as either voluntary or involuntary. Voluntary churn means customers decided to stop using the product or a service, while involuntary churn happens when customers fail to automatically update their subscription due to credit card expiration or other blockers.

## Explore churn rate

In [5]:
import pandas as pd
import numpy as np

In [6]:
telco_raw = pd.read_csv('telco.csv',';')

In [7]:
print(set(telco_raw['Churn']))

{'Yes', 'No'}


In [10]:
# Calculate the ratio size of each churn group
telco_raw.groupby(['Churn']).size() / telco_raw.shape[0] * 100

Churn
No     73.463013
Yes    26.536987
dtype: float64

## Target and Features

In [19]:
custid = ['customerID']
target = ['Churn']

features = [col for col in telco.columns
                if col not in custid+target]

X = telco[features]
Y = telco[target]

In [12]:
categorical = telco_raw.nunique()[telco_raw.nunique()<10].keys().tolist()
categorical.remove(target[0])

numerical = [col for col in telco_raw.columns
                if col not in custid+target+categorical]

In [14]:
telco_raw = pd.get_dummies(data=telco_raw, columns=categorical, drop_first=True)

## Scaling

In [13]:
from sklearn.preprocessing import StandardScaler

In [16]:
scaler = StandardScaler()

scaled_numerical = scaler.fit_transform(telco_raw[numerical])
scaled_numerical = pd.DataFrame(scaled_numerical, columns=numerical)

In [17]:
telco_raw = telco_raw.drop(columns=numerical, axis=1)
telco = telco_raw.merge(right= scaled_numerical, how = 'left', left_index=True, right_index=True)

## Predict churn with logistic regression

In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score

In [21]:
train_X, test_X, train_Y, test_Y = train_test_split(X,Y, test_size=0.25)

In [31]:
logreg = LogisticRegression(penalty='l1',C=0.1, solver='liblinear')

In [32]:
logreg.fit(train_X, train_Y)

  return f(*args, **kwargs)


LogisticRegression(C=0.1, penalty='l1', solver='liblinear')

In [24]:
pred_train_Y = logreg.predict(train_X)
pred_test_Y = logreg.predict(test_X)

### Model performance Metrics

Key metrics

* Accuracy - The % of the correctly predicted labels (both Chirn and non Churn)
* Precision - The % of total model's positive class prediction (here - predicted as Cgurn) that were correctly classified
* Recal - The % total positive class samples (all churned customers) that were correctly classified

In [25]:
train_accuracy = accuracy_score(train_Y, pred_train_Y)
test_accuracy = accuracy_score(test_Y, pred_test_Y)

In [26]:
print('Training accuracy:', round(train_accuracy,4))
print('Test accuracy:', round(test_accuracy,4))

Training accuracy: 0.8035
Test accuracy: 0.8143


In [30]:
train_precision = round(precision_score(train_Y, pred_train_Y),4)
test_precision = round(precision_score(test_Y, pred_test_Y),4)

ValueError: pos_label=1 is not a valid label. It should be one of ['No', 'Yes']

In [None]:
print('Training precision:', round(train_precision,4))
print('Test precision:', round(test_precision,4))

In [None]:
train_recall = round(recall_score(train_Y, pred_train_Y),4)
test_recall = round(recall_score(test_Y, pred_test_Y),4)

In [None]:
print('Training recall:', round(train_accuracy,4))
print('Test recall:', round(test_accuracy,4))