Competition Link: https://www.kaggle.com/c/customerattritionprediction/leaderboard

# Customer Attrition Prediction
### Determine the Customer Attrition from the given dataset

## Problem Statement

The training dataset contains 6338 samples and the testset contains 705 samples. Each sample contains 15 features and 1 prediction variable "CustomerAttrition" which indicates the class of the sample. The 15 input features and 1 prediction variable are:

    "ID", string, the Customer ID allocated to each customer,

    "sex", string, the gender of the person,

    "Aged", Boolean, the gender of the person,

    "Married", Boolean, The marrital status of the person,

    "TotalDependents", Boolean, Tells whether the person is dependent or independent,

    "ServiceSpan", numerical, gives the timespan of the service taken by the person,

    "4GService",string, the intenet service taken by the person ,

    "CyberProtection", Boolean, tells if cyber protection plan of company is taken by the person or not

    "HardwareSupport", Boolean, tells if hardware support plan of company is taken by the person or not,

    "TechnicalAssistance", Boolean, tells if technical assistance of company is taken by the person or not,,

    "FilmSubscription", Boolean, tells whether the person has subscribed for films,

    "SettlementProcess", string, The payment process chosen by the person,

    "QuarterlyPayment", numerical, The quaterly payment made by the person,

    "GrandPayment", numerical, The cummalative payment made by the person,

    "CustomerAttrition", Boolean, The choice of continuation of services taken by the customer,

## Objective

Your task is to predict the customer Attrition for each customer in the given dataset using data science models.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# to ignore warnings
import warnings as wg
wg.filterwarnings("ignore")

In [3]:
pd.set_option('max_rows', None)

**Loading Data**

In [4]:
data = pd.read_csv('../input/customerattritionprediction/train.csv')

In [5]:
data.head()

In [6]:
data.info()

In [7]:
data.columns

**Describing continuous numerical data**

In [8]:
data[['ServiceSpan','QuarterlyPayment', 'GrandPayment']].describe()

We can clearly see that ServiceSpan has negative minimum value which is not possible and requires to be changed

**Describing categorical data**

In [9]:
data[['sex', 'Aged', 'Married', 'TotalDependents',
       'MobileService', '4GService', 'CyberProtection', 'HardwareSupport',
       'TechnicalAssistance', 'FilmSubscription', 'SettlementProcess',
         'CustomerAttrition']].describe()

**Train Test Split**

In [10]:
x = data.drop('CustomerAttrition', axis = 'columns')
y = data['CustomerAttrition']

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.2, random_state=1, stratify=y)

## EDA 

In [None]:
X_train.head()

In [None]:
# Dropping ID column
X_train = X_train.drop('ID', axis = 'columns')

In [None]:
X_train.columns

In [None]:
# Separating numerical and categorical data to understand better 

num = ['ServiceSpan','QuarterlyPayment', 'GrandPayment']
cat = ['sex', 'Aged', 'Married', 'TotalDependents', 
       'MobileService', '4GService', 'CyberProtection', 'HardwareSupport',
       'TechnicalAssistance', 'FilmSubscription', 'SettlementProcess']

In [None]:
plt.style.use('ggplot')

In [None]:
pair = pd.concat([X_train, Y_train], axis = 'columns')

**Checking out numerical features**

In [None]:
# Pair plot for numerical features
sns.pairplot(vars = num, data = pair, hue = 'CustomerAttrition')

    We can clearly see that ServiceSpan and GrandPayment are strongly correlated 
    And ServiceSpan and QuaterlyPayment are not correlated at all

In [None]:
colormap = sns.diverging_palette(10, 220, as_cmap = True)
sns.heatmap(X_train[num].corr(), 
            annot = True, square = True,
           cmap = colormap)

In [None]:
Y_train.value_counts()

In [None]:
# Visualizing the value count of customer attribution
sns.countplot(Y_train)

In [None]:
for i in num:
#    plt.figure(figsize = (12, 6))
    fig, axes = plt.subplots(ncols=2, figsize=(15, 5))
#    plt.subplot(index+1,2,1)
    sns.boxplot(x = i , data = X_train, ax = axes[0])
    
#    plt.subplot(index+1, 2, 2)
    sns.distplot(X_train[i], ax = axes[1])

In [None]:
# Checking null values in the columns
X_train.isna().sum()

So, GrandPayment contains 9 missing value which are needed to be imputed

In [None]:
X_train.columns

In [None]:
mean_payments = X_train[['QuarterlyPayment','GrandPayment','ServiceSpan']].groupby(['ServiceSpan']).mean()
mean_payments = mean_payments.reset_index()
mean_payments

In [None]:
ax = sns.lineplot(x = 'ServiceSpan', y = 'QuarterlyPayment', data = mean_payments)
ax2 = ax.twinx()
sns.lineplot(x = 'ServiceSpan', y = 'GrandPayment', data = mean_payments, ax=ax2, color = 'b')
plt.show()

 We can also see that some ServiceSpan has negative and 0 values, Let's explore them also

In [None]:
X_train[X_train['ServiceSpan'] <= 0].shape

In [None]:
X_train[X_train['ServiceSpan'] <= 0].shape

So, 273 Customers have either 0 or lesser than 0 ServiceSpan, Now Let's dive deep

In [None]:
X_train.shape

In [None]:
X_train[X_train['ServiceSpan'] <= 0]['ServiceSpan'].value_counts()

In [None]:
plt.figure(figsize = (10, 5))
X_train[X_train['ServiceSpan'] <= 0]['ServiceSpan'].value_counts().plot(kind = 'bar')

Comparing with the record of 5070 customers only 109 have negative ServiceSpan

Only 1 value has -2 ServiceSpan which should be merged with other values

In [None]:
negative_span_index = X_train[X_train['ServiceSpan'] < 0].index
negative_span_index

In [None]:
negative_span = pd.concat([X_train[X_train['ServiceSpan'] < 0], Y_train[negative_span_index]], axis = 'columns')
negative_span.head()

In [None]:
sns.countplot(x = 'CustomerAttrition', data = negative_span)

It looks like Customers having Negative ServiceSpan have more chances to continue the service

In [None]:
f = plt.figure(figsize = (15, 18))
i =1
for  c in cat:
    f.add_subplot(4, 3, i)
    sns.countplot(c,data=negative_span)
    i+=1
plt.tight_layout()
plt.show()

It looks like most of the customers who have negative ServiceSpan are working because most of them are not married, not aged, not dependent, 

Requiring good internet and mobile service but not interested in CyberProtection, HardwareSupport, FilmSubscription and TechnicalAssistance

It is kind of possible that most of them are FreeLancers who require these services only for a short period of time as their GrandPayment is also low

In [None]:
negative_span[['QuarterlyPayment', 'GrandPayment']].mean()

In [None]:
negative_span[['QuarterlyPayment', 'GrandPayment']].mean().plot(kind = 'bar')
plt.ylabel('Mean Values')

Now, it is totally evident that those having negative ServiceSpan are those who require these service only for a short period like freelancers

In [None]:
negative_span.describe()

Now, Let's have a look at 0 ServiceSpan

In [None]:
zero_span = X_train[X_train['ServiceSpan'] == 0]
zero_span.head()

In [None]:
zero_span.shape

In [None]:
zero_span.index

In [None]:
sns.countplot(Y_train[zero_span.index])

In [None]:
f = plt.figure(figsize = (15, 18))
i =1
for  c in cat:
    f.add_subplot(4, 3, i)
    sns.countplot(c,data=zero_span)
    i+=1
plt.tight_layout()
plt.show()

Like before, Customers having Zero ServiceSpan are also showing almost the same characterstics as those of Negative Service Span

I guess they can be merged i.e, changing the ServiceSpan of those having negative values to 0 would do no harm

In [None]:
zero_span[['QuarterlyPayment', 'GrandPayment']].mean()

In [None]:
zero_span[['QuarterlyPayment', 'GrandPayment']].mean().plot(kind = 'bar')
plt.ylabel('Mean Values')

In this case the GrandPayment is just little bit more than quaterly payment, may be they took little bit more time than a quater

Now I shall compare the trend of zerospan and negativespan to ensure if i merge both of them then it won't dirupt any trend for the final ML model

In [None]:
for i in cat:

    fig, axes = plt.subplots(ncols=2, figsize=(15, 5))

    sns.countplot(x = i , data = negative_span, ax = axes[0])
    sns.countplot(x = i , data = zero_span, ax = axes[1])

Except sex all trends are same in both hence if I merge negative values to 0 then it should not affect much our ML model

### Checking missing values and outliers

In [None]:
# We saw above that 'GrandPayment' feature has some missing values
# Checking out the rows having missing 'GrandPayment' values
X_train[X_train['GrandPayment'].isnull()]

In [None]:
no_payment_index = X_train[X_train['GrandPayment'].isnull()].index

In [None]:
Y_train[no_payment_index]

In [None]:
no_payment = pd.concat([X_train[X_train['GrandPayment'].isnull()], Y_train[no_payment_index]], axis = 'columns')
no_payment

It looks like those who have missing grand_payment values were disappointed and did not enroll in the service after quaterly payment and also their ServiceSpan is very less.

Hence, these values are missing not at random

In some cases GrandPayment is lesser than QuarterlyPayment, Let's have a look at that also

In [None]:
lesser_grand_payment = X_train[X_train['GrandPayment'] < X_train['QuarterlyPayment']]
lesser_grand_payment.head()

In [None]:
lesser_grand_payment.shape

In [None]:
lesser_grand_payment_index = X_train[X_train['GrandPayment'] < X_train['QuarterlyPayment']].index

In [None]:
sns.countplot(Y_train[lesser_grand_payment_index])

It looks like, the reason behind the lesser grand payment has nothing to do with CustomerAttribution

In [None]:
f = plt.figure(figsize = (15, 18))
i =1
for  c in cat:
    f.add_subplot(4, 3, i)
    sns.countplot(c,data=lesser_grand_payment)
    i+=1
plt.tight_layout()
plt.show()

In [None]:
greater_grand_payment = X_train[~(X_train['GrandPayment'] < X_train['QuarterlyPayment'])]
greater_grand_payment.head()

In [None]:
sns.countplot(Y_train[greater_grand_payment.index])

In [None]:
for i in cat:
#    plt.figure(figsize = (12, 6))
    fig, axes = plt.subplots(ncols=2, figsize=(15, 5))
#    plt.subplot(index+1,2,1)
    sns.countplot(x = i , data = lesser_grand_payment, ax = axes[0])
    
#    plt.subplot(index+1, 2, 2)
    sns.countplot(x = i , data = greater_grand_payment, ax = axes[1])

**Checking out categorical features**

In [None]:
X_train[cat].describe()

In [None]:
# Checking the number of unique values in each column
X_train[cat].nunique()

In [None]:
X_train[cat].head()

In [None]:
for i in cat:
    print(X_train[i].value_counts(),'\n')

In [None]:
f = plt.figure(figsize = (15, 18))
i =1
for  c in cat:
    f.add_subplot(4, 3, i)
    sns.countplot(c,data=pair, hue = 'CustomerAttrition')
    i+=1
plt.tight_layout()
plt.show()

In [None]:
len(cat)

In [None]:
# This is just other method of above visualization, nothing else
fig, ax = plt.subplots(6, 2, figsize = [16,25])
sns.countplot(cat[0], data = X_train, ax = ax[0][0])
sns.countplot(cat[1], data = X_train, ax = ax[0][1])
sns.countplot(cat[2], data = X_train, ax = ax[1][0])
sns.countplot(cat[3], data = X_train, ax = ax[1][1])
sns.countplot(cat[4], data = X_train, ax = ax[2][0])
sns.countplot(cat[5], data = X_train, ax = ax[2][1])
sns.countplot(cat[6], data = X_train, ax = ax[3][0])
sns.countplot(cat[7], data = X_train, ax = ax[3][1])
sns.countplot(cat[8], data = X_train, ax = ax[4][0])
sns.countplot(cat[9], data = X_train, ax = ax[4][1])
sns.countplot(cat[10], data = X_train, ax = ax[5][0])

plt.tight_layout()