## Problem Statement

In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.

To reduce customer churn, telecom companies need to predict which customers are at high risk of churn.

In this project, you will analyse customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn and identify the main indicators of churn.

* The model will only be able to achieve one of the two goals 
    - To predict customers who will churn.
    
* You can’t use the above model to identify the important features for churn. That’s because PCA usually creates components 
  which are not easy to interpret.
  
* Therefore, build another model with the main objective of identifying important predictor attributes which help the business understand indicators of churn. 

A good choice to identify important variables is a logistic regression model or a model from the tree family. In case of logistic regression, make sure to handle multi-collinearity.

After identifying important predictors, display them visually - you can use plots, summary tables etc. - whatever you think best conveys the importance of features.

#### Finally, recommend strategies to manage customer churn based on your observations.

## Step 1 : Data Loading & Data Understanding

In [None]:
## Importing all necessary libaries

import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

import warnings
warnings.filterwarnings('ignore')

In [None]:
## Setting max columns to display

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 100)

In [None]:
## Loading and viewing the DataSet 

telecom_data = pd.read_csv('telecom_churn_data.csv')
telecom_data.head()

In [None]:
## Shape of the dataset

telecom_data.shape

In [None]:
## Checking null percentage in data

((telecom_data.isnull().sum()/telecom_data.index.size)*100).sort_values(ascending=False)

## Step 2 : Data Cleaning & Preparation

In [None]:
## Let us check if we have any duplicate data, i.e, if we have more than one record for any number

telecom_data.mobile_number.value_counts().sum()

Since the unique value count is same as the number of rows which we got above from shape, this means that there is no duplicate 
data. Let us prepare the data as per the requirement

In [None]:
telecom_data.head()

In [None]:
## Before any modifications, let us keep the original dataset seperated and create a copy of the Original Dataset

telecom_df = telecom_data.copy()
telecom_df.shape

In [None]:
## The circle_id column is not relevant to us as the Telecom Circular area will not make any relevance while modelling as well
## Therefore, let us drop this feature

telecom_df = telecom_df.drop(['circle_id'], axis=1)
telecom_df.head()

In [None]:
## Checking data statistics

telecom_df.describe()

In [None]:
## Checking NA values in the recharge columns for 6, 7, 8 and 9 motnhs respectively

print((telecom_df.total_rech_data_6.isna().sum()/telecom_data.index.size)*100)
print((telecom_df.total_rech_data_7.isna().sum()/telecom_data.index.size)*100)
print((telecom_df.total_rech_data_8.isna().sum()/telecom_data.index.size)*100)
print((telecom_df.total_rech_data_9.isna().sum()/telecom_data.index.size)*100)

In [None]:
## We can impute these values with 0, as we might need these features in creating new features so we are not dropping any data

telecom_df.total_rech_data_6 = telecom_df.total_rech_data_6.replace(np.nan, 0.0)
telecom_df.total_rech_data_7 = telecom_df.total_rech_data_7.replace(np.nan, 0.0)
telecom_df.total_rech_data_8 = telecom_df.total_rech_data_8.replace(np.nan, 0.0)
telecom_df.total_rech_data_9 = telecom_df.total_rech_data_9.replace(np.nan, 0.0)

In [None]:
## Similar to the recharge features, we will impute 0.0 for all average recharge amount data features

telecom_df.av_rech_amt_data_6 = telecom_df.av_rech_amt_data_6.replace(np.nan, 0.0)
telecom_df.av_rech_amt_data_7 = telecom_df.av_rech_amt_data_7.replace(np.nan, 0.0)
telecom_df.av_rech_amt_data_8 = telecom_df.av_rech_amt_data_8.replace(np.nan, 0.0)
telecom_df.av_rech_amt_data_9 = telecom_df.av_rech_amt_data_9.replace(np.nan, 0.0)

### Filter High - Value customers

Those who have recharged with an amount more than or equal to X, where X is greater than 70th percentile of the average recharge amount in the first two months (the good phase)

In [None]:
## Calculating Average Good Phase, i.e. 6th and 7th month recharge amount to find 0.7 percentile of the data

telecom_df['avg_rech_good_phase'] = ((telecom_df['total_rech_amt_6'] + telecom_df['total_rech_amt_7'])/2)

In [None]:
X = telecom_df.avg_rech_good_phase.quantile(0.7)
X

Filtering the records where recharge is equal to or more than above value of X, which is 368.5

In [None]:
telecom_df = telecom_df[telecom_df.avg_rech_good_phase >= X] 
telecom_df.shape