# Telecom Customer Churn Prediction

<b><br>Problem statement: Based on all information from this data set, a model was made to predict whether a particular customer will churn or not.
<br><br>During the model development, data set was separated in train (70% of data) and test(30% of data) data. 
<br>On the train data, a model was build that calculates which attributes are significantly related to churn (eg. 'tenure - Contract Duration', 'PhoneService', 'PaperlessBilling', 'TotalCharges', OnlineBackup', 'TechSupport'..). 
<br> When model was applied on the test data, Churn was predicted with an accuracy of 78%</b>
<br><br> Source: Kaggle
<br>This data set contains the following data with following features:
1. churn_data.csv
    * 'customerID'
    * 'tenure'
    * 'PhoneService'
    * 'PaperlessBilling'
    * 'PaymentMethod'
    * 'MonthlyCharges'
    * 'TotalCharges'
    * 'Churn' 
2. customer_data.csv
    * 'customerID'
    * 'gender'
    * 'SeniorCitizen'
    * 'Partner'
    * 'Dependents'    
3. internet_data.csv
    * 'customerID'
    * 'MultipleLines'
    * 'InternetService'
    * 'OnlineSecurity'
    * 'OnlineBackup'
    * 'DeviceProtection'
    * 'TechSupport'
    * 'StreamingTV'
    * 'StreamingMovies'



### IMPORTING NECESSARY LIBRARIES

In [None]:
import pandas as pd
import numpy as np

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn import metrics
from sklearn.metrics import classification_report

### IMPORTING AND MERGING DATASETS

In [None]:
# importing datasets
churn = pd.read_csv('../input/logisticregression-telecomcustomer-churmprediction/churn_data.csv')
customer = pd.read_csv('../input/logisticregression-telecomcustomer-churmprediction/customer_data.csv')
internet = pd.read_csv('../input/logisticregression-telecomcustomer-churmprediction/internet_data.csv')

# merging churn and customer dataframe on customerID
df_1 = pd.merge(churn, customer, how='inner', on='customerID')

# merging df_1 and internet dataframe on customerID
data = pd.merge(df_1,internet, how='inner', on = 'customerID')

# 1. EXPLORATORY DATA ANALYSIS AND DATA CLEANING
### Checking merged dataframe and data statistics

In [None]:
data.head()

In [None]:
data.describe()

In [None]:
data.info()

* With data.info() can be seen that the data set has 7042 entries with textual (object) and numerical (int64 & float64) data

In [None]:
# Checking for null values
data.isnull().sum()

* Data set does not contain any null values

With data analysis, it is discovered that some of the values are in the wrong format and that some of the data contain whitespaces.
<br> In order for a model to work properly, it is necessary to make 'data wrangling', eg. remove whitespace and convert data to right format.

In [None]:
# TotalCharges is an object and not float!!!
# We don´t have null values but from error we can see that column 'TotalCharges' contains whitespace = ' '

# data['TotalCharges'] = pd.to_numeric(data['TotalCharges'])

In [None]:
# How many whitespace = ' ' we have in column 'TotalCharges'
data['TotalCharges'].str.isspace().value_counts()

* Data set contains 11 data that contain whitespaces

In [None]:
data['TotalCharges'].isnull().sum()

### Replacing whitespace to NAN values and converting to numeric data (float)

In [None]:
# Replacing whitespace to NAN values and converting to numeric data (float)
data['TotalCharges'] = data['TotalCharges'].replace(' ', np.nan)
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'])

In [None]:
# How many NAN values is in column
data['TotalCharges'].isnull().sum()

In [None]:
# Replacing NAN values with mean value from all data in column 'TotalCharges'

#new_value = data['TotalCharges'].astype('float').mean(axis=0)

new_value = (data['TotalCharges']/data['MonthlyCharges']).mean()*data['MonthlyCharges']
data['TotalCharges'].replace(np.nan, new_value, inplace=True)

In [None]:
# How many NAN values is in column 'TotalCharges' after replacing NAN with mean 
data['TotalCharges'].isnull().sum()

In [None]:
# Checking for null values
data.isnull().sum()

### Data Visualization

In [None]:
sns.pairplot(data=data)

In [None]:
sns.countplot(x = 'Contract', data=data)

* Ration of contracts: Month-to-month vs. One year Contract vs. Two year

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(x = 'PaymentMethod', data=data)

In [None]:
sns.countplot(x = 'Churn', data=data)

* From ~7000 customers approximately ~2000 has churned

In [None]:
sns.countplot(x = 'gender', data=data)

# 2. DATA PRETPROCESSING

In [None]:
pd.set_option('display.max_columns', 500)
data.head()

In [None]:
# Making list for columns for One Hot Encoding
lista = ['PhoneService','PaperlessBilling', 'Churn', 'Partner', 'Dependents']

# With .map method and lambda function turning Yes/No into 1/0
data[lista] = data[lista].apply(lambda x:x.map({'Yes': 1, "No": 0}))
data.head()

In [None]:
# checking other data npr. 'StreamingMovies'
data['StreamingMovies'].value_counts()

In [None]:
# Making dummy variables for categorical data with more inputs

data_dummy = pd.get_dummies(data[['Contract', 'PaymentMethod', 'gender', 'MultipleLines', 'InternetService', 
                                     'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
                                    'TechSupport', 'StreamingTV', 'StreamingMovies']], drop_first=True)
data_dummy.head()

In [None]:
# Merging original data frame with 'dummy' dataframe
data = pd.concat([data,data_dummy], axis=1)
data.head()

In [None]:
data.columns

In [None]:
# Dropping attributes for which we made dummy variables

data = data.drop(['Contract','PaymentMethod','gender','MultipleLines','InternetService', 
                        'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
                        'TechSupport', 'StreamingTV', 'StreamingMovies'], axis=1)

# 3. MODEL DEVELOPMENT

In [None]:
# setting Independent variable (X) and Dependent variable (y)
X = data.drop(['Churn','customerID'], axis=1)
y = data['Churn']

# spliting data into train and test samples
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

In [None]:
# since data are within long-range (0 - 8684) it is necessary to perform data standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

### Model Training

In [None]:
# just checking X_train before standardization
X_train.head()

In [None]:
# standardization on X_train
X_train[['tenure','MonthlyCharges','TotalCharges']] = scaler.fit_transform(X_train[['tenure','MonthlyCharges','TotalCharges']])
X_train.head()

In [None]:
# Ordinary Least Squares: sm.OLS(y, X)
mod1 = sm.OLS(y_train,X_train,data=data)

In [None]:
results1 = mod1.fit()
print(results1.summary())

In [None]:
# LogisticRegression object as lr
lr = LogisticRegression()

In [None]:
#  RFE - Feature ranking with recursive feature elimination.
rfe = RFE(estimator=lr, n_features_to_select=20, step=1)    
rfe = rfe.fit(X_train, y_train)

In [None]:
# print summaries for the selection of attributes
print(rfe.support_)
print(rfe.ranking_)

In [None]:
# making list and dataframe to see what attributes was selected
list_for_df = list(zip(X_train.columns, rfe.support_, rfe.ranking_))
df = pd.DataFrame(list_for_df, columns = ['X_train.columns', 'rfe.support_', 'rfe.ranking_'])
df.head()

In [None]:
# the list of attributes that are selected
sel_att = X_train.columns[rfe.support_]
sel_att

In [None]:
#Adding a constant
X_train_const = sm.add_constant(X_train[sel_att])

In [None]:
# Ordinary Least Squares: sm.OLS(y, X)
mod2 = sm.OLS(y_train,X_train_const,data=data)
results2 = mod2.fit()
print(results2.summary())

In [None]:
# Getting the predicted values on the train set
y_predicted_train = results2.predict(X_train_const)
y_predicted_train.head()

In [None]:
# making dataframe for train values and predicted values with 'customerID'as index
final_y_predicted_df = pd.DataFrame(index= y_train.index, columns=('Churn','Churn_Predicted_Initial'))
final_y_predicted_df = pd.DataFrame({'Churn':y_train.values, 'Churn_Predicted_Initial':y_predicted_train})
final_y_predicted_df.index.name = 'customerID'
final_y_predicted_df.head()

In [None]:
# TRAIN DATA & PREDICTED ON TRAIN DATA
#Creating new column 'predicted' with 1 if Churn_Prob > 0.5 else 0
final_y_predicted_df['Churn_Predicted_Final'] = final_y_predicted_df.Churn_Predicted_Initial.map(lambda x: 1 if x > 0.5 else 0)
final_y_predicted_df.head()

In [None]:
# Confusion matrix for train data
from sklearn import metrics
confusion_matrix = metrics.confusion_matrix(final_y_predicted_df['Churn'], final_y_predicted_df['Churn_Predicted_Final'])
print(confusion_matrix)

In [None]:
# Accuracy_score for train data
from sklearn.metrics import classification_report
print(classification_report(final_y_predicted_df['Churn'], final_y_predicted_df['Churn_Predicted_Final']))

In [None]:
# Overall accuracy.
metrics.accuracy_score(final_y_predicted_df['Churn'], final_y_predicted_df['Churn_Predicted_Final'])

### Model Testing

In [None]:
# standardization on X_test
X_test[['tenure','MonthlyCharges','TotalCharges']] = scaler.fit_transform(X_test[['tenure','MonthlyCharges','TotalCharges']])

#Adding a constant
X_test_const = sm.add_constant(X_test[sel_att])

In [None]:
# Getting the predicted values on the test set
y_predicted_test = results2.predict(X_test_const)
y_predicted_test.head()

In [None]:
# making dataframe for test values and predicted values with 'customerID'as index
final_y_predicted_train = pd.DataFrame(index= y_test.index, columns=('Churn','Churn_Predicted_Initial'))
final_y_predicted_train = pd.DataFrame({'Churn':y_test.values, 'Churn_Predicted_Initial':y_predicted_test})
final_y_predicted_train.index.name = 'customerID'
final_y_predicted_train.head()

In [None]:
# TEST DATA & PREDICTED ON TEST DATA
#Creating new column 'predicted' with 1 if Churn_Prob > 0.5 else 0
final_y_predicted_train['Churn_Predicted_Final'] = final_y_predicted_train.Churn_Predicted_Initial.map(lambda x: 1 if x > 0.5 else 0)
final_y_predicted_train.head()

In [None]:
# Confusion matrix for test data
from sklearn import metrics
confusion_matrix = metrics.confusion_matrix(final_y_predicted_train['Churn'], final_y_predicted_train['Churn_Predicted_Final'])
print(confusion_matrix)

In [None]:
# Accuracy_score for train data
from sklearn.metrics import classification_report
print(classification_report(final_y_predicted_train['Churn'], final_y_predicted_train['Churn_Predicted_Final']))

In [None]:
# Overall accuracy.
metrics.accuracy_score(final_y_predicted_train['Churn'], final_y_predicted_train['Churn_Predicted_Final'])

###  Comparison: Logistic Regression without RFE - Recursive Feature Elimination

In [None]:
lr.fit(X_train,y_train)

In [None]:
predictions_lr = lr.predict(X_test)

In [None]:
print(classification_report(y_test,predictions_lr))