### Churn Analysis

- 1 Data Cleaning (3 marks)
    - 1.1 Load the data and check if it consist of any missing data or not? (3 marks)

- 2 Data Preparation & Analysis (22 marks)
    - 2.1 Drop variables that will not be used for the classification model: state, area code, phone number, customer service calls (3 marks)
    - 2.2 Replace yes with 1 and no with 0 for the following columns: international plan, voice mail plan (2 marks)
    - 2.3 Split the data into X and y, where X will have all the independent features and y will have the dependent feature(churn) (2 marks)
    - 2.4 Check the imbalance percentage: what percentage of churn customer we have in the data? (3 marks)
    - 2.5 Randomly split the data into train and test. Use the following paramters: train_size=0.7, test_size=0.3, random_state=100 (2 marks)
    - 2.6 Use min-max scaler to fit_transform the train data and transform the test data 
    - 2.7 Perfrom Logistic Regression using the SKLean package: Use the following paramters with the Logistic Regression model:max_iter = 1000, class_weight = 'balanced' (2 marks)
    - 2.8 Train the model on train data and check the sensitivity of the model in the test data (3 marks)
    - 2.9 Find the top positive influencing features and top negative influencing features using the coefficients 
    - 2.10 Explain the features and their importance to FREECELL based on the results generated in the last step (5 marks)
    

In [1]:
# supress warnings
import warnings
warnings.filterwarnings('ignore')

# Importing all required packages
import numpy as np
import pandas as pd

# Data viz lib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from matplotlib.pyplot import xticks

In [2]:
# Read the file

churn = pd.read_csv('telcom.csv') 

# Checking top 5 rows
churn.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [3]:
churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

In [4]:
# Analysis- 1.1
# Check if there are any missing values for the churn dataframe or not?
# print the count of nulls for all the columns

# Write your code here

churn.isnull().sum()

state                     0
account length            0
area code                 0
phone number              0
international plan        0
voice mail plan           0
number vmail messages     0
total day minutes         0
total day calls           0
total day charge          0
total eve minutes         0
total eve calls           0
total eve charge          0
total night minutes       0
total night calls         0
total night charge        0
total intl minutes        0
total intl calls          0
total intl charge         0
customer service calls    0
churn                     0
dtype: int64

## There is no null value in any of the columns in Churn Dataset.

In [5]:
# Analysis- 2.1
# Here we will be dropping those variables which will not be useful for building up the classification model.
# Write your code to drop the following columns: (state, area code, phone number, customer service calls) from the churn dataframe
# update the churn dataframe such that it doesn't contain the above mentioned columns

# Write your code here
# Hint: https://stackoverflow.com/a/37069701

churn.drop(['state', 'area code', 'phone number', 'customer service calls'], axis=1, inplace=True)


In [6]:
churn

Unnamed: 0,account length,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,churn
0,128,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.70,False
1,107,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.70,False
2,137,no,no,0,243.4,114,41.38,121.2,110,10.30,162.6,104,7.32,12.2,5,3.29,False
3,84,yes,no,0,299.4,71,50.90,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,False
4,75,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3328,192,no,yes,36,156.2,77,26.55,215.5,126,18.32,279.1,83,12.56,9.9,6,2.67,False
3329,68,no,no,0,231.1,57,39.29,153.4,55,13.04,191.3,123,8.61,9.6,4,2.59,False
3330,28,no,no,0,180.8,109,30.74,288.8,58,24.55,191.9,91,8.64,14.1,6,3.81,False
3331,184,yes,no,0,213.8,105,36.35,159.6,84,13.57,139.2,137,6.26,5.0,10,1.35,False


In [7]:
# Analysis- 2.2
# Here we will be replacing the values yes and no with 1 and 0 respectively for the columns 'international plan' and 'voice mail plan'
# This is required because, we will not be able to train the model with string values 'yes' and 'no'

# Write your code here
# Hint: https://stackoverflow.com/a/40901792

churn['international plan'] = churn['international plan'].map({'yes': 1, 'no': 0})
churn['voice mail plan'] = churn['voice mail plan'].map({'yes': 1, 'no': 0})


In [8]:
churn

Unnamed: 0,account length,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,churn
0,128,0,1,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.70,False
1,107,0,1,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.70,False
2,137,0,0,0,243.4,114,41.38,121.2,110,10.30,162.6,104,7.32,12.2,5,3.29,False
3,84,1,0,0,299.4,71,50.90,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,False
4,75,1,0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3328,192,0,1,36,156.2,77,26.55,215.5,126,18.32,279.1,83,12.56,9.9,6,2.67,False
3329,68,0,0,0,231.1,57,39.29,153.4,55,13.04,191.3,123,8.61,9.6,4,2.59,False
3330,28,0,0,0,180.8,109,30.74,288.8,58,24.55,191.9,91,8.64,14.1,6,3.81,False
3331,184,1,0,0,213.8,105,36.35,159.6,84,13.57,139.2,137,6.26,5.0,10,1.35,False


In [9]:
# Ananlysis- 2.3
# Create the variable X and y
# X will have all the columns from the dataframe churn while y will only have the 'churn' column from the dataframe churn

# Write your code here
X = churn.drop(['churn'],axis=1)
y = churn[['churn']]

In [10]:
X

Unnamed: 0,account length,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge
0,128,0,1,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.70
1,107,0,1,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.70
2,137,0,0,0,243.4,114,41.38,121.2,110,10.30,162.6,104,7.32,12.2,5,3.29
3,84,1,0,0,299.4,71,50.90,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78
4,75,1,0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3328,192,0,1,36,156.2,77,26.55,215.5,126,18.32,279.1,83,12.56,9.9,6,2.67
3329,68,0,0,0,231.1,57,39.29,153.4,55,13.04,191.3,123,8.61,9.6,4,2.59
3330,28,0,0,0,180.8,109,30.74,288.8,58,24.55,191.9,91,8.64,14.1,6,3.81
3331,184,1,0,0,213.8,105,36.35,159.6,84,13.57,139.2,137,6.26,5.0,10,1.35


In [11]:
y

Unnamed: 0,churn
0,False
1,False
2,False
3,False
4,False
...,...
3328,False
3329,False
3330,False
3331,False


In [12]:
# Analysis- 2.4
# Find out the percentage or rows where churn=1 and churn=0. If the percentage is anything other than 50%-50%,
# we call that data as an imbalace data

# Write your code here

y.churn.value_counts(normalize = True)*100

False    85.508551
True     14.491449
Name: churn, dtype: float64

### The data is inbalanced as the Churn customers are 14% and the Non Churn customers are 85%

In [13]:
# Analysis- 2.5
# Here we will be splitting the data into train and test
# The train data will be having 70% of the churn data while test will have 30% of the churn data
# Use the following paramters
# train_size=0.7, test_size=0.3, random_state=100

from sklearn.model_selection import train_test_split

# Write your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

# Hint: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [14]:
# Analysis- 2.6
# Here we will be scaling our complete X_train and X_test data using min-max scaler
# The code has already been provided to you, you are not required to write anything over here
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [15]:
# Analysis- 2.7
# Here we will training our Logistic Regression model.
# Please use the following paramters with the LogisticRegression function: max_iter = 1000, class_weight = 'balanced'

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

logreg = LogisticRegression(max_iter = 1000, class_weight = 'balanced') # Write your code here, Define the Logistic Regression Model with the paramters as provided

logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

print(metrics.accuracy_score(y_test, y_pred))
print(metrics.recall_score(y_test, y_pred))
metrics.confusion_matrix(y_test, y_pred)

0.765
0.7131147540983607


array([[678, 200],
       [ 35,  87]], dtype=int64)

In [16]:
# Analysis-2.8
# Here we need to check the performance of the trained model
# Check the accuracy score
# Check the recall score
# Print the confusion matrix

# Write your code here
y_pred = logreg.predict(X_test)

print(metrics.accuracy_score(y_test, y_pred))
print(metrics.recall_score(y_test, y_pred))
metrics.confusion_matrix(y_test, y_pred)

0.765
0.7131147540983607


array([[678, 200],
       [ 35,  87]], dtype=int64)

In [17]:
print('Accuracy is : '+str(metrics.accuracy_score(y_test, y_pred)))
print('Recall is : '+str(metrics.recall_score(y_test, y_pred)))
print('\n')
print('Confusion Matrix below :')
metrics.confusion_matrix(y_test, y_pred)

Accuracy is : 0.765
Recall is : 0.7131147540983607


Confusion Matrix below :


array([[678, 200],
       [ 35,  87]], dtype=int64)

In [18]:
# Analysis-2.9
# Here we will be printing the coefficients of the model together with the variables
# You are not required to write anything here. The code is already provided
feature_importance = pd.DataFrame({"Feature":X.columns.tolist(),"Coefficients":logreg.coef_[0]})
feature_importance.sort_values(by = 'Coefficients')

Unnamed: 0,Feature,Coefficients
2,voice mail plan,-1.291948
14,total intl calls,-0.989424
11,total night calls,0.01689
0,account length,0.128915
13,total intl minutes,0.349296
15,total intl charge,0.357756
5,total day calls,0.371229
8,total eve calls,0.506451
12,total night charge,0.506992
10,total night minutes,0.509292


In [19]:
# Analysis-2.10
# Analyze the above results and explain the model to the FREECELL company
# Recommend the top positive influencing and top negative influencing variables.

# Add at least 5 recommendations for FREECELL company

## Analyze the above results and explain the model to the FREECELL company : 

Steps for creating the models are below :

1. Data Cleaning Process : Loaded the data, checked the data types if there are any mis-matched data type. Checked for the null values as well to complete the dataset, however there were no null values present here.

2. Dropping unwanted variables : Dataset contained many variables which were not needed and will not help us in model building and predcition process therefore we dropped/removed those variables. 

3. Data Mapping : Mapped 2 variables international plan and voice mail plan form Yes/No to 1/0 as the model does not understand Yes and No and we will not be able to train the model in this way. 

4. Seggregating Independent and Dependent Variable - Creating two dataframes X and y where X is the dataset of all independent variables and y is the dataset of dependent variable which is CHURN.

5. Checking Imbalanced Data : Checking if data is balanced or imbalanced by counting the percentage of customers who have churned or not. We see that the data is imbalanced.

5.Test - Train Split : This step is very necessary while building a model, we need to create model , train it and then validate it on the test dataset therefore a split using paramters: train_size=0.7, test_size=0.3, random_state=100 was done. 
    
6.Min Max scaling : MinMax scaling was then performed to rescale the variables of the model . This is done to normalize the features of the data and bring all the data to same scale.

7. Running Logistic Regression on the model : Running logistic regression on the model and regressing all the independent variables using code : LogisticRegression(max_iter = 1000, class_weight = 'balanced').

8. Checking for Accuracy , Recall and Confusion Matrix : These are the Interpretation of Performance Measures for the model which we have created above.


Accuracy - Accuracy is used as the performance measure and it is simply a ratio of correctly predicted observation to the total observations. Accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model. For our model, we have got 0.765 which means our model is approx. 76.5% accurate.

Accuracy Formulae = TP+TN/TP+FP+FN+TN

Recall - Recall is the ratio of correctly predicted positive observations to the all observations in actual as Yes. Out of all the positive classes, how much we predicted correctly. It should be high as possible.

Recall = TP/TP+FN


Accuracy do not show the truth when there is unbalance between classes as in out case. Recall shows the rate of correctly estimated churn customers. Recall is more important than accuracy because ﬁnding churn customer more important
 
Confusion Matrix : It is a performance measurement for machine learning classification problem where output can be two or more classes. It is a table with 4 different combinations of predicted and actual values.

True Positive:
Interpretation: You predicted positive and it’s true.

True Negative:
Interpretation: You predicted negative and it’s true.

False Positive: (Type 1 Error)
Interpretation: You predicted positive and it’s false.

False Negative: (Type 2 Error)
Interpretation: You predicted negative and it’s false.

# Recommend the top positive influencing and top negative influencing variables.

Top positive influencing variables: 

1. international plan	
2. total day minutes	
3. total day charge
4. number vmail messages	
5. total eve minutes	


Top negative influencing variables : 

1. voice mail plan
2. total intl calls
3. total night calls
4. account length
5. total intl minutes



## Add at least 5 recommendations for FREECELL company

1. The company should promote and optimize international plans more and more for the customers.

2. The company should launch marketing campaigns and advertisement to increase total day minutes of each customer so that more and more customers remain with the service.

3. The company should keep minimum total day charge in the starting to get more and more customers and gradually increase it when they get more customers, they also have to maintain an optimal limit so that they do not loose out on customers.

4. The company needs to increase more and more total eve minutes by launching plans that specifically work in the evening time.

5. The company should work on decreasing or eliminating voice mail plans as it causing the customers to churn.

6. As we know that there are customers who have international plans are more but the total international calls/total intl minutes have a negative impact on the churn therefore we need to focus on these parameters.

7. Have better voice mail plans as per the customer's satisfaction.

8. Marketing campaigns and new program teams should focus on creating plans for total day minutes vs total day charge and total eve minutes.