# Churn Prediction using Logistic Regression in Machine Learning

### Problem Definition
- **Type**: Regression in binary classification

### Problem Statement
To predict customer churn using logistic regression. Churn prediction aims to identify customers likely to leave a service. By predicting churn, companies can take proactive measures to retain customers. Logistic regression is suitable for this task because of its simplicity and effectiveness in binary classification problems.

### Data Handling

#### Data Sourcing
Sourcing Data: Gather data from the company, including past customers information thsi  informations are tell as all about the perseon who leaves or not 

#### Defining Parameters
- Parameters include customer demographic information, account information, and service usage patterns.

#### Expert Consultation
- Consultation with domain experts to understand key factors influencing churn.

# Evaluation Metric
- The evaluation metric chosen for this project is Accuracy

In [106]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [108]:
dataset = pd.read_csv("ChurnData.csv")

In [110]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 28 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tenure    200 non-null    float64
 1   age       200 non-null    float64
 2   address   200 non-null    float64
 3   income    200 non-null    float64
 4   ed        200 non-null    float64
 5   employ    200 non-null    float64
 6   equip     200 non-null    float64
 7   callcard  200 non-null    float64
 8   wireless  200 non-null    float64
 9   longmon   200 non-null    float64
 10  tollmon   200 non-null    float64
 11  equipmon  200 non-null    float64
 12  cardmon   200 non-null    float64
 13  wiremon   200 non-null    float64
 14  longten   200 non-null    float64
 15  tollten   200 non-null    float64
 16  cardten   200 non-null    float64
 17  voice     200 non-null    float64
 18  pager     200 non-null    float64
 19  internet  200 non-null    float64
 20  callwait  200 non-null    float6

In [112]:
dataset.corr()

Unnamed: 0,tenure,age,address,income,ed,employ,equip,callcard,wireless,longmon,...,pager,internet,callwait,confer,ebill,loglong,logtoll,lninc,custcat,churn
tenure,1.0,0.431802,0.456328,0.109383,-0.070503,0.445755,-0.117102,0.42653,-0.07059,0.763134,...,0.018791,-0.164921,-0.009747,0.08065,-0.099128,0.864388,0.310045,0.246353,0.134237,-0.37686
age,0.431802,1.0,0.746566,0.211275,-0.071509,0.622553,-0.071357,0.170404,-0.065527,0.373547,...,0.006803,-0.078395,0.020002,0.030625,-0.048279,0.379413,0.0936,0.313359,0.041055,-0.287697
address,0.456328,0.746566,1.0,0.132807,-0.14555,0.520926,-0.148977,0.209204,-0.146478,0.421782,...,-0.105812,-0.191058,-0.019967,-0.030494,-0.172171,0.409357,0.018386,0.212929,-0.016841,-0.260659
income,0.109383,0.211275,0.132807,1.0,0.141241,0.345161,-0.010741,-0.019969,-0.029635,0.041808,...,0.056977,0.102809,0.081133,-0.031556,-0.041392,0.065595,-0.156498,0.680313,0.030725,-0.09079
ed,-0.070503,-0.071509,-0.14555,0.141241,1.0,-0.213886,0.488041,-0.071178,0.26767,-0.072735,...,0.258698,0.552996,-0.016247,-0.132215,0.427315,-0.054581,-0.007227,0.206718,0.013127,0.216112
employ,0.445755,0.622553,0.520926,0.345161,-0.213886,1.0,-0.17447,0.266612,-0.101187,0.363386,...,0.038381,-0.250044,0.119708,0.173247,-0.151965,0.377186,0.068718,0.540052,0.131292,-0.337969
equip,-0.117102,-0.071357,-0.148977,-0.010741,0.488041,-0.17447,1.0,-0.087051,0.386735,-0.097618,...,0.308633,0.623509,-0.034021,-0.103499,0.603133,-0.113065,-0.027882,0.083494,0.174955,0.275284
callcard,0.42653,0.170404,0.209204,-0.019969,-0.071178,0.266612,-0.087051,1.0,0.220118,0.322514,...,0.251069,-0.067146,0.370878,0.311056,-0.045058,0.35103,0.08006,0.15692,0.407553,-0.311451
wireless,-0.07059,-0.065527,-0.146478,-0.029635,0.26767,-0.101187,0.386735,0.220118,1.0,-0.073043,...,0.667535,0.343631,0.38967,0.382925,0.321433,-0.042637,0.178317,0.033558,0.598156,0.174356
longmon,0.763134,0.373547,0.421782,0.041808,-0.072735,0.363386,-0.097618,0.322514,-0.073043,1.0,...,-0.001372,-0.223929,0.032913,0.060614,-0.124605,0.901631,0.247302,0.12255,0.072519,-0.292026


### The columns include:
#### •	tenure: The number of months the customer has been with the company.
#### •	age: The age of the customer.
#### •	address: The number of years at the current address.
#### •	income: The annual income of the customer.
#### •	ed: Level of education.
#### •	employ: Number of years with the current employer.
#### •	equip: Whether the customer has the equipment.
#### •	callcard: Whether the customer has a call card.
#### •	wireless: Whether the customer has wireless service.
#### •	longmon: Monthly long distance charges.
#### •	Other columns related to customer service usage and preferences.
#### •	churn: Indicates if the customer has churned (1) or not (0).


In [115]:
dataset.head(10)

Unnamed: 0,tenure,age,address,income,ed,employ,equip,callcard,wireless,longmon,...,pager,internet,callwait,confer,ebill,loglong,logtoll,lninc,custcat,churn
0,11.0,33.0,7.0,136.0,5.0,5.0,0.0,1.0,1.0,4.4,...,1.0,0.0,1.0,1.0,0.0,1.482,3.033,4.913,4.0,1.0
1,33.0,33.0,12.0,33.0,2.0,0.0,0.0,0.0,0.0,9.45,...,0.0,0.0,0.0,0.0,0.0,2.246,3.24,3.497,1.0,1.0
2,23.0,30.0,9.0,30.0,1.0,2.0,0.0,0.0,0.0,6.3,...,0.0,0.0,0.0,1.0,0.0,1.841,3.24,3.401,3.0,0.0
3,38.0,35.0,5.0,76.0,2.0,10.0,1.0,1.0,1.0,6.05,...,1.0,1.0,1.0,1.0,1.0,1.8,3.807,4.331,4.0,0.0
4,7.0,35.0,14.0,80.0,2.0,15.0,0.0,1.0,0.0,7.1,...,0.0,0.0,1.0,1.0,0.0,1.96,3.091,4.382,3.0,0.0
5,68.0,52.0,17.0,120.0,1.0,24.0,0.0,1.0,0.0,20.7,...,0.0,0.0,0.0,0.0,0.0,3.03,3.24,4.787,1.0,0.0
6,42.0,40.0,7.0,37.0,2.0,8.0,1.0,1.0,1.0,8.25,...,0.0,1.0,1.0,1.0,1.0,2.11,3.157,3.611,4.0,0.0
7,9.0,21.0,1.0,17.0,2.0,2.0,0.0,0.0,0.0,2.9,...,0.0,0.0,0.0,0.0,0.0,1.065,3.24,2.833,1.0,0.0
8,35.0,50.0,26.0,140.0,2.0,21.0,0.0,1.0,0.0,6.5,...,0.0,0.0,1.0,1.0,0.0,1.872,3.314,4.942,3.0,0.0
9,49.0,51.0,27.0,63.0,4.0,19.0,0.0,1.0,0.0,12.85,...,0.0,1.0,1.0,0.0,1.0,2.553,3.248,4.143,2.0,0.0


#### This line imports the selectKBestand chi2 function from the sklearn.feature selection module. selectKBest is used to select the top K features on the scoring function. and chi2ies the chi-quared scoring function used for feature selection 

In [118]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [120]:
dataset.columns

Index(['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip',
       'callcard', 'wireless', 'longmon', 'tollmon', 'equipmon', 'cardmon',
       'wiremon', 'longten', 'tollten', 'cardten', 'voice', 'pager',
       'internet', 'callwait', 'confer', 'ebill', 'loglong', 'logtoll',
       'lninc', 'custcat', 'churn'],
      dtype='object')

### covert dataset to numpy array
#### This line convertsthe data set from a pandas DataFrame to a Numpy array. this is required because selectKBest operates on Numy array using dataset = dataset.values

In [123]:
dataset = dataset.values

## feature and target


In [126]:
X = dataset[:,0:26]
y = dataset[:,27]

#### feature selecion 
#####  selectKBest(score_function=chi2, k=6) intializes the selectKBest with the chi-squared  scoring functon, specifying that the top 6 features should be selected 
##### fit=test.fit(x,y)fits the model to the data scoring each features and selecting the top 6. 

In [129]:
test = SelectKBest(score_func=chi2, k=6)        # k is number of features
fit = test.fit(X, y)

### print scores
##### np.set_printoptions(precision=3) sets the printing options for Numpy to display numbers with 3 decimals places
##### print(fit_scores_)prnts the chi_squared scores for each feature

In [132]:
np.set_printoptions(precision=3)
print(fit.scores_)

[3.728e+02 6.842e+01 1.198e+02 3.601e+02 5.437e+00 1.784e+02 8.715e+00
 5.723e+00 4.317e+00 1.407e+02 7.581e-01 3.144e+02 9.504e+01 2.401e+02
 1.670e+04 9.338e+02 1.417e+04 2.856e+00 2.252e+00 7.274e+00 3.049e-01
 7.149e-01 7.274e+00 5.506e+00 2.395e-02 3.935e-01]


#### Transform dataset  this line transforms the original feature set x to include only the top  K  features (in this case ,6 features) that selected based on the chi-squared test.

In [135]:
X_SelectedFeatures = fit.transform(X)
print(X_SelectedFeatures)

[[  11.    136.      0.     42.    211.45  125.  ]
 [  33.     33.      0.    288.8     0.      0.  ]
 [  23.     30.      0.    157.05    0.      0.  ]
 ...
 [   6.     47.      0.     29.9   128.45   80.  ]
 [  24.     25.      0.    186.6  1152.9   780.  ]
 [  61.    190.     42.55 1063.15    0.   1600.  ]]


#### This imports the the standardscaler class from sklearn.preprocessing the standardScalerstandardize features by removing the mean and scaling to unit varianc.

In [138]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

#### spliting the data set this splits the data into training and testing 
#### x_select Feature represents the feature matrix
#### y represents rhe target feature 

In [141]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_SelectedFeatures, y, test_size=0.33, random_state=42)

## scaling the training data and scalling the test data
### the fit_transform method is called on x_train to compute mean and standard devation  on the trainig data and rhen tranform it. this means the training data is scaled to this statistics
### the transform method is called on x_test to scale it using the mean and standard devation computed from the training data. this ensures that the test datascaled in the same way as the trainig data 

In [144]:
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

### outputing the scaled traning data

In [147]:
X_train_scaled

array([[ 0.692, -0.494, -0.805,  0.069,  2.677,  2.041],
       [ 1.45 ,  3.898, -0.805,  0.127, -0.662,  0.424],
       [-0.492, -0.213, -0.805, -0.557,  0.106, -0.739],
       [-0.777,  0.024,  2.288, -0.625, -0.136, -0.411],
       [-0.445, -0.169, -0.805, -0.473,  0.05 , -0.566],
       [-1.203, -0.701, -0.805, -0.675, -0.662, -0.739],
       [ 1.166, -0.272,  2.043,  0.414, -0.662,  1.206],
       [ 0.834, -0.479, -0.805, -0.268, -0.662, -0.184],
       [-0.587,  1.207, -0.805, -0.505, -0.112, -0.739],
       [-0.871, -0.568, -0.805, -0.566, -0.154, -0.739],
       [ 1.118,  0.467, -0.805,  0.339, -0.662,  0.687],
       [-0.871, -0.612, -0.805, -0.719, -0.662, -0.572],
       [-1.629,  0.97 ,  0.325, -0.753, -0.654, -0.736],
       [-1.44 , -0.346, -0.805, -0.726, -0.662, -0.697],
       [-0.587, -0.583, -0.805, -0.545, -0.662, -0.739],
       [-1.203, -0.257,  0.808, -0.691, -0.662, -0.739],
       [-0.35 ,  0.186,  0.754, -0.659, -0.662, -0.739],
       [ 1.071, -0.45 , -0.805,

## Model Training & Prediction
Training the model by using the concepts of Logistic Regression and then making prediction from the data which we splitted for testing. Also calculating the accuracy socre of the prediction.

In [150]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)

In [152]:
y_pred = model.predict(X_test)

#### This part imports the warnings and set it to ignore all warnings. This is used often used to keep the output clean, but it is issential to understand and warnings instead of ignoring them

In [155]:
import warnings
warnings.filterwarnings("ignore")

#### importing the jaccard_score from sklearn.matrics and calculating the jaccard score. it  is calculated between the true labels y_test and the predictable labels Y_perd. The jaccard score is a measure of similarity between two sets of data
##### the boundary :- The logistic regression model makes a prediction based on the probability treshold(usually.o.5)  
#####  Y=1 if (z) >= 0.5(means churned) 
##### Y=0 if (Z) <  0.5(means not churned) 

In [158]:
from sklearn.metrics import jaccard_score
jaccard_score(y_test, y_pred)

0.5

In [160]:
from sklearn.metrics import accuracy_score
print("Accuracy of the model is :",accuracy_score(y_test, y_pred))

Accuracy of the model is : 0.8333333333333334


 ### error matrics  to evaluate the model

In [165]:

from sklearn.metrics import mean_absolute_error, mean_squared_error 
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

In [167]:
print(mae)

0.16666666666666666


In [169]:
print(rmse)

0.408248290463863


### feature importance 
#### In the context of logistic regression, feature importance typically refers to the coefficients of the features in the model. Logistic regression is a linear model, and the coefficients can be interpreted as the importance of each feature in predicting the target variable

#### For logistic regression:
- Each feature has an associated coefficient (weight).
- The sign of the coefficient (positive or negative) indicates the direction of the relationship between the feature and the target.
  - A positive coefficient means that as the feature value increases, the likelihood of the positive class (or target class) increases.
  - A negative coefficient means that as the feature value increases, the likelihood of the positive class decreases.
- The magnitude of the coefficient indicates the strength of the relationship



In [193]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import numpy as np

# Load data
data = pd.read_csv('ChurnData.csv')

# data setup
X = data[['age','address','income','ed','employ']]
y = data['churn']

# Preprocessing
numeric_features = ['age','address','income','ed','employ']
categorical_features = ['address']

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(drop='first'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
          ])

# Model pipeline
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', LogisticRegression())])

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model.fit(X_train, y_train)

# Extract the logistic regression model from the pipeline
log_reg = model.named_steps['classifier']

# Get feature names
feature_names = numeric_features + list(preprocessor.transformers_[1][1].named_steps['onehot'].get_feature_names_out(categorical_features))

# Get coefficients
coefficients = log_reg.coef_[0]

# here we are creating a DataFrame for feature importance
feature_importance = pd.DataFrame({'Feature': feature_names, 'Importance': coefficients})
feature_importance['Absolute Importance'] = feature_importance['Importance'].abs()
feature_importance = feature_importance.sort_values(by='Absolute Importance', ascending=False)

print(feature_importance)

         Feature  Importance  Absolute Importance
31  address_27.0    1.357011             1.357011
10   address_6.0    1.141270             1.141270
14  address_10.0   -1.030480             1.030480
17  address_13.0    0.993845             0.993845
4         employ   -0.841432             0.841432
36  address_33.0    0.812501             0.812501
7    address_3.0   -0.654386             0.654386
8    address_4.0   -0.557345             0.557345
3             ed    0.497200             0.497200
0            age   -0.462289             0.462289
20  address_16.0   -0.460108             0.460108
6    address_2.0    0.458957             0.458957
22  address_18.0    0.441784             0.441784
15  address_11.0   -0.441168             0.441168
33  address_29.0   -0.414984             0.414984
21  address_17.0   -0.362040             0.362040
25  address_21.0   -0.335741             0.335741
19  address_15.0   -0.307058             0.307058
27  address_23.0   -0.297083             0.297083


## save model

In [213]:
import pickle


In [219]:
pickle.dump(model,open("ml project 1.pkl","wb"))

In [205]:
load=pickle.load(open("ml project 1.pkl","rb"))