![Cartoon of telecom customers](IMG_8811.png)


The telecommunications (telecom) sector in India is rapidly changing, with more and more telecom businesses being created and many customers deciding to switch between providers. "Churn" refers to the process where customers or subscribers stop using a company's services or products. Understanding the factors that influence keeping a customer as a client in predicting churn is crucial for telecom companies to enhance their service quality and customer satisfaction. As the data scientist on this project, you aim to explore the intricate dynamics of customer behavior and demographics in the Indian telecom sector in predicting customer churn, utilizing two comprehensive datasets from four major telecom partners: Airtel, Reliance Jio, Vodafone, and BSNL:

- `telecom_demographics.csv` contains information related to Indian customer demographics:

| Variable             | Description                                      |
|----------------------|--------------------------------------------------|
| `customer_id `         | Unique identifier for each customer.             |
| `telecom_partner `     | The telecom partner associated with the customer.|
| `gender `              | The gender of the customer.                      |
| `age `                 | The age of the customer.                         |
| `state`                | The Indian state in which the customer is located.|
| `city`                 | The city in which the customer is located.       |
| `pincode`              | The pincode of the customer's location.          |
| `registration_event` | When the customer registered with the telecom partner.|
| `num_dependents`      | The number of dependents (e.g., children) the customer has.|
| `estimated_salary`     | The customer's estimated salary.                 |

- `telecom_usage` contains information about the usage patterns of Indian customers:

| Variable   | Description                                                  |
|------------|--------------------------------------------------------------|
| `customer_id` | Unique identifier for each customer.                         |
| `calls_made` | The number of calls made by the customer.                    |
| `sms_sent`   | The number of SMS messages sent by the customer.             |
| `data_used`  | The amount of data used by the customer.                     |
| `churn`    | Binary variable indicating whether the customer has churned or not (1 = churned, 0 = not churned).|


In [244]:
# Import libraries and methods/functions
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix , accuracy_score
from sklearn.linear_model import LogisticRegression 

# Start your code here!
telecom_demographics_df = pd.read_csv('telecom_demographics.csv')
telecom_usage_df = pd.read_csv('telecom_usage.csv')

In [245]:
churn_df = telecom_demographics_df.merge(telecom_usage_df  , on = 'customer_id')
churn_df

Unnamed: 0,customer_id,telecom_partner,gender,age,state,city,pincode,registration_event,num_dependents,estimated_salary,calls_made,sms_sent,data_used,churn
0,15169,Airtel,F,26,Himachal Pradesh,Delhi,667173,2020-03-16,4,85979,75,21,4532,1
1,149207,Airtel,F,74,Uttarakhand,Hyderabad,313997,2022-01-16,0,69445,35,38,723,1
2,148119,Airtel,F,54,Jharkhand,Chennai,549925,2022-01-11,2,75949,70,47,4688,1
3,187288,Reliance Jio,M,29,Bihar,Hyderabad,230636,2022-07-26,3,34272,95,32,10241,1
4,14016,Vodafone,M,45,Nagaland,Bangalore,188036,2020-03-11,4,34157,66,23,5246,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6495,78836,Airtel,M,54,Odisha,Chennai,125785,2021-01-29,4,124805,-2,39,5000,0
6496,146521,BSNL,M,69,Andhra Pradesh,Hyderabad,923076,2022-01-03,1,65605,20,31,3562,0
6497,40413,Airtel,M,19,Gujarat,Hyderabad,152201,2020-07-21,0,28632,73,14,65,0
6498,64961,Vodafone,M,26,Meghalaya,Chennai,782127,2020-11-21,3,119757,52,8,6835,0


In [246]:
churn_df.isna().sum()

customer_id           0
telecom_partner       0
gender                0
age                   0
state                 0
city                  0
pincode               0
registration_event    0
num_dependents        0
estimated_salary      0
calls_made            0
sms_sent              0
data_used             0
churn                 0
dtype: int64

In [247]:
churn_df['city'].value_counts()

Delhi        1128
Chennai      1093
Mumbai       1090
Hyderabad    1068
Bangalore    1066
Kolkata      1055
Name: city, dtype: int64

In [248]:
churn_df

Unnamed: 0,customer_id,telecom_partner,gender,age,state,city,pincode,registration_event,num_dependents,estimated_salary,calls_made,sms_sent,data_used,churn
0,15169,Airtel,F,26,Himachal Pradesh,Delhi,667173,2020-03-16,4,85979,75,21,4532,1
1,149207,Airtel,F,74,Uttarakhand,Hyderabad,313997,2022-01-16,0,69445,35,38,723,1
2,148119,Airtel,F,54,Jharkhand,Chennai,549925,2022-01-11,2,75949,70,47,4688,1
3,187288,Reliance Jio,M,29,Bihar,Hyderabad,230636,2022-07-26,3,34272,95,32,10241,1
4,14016,Vodafone,M,45,Nagaland,Bangalore,188036,2020-03-11,4,34157,66,23,5246,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6495,78836,Airtel,M,54,Odisha,Chennai,125785,2021-01-29,4,124805,-2,39,5000,0
6496,146521,BSNL,M,69,Andhra Pradesh,Hyderabad,923076,2022-01-03,1,65605,20,31,3562,0
6497,40413,Airtel,M,19,Gujarat,Hyderabad,152201,2020-07-21,0,28632,73,14,65,0
6498,64961,Vodafone,M,26,Meghalaya,Chennai,782127,2020-11-21,3,119757,52,8,6835,0


In [249]:
#churned_df = churn_df
#encoder = OneHotEncoder(sparse = False)
#encoded = encoder.fit_transform(churn_df[['telecom_partner','gender','state','city' , 'registration_event']])
#encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(['telecom_partner','gender','state','city' , 'registration_event']))
#churned_df.drop(['telecom_partner','gender','state','city'],axis = 1 ,inplace = True)
#churned_df = pd.concat([churned_df,encoded_df] , axis = 1)
churn_df2= pd.get_dummies(churn_df,columns = ['telecom_partner','gender','state','city', 'registration_event'])          
#    dtype=int       )
churn_df2.drop(['churn' , 'customer_id'], axis = 1 , inplace = True)
churn_df2

Unnamed: 0,age,pincode,num_dependents,estimated_salary,calls_made,sms_sent,data_used,telecom_partner_Airtel,telecom_partner_BSNL,telecom_partner_Reliance Jio,telecom_partner_Vodafone,gender_F,gender_M,state_Andhra Pradesh,state_Arunachal Pradesh,state_Assam,state_Bihar,state_Chhattisgarh,state_Goa,state_Gujarat,state_Haryana,state_Himachal Pradesh,state_Jharkhand,state_Karnataka,state_Kerala,state_Madhya Pradesh,state_Maharashtra,state_Manipur,state_Meghalaya,state_Mizoram,state_Nagaland,state_Odisha,state_Punjab,state_Rajasthan,state_Sikkim,state_Tamil Nadu,state_Telangana,state_Tripura,state_Uttar Pradesh,state_Uttarakhand,...,registration_event_2023-03-25,registration_event_2023-03-26,registration_event_2023-03-27,registration_event_2023-03-28,registration_event_2023-03-29,registration_event_2023-03-30,registration_event_2023-03-31,registration_event_2023-04-01,registration_event_2023-04-02,registration_event_2023-04-03,registration_event_2023-04-04,registration_event_2023-04-05,registration_event_2023-04-06,registration_event_2023-04-07,registration_event_2023-04-08,registration_event_2023-04-09,registration_event_2023-04-10,registration_event_2023-04-11,registration_event_2023-04-12,registration_event_2023-04-13,registration_event_2023-04-14,registration_event_2023-04-15,registration_event_2023-04-16,registration_event_2023-04-17,registration_event_2023-04-18,registration_event_2023-04-19,registration_event_2023-04-20,registration_event_2023-04-21,registration_event_2023-04-22,registration_event_2023-04-23,registration_event_2023-04-24,registration_event_2023-04-25,registration_event_2023-04-26,registration_event_2023-04-27,registration_event_2023-04-28,registration_event_2023-04-29,registration_event_2023-04-30,registration_event_2023-05-01,registration_event_2023-05-02,registration_event_2023-05-03
0,26,667173,4,85979,75,21,4532,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,74,313997,0,69445,35,38,723,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,54,549925,2,75949,70,47,4688,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,29,230636,3,34272,95,32,10241,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,45,188036,4,34157,66,23,5246,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6495,54,125785,4,124805,-2,39,5000,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6496,69,923076,1,65605,20,31,3562,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6497,19,152201,0,28632,73,14,65,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6498,26,782127,3,119757,52,8,6835,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [250]:
#X = churned_df
#y = churned_df['churn']

In [251]:
#print(y.value_counts(normalize=True))


In [252]:
#X.drop(['customer_id' , 'age' ,'pincode' , 'registration_event'], axis = 1 , inplace = True)

In [253]:
#X

In [254]:
#numerical_cols = churn_df.select_dtypes(include=['int64', 'float64']).drop(['customer_id' , 'churn'], axis=1).column
#numerical_cols = churned_df.select_dtypes(include=['int64', 'float64']).columns
#numerical_cols

In [255]:
ss = StandardScaler()
numerical_scaled = ss.fit_transform(churn_df2)
#sarray = ss.fit_transform(X)
#sX =pd.DataFrame(sarray , columns = X.columns)
numerical_scaled

array([[-1.22296979,  0.45493603,  1.43653887, ..., -0.04117252,
        -0.03283419, -0.02774568],
       [ 1.6963038 , -0.90419473, -1.41134604, ..., -0.04117252,
        -0.03283419, -0.02774568],
       [ 0.47993981,  0.00372937,  0.01259641, ..., -0.04117252,
        -0.03283419, -0.02774568],
       ...,
       [-1.64869719, -1.5268359 , -1.41134604, ..., -0.04117252,
        -0.03283419, -0.02774568],
       [-1.22296979,  0.89731466,  0.72456764, ..., -0.04117252,
        -0.03283419, -0.02774568],
       [-1.64869719,  1.42210486,  0.72456764, ..., -0.04117252,
        -0.03283419, -0.02774568]])

In [256]:
X = numerical_scaled
#X.drop('churn' , axis = 1 , inplace = True)
X

array([[-1.22296979,  0.45493603,  1.43653887, ..., -0.04117252,
        -0.03283419, -0.02774568],
       [ 1.6963038 , -0.90419473, -1.41134604, ..., -0.04117252,
        -0.03283419, -0.02774568],
       [ 0.47993981,  0.00372937,  0.01259641, ..., -0.04117252,
        -0.03283419, -0.02774568],
       ...,
       [-1.64869719, -1.5268359 , -1.41134604, ..., -0.04117252,
        -0.03283419, -0.02774568],
       [-1.22296979,  0.89731466,  0.72456764, ..., -0.04117252,
        -0.03283419, -0.02774568],
       [-1.64869719,  1.42210486,  0.72456764, ..., -0.04117252,
        -0.03283419, -0.02774568]])

In [257]:
y = churn_df['churn']

In [258]:
X_train , X_test , y_train , y_test = train_test_split(X , y , random_state = 42 , test_size = 0.2)

In [259]:
n = [100,200,300,400,500]
LogRegAcc = 0
RcAcc = 0
RFCAccs = []
LogReg = LogisticRegression()
RC = RidgeClassifier()

LogReg.fit(X_train , y_train)
LogRegy_pred = LogReg.predict(X_test)
LogRegAcc = accuracy_score(y_test , LogRegy_pred)

RC.fit(X_train , y_train)
RCy_pred = RC.predict(X_test)
RCAcc = accuracy_score(y_test , RCy_pred)

#for i in n:
 #   RFC = RandomForestClassifier(n_estimators = i)
  #  RFC.fit(X_train,y_train)
   # RFCy_pred = RFC.predict(X_test)
   # RFCAccs.append(accuracy_score(y_test , RFCy_pred))

RFC2 = RandomForestClassifier()
RFC2.fit(X_train,y_train)
RFC2y_pred = RFC2.predict(X_test)
    #RFCAccs.append(accuracy_score(y_test , RFCy_pred))
RFC2Acc = accuracy_score(y_test , RFC2y_pred)

print(LogRegAcc , RCAcc , RFC2Acc)

0.7292307692307692 0.7323076923076923 0.7892307692307692


In [260]:
print(confusion_matrix(y_test , LogRegy_pred) ,"sep" , confusion_matrix(y_test , RCy_pred) , "sep" , confusion_matrix(y_test ,RFC2y_pred))

[[920 107]
 [245  28]] sep [[927 100]
 [248  25]] sep [[1026    1]
 [ 273    0]]


In [261]:
print(classification_report(y_test , LogRegy_pred) ,"sep" , classification_report(y_test , RCy_pred) , "sep" , classification_report(y_test ,RFC2y_pred))

              precision    recall  f1-score   support

           0       0.79      0.90      0.84      1027
           1       0.21      0.10      0.14       273

    accuracy                           0.73      1300
   macro avg       0.50      0.50      0.49      1300
weighted avg       0.67      0.73      0.69      1300
 sep               precision    recall  f1-score   support

           0       0.79      0.90      0.84      1027
           1       0.20      0.09      0.13       273

    accuracy                           0.73      1300
   macro avg       0.49      0.50      0.48      1300
weighted avg       0.67      0.73      0.69      1300
 sep               precision    recall  f1-score   support

           0       0.79      1.00      0.88      1027
           1       0.00      0.00      0.00       273

    accuracy                           0.79      1300
   macro avg       0.39      0.50      0.44      1300
weighted avg       0.62      0.79      0.70      1300



In [262]:
logreg_pred = LogRegy_pred
rf_pred = RFC2y_pred
target = churn_df['churn']
features_scaled = numerical_scaled
higher_accuracy = "RandomForest"
print("numerical_scaled shape:", numerical_scaled.shape)
features_scaled

numerical_scaled shape: (6500, 1263)


array([[-1.22296979,  0.45493603,  1.43653887, ..., -0.04117252,
        -0.03283419, -0.02774568],
       [ 1.6963038 , -0.90419473, -1.41134604, ..., -0.04117252,
        -0.03283419, -0.02774568],
       [ 0.47993981,  0.00372937,  0.01259641, ..., -0.04117252,
        -0.03283419, -0.02774568],
       ...,
       [-1.64869719, -1.5268359 , -1.41134604, ..., -0.04117252,
        -0.03283419, -0.02774568],
       [-1.22296979,  0.89731466,  0.72456764, ..., -0.04117252,
        -0.03283419, -0.02774568],
       [-1.64869719,  1.42210486,  0.72456764, ..., -0.04117252,
        -0.03283419, -0.02774568]])