The telecommunications (telecom) sector in India is rapidly changing, with more and more telecom businesses being created and many customers deciding to switch between providers. "Churn" refers to the process where customers or subscribers stop using a company's services or products. Understanding the factors that influence keeping a customer as a client in predicting churn is crucial for telecom companies to enhance their service quality and customer satisfaction. As the data scientist on this project, you aim to explore the intricate dynamics of customer behavior and demographics in the Indian telecom sector in predicting customer churn, utilizing two comprehensive datasets from four major telecom partners: Airtel, Reliance Jio, Vodafone, and BSNL:

- `telecom_demographics.csv` contains information related to Indian customer demographics:

| Variable             | Description                                      |
|----------------------|--------------------------------------------------|
| `customer_id `         | Unique identifier for each customer.             |
| `telecom_partner `     | The telecom partner associated with the customer.|
| `gender `              | The gender of the customer.                      |
| `age `                 | The age of the customer.                         |
| `state`                | The Indian state in which the customer is located.|
| `city`                 | The city in which the customer is located.       |
| `pincode`              | The pincode of the customer's location.          |
| `registration_event` | When the customer registered with the telecom partner.|
| `num_dependents`      | The number of dependents (e.g., children) the customer has.|
| `estimated_salary`     | The customer's estimated salary.                 |

- `telecom_usage` contains information about the usage patterns of Indian customers:

| Variable   | Description                                                  |
|------------|--------------------------------------------------------------|
| `customer_id` | Unique identifier for each customer.                         |
| `calls_made` | The number of calls made by the customer.                    |
| `sms_sent`   | The number of SMS messages sent by the customer.             |
| `data_used`  | The amount of data used by the customer.                     |
| `churn`    | Binary variable indicating whether the customer has churned or not (1 = churned, 0 = not churned).|


In [51]:
# Import libraries and methods/functions
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Start your code here!

In [52]:
demographics_df = pd.read_csv('telecom_demographics.csv')
usage_df = pd.read_csv('telecom_usage.csv')

In [53]:
print(demographics_df.head())
print(usage_df.head())

   customer_id telecom_partner  ... num_dependents  estimated_salary
0        15169          Airtel  ...              4             85979
1       149207          Airtel  ...              0             69445
2       148119          Airtel  ...              2             75949
3       187288    Reliance Jio  ...              3             34272
4        14016        Vodafone  ...              4             34157

[5 rows x 10 columns]
   customer_id  calls_made  sms_sent  data_used  churn
0        15169          75        21       4532      1
1       149207          35        38        723      1
2       148119          70        47       4688      1
3       187288          95        32      10241      1
4        14016          66        23       5246      1


In [54]:
churn_df= pd.merge(demographics_df, usage_df, on='customer_id')

In [59]:
# Save the merged DataFrame to a CSV file
churn_df.to_csv('merged_telecom_data.csv', index=False)


In [60]:
print(churn_df.head())
print(churn_df.columns)

   customer_id telecom_partner gender  ...  sms_sent data_used churn
0        15169          Airtel      F  ...        21      4532     1
1       149207          Airtel      F  ...        38       723     1
2       148119          Airtel      F  ...        47      4688     1
3       187288    Reliance Jio      M  ...        32     10241     1
4        14016        Vodafone      M  ...        23      5246     1

[5 rows x 14 columns]
Index(['customer_id', 'telecom_partner', 'gender', 'age', 'state', 'city',
       'pincode', 'registration_event', 'num_dependents', 'estimated_salary',
       'calls_made', 'sms_sent', 'data_used', 'churn'],
      dtype='object')


In [61]:
print(churn_df.isnull().sum())

customer_id           0
telecom_partner       0
gender                0
age                   0
state                 0
city                  0
pincode               0
registration_event    0
num_dependents        0
estimated_salary      0
calls_made            0
sms_sent              0
data_used             0
churn                 0
dtype: int64


In [62]:
value_counts = churn_df['churn'].value_counts()
print(value_counts)

count_of_ones = value_counts.get(1, 0) 

proportion_of_ones = count_of_ones / len(churn_df)

print(f"Proportion of 1's: {proportion_of_ones:.2f}")


0    5197
1    1303
Name: churn, dtype: int64
Proportion of 1's: 0.20


In [63]:
churn_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6500 entries, 0 to 6499
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         6500 non-null   int64 
 1   telecom_partner     6500 non-null   object
 2   gender              6500 non-null   object
 3   age                 6500 non-null   int64 
 4   state               6500 non-null   object
 5   city                6500 non-null   object
 6   pincode             6500 non-null   int64 
 7   registration_event  6500 non-null   object
 8   num_dependents      6500 non-null   int64 
 9   estimated_salary    6500 non-null   int64 
 10  calls_made          6500 non-null   int64 
 11  sms_sent            6500 non-null   int64 
 12  data_used           6500 non-null   int64 
 13  churn               6500 non-null   int64 
dtypes: int64(9), object(5)
memory usage: 761.7+ KB


In [64]:
categorical_columns = ['gender', 'telecom_partner', 'state', 'city','registration_event' ]

# Convert categorical columns to dummy/indicator variables
churn_df_encoded = pd.get_dummies(churn_df, columns=categorical_columns)

print(churn_df_encoded)

      customer_id  ...  registration_event_2023-05-03
0           15169  ...                              0
1          149207  ...                              0
2          148119  ...                              0
3          187288  ...                              0
4           14016  ...                              0
...           ...  ...                            ...
6495        78836  ...                              0
6496       146521  ...                              0
6497        40413  ...                              0
6498        64961  ...                              0
6499        60427  ...                              0

[6500 rows x 1265 columns]


In [65]:

churn_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6500 entries, 0 to 6499
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         6500 non-null   int64 
 1   telecom_partner     6500 non-null   object
 2   gender              6500 non-null   object
 3   age                 6500 non-null   int64 
 4   state               6500 non-null   object
 5   city                6500 non-null   object
 6   pincode             6500 non-null   int64 
 7   registration_event  6500 non-null   object
 8   num_dependents      6500 non-null   int64 
 9   estimated_salary    6500 non-null   int64 
 10  calls_made          6500 non-null   int64 
 11  sms_sent            6500 non-null   int64 
 12  data_used           6500 non-null   int64 
 13  churn               6500 non-null   int64 
dtypes: int64(9), object(5)
memory usage: 761.7+ KB


In [66]:
churn_df_encoded.info

<bound method DataFrame.info of       customer_id  ...  registration_event_2023-05-03
0           15169  ...                              0
1          149207  ...                              0
2          148119  ...                              0
3          187288  ...                              0
4           14016  ...                              0
...           ...  ...                            ...
6495        78836  ...                              0
6496       146521  ...                              0
6497        40413  ...                              0
6498        64961  ...                              0
6499        60427  ...                              0

[6500 rows x 1265 columns]>

In [67]:
print(churn_df_encoded.columns[20:])  # Print the first 20 column names
print(len(churn_df_encoded.columns))

Index(['state_Goa', 'state_Gujarat', 'state_Haryana', 'state_Himachal Pradesh',
       'state_Jharkhand', 'state_Karnataka', 'state_Kerala',
       'state_Madhya Pradesh', 'state_Maharashtra', 'state_Manipur',
       ...
       'registration_event_2023-04-24', 'registration_event_2023-04-25',
       'registration_event_2023-04-26', 'registration_event_2023-04-27',
       'registration_event_2023-04-28', 'registration_event_2023-04-29',
       'registration_event_2023-04-30', 'registration_event_2023-05-01',
       'registration_event_2023-05-02', 'registration_event_2023-05-03'],
      dtype='object', length=1245)
1265


In [68]:
X = churn_df_encoded.drop(['customer_id', 'churn'], axis=1)
y = churn_df_encoded['churn']

In [69]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6500 entries, 0 to 6499
Columns: 1263 entries, age to registration_event_2023-05-03
dtypes: int64(7), uint8(1256)
memory usage: 8.2 MB


In [70]:
y.info()

<class 'pandas.core.series.Series'>
Int64Index: 6500 entries, 0 to 6499
Series name: churn
Non-Null Count  Dtype
--------------  -----
6500 non-null   int64
dtypes: int64(1)
memory usage: 101.6 KB


In [71]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [72]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [73]:

logistic_model = LogisticRegression(max_iter=1000)
random_forest_model = RandomForestClassifier(n_estimators=100, random_state=42)

In [74]:
logistic_model.fit(X_train, y_train )

In [75]:


random_forest_model.fit(X_train, y_train)

In [76]:
logreg_pred = logistic_model.predict(X_test)

In [77]:
rf_pred= random_forest_model.predict(X_test)

In [78]:
logistic_confusion_matrix = confusion_matrix(y_test, logreg_pred)
random_forest_confusion_matrix = confusion_matrix(y_test, rf_pred)

In [79]:

print("Confusion Matrix for Logistic Regression:")
print(logistic_confusion_matrix)

print("\nConfusion Matrix for Random Forest:")
print(random_forest_confusion_matrix)

Confusion Matrix for Logistic Regression:
[[920 107]
 [245  28]]

Confusion Matrix for Random Forest:
[[1026    1]
 [ 273    0]]


In [80]:
print('classification report for Logistic Regression')
logistic_classification_report = classification_report(y_test, logreg_pred)
print(logistic_classification_report)
print('classification report for Random Forest')
random_forest_classification_report = classification_report(y_test, rf_pred)
print(random_forest_classification_report)


classification report for Logistic Regression
              precision    recall  f1-score   support

           0       0.79      0.90      0.84      1027
           1       0.21      0.10      0.14       273

    accuracy                           0.73      1300
   macro avg       0.50      0.50      0.49      1300
weighted avg       0.67      0.73      0.69      1300

classification report for Random Forest
              precision    recall  f1-score   support

           0       0.79      1.00      0.88      1027
           1       0.00      0.00      0.00       273

    accuracy                           0.79      1300
   macro avg       0.39      0.50      0.44      1300
weighted avg       0.62      0.79      0.70      1300



In [81]:
from sklearn.metrics import accuracy_score

In [84]:
logistic_accuracy = accuracy_score(y_test, logreg_pred)
random_forest_accuracy = accuracy_score(y_test, rf_pred)

# Compare accuracies and set the higher_accuracy variable
if logistic_accuracy > random_forest_accuracy:
    higher_accuracy = "LogisticRegression"
    print(f'The higher accuracy model is {higher_accuracy}')
else:
    higher_accuracy = "RandomForest"
    print(f'The higher accuracy model is {higher_accuracy}')

The higher accuracy model is RandomForest
