![Cartoon of telecom customers](IMG_8811.png)


The telecommunications (telecom) sector in India is rapidly changing, with more and more telecom businesses being created and many customers deciding to switch between providers. "Churn" refers to the process where customers or subscribers stop using a company's services or products. Understanding the factors that influence keeping a customer as a client in predicting churn is crucial for telecom companies to enhance their service quality and customer satisfaction. As the data scientist on this project, you aim to explore the intricate dynamics of customer behavior and demographics in the Indian telecom sector in predicting customer churn, utilizing two comprehensive datasets from four major telecom partners: Airtel, Reliance Jio, Vodafone, and BSNL:

- `telecom_demographics.csv` contains information related to Indian customer demographics:

| Variable             | Description                                      |
|----------------------|--------------------------------------------------|
| `customer_id `         | Unique identifier for each customer.             |
| `telecom_partner `     | The telecom partner associated with the customer.|
| `gender `              | The gender of the customer.                      |
| `age `                 | The age of the customer.                         |
| `state`                | The Indian state in which the customer is located.|
| `city`                 | The city in which the customer is located.       |
| `pincode`              | The pincode of the customer's location.          |
| `registration_event` | When the customer registered with the telecom partner.|
| `num_dependents`      | The number of dependents (e.g., children) the customer has.|
| `estimated_salary`     | The customer's estimated salary.                 |

- `telecom_usage` contains information about the usage patterns of Indian customers:

| Variable   | Description                                                  |
|------------|--------------------------------------------------------------|
| `customer_id` | Unique identifier for each customer.                         |
| `calls_made` | The number of calls made by the customer.                    |
| `sms_sent`   | The number of SMS messages sent by the customer.             |
| `data_used`  | The amount of data used by the customer.                     |
| `churn`    | Binary variable indicating whether the customer has churned or not (1 = churned, 0 = not churned).|


In [1]:
# Import libraries and methods/functions
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Start your code here!

## Step 1: Load the CSV files

In [2]:
# Load the telecom demographics and usage data
telecom_demographics = pd.read_csv('telecom_demographics.csv')
telecom_usage = pd.read_csv('telecom_usage.csv')

# Display the first few rows and info of the DataFrames
print("Telecom Demographics Data:")
print(telecom_demographics.head())
print(telecom_demographics.info())

print("\nTelecom Usage Data:")
print(telecom_usage.head())
print(telecom_usage.info())

Telecom Demographics Data:
   customer_id telecom_partner  ... num_dependents  estimated_salary
0        15169          Airtel  ...              4             85979
1       149207          Airtel  ...              0             69445
2       148119          Airtel  ...              2             75949
3       187288    Reliance Jio  ...              3             34272
4        14016        Vodafone  ...              4             34157

[5 rows x 10 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6500 entries, 0 to 6499
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         6500 non-null   int64 
 1   telecom_partner     6500 non-null   object
 2   gender              6500 non-null   object
 3   age                 6500 non-null   int64 
 4   state               6500 non-null   object
 5   city                6500 non-null   object
 6   pincode             6500 non-null   int6

In [3]:
# Merge the two DataFrames on 'customer_id'
churn_df = pd.merge(telecom_demographics, telecom_usage, on='customer_id')

# Display the first few rows of the merged DataFrame
print(churn_df.head())

   customer_id telecom_partner gender  ...  sms_sent data_used churn
0        15169          Airtel      F  ...        21      4532     1
1       149207          Airtel      F  ...        38       723     1
2       148119          Airtel      F  ...        47      4688     1
3       187288    Reliance Jio      M  ...        32     10241     1
4        14016        Vodafone      M  ...        23      5246     1

[5 rows x 14 columns]


## Step 2: Calculate and print the churn rate


In [4]:
# Calculate churn rate
churn_rate = churn_df['churn'].mean()
print(f'Churn Rate: {churn_rate:.2f}')

Churn Rate: 0.20


## Step 3: Identify categorical variables
Categorical variables typically include non-numeric data that can be converted into a numeric format using techniques like one-hot encoding. In our dataset, these variables are likely to include 'telecom_partner', 'gender', 'state', and 'city'.

In [5]:
from sklearn.preprocessing import OneHotEncoder

# Identify categorical features
categorical_features = ['telecom_partner', 'gender', 'state', 'city']

# Convert 'registration_event' to datetime and then to numerical value (e.g., number of days since the earliest date)
churn_df['registration_event'] = pd.to_datetime(churn_df['registration_event'])
churn_df['registration_event'] = (churn_df['registration_event'] - churn_df['registration_event'].min()).dt.days

# One-hot encode the categorical features
one_hot_encoder = OneHotEncoder(sparse=False, drop='first')
categorical_encoded = one_hot_encoder.fit_transform(churn_df[categorical_features])

# Convert the encoded categorical features into a DataFrame
categorical_encoded_df = pd.DataFrame(categorical_encoded, columns=one_hot_encoder.get_feature_names_out(categorical_features))

# Drop the original categorical columns and concatenate the encoded columns
churn_df = churn_df.drop(columns=categorical_features)
churn_df = pd.concat([churn_df, categorical_encoded_df], axis=1)

## Step 5: Feature scaling

In [6]:
# Define the features to be scaled
features_to_scale = ['age', 'num_dependents', 'estimated_salary', 'calls_made', 'sms_sent', 'data_used']

# Initialize the StandardScaler
scaler = StandardScaler()

# Scale the features
scaled_features = scaler.fit_transform(churn_df[features_to_scale])

# Convert the scaled features into a DataFrame
scaled_features_df = pd.DataFrame(scaled_features, columns=features_to_scale)

# Drop the original columns and concatenate the scaled columns
churn_df = churn_df.drop(columns=features_to_scale)
churn_df = pd.concat([churn_df, scaled_features_df], axis=1)

## Step 6: Define features and target variable

In [7]:
# Define the target variable
target = 'churn'

# Define the feature set
features = churn_df.drop(columns=[target])

# Define the target variable
target_variable = churn_df[target]

## Step 7: Split the data into training and testing sets

In [8]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target_variable, test_size=0.2, random_state=42)

## Step 8: Train Logistic Regression and Random Forest models

In [9]:
from sklearn.linear_model import LogisticRegression

# Initialize the models
logreg_model = LogisticRegression(random_state=42)
rf_model = RandomForestClassifier(random_state=42)

# Train the models
logreg_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)

# Predict on the test data
logreg_pred = logreg_model.predict(X_test)
rf_pred = rf_model.predict(X_test)

## Step 9: Assess the models on test data

In [10]:
from sklearn.metrics import accuracy_score

# Calculate accuracy scores
logreg_accuracy = accuracy_score(y_test, logreg_pred)
rf_accuracy = accuracy_score(y_test, rf_pred)

# Determine which model has higher accuracy
higher_accuracy = "LogisticRegression" if logreg_accuracy > rf_accuracy else "RandomForest"

# Print the accuracy scores and the model with higher accuracy
print(f'Logistic Regression Accuracy: {logreg_accuracy:.2f}')
print(f'Random Forest Accuracy: {rf_accuracy:.2f}')
print(f'Higher Accuracy Model: {higher_accuracy}')

# Print classification reports
print("\nLogistic Regression Classification Report:")
print(classification_report(y_test, logreg_pred))

print("\nRandom Forest Classification Report:")
print(classification_report(y_test, rf_pred))

# Print confusion matrices
print("\nLogistic Regression Confusion Matrix:")
print(confusion_matrix(y_test, logreg_pred))

print("\nRandom Forest Confusion Matrix:")
print(confusion_matrix(y_test, rf_pred))

Logistic Regression Accuracy: 0.79
Random Forest Accuracy: 0.79
Higher Accuracy Model: RandomForest

Logistic Regression Classification Report:
              precision    recall  f1-score   support

           0       0.79      1.00      0.88      1027
           1       0.00      0.00      0.00       273

    accuracy                           0.79      1300
   macro avg       0.40      0.50      0.44      1300
weighted avg       0.62      0.79      0.70      1300


Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.79      1.00      0.88      1027
           1       0.00      0.00      0.00       273

    accuracy                           0.79      1300
   macro avg       0.40      0.50      0.44      1300
weighted avg       0.62      0.79      0.70      1300


Logistic Regression Confusion Matrix:
[[1027    0]
 [ 273    0]]

Random Forest Confusion Matrix:
[[1027    0]
 [ 273    0]]


# Their solution
(I like mine better)

In [11]:
# Import required libraries and methods/functions
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder 
# OneHotEncoder is not needed if using pd.get_dummies()
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Load data
telco_demog = pd.read_csv('telecom_demographics.csv')
telco_usage = pd.read_csv('telecom_usage.csv')

# Join data
churn_df = telco_demog.merge(telco_usage, on='customer_id')

# Identify churn rate
churn_rate = churn_df['churn'].value_counts() / len(churn_df)
print(churn_rate)

# Identify categorical variables
print(churn_df.info())

# One Hot Encoding for categorical variables
churn_df = pd.get_dummies(churn_df, columns=['telecom_partner', 'gender', 'state', 'city', 'registration_event'])

# Feature Scaling
scaler = StandardScaler()

# 'customer_id' is not a feature
features = churn_df.drop(['customer_id', 'churn'], axis=1)
features_scaled = scaler.fit_transform(features)

# Target variable
target = churn_df['churn']

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.2, random_state=42)

# Instantiate the Logistic Regression
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)

# Logistic Regression predictions
logreg_pred = logreg.predict(X_test)

# Logistic Regression evaluation
print(confusion_matrix(y_test, logreg_pred))
print(classification_report(y_test, logreg_pred))

# Instantiate the Random Forest model
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Random Forest predictions
rf_pred = rf.predict(X_test)

# Random Forest evaluation
print(confusion_matrix(y_test, rf_pred))
print(classification_report(y_test, rf_pred))

# Which accuracy score is higher? Ridge or RandomForest
higher_accuracy = "RandomForest"

0    0.799538
1    0.200462
Name: churn, dtype: float64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6500 entries, 0 to 6499
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         6500 non-null   int64 
 1   telecom_partner     6500 non-null   object
 2   gender              6500 non-null   object
 3   age                 6500 non-null   int64 
 4   state               6500 non-null   object
 5   city                6500 non-null   object
 6   pincode             6500 non-null   int64 
 7   registration_event  6500 non-null   object
 8   num_dependents      6500 non-null   int64 
 9   estimated_salary    6500 non-null   int64 
 10  calls_made          6500 non-null   int64 
 11  sms_sent            6500 non-null   int64 
 12  data_used           6500 non-null   int64 
 13  churn               6500 non-null   int64 
dtypes: int64(9), object(5)
memory usage: 761.7+ KB
None
[[920 107]
 

## Comparison Summary
My Code:
- Uses StandardScaler and OneHotEncoder from sklearn.preprocessing for scaling and encoding.
- Converts registration_event to a numerical value by calculating the number of days since the earliest date.
- Merges data using pd.merge.
- Defines a custom padding function for handling sequence lengths.
- Splits data into training and testing sets using train_test_split.
- Trains both LogisticRegression and RandomForestClassifier models and compares their accuracy.
- Uses classification_report and confusion_matrix for model evaluation.

Proposed Solution:
- Uses pd.get_dummies for one-hot encoding instead of OneHotEncoder.
- Merges data using pd.merge.
- Scales features using StandardScaler.
- Splits data into training and testing sets using train_test_split.
- Trains both LogisticRegression and RandomForestClassifier models and compares their accuracy.
- Uses classification_report and confusion_matrix for model evaluation.