![Cartoon of telecom customers](IMG_8811.png)


The telecommunications (telecom) sector in India is rapidly changing, with more and more telecom businesses being created and many customers deciding to switch between providers. "Churn" refers to the process where customers or subscribers stop using a company's services or products. Understanding the factors that influence keeping a customer as a client in predicting churn is crucial for telecom companies to enhance their service quality and customer satisfaction. As the data scientist on this project, you aim to explore the intricate dynamics of customer behavior and demographics in the Indian telecom sector in predicting customer churn, utilizing two comprehensive datasets from four major telecom partners: Airtel, Reliance Jio, Vodafone, and BSNL:

- `telecom_demographics.csv` contains information related to Indian customer demographics:

| Variable             | Description                                      |
|----------------------|--------------------------------------------------|
| `customer_id `         | Unique identifier for each customer.             |
| `telecom_partner `     | The telecom partner associated with the customer.|
| `gender `              | The gender of the customer.                      |
| `age `                 | The age of the customer.                         |
| `state`                | The Indian state in which the customer is located.|
| `city`                 | The city in which the customer is located.       |
| `pincode`              | The pincode of the customer's location.          |
| `registration_event` | When the customer registered with the telecom partner.|
| `num_dependents`      | The number of dependents (e.g., children) the customer has.|
| `estimated_salary`     | The customer's estimated salary.                 |

- `telecom_usage` contains information about the usage patterns of Indian customers:

| Variable   | Description                                                  |
|------------|--------------------------------------------------------------|
| `customer_id` | Unique identifier for each customer.                         |
| `calls_made` | The number of calls made by the customer.                    |
| `sms_sent`   | The number of SMS messages sent by the customer.             |
| `data_used`  | The amount of data used by the customer.                     |
| `churn`    | Binary variable indicating whether the customer has churned or not (1 = churned, 0 = not churned).|


In [64]:
# Import libraries and methods/functions
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.linear_model import RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Start your code here!

# Load the Data

In [65]:
# Load data
telco_demog = pd.read_csv('telecom_demographics.csv')
telco_usage = pd.read_csv('telecom_usage.csv')


# Join the Data and Calculate Churn Rate

In this section, we will perform the following steps:

1. **Join the Data**: Merge the demographic and usage data on the `customer_id` column to create a unified DataFrame named `churn_df`.
2. **Identify Churn Rate**: Calculate the churn rate by determining the proportion of customers who have churned. This is done by using the `value_counts` method on the `churn` column and dividing by the total number of records in the DataFrame.

The code snippet below demonstrates these steps:

In [66]:
# Join data
churn_df = telco_demog.merge(telco_usage, on='customer_id')

# Identify churn rate
churn_rate = churn_df['churn'].value_counts() / len(churn_df)
print(churn_rate)



0    0.799538
1    0.200462
Name: churn, dtype: float64


# Identify Categorical Variables and Perform One-Hot Encoding

In this section, we will identify the categorical variables in our unified DataFrame `churn_df` and perform one-hot encoding on these variables. One-hot encoding is a process that converts categorical variables into a format that can be provided to machine learning algorithms to improve predictions.

1. **Identify Categorical Variables**: We will use the `info()` method to get a concise summary of the DataFrame, which includes the data types of each column. This will help us identify which columns are categorical.

2. **One-Hot Encoding**: We will use the `pd.get_dummies()` function to perform one-hot encoding on the identified categorical variables. This function will create new binary columns for each category in the specified columns.

The code snippet below demonstrates these steps:

In [67]:
# Identify categorical variables
print(churn_df.info())

# One Hot Encoding for categorical variables
churn_df = pd.get_dummies(churn_df, columns=['telecom_partner', 'gender', 'state', 'city', 'registration_event'])

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6500 entries, 0 to 6499
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         6500 non-null   int64 
 1   telecom_partner     6500 non-null   object
 2   gender              6500 non-null   object
 3   age                 6500 non-null   int64 
 4   state               6500 non-null   object
 5   city                6500 non-null   object
 6   pincode             6500 non-null   int64 
 7   registration_event  6500 non-null   object
 8   num_dependents      6500 non-null   int64 
 9   estimated_salary    6500 non-null   int64 
 10  calls_made          6500 non-null   int64 
 11  sms_sent            6500 non-null   int64 
 12  data_used           6500 non-null   int64 
 13  churn               6500 non-null   int64 
dtypes: int64(9), object(5)
memory usage: 761.7+ KB
None


# Feature Scaling

Feature scaling is a crucial step in the data preprocessing pipeline, especially for algorithms that rely on distance metrics, such as k-nearest neighbors (KNN) and support vector machines (SVM). Scaling ensures that all features contribute equally to the model's performance by bringing them to a common scale without distorting differences in the ranges of values.

In this section, we will use the `StandardScaler` from the `sklearn.preprocessing` module to standardize our features. Standardization involves rescaling the features so that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one.

1. **Initialize the Scaler**: We will create an instance of the `StandardScaler` class.
2. **Select Features**: We will drop the `customer_id` and `churn` columns from our DataFrame `churn_df` as they are not features. The `customer_id` is an identifier, and `churn` is the target variable.
3. **Fit and Transform**: We will fit the scaler to our features and transform them to the standardized scale.

The code snippet below demonstrates these steps:

In [68]:
# Feature Scaling
scaler = StandardScaler()

# 'customer_id' is not a feature
features = churn_df.drop(['customer_id', 'churn'], axis=1)
features_scaled = scaler.fit_transform(features)

# Target Variable and Dataset Splitting

After scaling our features, the next step is to define our target variable and split the dataset into training and testing sets. This is essential for evaluating the performance of our machine learning model.

1. **Target Variable**: The target variable, which we aim to predict, is the `churn` column from our DataFrame `churn_df`.
2. **Splitting the Dataset**: We will use the `train_test_split` function from the `sklearn.model_selection` module to split our dataset. This function will divide the scaled features and the target variable into training and testing sets. We will allocate 80% of the data for training and 20% for testing. Setting a `random_state` ensures reproducibility of the split.

The code snippet below demonstrates these steps:

In [69]:
# Target variable
target = churn_df['churn']

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.2, random_state=42)

# Logistic Regression Model

Logistic Regression is a popular and widely used machine learning algorithm for binary classification problems. It models the probability that a given input point belongs to a certain class. In this section, we will:

1. **Instantiate the Logistic Regression Model**: We will create an instance of the `LogisticRegression` class from the `sklearn.linear_model` module. Setting a `random_state` ensures the reproducibility of our results.
2. **Fit the Model**: Using the training data (`X_train` and `y_train`), we will train the logistic regression model.
3. **Make Predictions**: We will use the trained model to make predictions on the test data (`X_test`).
4. **Evaluate the Model**: To assess the performance of our logistic regression model, we will use a confusion matrix and a classification report. These metrics will help us understand how well our model is performing in terms of precision, recall, F1-score, and overall accuracy.

The code snippet below demonstrates these steps:

In [70]:
# Instantiate the Logistic Regression
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)

# Logistic Regression predictions
logreg_pred = logreg.predict(X_test)

# Logistic Regression evaluation
print(confusion_matrix(y_test, logreg_pred))
print(classification_report(y_test, logreg_pred))

[[920 107]
 [245  28]]
              precision    recall  f1-score   support

           0       0.79      0.90      0.84      1027
           1       0.21      0.10      0.14       273

    accuracy                           0.73      1300
   macro avg       0.50      0.50      0.49      1300
weighted avg       0.67      0.73      0.69      1300



# Random Forest Model

Random Forest is an ensemble learning method that operates by constructing multiple decision trees during training and outputting the class that is the mode of the classes of the individual trees. It is known for its robustness and ability to handle large datasets with higher dimensionality. In this section, we will:

1. **Instantiate the Random Forest Model**: We will create an instance of the `RandomForestClassifier` class from the `sklearn.ensemble` module. Setting a `random_state` ensures the reproducibility of our results.
2. **Fit the Model**: Using the training data (`X_train` and `y_train`), we will train the random forest model.
3. **Make Predictions**: We will use the trained model to make predictions on the test data (`X_test`).
4. **Evaluate the Model**: To assess the performance of our random forest model, we will use a confusion matrix and a classification report. These metrics will help us understand how well our model is performing in terms of precision, recall, F1-score, and overall accuracy.

The code snippet below demonstrates these steps:

In [71]:
# Instantiate the Random Forest model
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Random Forest predictions
rf_pred = rf.predict(X_test)

# Random Forest evaluation
print(confusion_matrix(y_test, rf_pred))
print(classification_report(y_test, rf_pred))

[[1026    1]
 [ 273    0]]
              precision    recall  f1-score   support

           0       0.79      1.00      0.88      1027
           1       0.00      0.00      0.00       273

    accuracy                           0.79      1300
   macro avg       0.39      0.50      0.44      1300
weighted avg       0.62      0.79      0.70      1300



### Comparison

#### Accuracy:
- **Random Forest Model:** 0.79
- **Logistic Regression:** 0.73

#### Class 0 (Majority Class):
- **Random Forest Model F1-score:** 0.88
- **Logistic Regression F1-score:** 0.84

#### Class 1 (Minority Class):
- **Random Forest Model F1-score:** 0.00
- **Logistic Regression F1-score:** 0.14

#### Macro Average F1-score:
- **Random Forest Model:** 0.44
- **Logistic Regression:** 0.49

#### Weighted Average F1-score:
- **Random Forest Model:** 0.70
- **Logistic Regression:** 0.69

### Interpretation

- **Accuracy:** The random forest model has a higher accuracy (0.79) compared to the logistic regression model (0.73). However, accuracy is not a reliable metric when dealing with imbalanced datasets.

- **Class 0 Performance:** Both models perform well for the majority class (class 0), but the random forest model performs slightly better with a higher F1-score (0.88 vs. 0.84).

- **Class 1 Performance:** The logistic regression model performs better for the minority class (class 1) with a non-zero F1-score (0.14 vs. 0.00). This indicates that logistic regression has a better ability to identify instances of the minority class, although it is still poor.

- **Macro and Weighted Average:** The logistic regression model has a slightly better macro average F1-score (0.49 vs. 0.44) but a slightly lower weighted average F1-score (0.69 vs. 0.70).

# Conclusion
With respect to accuracy, the random forest model is the winner, as it has a higher accuracy (0.79) compared to the logistic regression model (0.73). However, considering the imbalanced nature of the dataset, other metrics like F1-score for the minority class should also be taken into account for a comprehensive evaluation.

In [72]:
# Which accuracy score is higher? Logistic or RandomForest
higher_accuracy = "RandomForest"