# Logistic Regression
---

### Objectives:

- Use Logistic Regression for classification

- Preprocess data for modeling

- Implement Logistic regression on real world data

### Installs:

In [0]:
%%capture
%pip install numpy==2.4.0
%pip install pandas==2.3.3
%pip install scikit-learn==1.8.0
%pip install matplotlib==3.10.8
%pip install seaborn==0.13.0

In [0]:
# Command to restart the kernel and update the installed libraries
%restart_python

### Imports:

In [0]:
# Data Analize and Visualization
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Data Modeling / Model Linear / Metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, accuracy_score, roc_auc_score, classification_report, ConfusionMatrixDisplay

# Warnings
import warnings
warnings.filterwarnings('ignore')

### Load the data

In [0]:
df = pd.read_csv('./data/ChurnData.csv')

### Verify successful load with some randomly selected records


In [0]:
df.sample(9)

### Understand the data
---

#### Scenario

A telecommunications company is concerned about the number of customers migrating from its fixed-line telephone services to cable TV competitors. They need to understand who is most likely to leave the company.

#### Loading Telecommunications Churn Data

Telecommunications Churn is a hypothetical data file that describes a telecommunications company's efforts to reduce customer churn. Each case corresponds to a different customer and records various demographic and service usage information.

#### About the Dataset

This is a historical customer dataset, where each row represents a customer. It is typically less expensive to retain customers than to acquire new ones, therefore the focus of this analysis is to predict which customers will remain with the company.

This dataset provides information about customer preferences, chosen services, personal details, etc., which helps predict customer churn.

### Explore the data
First, consider a statistical summary of the data.

In [0]:
df.describe()

In [0]:
df.info()

### Note:

For this project, only some variables from the dataset will be used in order to simplify it. As a selection criterion, I will be choosing the variables with the strongest correlation to the target variable.

### Checking the correlations between the variables

In [0]:
df.corr()['churn'].abs().sort_values(ascending = False)

In [0]:
plt.rc('font', size = 8)
fig, ax = plt.subplots(figsize = (14, 10))
sns.heatmap(df.corr(), annot = True, cmap = 'coolwarm', linewidths = 0.5, ax = ax)
ax.set_title('Correlation Features Matrix')
plt.tight_layout()
plt.show()

In [0]:
# Dropping features of low correlation with the target variable

df = df.drop(columns = ['income', 'confer', 'logtoll', 'tollten', 'callwait', 'custcat', 'tollmon'])
df.head()

### Extract the input features and labels from the data set
Extract the required columns and convert the resulting dataframes to NumPy arrays.


In [0]:
# Shape Dataset
df.shape

In [0]:
df['churn'] = df['churn'].astype('int')
X = df.iloc[:, 0: 20].to_numpy()
y = df.iloc[:, 20].to_numpy()

print(f'The shape X train {X.shape}')
print(f'The shape y train {y.shape}')


### Preprocess selected features an train test split

Create train and test datasets

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 33)

The next step is to standardize the input variables so that the model does not inadvertently favor any variable due to its magnitude. The typical way to do this is to subtract the mean and divide by the standard deviation.

In [0]:
std_scaler = StandardScaler()
X_train = std_scaler.fit_transform(X_train)
X_test = std_scaler.transform(X_test)

In [0]:
pd.DataFrame(X_train).head()

In [0]:
pd.DataFrame(X_train).describe().round(2)

A standardized variable has zero mean and a standard deviation of one.

### Build a Logistic Regression Classifier Model

In [0]:
# Create a model object
LR = LogisticRegression()

# Train the model in the training data
LR.fit(X_train, y_train)

In [0]:
# Predict the target variable in the test data
y_pred = LR.predict(X_test)
y_pred[: 10]

In [0]:
# Collecting the probabilities of the model classifications
y_prob = LR.predict_proba(X_test)
y_prob[ : 10]

### Model Evaluation

In [0]:
coefficients = pd.Series(LR.coef_[0], index = df.columns[: - 1])
coefficients

In [0]:
# Data collect
data_ax = coefficients.sort_values().reset_index()
data_ax.columns = ['Feature', 'Coefficient']

# # Creating a category to automatically "paint" the graph
# If > 0 it helps Churn (Bad/Red), if < 0 it retains the customer (Good/Blue)
data_ax['Impact'] = data_ax['Coefficient'].apply(lambda x: 'High Risk' if x > 0 else 'Low Risk')

# Figure
plt.figure(figsize = (12, 8))
sns.set_style('whitegrid')

# Barplot
sns.barplot(
    data = data_ax,
    y = 'Feature',
    x = 'Coefficient',
    edgecolor = 'black',
    hue = 'Impact',
    dodge = False,
    palette = {'High Risk': 'firebrick', 'Low Risk': 'teal'}
)

plt.axvline(x = 0, color = 'black', linestyle = '--', linewidth = 1)
plt.title('Features Importance', fontsize = 15)
plt.xlabel('Coefficients', fontsize = 12)
plt.ylabel('Variables', fontsize = 12)
plt.legend(title = 'Type of Impact', loc = 'upper right', fontsize = 12)

plt.tight_layout()
plt.show()

In [0]:
print(classification_report(y_test, y_pred))

In [0]:
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred, 
    display_labels = ['No-Churn', 'Churn'], 
    cmap = 'Blues', 
    values_format = 'd'
)

plt.grid(False) 
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

In [0]:
print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')
print(f'AUC-ROC: {roc_auc_score(y_test, y_prob[:, 1]):.2f}')
print(f'Root men squared error: {log_loss(y_test, y_prob[:, 1]):.2f}')

### Conclusion
---

- The developed Logistic Regression model demonstrates a certain statistical robustness and predictive capacity. With an **Accuracy of 85%** and an **Area Under the Curve (AUC-ROC) of 0.84**, the classifier exhibits a discriminatory competence far superior to chance, validating its effectiveness in distinguishing between retained customers and those prone to cancellation (*churn*).

- The analysis of the coefficients reveals clear vectors of influence on consumer behavior, divided into risk factors and protective factors:

- The variables `wiremon` (+0.69) and `cardmon` (+0.44) show the highest positive coefficients. This indicates that high monthly costs associated with wireless services and calling cards are the main drivers of the cancellation decision. There is an elastic price sensitivity in these specific services.

-  The variables `callcard` (-0.74) and `tenure` (-0.65) act as strong protective factors. Relationship longevity (*tenure*) and the use of the calling card service are inversely correlated with churn. Customers engaged in these products demonstrate greater loyalty.

- Demographic variables such as `ed` (Education) and obsolete services such as `pager` showed coefficients close to zero, indicating that they do not have significant explanatory power for this phenomenon.

- The **RMSE of 0.40** suggests that the estimated probabilities have an acceptable standard deviation in relation to the actual classes.