## **Introduction to Logistics Regression**
### 1. Short History of Logistic Regression

The logistic function was introduced in the **19th century** by **Pierre-FranÃ§ois Verhulst** to model population growth.

In the **early 20th century**, statisticians realized that the same function could model **binary outcomes** (yes/no, success/failure).

By the **1940sâ€“1950s**, logistic regression was formally developed as a statistical method, especially in **biostatistics and social sciences**, to model probabilities of events that have only two possible outcomes.

**Today**, logistic regression is one of the foundational algorithms in machine learning, widely used for problems such as:

- **Churn prediction**
- **Fraud detection**
- **Medical diagnosis**

**Because it is**:
- **Interpretable**
- **Probabilistic** 
- **Mathematically well-grounded**

---

### 2. What is Logistic Regression?

**Logistic regression** is a **supervised learning algorithm** used for **binary classification**.

**Its goal**: Model the probability that an outcome belongs to the **positive class** (e.g., customer churns).

**Key difference**: Instead of predicting a class label directly, logistic regression predicts a **probability**, which is then converted into a class decision using a **threshold** (commonly **0.5**).

---

### 3. Why the Linear Model Output is Not a Probability

Logistic regression starts with a **linear model**, just like linear regression:

$z = 0.8x - 1.2$


**Key issue**:

**Linear model outputs**: `z âˆˆ (-âˆž, +âˆž)` (any real number)

**Probabilities must satisfy**: `0 â‰¤ p â‰¤ 1`

**Therefore**:
- Linear output `z` **cannot** be interpreted as a probability
- We need a transformation that maps all real numbers into the interval `(0,1)`

---

### 4. Log-Odds (Logit): Linking Linear Models to Probability

Instead of modeling probability directly, logistic regression models the **log-odds**, also called the **logit**.

**Odds**:

$$odds = \frac{p}{1 - p}$$


**Log-odds (logit)**:

$$\log\left(\frac{p}{1 - p}\right)$$


**Important property**:

- Log-odds range: (-âˆž, +âˆž)
- Linear model range: (-âˆž, +âˆž)
  
âœ“ Perfect match!


---

### 5. Logistic Regression Model (Example)

**"Logistics Regression Fits a linear model to the log-odds"**:

$$\log\left(\frac{p}{1-p}\right) = 0.8x - 1.2$$



**Left-hand side**: Log-odds (logit) of churning  
**Right-hand side**: Linear regression on feature `x`

**This equation means**:
- Features influence the **log-odds of churn linearly**
- **Not** the probability directly

---

### 6. Converting Log-Odds into a Valid Probability

To recover the probability `p`, we solve the log-odds equation.

This leads to the **logistic (sigmoid) function**:

$$p = \frac{1}{1 + e^{-(0.8x - 1.2)}}$$


**Properties of the logistic function**:
- Maps **any real number** to `(0,1)`
- Produces a **smooth, interpretable probability**
- **Ensures outputs are always valid probabilities**

---

### 7. Final Interpretation

1. Linear model â†’ z = 0.8x - 1.2 (log-odds score)

2. Logistic function â†’ p = sigmoid(z) (probability)

3. Threshold â†’ Class decision (p > 0.5 = churn)


**Complete flow**:

Features â†’ Linear Model â†’ Log-odds â†’ Sigmoid â†’ Probability â†’ Class


**That's logistic regression!** ðŸŽ¯


## **Data Preparation**

In [1]:
# import the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('Telco-Customer-Churn.csv')
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [3]:
# convert column headers to lower case and replace space with _
# convert categorical column values to lower case and replace space with _

df.columns = df.columns.str.lower().str.replace(' ', '_')

categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')

In [4]:
df.head().T

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
gender,female,male,male,male,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,no,no
dependents,no,no,no,no,no
tenure,1,34,2,45,2
phoneservice,no,yes,yes,no,yes
multiplelines,no_phone_service,no,no,no_phone_service,no
internetservice,dsl,dsl,dsl,dsl,fiber_optic
onlinesecurity,no,yes,yes,yes,no


In [5]:
# convert total charges to number and replace nulls with 0
df.totalcharges = pd.to_numeric(df.totalcharges, errors = 'coerce')
df.totalcharges = df.totalcharges.fillna(0)

In [6]:
# convert yes to 1 and no to 0 and convert the datatype to int
df.churn = (df.churn == 'yes').astype(int)

## **Setting up the Validation Framework**

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
# split the dataset into train, validation and test sets
df_full_train, df_test = train_test_split(df, test_size = 0.2, random_state = 1)
df_train, df_val = train_test_split(df_full_train, test_size = 0.25, random_state = 1)

print(len(df_train), len(df_val), len(df_test))

4225 1409 1409


In [9]:
# reset the index
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [10]:
# seperate the target variables
y_train = df_train['churn'].values
y_val = df_val['churn'].values
y_test = df_test['churn'].values

In [11]:
# delete the target valriables in the feature matrix
del df_train['churn']
del df_val['churn']
del df_test['churn']

In [12]:
df_full_train = df_full_train.reset_index(drop=True)

In [13]:
df_full_train.head()

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,5442-pptjy,male,0,yes,yes,12,yes,no,no,no_internet_service,...,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,mailed_check,19.7,258.35,0
1,6261-rcvns,female,0,no,no,42,yes,no,dsl,yes,...,yes,yes,no,yes,one_year,no,credit_card_(automatic),73.9,3160.55,1
2,2176-osjuv,male,0,yes,no,71,yes,yes,dsl,yes,...,no,yes,no,no,two_year,no,bank_transfer_(automatic),65.15,4681.75,0
3,6161-erdgd,male,0,yes,yes,71,yes,yes,dsl,yes,...,yes,yes,yes,yes,one_year,no,electronic_check,85.45,6300.85,0
4,2364-ufrom,male,0,no,no,30,yes,no,dsl,yes,...,no,yes,yes,no,one_year,no,electronic_check,70.4,2044.75,0


In [14]:
# check for missing values
df_full_train.isnull().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

In [15]:
# check the proportion of churn vs no churn
df_full_train.churn.value_counts(normalize = True)

churn
0    0.730032
1    0.269968
Name: proportion, dtype: float64

In [16]:
global_churn_rate = round(df_full_train.churn.mean(),2)
print(global_churn_rate)

0.27


In [17]:
numerical = ['tenure', 'monthlycharges', 'totalcharges']

In [18]:
df_full_train.columns

Index(['customerid', 'gender', 'seniorcitizen', 'partner', 'dependents',
       'tenure', 'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod', 'monthlycharges', 'totalcharges', 'churn'],
      dtype='object')

In [19]:
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
        'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod']

In [20]:
df_full_train[categorical].nunique()

gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

## **Feature Importance: Churn rate and risk ratio**

Feature importance analysis (part of EDA) - identifying which features affect our target

- Churn rate
- Risk ratio
- Mutual information

#### **Interpreting churn metrics**

**Difference (Group âˆ’ Global):**

- If the value is less than 0, the group is less likely to churn than the global average.

- If the value is greater than 0, the group is more likely to churn than the global average.

**Risk Ratio (Group / Global):**

- If the risk ratio is greater than 1, the group is more likely to churn than the global population.

- If the risk ratio is less than 1, the group is less likely to churn than the global population.

In [21]:
global_churn = df_full_train['churn'].mean()
print(global_churn)

0.26996805111821087


In [22]:
from IPython.display import display

In [23]:
for c in categorical:
    print(c)
    df_group = df_full_train.groupby(c).churn.agg(['mean', 'count'])
    df_group['diff'] = df_group['mean'] - global_churn
    df_group['risk'] = df_group['mean'] / global_churn
    display(df_group)
    print()
    print()

gender


Unnamed: 0_level_0,mean,count,diff,risk
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,0.276824,2796,0.006856,1.025396
male,0.263214,2838,-0.006755,0.97498




seniorcitizen


Unnamed: 0_level_0,mean,count,diff,risk
seniorcitizen,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.24227,4722,-0.027698,0.897403
1,0.413377,912,0.143409,1.531208




partner


Unnamed: 0_level_0,mean,count,diff,risk
partner,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.329809,2932,0.059841,1.221659
yes,0.205033,2702,-0.064935,0.759472




dependents


Unnamed: 0_level_0,mean,count,diff,risk
dependents,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.31376,3968,0.043792,1.162212
yes,0.165666,1666,-0.104302,0.613651




phoneservice


Unnamed: 0_level_0,mean,count,diff,risk
phoneservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.241316,547,-0.028652,0.89387
yes,0.273049,5087,0.003081,1.011412




multiplelines


Unnamed: 0_level_0,mean,count,diff,risk
multiplelines,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.257407,2700,-0.012561,0.953474
no_phone_service,0.241316,547,-0.028652,0.89387
yes,0.290742,2387,0.020773,1.076948




internetservice


Unnamed: 0_level_0,mean,count,diff,risk
internetservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
dsl,0.192347,1934,-0.077621,0.712482
fiber_optic,0.425171,2479,0.155203,1.574895
no,0.077805,1221,-0.192163,0.288201




onlinesecurity


Unnamed: 0_level_0,mean,count,diff,risk
onlinesecurity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.420921,2801,0.150953,1.559152
no_internet_service,0.077805,1221,-0.192163,0.288201
yes,0.153226,1612,-0.116742,0.56757




onlinebackup


Unnamed: 0_level_0,mean,count,diff,risk
onlinebackup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.404323,2498,0.134355,1.497672
no_internet_service,0.077805,1221,-0.192163,0.288201
yes,0.217232,1915,-0.052736,0.80466




deviceprotection


Unnamed: 0_level_0,mean,count,diff,risk
deviceprotection,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.395875,2473,0.125907,1.466379
no_internet_service,0.077805,1221,-0.192163,0.288201
yes,0.230412,1940,-0.039556,0.85348




techsupport


Unnamed: 0_level_0,mean,count,diff,risk
techsupport,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.418914,2781,0.148946,1.551717
no_internet_service,0.077805,1221,-0.192163,0.288201
yes,0.159926,1632,-0.110042,0.59239




streamingtv


Unnamed: 0_level_0,mean,count,diff,risk
streamingtv,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.342832,2246,0.072864,1.269897
no_internet_service,0.077805,1221,-0.192163,0.288201
yes,0.302723,2167,0.032755,1.121328




streamingmovies


Unnamed: 0_level_0,mean,count,diff,risk
streamingmovies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.338906,2213,0.068938,1.255358
no_internet_service,0.077805,1221,-0.192163,0.288201
yes,0.307273,2200,0.037305,1.138182




contract


Unnamed: 0_level_0,mean,count,diff,risk
contract,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
month-to-month,0.431701,3104,0.161733,1.599082
one_year,0.120573,1186,-0.149395,0.446621
two_year,0.028274,1344,-0.241694,0.10473




paperlessbilling


Unnamed: 0_level_0,mean,count,diff,risk
paperlessbilling,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.172071,2313,-0.097897,0.637375
yes,0.338151,3321,0.068183,1.25256




paymentmethod


Unnamed: 0_level_0,mean,count,diff,risk
paymentmethod,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bank_transfer_(automatic),0.168171,1219,-0.101797,0.622928
credit_card_(automatic),0.164339,1217,-0.10563,0.608733
electronic_check,0.45589,1893,0.185922,1.688682
mailed_check,0.19387,1305,-0.076098,0.718121






## **Feature Importance: Mutual Information**

Mutual Information - concept from Information theory, it tells us how much we can learn about one variable, if we know the value of another

- Higher mutual information means more information about churn from a feature.
- Lower mutual information means less information.
- [More Information on Mutual information](https://en.wikipedia.org/wiki/Mutual_information)

In [24]:
from sklearn.metrics import mutual_info_score

In [25]:
categories = {}
for category in categorical:
    score = mutual_info_score(df_full_train['churn'], df_full_train[category])
    categories[category] = round(float(score), 5)

# DataFrame creation
df_mi = pd.DataFrame.from_dict(
    categories, 
    orient='index', 
    columns=['mutual_info_score']
).sort_values('mutual_info_score', ascending=False).reset_index()

df_mi.columns = ['features', 'mutual_info_score']
df_mi


Unnamed: 0,features,mutual_info_score
0,contract,0.09832
1,onlinesecurity,0.06309
2,techsupport,0.06103
3,internetservice,0.05587
4,onlinebackup,0.04692
5,deviceprotection,0.04345
6,paymentmethod,0.04321
7,streamingtv,0.03185
8,streamingmovies,0.03158
9,paperlessbilling,0.01759


## **Feature Importance: Correlation**

A way to measure feature importance for numerical variables is using correlation.