Check the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) of ``LogisticRegression()`` from ``sklearn.linear_model`` for details.

In [None]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression

In [None]:
churn = pd.read_csv('Data/churn.csv',sep=' ')
churn.head()

Unnamed: 0,COLLEGE,INCOME,OVERAGE,LEFTOVER,HOUSE,HANDSET_PRICE,OVER_15MINS_CALLS_PER_MONTH,AVERAGE_CALL_DURATION,REPORTED_SATISFACTION,REPORTED_USAGE_LEVEL,CONSIDERING_CHANGE_OF_PLAN,LEAVE
1,zero,31953,0,6,313378,161,0,4,unsat,little,no,STAY
2,one,36147,0,13,800586,244,0,6,unsat,little,considering,STAY
3,one,27273,230,0,305049,201,16,15,unsat,very_little,perhaps,STAY
4,zero,120070,38,33,788235,780,3,2,unsat,very_high,considering,LEAVE
5,one,29215,208,85,224784,241,21,1,very_unsat,little,never_thought,STAY


In [None]:
display(churn.shape, churn.dtypes) # some variables are object (string or mixed).

(20000, 12)

COLLEGE                        object
INCOME                          int64
OVERAGE                         int64
LEFTOVER                        int64
HOUSE                           int64
HANDSET_PRICE                   int64
OVER_15MINS_CALLS_PER_MONTH     int64
AVERAGE_CALL_DURATION           int64
REPORTED_SATISFACTION          object
REPORTED_USAGE_LEVEL           object
CONSIDERING_CHANGE_OF_PLAN     object
LEAVE                          object
dtype: object

## 1. Data pre-processing

### 1.1 Convert Target Variable as numbers

Convert ``LEAVE`` as numeric as we like to have ``LEAVE`` as the positive class. 

``sklearn`` can handle categorical target variable, if so the class labels will be ordered according to the alphabetical order (``LEAVE`` = 0, ``STAY`` = 1). If you go with out transformed the target variable, pay attention to the probabilies (``LEAVE`` is in the 1st column and ``STAY`` is in the second column)

In [None]:
churn.loc[churn['LEAVE']=='LEAVE','LEAVE'] = 1    # "LEAVE" is the positive class
churn.loc[churn['LEAVE']=='STAY','LEAVE'] = 0

### 1.2 Convert categorical variables as numbers 

As ``sklearn`` can only take numeric predictors, we need to convert all categorical variables as numbers. 

In [None]:
# Convert COLLEGE as numeric

churn.loc[churn['COLLEGE']=='zero','COLLEGE'] = 0  
churn.loc[churn['COLLEGE']=='one','COLLEGE'] = 1

In [None]:
# Check the unique values of three variables (consider as ordinal)

for c in ['REPORTED_SATISFACTION','REPORTED_USAGE_LEVEL','CONSIDERING_CHANGE_OF_PLAN']:
    print(churn[c].unique())

['unsat' 'very_unsat' 'very_sat' 'avg' 'sat']
['little' 'very_little' 'very_high' 'high' 'avg']
['no' 'considering' 'perhaps' 'never_thought' 'actively_looking_into_it']


In [None]:
# Convert REPORTED_SATISFACTION as numeric
# very_unsat = 0, unsat=1... very_sat =4

churn['REPORTED_SATISFACTION'] = pd.Categorical(churn['REPORTED_SATISFACTION'], 
                                                categories=["very_unsat", "unsat", "avg", "sat","very_sat"], 
                                                ordered= True)

codes, uniques = pd.factorize(churn['REPORTED_SATISFACTION'], sort=True)
churn['REPORTED_SATISFACTION'] = codes

In [None]:
# Convert REPORTED_USAGE_LEVEL as numeric

churn['REPORTED_USAGE_LEVEL'] = pd.Categorical(churn['REPORTED_USAGE_LEVEL'], 
                                                categories=["very_little", "little", "avg", "high","very_high"], 
                                                ordered= True)

codes, uniques = pd.factorize(churn['REPORTED_USAGE_LEVEL'], sort=True)
churn['REPORTED_USAGE_LEVEL'] = codes

In [None]:
# Convert CONSIDERING_CHANGE_OF_PLAN as numeric

churn['CONSIDERING_CHANGE_OF_PLAN'] = pd.Categorical(churn['CONSIDERING_CHANGE_OF_PLAN'], 
                                                categories=["never_thought", "no", "perhaps", "considering","actively_looking_into_it"], 
                                                ordered= True)

codes, uniques = pd.factorize(churn['CONSIDERING_CHANGE_OF_PLAN'], sort=True)
churn['CONSIDERING_CHANGE_OF_PLAN'] = codes

If you check the datatype again, ``COLLEGE`` and ``LEAVE`` are still object.
Make sure every variable is numeric

In [None]:
churn = churn.astype({'COLLEGE': 'int64', 'LEAVE': 'int64'})   

churn.dtypes

COLLEGE                        int64
INCOME                         int64
OVERAGE                        int64
LEFTOVER                       int64
HOUSE                          int64
HANDSET_PRICE                  int64
OVER_15MINS_CALLS_PER_MONTH    int64
AVERAGE_CALL_DURATION          int64
REPORTED_SATISFACTION          int64
REPORTED_USAGE_LEVEL           int64
CONSIDERING_CHANGE_OF_PLAN     int64
LEAVE                          int64
dtype: object

## 2.  Split into train and test 

In [None]:
from sklearn.model_selection import train_test_split

X = churn.drop('LEAVE', axis=1)

y = churn['LEAVE']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  

display(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(16000, 11)

(4000, 11)

(16000,)

(4000,)

## 3. Scale the data

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)   # train the scaler on X_train and transfrom it 

X_test_scaled  = scaler.transform(X_test)        # apply the scaler to transform test data directly

In [None]:
# convert array into dataframe with col names so that we can select features easily

X_train_scaled = pd.DataFrame(X_train_scaled, columns = X_train.columns)  

X_test_scaled = pd.DataFrame(X_test_scaled, columns = X_train.columns)

## 3. Model Training

###  3.1 Train ``m1`` with only two predictors ``COLLEGE``, ``INCOME`` 

In [None]:
# Take only two predictors 

X_train_sub = X_train_scaled[['COLLEGE', 'INCOME']]   # 2D features

X_test_sub = X_test_scaled[['COLLEGE', 'INCOME']]

In [None]:
m1 = LogisticRegression().fit(X_train_sub, y_train)   # fit the model

display(m1.intercept_, m1.coef_, m1.feature_names_in_)  

array([-0.31032163])

array([[0.05265908, 0.60439315]])

array(['COLLEGE', 'INCOME'], dtype=object)

#### Predict and Check Model Accuracy on Train 

In [None]:
# Predict class labels (default cutting point 0.5)

train_pred1 = m1.predict(X_train_sub)  
train_pred1

array([1, 1, 1, ..., 0, 0, 1])

In [None]:
# Check model accuracy: proportion of correct prediction

from sklearn.metrics import accuracy_score    

accuracy_score(train_pred1, y_train)     

# alternatively use m1.score(X_train_sub, y_train)

0.5434375

Estimate the ***class probability*** and predict on train set with a different cutting point (e.g., 0.55),  check model accuracy accordingly.

In [None]:
# Class probability of class 0, 1

train_prob1 = m1.predict_proba(X_train_sub)

train_prob1    

array([[0.47976773, 0.52023227],
       [0.43973893, 0.56026107],
       [0.43823944, 0.56176056],
       ...,
       [0.57474123, 0.42525877],
       [0.547727  , 0.452273  ],
       [0.47457994, 0.52542006]])

In [None]:
train_pred2 = np.where(train_prob1[:,1] >= 0.55, 1, 0)     # take 2nd col (class 1)

accuracy_score(train_pred2, y_train)   

0.5275

#### Manually compute ``log_odds`` and ``probability`` of positive class

Try with only the first observation (row 1).

- Compute ``log_odds`` (i.e., ``f(x)``) using the function formula. 
- Calculate ``probability (y = 1)`` using below formula. 

![image.png](attachment:image.png)

In [None]:
X_train_sub.iloc[0,:]   # feature values for the 1st observation

#m1.intercept_           # intercept: 1D array
#m1.coef_                # coefs: 1*2 shape 2D array 

COLLEGE    1.000000
INCOME     0.560291
Name: 0, dtype: float64

In [None]:
# log_odds = b + w1 * COLLEGE + w2 * INCOME

log_odds = m1.intercept_[0] + X_train_sub.iloc[0,0] * m1.coef_[0,0] + X_train_sub.iloc[0,1] * m1.coef_[0,1] 

log_odds  

0.0809732981955622

In [None]:
# Compute probability of class 1 for the 1st observation

odds = np.exp(log_odds)          # odds = exponential function of log_odds with base e     

prob = odds/(1 + odds)        

display(prob, train_prob1[0,1])   # same!

0.5202322710545506

0.5202322710545505

``log_odds`` are be returned by ``decision_function()`` method as well.  

-  ``log_odds`` is proportional to the orthogonal distance of that observation to the hyperplane, and also called as ``cofidence scores`` for observations.

In [None]:
log_odds = m1.decision_function(X_train_sub)
log_odds

array([ 0.0809733 ,  0.24222168,  0.24831027, ..., -0.30122207,
       -0.19149099,  0.10176798])

In [None]:
odds = np.exp(log_odds)   # apply natural exponential function to get odds for all obs

probs = odds/(1 + odds)   # get probabilities for all obs      

probs

array([0.52023227, 0.56026107, 0.56176056, ..., 0.42525877, 0.452273  ,
       0.52542006])

**Predict and Check Model Accuracy on Test Data**

<font color=red>***Exercise 1: Your Codes Here***</font>  

- Please use ``m1`` to predict the class labels (should use ``X_test_sub`` as features) with default threshhold (i.e, cut-off value). Calculate model accuracy accordingly. 
- Estimate the ``probability (of 1)`` for each observation on test data, and use estimated probability to make class predictions (cutting point 0.7)  and check model accuracy again.

In [None]:
test_pred1 = m1.predict(X_test_sub)

accuracy_score(test_pred1, y_test)  #m1.score(X_test_sub, y_test)

0.55425

In [None]:
test_prob1 = m1.predict_proba(X_test_sub)  # get probabilities directly.

test_pred1 = np.where(test_prob1[:,1] >= 0.7 , 1, 0)   

accuracy_score(test_pred1, y_test)  

0.51375

###  3.2 Train ``m2`` with all 11 features 

- ``X_train_scaled`` stored all scaled featuress in the train set.

In [None]:
m2 = LogisticRegression().fit(X_train_scaled, y_train)   # use X_train_scaled

display(m2.intercept_, m2.coef_, m2.feature_names_in_)  

array([-0.53458999])

array([[ 0.05545081,  0.43244772,  1.62343002,  0.7362108 , -1.58298302,
         0.35297282,  0.42827746,  0.34422411, -0.02602629,  0.01206385,
        -0.06650498]])

array(['COLLEGE', 'INCOME', 'OVERAGE', 'LEFTOVER', 'HOUSE',
       'HANDSET_PRICE', 'OVER_15MINS_CALLS_PER_MONTH',
       'AVERAGE_CALL_DURATION', 'REPORTED_SATISFACTION',
       'REPORTED_USAGE_LEVEL', 'CONSIDERING_CHANGE_OF_PLAN'], dtype=object)

#### Predict on Train Data

- if not specified, always use the default threshold (cut-off value = 0.5)

In [None]:
train_pred2 = m2.predict(X_train_scaled)          

accuracy_score(train_pred2, y_train)             #m2.score(X_train_scaled, y_train)  

0.641375

#### Predict on Test Data

<font color=red>***Exercise 2: Your Codes Here***</font>  

- Please use ``m2`` to predict the class of test data (you may use ``X_test_scaled`` as predictors). Then calculate model accuracy. 

In [None]:
test_pred2 = m2.predict(X_test_scaled)       

accuracy_score(test_pred2, y_test)        #m2.score(X_test_scaled, y_test)  

0.64375