---
# Logistic Regression - Model 1
---
In this notebook,  we will be looking at the performance of the Logistic Regression algorithm to accurately forecast clients leaving the company.


## Results

---

### Importing necessary library

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

### Read data file

In [2]:
churn_df = pd.read_excel('../data/churn_cleaned_featEng.xlsx')
churn_df

Unnamed: 0,Latitude,Longitude,Tenure Months,Monthly Charges,Churn Value,Senior Citizen_Yes,Partner_Yes,Dependents_Yes,Internet Service_Fiber optic,Internet Service_No,...,Contract_Two year,Paperless Billing_Yes,Payment Method_Credit card (automatic),Payment Method_Electronic check,Payment Method_Mailed check,Gender_Male,Phone Service_Yes,Multiple Lines_Yes,Streaming TV_Yes,Streaming Movies_Yes
0,33.964131,118.272783,2,53.85,1,0,0,0,0,0,...,0,1,0,0,1,1,1,0,0,0
1,34.059281,118.307420,2,70.70,1,0,0,1,1,0,...,0,1,0,1,0,0,1,0,0,0
2,34.048013,118.293953,8,99.65,1,0,0,1,1,0,...,0,1,0,1,0,0,1,1,1,1
3,34.062125,118.315709,28,104.80,1,0,1,1,1,0,...,0,1,0,1,0,0,1,1,1,1
4,34.039224,118.266293,49,103.70,1,0,0,1,1,0,...,0,1,0,0,0,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,34.341737,116.539416,72,21.15,0,0,0,0,0,1,...,1,1,0,0,0,0,1,0,0,0
7039,34.667815,117.536183,24,84.80,0,0,1,1,0,0,...,0,1,0,0,1,1,1,1,1,1
7040,34.559882,115.637164,72,103.20,0,0,1,1,1,0,...,0,1,1,0,0,0,1,1,1,1
7041,34.167800,116.864330,11,29.60,0,0,1,1,0,0,...,0,1,0,1,0,0,0,0,0,0


---

<center>
    
## Preparing data

</center>

---

### Separate X and y features

In [3]:
# Seperate X and y features
X = churn_df.drop(columns=['Churn Value'])
y = churn_df['Churn Value']

### Split dataset (training/testing)

In [4]:
# Separating the dataset into a training dataset (70%) and testing+validation (30%) dataset
X_train, X_test_validation, y_train, y_test_validation = train_test_split(X, y, train_size=0.7, random_state=5)

# Separating the testing+valisation dataset into a testing dataset (15%) and a validation dataset (15%) 
X_val, X_test, y_val, y_test = train_test_split(X_test_validation, y_test_validation, test_size=0.5, random_state=5)

### Converting data subset to dataframe 

In [5]:
X_train = pd.DataFrame(X_train, columns=X.columns)
X_test = pd.DataFrame(X_test, columns=X.columns)

### Scale X features

In [6]:
# Create instance of scaler
scaler = StandardScaler()

# Scale the data
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)

---

<center>
    
## Model

</center>

---

### Find best parameters for model


In [7]:
import features

# Create instance of feature selector
selection1= features.features
selection2= features.features_chi

selection1best5 = list(dict(sorted(selection1.items(), key=lambda item: item[1], reverse=True)[:5]).keys())
selection1best10 = list(dict(sorted(selection1.items(), key=lambda item: item[1], reverse=True)[:10]).keys())
selection2best5 = list(dict(sorted(selection2.items(), key=lambda item: item[1], reverse=True)[:5]).keys())
selection2best10 = list(dict(sorted(selection2.items(), key=lambda item: item[1], reverse=True)[:10]).keys())

### Run model on training dataset

### Investigate best model's predictive features

In [8]:
from sklearn.linear_model import LogisticRegression

using_features = selection1best5

# Create instance of model
model = LogisticRegression()

# Fit the model
model.fit(X_train[using_features], y_train)

# Predict the model
y_pred = model.predict(X_test[using_features])

# Print the classification report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.84      0.91      0.87       783
           1       0.65      0.49      0.56       274

    accuracy                           0.80      1057
   macro avg       0.74      0.70      0.71      1057
weighted avg       0.79      0.80      0.79      1057



---

<center>
    
## Validation

</center>

---

### Run model on testing dataset

In [9]:
# Run the mode on testing data
y_pred = model.predict(X_test[using_features])

# Print the classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.91      0.87       783
           1       0.65      0.49      0.56       274

    accuracy                           0.80      1057
   macro avg       0.74      0.70      0.71      1057
weighted avg       0.79      0.80      0.79      1057



### Model's validation

In [10]:
# Models validation
from sklearn.model_selection import cross_val_score

# lets do cross validation, to find the best hyperparameters
# Create instance of model
# we will be doing cross validation to find the best hyperparameters out of the following
# C: Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
# penalty: Used to specify the norm used in the penalization. The ‘newton-cg’, ‘sag’ and ‘lbfgs’ solvers support only l2 penalties.
# solver: Algorithm to use in the optimization problem.
# tol: Tolerance for stopping criteria.
# max_iter: Maximum number of iterations taken for the solvers to converge.
model = LogisticRegression()

# Create a dictionary of hyperparameters
hyperparameters = {
    'C': [0.001,  0.1, 10, 100, 200,500],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear'],
    'tol': [0.0001, 0.001, 0.01,0.1],
    'max_iter': [100, 1000, 10000]
}

# Create instance of GridSearchCV
grid = GridSearchCV(model, hyperparameters, cv=5)

# Fit the model
grid.fit(X_train[using_features], y_train)

# Print the best hyperparameters
print(grid.best_params_)
print(grid.best_score_)
print(grid.best_estimator_)
print(grid.best_index_)
print(grid.best_estimator_.coef_)

{'C': 100, 'max_iter': 10000, 'penalty': 'l1', 'solver': 'liblinear', 'tol': 0.01}
0.799391480730223
LogisticRegression(C=100, max_iter=10000, penalty='l1', solver='liblinear',
                   tol=0.01)
90
[[-0.03056315  1.26669317 -1.33650621  0.64608374 -1.51939739]]


---

<center>
    
## Results

</center>

---

### Results

In [11]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.91      0.87       783
           1       0.65      0.49      0.56       274

    accuracy                           0.80      1057
   macro avg       0.74      0.70      0.71      1057
weighted avg       0.79      0.80      0.79      1057





**Explanation of Metrics:**

>    **Precision**: The proportion of true positive predictions out of all positive predictions.</br>
     **Recall**: The proportion of true positive predictions out of all actual positive instances.</br>
     **F1-score**: The harmonic mean of precision and recall, balancing both metrics.</br>
     **Support**: The number of actual occurrence</br>

### Results

**DecisionTreeClassifier** (scoring=balanced_accuracy, class_weights={0: 0.5, 1: 1.5})
|               | Precision | Recall | F1-Score | Support |
|---------------|----------|-------|---------|--------|
| Class 0       | 0.93     | 0.67  | 0.78    | 783    |
| Class 1       | 0.48     | 0.86  | 0.61    | 274    |
| Accuracy      | -        | -  | 0.72       | 1057   |
| Macro Avg     | 0.71     | 0.77  | 0.70    | 1057   |
| Weighted Avg  | 0.82     | 0.72  | 0.74    | 1057   |

</br></br>
**K-Nearest Neighbor** (weights=uniform)

|               | Precision | Recall | F1-Score | Support |
|---------------|:---------:|:------:|:-------:|:-------:|
| **Class 0**   | 0.83     | 0.81   | 0.82    | 783     |
| **Class 1**   | 0.49     | 0.51   | 0.50    | 274     |
| **Accuracy**  | -        | -   | 0.74       | 1057    |
| **Macro Avg** | 0.66     | 0.66   | 0.66    | 1057    |
| **Weighted Avg** | 0.74     | 0.74   | 0.74    | 1057    |

</br></br>
**Random Forest** (scoring=accuracy, class_weights={0: 1, 1: 1})
|               | Precision | Recall | F1-Score | Support |
|---------------|:---------:|:------:|:-------:|:-------:|
| **Class 0**   | 0.86     | 0.90   | 0.88    | 783     |
| **Class 1**   | 0.66     | 0.57   | 0.61    | 274     |
| **Accuracy**  | -        | -      | 0.81    | 1057    |
| **Macro Avg** | 0.76     | 0.73   | 0.74    | 1057    |
| **Weighted Avg** | 0.81     | 0.81   | 0.81    | 1057    |

</br></br>
**Logistic Regression**
|               | Precision | Recall | F1-Score | Support |
|---------------|:---------:|:------:|:-------:|:-------:|
| **Class 0**   | 0.84     | 0.91   | 0.87    | 783     |
| **Class 1**   | 0.65     | 0.49   | 0.56    | 274     |
| **Accuracy**  | -        | -      | 0.80    | 1057    |
| **Macro Avg** | 0.74     | 0.70   | 0.71    | 1057    |
| **Weighted Avg** | 0.79     | 0.80   | 0.79    | 1057    |
