# IndyCare Case: The NPS key drivers' analysis

## 1. Case introduction

### Context
<br>

- IndyCare has started to gather **patient feedback** and data on the **Net Promotor Score (NPS)**
- Management would like to have a better understanding of the **current deficiencies** in the hospitals and how these can be improved 

<center><img src="https://www.dropbox.com/s/2fagnil1d9hcdxq/charlie-nps-supporter.png?raw=1" width="15%" style="float:right"></center>

### Net Promotor Score (NPS)
<br>

- NPS is a leading indicator of **customer satisfaction**
<img src="https://www.dropbox.com/s/pve2ojgz1h7qlgu/NPS.png?raw=1" width="100%"> 

### CRISP-DM 
- The **CR**oss **I**ndustry **S**tandard **P**rocess for **D**ata **M**ining is a process model that serves as the base for a data science process with 6 sequential phases:

<br>

<center><img src="https://www.dropbox.com/s/sojcporoik37epw/crispdm.png?raw=1" width="50%"></center>


## 2. Reading in the data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, recall_score, confusion_matrix, ConfusionMatrixDisplay

from imblearn.over_sampling import SMOTE


In [None]:
df = pd.read_csv('../data/Dataset IndyCare Case.csv').set_index('SN')
df.head() # We take a look at the first 5 rows of the dataset 

In [None]:
df.shape

In [None]:
df.info()

We observe that the data set does not contain any missing values, so we can proceed further with our Exploratory Data Analysis (EDA) stage which allows us to better understand our data. We now take a look at the distribution of our target variable.

### Goals of the case

1. **Explore and visualize** the data to better understand its structure and check for anomalies;
2. Build a **predictive model** to early identify the detractors and **evaluate** the model's performance;
3. Determine which features have a **significant impact** on the NPS status.

## 3. Exploratory data analysis (EDA)

### Split the dataframe

For this part of the case, we only look at the survey answers of the patients. Therefore, we remove all the patient characteristics from the dataset and store them in df_background. We will use this dataset in the following session when we cluster patients based on their characteristics.

In [None]:
# We create a new dataframe to put these background variables in
df_background = df[["Department", "AgeYrs", "Gender", "MaritalStatus","BedCategory", "Estimatedcost"
                  ,"AdmissionDate", "DischargeDate","LengthofStay"]]

# Remove the background variables from the "df_survey" dataset as well as the target variables NPS_Score and NPS_Status
df_survey = df.drop(df_background.columns, axis=1)
df_survey = df_survey.drop(['NPS_Score','NPS_Status'], axis = 1) 
df_target = df['NPS_Status'] 

In [None]:
# This dataframe only contains the background variables
df_background.head()

In [None]:
# This dataframe only contains the answers from the survey questions
# We will call these the 'features' of our model and represent the independent variables 
df_survey.head() 

In [None]:
# This dataframe only contains the target variable a.k.a. the dependent variable
df_target.head()

### What type of predictive problem is this?

### Target variable: Absolute and Relative Count

In [None]:
# Give the absolute count
df_target.value_counts()

In [None]:
# Give the relative count
df_target.value_counts(normalize = True)

#### Exercise 1

Can you make a plot that shows the count of the target classes?

Hint: Ask ChatGPT to help you (use seaborn package).

#### Solution

### Correlation matrix

In [None]:
plt.subplots(figsize=(10, 8))
sns.heatmap(df.corr(numeric_only=True));

What can you already infer from this plot?

### Explore the features  

#### Exercise 2

Plot a histogram to visualize the distribution of at least one survey question in relation to the NPS_Status

Hint: You can ask ChatGPT to help you plot the number of patients who gave a certain score in relation to their NPS Status

Can you plot this for all survey questions in one code?

#### Solution

## 4. Predictive modeling

### Binary classification problem 
<br>

- *Passives* don't generate bad word-of-mouth
- Idea is to identify *detractors* early on and convert them into *promotors*
- Focus on distinguising the *detractors* from the *passives* and *promotors*

In [None]:
# We create a new dataframe that contains the binary target
df_binary = df.copy()
# We then gather all the instances that are passives and promotors into "non-detractors"
df_binary['NPS_Status'] = df['NPS_Status'].replace(["Passive", "Promotor"], "Non-Detractor")

# We drop the background and target variables from our survey dataframe. It now only contains the features.
df_survey_binary = df_binary.drop(df_background.columns, axis=1)
df_survey_binary = df_survey_binary.drop(['NPS_Score','NPS_Status'], axis = 1) 
# We define our target dataframe
df_target_binary = df_binary['NPS_Status'] 

df_target_binary.value_counts(normalize=True)

We create a new dataframe for our binary target. We do however get an even more imbalanced dataset, so we will need to apply some resampling technique as well.

### Defining our training and test variables 
The <a href=https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html>`train_test_split`</a> procedure involves taking a dataset and dividing it into two subsets. 

- Train Dataset: used to fit the machine learning model.
- Test Dataset: used to evaluate the fit of the machine learning model and its subsequent predictions.

#### Exercise 3

Define the X and y variable(s) for our model for both the training and test set. Use the following parameters:
- test_size = 0.2 
- random_state = 42

Hint: Look at the examples of <a href=https://www.geeksforgeeks.org/how-to-do-train-test-split-using-sklearn-in-python>`train_test_split`</a> or ask ChatGPT.

#### Solution

In [None]:
# Show the first few rows of X_train 
X_train.head()

In [None]:
# Show the first few rows of y_train
y_train.head()

### Predictive models

###### Logistic regression
    
&#43; Easy to implement<br>
&#43; High interpretation<br>
&#8722; Lower performance<br>
&#8722; Linear boundary

<center><img src="https://www.dropbox.com/s/nfl722ylatv9sgo/LR_boundary.png?raw=1" width="70%" style="float:right"></center>

### Logistic regression

In [None]:
# Convert target variable to binary format
y_train = np.where(y_train == 'Detractor', 1, 0)
y_test = np.where(y_test == 'Detractor', 1, 0)
target_names = ['Non-Detractors', 'Detractors']

# Define a logistic regression classifier
LR = LogisticRegression(solver='newton-cg', max_iter=500)
# Fit the classifier on the training data
LR.fit(X_train, y_train)
y_pred= LR.predict(X_test)

# Estimate the accuracy of the classifier on both training and test data
print('Accuracy of LR classifier on overall sample training set: {:.2f}'
     .format(LR.score(X_train, y_train)))
print('Accuracy of LR classifier on overall sample test set: {:.2f}\n'
     .format(LR.score(X_test, y_test)))

#### Confusion matrix

In [None]:
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels = target_names)
disp.plot()
plt.title('Logistic Regression \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred)));

#### Scoring metrics
<br>

- **precision**: proportion of predicted instances that truly belongs to that class $\frac{tp}{tp+fp}$
- **recall**: proportion of actual instances of a class correctly classified $\frac{tp}{tp+fn}$
- **f1-score**: mean between precision & recall
- **support**  number of occurences of the given class in your test set 

&nbsp;
- **accuracy**: proportion of correctly classified instances $\frac{tp+tn}{tp+tn+fp+fn}$ 
- **macro average**: average, independent of distribution
- **weighted average**: average, dependent on distribution

#### Classification report

In [None]:
print(classification_report(y_test, y_pred, target_names = target_names))

#### Are we happy with this result?

Ask ChatGPT to help you with the interpretation of the classification report.

Hint: You can simply copy-paste the table in ChatGPT.

&rarr; What can we do?

### Treating the data imbalance

Over-sample the minority (*passive* and *detractor*) classes using:<br>

- **Resampling** where you duplicate observations from your minority class in your training set
    - Simply adding duplicate records of minority class often don’t add any new information to the model
    
 
- **SMOTE** where you synthesize new instances from the existing minority observations in your training set

In [None]:
#we make sure that our training dataset is a balanced training dataset 
sm = SMOTE(random_state=0)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

Check the **new distribution**: equal amounts for the three classes

In [None]:
# Old distribution
pd.Series(y_train).value_counts().rename({0: 'Non-Detractor', 1: 'Detractor'})

In [None]:
# New distribution
pd.Series(y_train_res).value_counts().rename({0: 'Non-Detractor', 1: 'Detractor'})

#### Exercise 4
Apply the logistic regression again and compare the results with those of the previous - imbalanced - classifier. <br>
There is no need to convert the target variable to binary format like previously. This needs to be done only once.


#### Solution

## 5. Feature importance: Key drivers of NPS

Goal: Determine which <u>*features*</u> have a **significant impact** on our model's prediction of the NPS status
- Provides insights into explainability/interpretability of the model
- Can be used for feature selection to improve the final model's performance

Let's look at the **feature importance** for our **Logistic Regression** model, as this model had the highest recall for *detractors* of all classifiers and is thus most suited in correctly differentiating *detractors* from the *non-detractors*.

### Logistic Regression: Coefficients

In [None]:
# Get importance
importance = LR.coef_[0]

fig = plt.figure(figsize = (8,8))
plt.barh(X_train.columns, importance)
plt.title('Feature Importance')
plt.xlabel('Importance') 
plt.show()

In [None]:
import statsmodels.api as sm

# Convert y_train_res to binary format
y_train_binary = np.where(y_train_res == 'Detractor', 0, 1)
y_train_binary

# Create and fit the logistic regression model
logit_model = sm.Logit(y_train_binary, X_train_res)
result = logit_model.fit(disp=False)

# Get the coefficients, standard errors, and p-values
coefficients = LR.coef_[0]
std_errors = result.bse
p_values = result.pvalues

# Create a DataFrame to store the results
results_df = pd.DataFrame({'Coefficients': coefficients,
                           'Standard Errors': std_errors,
                           'P-values': p_values})

# Add stars to indicate significance levels
alpha_10 = 0.10
alpha_5 = 0.05
alpha_1 = 0.01

results_df['Significance'] = results_df['P-values'].apply(lambda x: '***' if x < alpha_1 else '**' if x < alpha_5 else '*' if x < alpha_10 else '')

# Format the DataFrame
styled_results_df = results_df.style\
    .format({'Coefficients': '{:.4f}',
             'Standard Errors': '{:.4f}',
             'P-values': '{:.4f}'})\
    .set_caption('Logistic Regression Results')\
    .set_table_attributes('class="dataframe"')



# Display the formatted DataFrame
display(styled_results_df)

#### Exercise 5
How should we interpret the above coefficients and significance levels?