This is a continuation of my previous notebook about liver patients, so I will not repeat certain elements of the usual data overview here.

# Contents 🐱‍👤

1. Initial data manipulation
   1. Loading data
   2. Drop NA
   3. Renaming values and columns
2. Converting a quantitative variable to a qualitative variable
   1. Kurtosis
   2. Boxplots
   3. Converting Aspartate and Alamine 
   4. Renaming values
3. A look at new qualitative variables
4. Outliers
5. Pearson's correlation
6. Standardization of quantitative variables
7. Testing databases
8. Logistic regression
9. Performance comparison
10. Conclusions
11. Appendix I: balancing classes
    1.  Random Undersampling
    2.  Random Oversampling
    3.  Logistic regression for undersampling and oversampling
    4.  Conclusions II

# Initial data manipulation is discussed more fully in this [notebook](https://www.kaggle.com/mogrim/logistic-regression-with-all-outliers-removed).

In [None]:
# packages
import pandas as pd
import matplotlib as plt
import seaborn as sns
import numpy as np
from scipy.stats import kurtosis
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, balanced_accuracy_score
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

In [None]:
# data loading
df = pd.read_csv("../input/indian-liver-patient-records/indian_liver_patient.csv")

In [None]:
df = df.dropna() # simply drop NA
df_c = df.copy()  # database copy for qualitative data

In [None]:
# For df_c
df_c["Dataset"] = df["Dataset"].map({1: "Sick", 2: "Healthy"})

# For df
df['Gender'] = df['Gender'].map({'Male': 1, 'Female': 0})
df['Dataset'] = df['Dataset'].map({1: 1, 2: 0})
df.rename(columns={'Gender': 'Male'}, inplace=True)
df.rename(columns={'Dataset': 'Target'}, inplace=True)

In [None]:
df_c.drop(columns=['Age', 'Total_Bilirubin', 'Direct_Bilirubin',
                   'Alkaline_Phosphotase', 'Alamine_Aminotransferase',
                   'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin',
                   'Albumin_and_Globulin_Ratio'], inplace=True)

# Converting a quantitative variable to a qualitative variable

**Kurtosis** is a measure of outliers. The higher its value, the more likely there are outliers in the database. 

In [None]:
df.drop(columns=["Target", "Male"]).kurtosis()

The very high value of kurtosis has two variables therefore, having medical information about normal ranges of their values in blood, we will convert them into qualitative variables.

In [None]:
plt.rcParams['figure.figsize'] = [10, 8]  # for size
sns.boxplot(data=df.Aspartate_Aminotransferase, orient="h").set_title("Aspartate_Aminotransferase")
print("Min value:", min(df.Aspartate_Aminotransferase))

In [None]:
sns.boxplot(data=df.Alamine_Aminotransferase, orient="h").set_title("Alamine_Aminotransferase")
print("Min value:", min(df.Alamine_Aminotransferase))

## Converting Aspartate and Alamine to qualitative varibles

+ **Alamine Aminotransferase**: test result can range from 7 to 55 units per liter. 
+ **Aspartate_Aminotransferase**: normal ranges are: 10-40 units/L (males), 9-32 units/L (females). 

In both cases, the minimum value of the variable is 10, so neither case is sub-normal. Hence the replacement will be to assign a value to the person:1: above normal, 0: in normal, depending on the reference values given in the medical literature. 

In [None]:
def alamine(df):
    if df['Alamine_Aminotransferase'] <= 55: return 0
    else: return 1

In [None]:
def aspartate(df):
    if df['Male'] == 1 and df['Aspartate_Aminotransferase'] <= 40: return 0
    elif df['Male'] == 0 and df['Aspartate_Aminotransferase'] <= 32: return 0 
    elif df['Male'] == 0 and df['Aspartate_Aminotransferase'] > 32: return 1
    else: return 1

I don't want to make changes to the main database, so I will make copy.

In [None]:
df_1 = df.copy()

In [None]:
df_1['Alamine_Aminotransferase'] = df_1.apply(alamine, axis=1)
df_1['Aspartate_Aminotransferase'] = df_1.apply(aspartate, axis=1)

In [None]:
df_1.head()

For better readability of the quality variables, I will put them in a separate database and convert 1-0 to above normal and normal.

In [None]:
to_add = df_1.loc[:,['Alamine_Aminotransferase', 'Aspartate_Aminotransferase']]
df_c = df_c.join(to_add)

In [None]:
df_c.head()

In [None]:
df_c['Alamine_Aminotransferase'] = df_c['Alamine_Aminotransferase'].map({0: "Normal", 1: "Above_Normal"})
df_c['Aspartate_Aminotransferase'] = df_c['Aspartate_Aminotransferase'].map({0: "Normal", 1: "Above_Normal"})

In [None]:
df_c.head()

# A look at new qualitative variables

In [None]:
sns.countplot(x="Alamine_Aminotransferase", hue="Dataset", data=df_c).set_title(
    "Liver dieses among Alamine_Aminotransferase level")

In [None]:
sns.countplot(x="Aspartate_Aminotransferase", hue="Dataset", data=df_c).set_title(
    "Liver dieses among Aspartate_Aminotransferase level")

In [None]:
df_c.groupby("Gender").Aspartate_Aminotransferase.value_counts()

In [None]:
df_c.groupby("Dataset").Aspartate_Aminotransferase.value_counts()


In [None]:
df_c.Alamine_Aminotransferase.value_counts()

In [None]:
df_c.Aspartate_Aminotransferase.value_counts()

# Outliers

As I mentioned in a previous notebook on this issue, **we cannot perform mathematical operations such as counting the mean or quartile on qualitative variables**. Therefore, they must be excluded.

In [None]:
# database for quantitative variables
df_q = df_1.drop(columns=['Male', 'Target', 'Alamine_Aminotransferase', 'Aspartate_Aminotransferase'])

In [None]:
sns.boxplot(data=df_q, orient="h").set_title("Outliers")

The Alkaline variable has quite a few outliers (not as many as the two we converted to qualitative variables) so we will perform the outlier removal procedure once.

In [None]:
def remove_outliers(df_in):

    Q1 = df_in.quantile(0.25)
    Q3 = df_in.quantile(0.75)
    IQR = Q3 - Q1
    upper_limit = Q3 + 1.5*IQR
    lower_limit = Q1 - 1.5*IQR

    df_clean = df_in[~((df_in < lower_limit) | (df_in > upper_limit)).any(axis=1)]
    
    return df_clean

In [None]:
df_q = remove_outliers(df_q)

In [None]:
print("Number of cases in df:", len(df))
print("Number of cases in df_q:", len(df_q))
print("We've removed:", round(100-(len(df_q)*100/len(df)),2), "percent of rows.")

With this procedure we removed about 25% of the database which is a very good result compared to the previous 80%.

# Pearson's correlation

In [None]:
sns.heatmap(df_q.corr(), annot=True, cmap='coolwarm',
            mask=np.triu(df_q.corr())).set_title("Correlogram")

Two variables correlate very strongly, so I will remove them. 

In [None]:
df_trimmed = df_1[df_1.index.isin(df_q.index)]
df_trimmed = df_trimmed.drop(columns=['Albumin', 'Total_Bilirubin'])

# Standardization of quantitative variables

$ z = \frac{x-u}{s}$

where: x - *sample*, u - *mean*, s - *std* 

Standardization is the process of putting different variables on the same scale; allows you to compare scores between different types of variables.

In [None]:
scaler = StandardScaler()
df_scaled = df_trimmed.copy()
# scaling trimmed data
df_scaled[['Age', 'Direct_Bilirubin', 'Alkaline_Phosphotase', 'Total_Protiens', 'Albumin_and_Globulin_Ratio']] = scaler.fit_transform(
    df_trimmed[['Age', 'Direct_Bilirubin', 'Alkaline_Phosphotase', 'Total_Protiens', 'Albumin_and_Globulin_Ratio']])

In [None]:
df_scaled_all = df.copy()
# scaling complete data
df_scaled_all[['Age', 'Total_Bilirubin', 'Direct_Bilirubin', 'Alkaline_Phosphotase', 'Alamine_Aminotransferase', 'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin', 'Albumin_and_Globulin_Ratio']] = scaler.fit_transform(
    df[['Age', 'Total_Bilirubin', 'Direct_Bilirubin', 'Alkaline_Phosphotase', 'Alamine_Aminotransferase', 'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin', 'Albumin_and_Globulin_Ratio']])


# Testing databases

I will use four databases to perform a logistic regression model to predict whether a patient has a diseased liver or not.

In [None]:
# initial database
df.head() 

In [None]:
# scaled initial database
df_scaled_all.head()

In [None]:
# trimmed database
df_trimmed.head()

In [None]:
# scaled trimmed database
df_scaled.head()

# Logistic regression

## Splitting data to X & y

In [None]:
# for df
X_df = df.loc[:, df.columns != 'Target']
y_df = df.loc[:, 'Target']

# for df_scaled_all
X_df_scaled_all = df_scaled_all.loc[:, df_scaled_all.columns != 'Target']
y_df_scaled_all = df_scaled_all.loc[:, 'Target']

# for df trimmed
X_df_trimmed = df_trimmed.loc[:, df_trimmed.columns != 'Target']
y_df_trimmed = df_trimmed.loc[:, 'Target']

# for df trimmed and scaled
X_df_trimmed_scaled = df_scaled.loc[:, df_scaled.columns != 'Target']
y_df_trimmed_scaled = df_scaled.loc[:, 'Target']

In [None]:
# for df
X_train_df, X_test_df, y_train_df, y_test_df = train_test_split(X_df, y_df, test_size = 0.30, random_state = 0, stratify = y_df)

# for df_scaled_all
X_train_df_scaled_all, X_test_df_scaled_all, y_train_df_scaled_all, y_test_df_scaled_all = train_test_split(X_df_scaled_all, y_df_scaled_all, test_size = 0.30, random_state = 0, stratify = y_df_scaled_all)

# for df trimmed
X_train_df_trimmed, X_test_df_trimmed, y_train_df_trimmed, y_test_df_trimmed = train_test_split(X_df_trimmed, y_df_trimmed, test_size = 0.30, random_state = 0, stratify = y_df_trimmed)

# for df trimmed and scaled
X_train_df_trimmed_scaled, X_test_df_trimmed_scaled, y_train_df_trimmed_scaled, y_test_df_trimmed_scaled = train_test_split(X_df_trimmed_scaled, y_df_trimmed_scaled, test_size = 0.30, random_state = 0, stratify = y_df_trimmed_scaled)

## Model

In [None]:
model_1 = LogisticRegression(max_iter=1000)
model_2 = LogisticRegression(max_iter=1000)
model_3 = LogisticRegression(max_iter=1000)
model_4 = LogisticRegression(max_iter=1000)

In [None]:
res_1 = model_1.fit(X_train_df, y_train_df)
res_2 = model_2.fit(X_train_df_scaled_all, y_train_df_scaled_all)
res_3 = model_3.fit(X_train_df_trimmed, y_train_df_trimmed)
res_4 = model_4.fit(X_train_df_trimmed_scaled, y_train_df_trimmed_scaled)

y_predict_1 = model_1.predict(X_test_df)
y_predict_2 = model_2.predict(X_test_df_scaled_all)
y_predict_3 = model_3.predict(X_test_df_trimmed)
y_predict_4 = model_4.predict(X_test_df_trimmed_scaled)


# Performance comparison

In [None]:
def score_function(y_pred, y_test):

    Acc = accuracy_score(y_test, y_pred)
    Pre = precision_score(y_test, y_pred)
    Rec = recall_score(y_test, y_pred)
    Bal = balanced_accuracy_score(y_test, y_pred)

    data = pd.DataFrame()
    names = ["Accuracy", "Precision", "Recall", "Balanced accuracy"]
    values = [Acc, Pre, Rec, Bal]
    data["Names"] = names
    data['Scores'] = values

    return data

In [None]:
scores_1 = score_function(y_predict_1, y_test_df)
scores_2 = score_function(y_predict_2, y_test_df_scaled_all)
scores_3 = score_function(y_predict_3, y_test_df_trimmed)
scores_4 = score_function(y_predict_4, y_test_df_trimmed_scaled)

In [None]:
names = ["Accuracy", "Precision", "Recall", "Balanced accuracy"]

results = pd.DataFrame({"Names": names, "df": scores_1['Scores'], "df_scaled_all":scores_2['Scores'],
                        "df_trimmed": scores_3['Scores'], "df_trimmed_scaled": scores_4['Scores']})


In [None]:
results.set_index("Names")

# Conclusions

## Confusion matrix
<img src="https://i.imgur.com/Xdtufpb.jpg" width="500">


$\operatorname{accuracy} = \frac{t_p+t_n}{t_p+t_n+f_p+f_n}$

$\operatorname{precision} = \frac{t_p}{t_p + f_p}$

$\operatorname{recall} = \frac{t_p}{t_p + f_n}$

$\operatorname{balanced-accuracy} = \frac{1}{2}\left(\frac{t_p}{t_p + f_n}+\frac{t_n}{t_n+f_p}\right)$

## Summary
+ Standardization of the variables has barely any effect on the results.
+ For the trimmed dataset, there is a slight decrease in accuracy (about 2%), a slightly larger decrease in precision (about 8%).
+ For the trimmed dataset, there is an increase in recall of approximately 6%.
+ Almost zero change in balanced accuracy.

# Appendix 1: balancing classes

Strongly unbalanced classes can affect the performance quality of classification algorithms so we will use two solution methods. <br />
You can read about both of them and many more [here](https://imbalanced-learn.org/stable/index.html).

In [None]:
df.Target.value_counts().plot.pie(autopct='%.2f')

In [None]:
df_trimmed.Target.value_counts().plot.pie(autopct='%.2f')

In [None]:
print("df class 0:", len(df.Target)-df.Target.sum())
print("df_trimmed class 0:", len(df_trimmed.Target)-df_trimmed.Target.sum())


## Random Undersampling

In [None]:
rus_1 = RandomUnderSampler(sampling_strategy=1)
rus_2 = RandomUnderSampler(sampling_strategy=1)
rus_3 = RandomUnderSampler(sampling_strategy=1)
rus_4 = RandomUnderSampler(sampling_strategy=1)

X_rus_df, y_rus_df = rus_1.fit_resample(X_df, y_df)
X_rus_df_scaled_all, y_rus_df_scaled_all = rus_2.fit_resample(X_df_scaled_all, y_df_scaled_all)
X_rus_df_trimmed, y_rus_df_trimmed = rus_3.fit_resample(X_df_trimmed, y_df_trimmed)
X_rus_df_trimmed_scaled, y_rus_df_trimmed_scaled = rus_4.fit_resample(X_df_trimmed_scaled, y_df_trimmed_scaled)

# for df
X_train_df_rus, X_test_df_rus, y_train_df_rus, y_test_df_rus = train_test_split(
    X_rus_df, y_rus_df, test_size=0.30, random_state=0, stratify=y_rus_df)

# for df_scaled_all
X_train_df_scaled_all_rus, X_test_df_scaled_all_rus, y_train_df_scaled_all_rus, y_test_df_scaled_all_rus = train_test_split(
    X_rus_df_scaled_all, y_rus_df_scaled_all, test_size=0.30, random_state=0, stratify=y_rus_df_scaled_all)

# for df trimmed
X_train_df_trimmed_rus, X_test_df_trimmed_rus, y_train_df_trimmed_rus, y_test_df_trimmed_rus = train_test_split(
    X_rus_df_trimmed, y_rus_df_trimmed, test_size=0.30, random_state=0, stratify=y_rus_df_trimmed)

# for df trimmed and scaled
X_train_df_trimmed_scaled_rus, X_test_df_trimmed_scaled_rus, y_train_df_trimmed_scaled_rus, y_test_df_trimmed_scaled_rus = train_test_split(
    X_rus_df_trimmed_scaled, y_rus_df_trimmed_scaled, test_size=0.30, random_state=0, stratify=y_rus_df_trimmed_scaled)


## Random Oversampling

In [None]:
ros_1 = RandomOverSampler(sampling_strategy=1)
ros_2 = RandomOverSampler(sampling_strategy=1)
ros_3 = RandomOverSampler(sampling_strategy=1)
ros_4 = RandomOverSampler(sampling_strategy=1)

X_ros_df, y_ros_df = ros_1.fit_resample(X_df, y_df)
X_ros_df_scaled_all, y_ros_df_scaled_all = ros_2.fit_resample(X_df_scaled_all, y_df_scaled_all)
X_ros_df_trimmed, y_ros_df_trimmed = ros_3.fit_resample(X_df_trimmed, y_df_trimmed)
X_ros_df_trimmed_scaled, y_ros_df_trimmed_scaled = ros_4.fit_resample(X_df_trimmed_scaled, y_df_trimmed_scaled)

# for df
X_train_df_ros, X_test_df_ros, y_train_df_ros, y_test_df_ros = train_test_split(
    X_ros_df, y_ros_df, test_size=0.30, random_state=0, stratify=y_ros_df)

# for df_scaled_all
X_train_df_scaled_all_ros, X_test_df_scaled_all_ros, y_train_df_scaled_all_ros, y_test_df_scaled_all_ros = train_test_split(
    X_ros_df_scaled_all, y_ros_df_scaled_all, test_size=0.30, random_state=0, stratify=y_ros_df_scaled_all)

# for df trimmed
X_train_df_trimmed_ros, X_test_df_trimmed_ros, y_train_df_trimmed_ros, y_test_df_trimmed_ros = train_test_split(
    X_ros_df_trimmed, y_ros_df_trimmed, test_size=0.30, random_state=0, stratify=y_ros_df_trimmed)

# for df trimmed and scaled
X_train_df_trimmed_scaled_ros, X_test_df_trimmed_scaled_ros, y_train_df_trimmed_scaled_ros, y_test_df_trimmed_scaled_ros = train_test_split(
    X_ros_df_trimmed_scaled, y_ros_df_trimmed_scaled, test_size=0.30, random_state=0, stratify=y_ros_df_trimmed_scaled)


## Logistic regression for undersampling and oversampling 

### For Random Undersampling

In [None]:
model_1_rus = LogisticRegression(max_iter=1000)
model_2_rus = LogisticRegression(max_iter=1000)
model_3_rus = LogisticRegression(max_iter=1000)
model_4_rus = LogisticRegression(max_iter=1000)

In [None]:
res_1_rus = model_1_rus.fit(X_train_df_rus, y_train_df_rus)
res_2_rus = model_2_rus.fit(X_train_df_scaled_all_rus, y_train_df_scaled_all_rus)
res_3_rus = model_3_rus.fit(X_train_df_trimmed_rus, y_train_df_trimmed_rus)
res_4_rus = model_4_rus.fit(X_train_df_trimmed_scaled_rus, y_train_df_trimmed_scaled_rus)

y_predict_1_rus = model_1_rus.predict(X_test_df_rus)
y_predict_2_rus = model_2_rus.predict(X_test_df_scaled_all_rus)
y_predict_3_rus = model_3_rus.predict(X_test_df_trimmed_rus)
y_predict_4_rus = model_4_rus.predict(X_test_df_trimmed_scaled_rus)

In [None]:
scores_1_rus = score_function(y_predict_1_rus, y_test_df_rus)
scores_2_rus = score_function(y_predict_2_rus, y_test_df_scaled_all_rus)
scores_3_rus = score_function(y_predict_3_rus, y_test_df_trimmed_rus)
scores_4_rus = score_function(y_predict_4_rus, y_test_df_trimmed_scaled_rus)

In [None]:
names = ["Accuracy", "Precision", "Recall", "Balanced accuracy"]

results_rus = pd.DataFrame({"Names": names, "df_rus": scores_1_rus['Scores'], "df_scaled_all_rus": scores_2_rus['Scores'],
                          "df_trimmed_rus": scores_3_rus['Scores'], "df_trimmed_scaled_rus": scores_4_rus['Scores']})

In [None]:
results_rus.set_index("Names")

### For Random Oversampling

In [None]:
model_1_ros = LogisticRegression(max_iter=1000)
model_2_ros = LogisticRegression(max_iter=1000)
model_3_ros = LogisticRegression(max_iter=1000)
model_4_ros = LogisticRegression(max_iter=1000)

In [None]:
res_1_ros = model_1_ros.fit(X_train_df_ros, y_train_df_ros)
res_2_ros = model_2_ros.fit(X_train_df_scaled_all_ros, y_train_df_scaled_all_ros)
res_3_ros = model_3_ros.fit(X_train_df_trimmed_ros, y_train_df_trimmed_ros)
res_4_ros = model_4_ros.fit(X_train_df_trimmed_scaled_ros, y_train_df_trimmed_scaled_ros)

y_predict_1_ros = model_1_ros.predict(X_test_df_ros)
y_predict_2_ros = model_2_ros.predict(X_test_df_scaled_all_ros)
y_predict_3_ros = model_3_ros.predict(X_test_df_trimmed_ros)
y_predict_4_ros = model_4_ros.predict(X_test_df_trimmed_scaled_ros)

In [None]:
scores_1_ros = score_function(y_predict_1_ros, y_test_df_ros)
scores_2_ros = score_function(y_predict_2_ros, y_test_df_scaled_all_ros)
scores_3_ros = score_function(y_predict_3_ros, y_test_df_trimmed_ros)
scores_4_ros = score_function(y_predict_4_ros, y_test_df_trimmed_scaled_ros)

In [None]:
names = ["Accuracy", "Precision", "Recall", "Balanced accuracy"]

results_ros = pd.DataFrame({"Names": names, "df_ros": scores_1_ros['Scores'], "df_scaled_all_ros": scores_2_ros['Scores'],
                            "df_trimmed_ros": scores_3_ros['Scores'], "df_trimmed_scaled_ros": scores_4_ros['Scores']})

In [None]:
results_ros.set_index("Names")

## Conclusions II

For Random Undersampling:
+ Standardizing the variables for the original database (df) has positive effects.
+ Standardizing the variables for the trimmed database (df_trimmed) has negative effects.
+ I do not find the effect of using Undersampling to be satisfying.

For Random Oversampling:
+ Standardization of variables in both cases has positive effects.
+ The df has higher Accuracy, Precision and Bananced accuracy values, but lower Recall from the trimmed dataset.

In this case, Oversampling is better than Undersampling.<br />
Statistical treatments seem to have little effect on the final performance of the logistic regression classifier.

# Final Words

If anyone has suggestions on how to do something in the loops or improve it - feel free to write.