At the beginning I want to say that this is my first more written analysis so don’t be too hard on me AND my english is bad 😢.

<img src="https://i.imgur.com/f1uQ0uN.png" width="500">

# About
*Liver disease* is a growing problem of our time, and having a good method to identify the patients most at risk could help doctors make a faster diagnosis and treatment. We must remember that statistical methods are intended to HELP diagnosticians, not replace them as unquestionable oracles. 

*Note*: During the analysis, certain variables may be removed for the sake of the model. I will use the knowledge acquired from the books with **Logistic Regression** as the default classification method in my mind. 


**Major variables:** 
+ **Bilirubin** is a bile pigment that comes from the breakdown of red blood cells. An increase in this concentration may cause jaundice. 
+ **Alkaline** is en enzyme which can by found in the liver and when liver is damaged Alkaline may leak into the bloodstream. Its high levels in blood can indicate liver disease. 
+ **Alamine Aminotransferase**: test result can range from 7 to 55 units per liter. 
+ **Aspartate_Aminotransferase**: normal ranges are: 10-40 units/L (males), 9-32 units/L (females). 
+ In people with badly damaged livers, **proteins** are not properly processed. 
+ Low **albumin** levels can indicate a problem with liver or kidneys. 
+ **Globulins** play an important role in liver function, blood clotting, and fighting infection. Low globulin levels can be a sign of liver or kidney disease. High levels may indicate infection, inflammatory disease or immune disorders.

**Dataset variable:**
+ 1-liver patient
+ 2-non liver patient

Dataset variable will be renamed and shift in values.


# ANALYSIS OVERVIEW 🐱‍👤

1. Loading data and packages
2. First look
3. Missing values
4. Fix dataset
5. Dividing the dataset into categorical and quantitative variables
6. Operation on categorical variables
   1. Value counts for Gender & Liver, Disease among Gender, Barplot
7. Operation on quantitative variables
   1. Descriptive statistics
   2. Coefficient of variation
   3. Kurtosis
   4. Skewness
   5. Normality test
   6. Outleiers
   7. Person correlation coefficients
9.  PCA
10. Logistic Regression
    1.  Splitting data to X & y
    2.  Models
    3.  Comparison of results
11. Conclusions

# Loading data and packages

In [None]:
!pip install factor_analyzer

In [None]:
# packages

import pandas as pd
import matplotlib as plt
import seaborn as sns 
import numpy as np
from scipy.stats import kurtosis, skew, shapiro, zscore
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity, calculate_kmo
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import plot_confusion_matrix, confusion_matrix, accuracy_score, precision_score, recall_score, balanced_accuracy_score, roc_curve, roc_auc_score

In [None]:
df = pd.read_csv("../input/indian-liver-patient-records/indian_liver_patient.csv")

# First look

In [None]:
df.head(10)

In [None]:
# What is the Dtype of our variables?
df.info()

We have one variable of class "object", but remember that the variable "Dataset" is a quality variable too.

In [None]:
# How many rows and columns we have?
df.shape

# Missing values

Sometimes for some reason there are gaps in our database. This is quite a complex issue and before we start talking about it, we should check how many such missing values our database contains. <br />
If it is a small number, the best and the simplest method is to remove all the cases and not bother with theory and validity of imputation methods.

In [None]:
df.isnull().sum()

Of the nearly 600 cases, only four have missing data in one column. Let's see how they look...

In [None]:
df[df.isna().any(axis=1)]

These four cases have *Gender* and *Dataset* equally, so removing them should not harm us. 

In [None]:
# Just drop NA 
df = df.dropna()

# Fix dataset

The current step will be a bit tangled but will prove very helpful in the next sections. <br />
First, we will copy the current database and describe it as *df_c*, then replace the Dataset values to healthy, sick respectively. Next, for the *df* database, we will change the Gender column to Male, where as 1 - Yes, 0 - No; similarly, we will do with the Dataset column i.e. rename it to Target: 1 - liver disease, 0 - healthy.

In [None]:
# copy 
df_c = df.copy()

In [None]:
# For df_c
df_c["Dataset"] = df["Dataset"].map({1:"Sick", 2:"Healthy"})

# For df
df['Gender'] = df['Gender'].map({'Male': 1, 'Female': 0})
df['Dataset'] = df['Dataset'].map({1: 1, 2: 0})
df.rename(columns={'Gender': 'Male'}, inplace=True)
df.rename(columns={'Dataset': 'Target'}, inplace=True)

# Dividing the dataset into categorical and quantitative variables

As a person closely related to the sciences, I have to say one important and unpleasant thing: you MUST NOT perform certain mathematical operations on qualitative variables, for example, the kind that *describe()* does. <br />
Before calculations, data should be divided in such a way that some operations can be performed on qualitative variables and others on quantitative variables.

We only need *df_c* for one purpose - to perform operations on quality variables, so we can remove all other variables from it.

*df* is our main database, so for operations on quantitative variables I will create an additional copy of it, containing only quantitative variables.

In [None]:
df_c.drop(columns=['Age', 'Total_Bilirubin', 'Direct_Bilirubin',
                   'Alkaline_Phosphotase', 'Alamine_Aminotransferase',
                   'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin',
                   'Albumin_and_Globulin_Ratio'], inplace=True)

In [None]:
df_quantitative = df.drop(columns=["Male", "Target"])

# Operation on categorical variables

The categorical variables we have in the database are expressed on a nominal scale. This means that we can only perform the following operations on them: 
+ counting, 
+ calculating fractions, 
+ calculating mode.

In [None]:
# Fraction for Gender
df_c.Gender.value_counts(normalize=True)

In [None]:
# Fraction for Dataset
df_c.Dataset.value_counts(normalize=True)

In [None]:
# Disease and Gender values
df_c.groupby("Dataset").Gender.value_counts()

In [None]:
# Fraction of healthy/sick by Gender
pd.crosstab(df_c['Gender'], df_c['Dataset']).apply(lambda r: r/r.sum()*100, axis=1)

In [None]:
plt.rcParams['figure.figsize'] = [10, 8]  # for size
sns.countplot(x="Gender", hue="Dataset", data=df_c).set_title("Liver dieses among Gender")

The conclusions we can draw from the above are: 
+ the database is mainly composed of men (76%) and sick people (72%),
+ despite the large difference in numbers, women are less often ill by only about 9%,
+ our database is heavily unbalanced.

# Operation on quantitative variables

One of the most popular functions for initial review of QUANTITATIVE data is "describe()". Most often its result is not discussed, but we are tempted to give a brief comment.

In [None]:
df_quantitative.describe()

**Mean +/- std:** 4 variables have std greater than the mean which means really high dispersion in our data. Alkaline_P also has a very high std, comparable to its mean. <br />
**Max/Min:** Keeping in mind that the correct range for Alamine is 7-55, Aspartate 10-40(M)/9-32(F) we can feel very anxious to see that the maximum values in the database are expressed in thousands. While the minimums are within the normal range.

With the above, I believe the database will contain a great number of outliers.

In [None]:
# coefficient of variation
def cv(x): return np.std(x) / np.mean(x) * 100
df_quantitative.apply(cv)

$ V = \frac{std}{mean}*100 $

The coefficient of variation is often used to determine if a variable will be important to the model. In our case, the variable: Protein and Albumin do not have a very high V, compared to the other variables. So if they correlate strongly with other variables in the database they will most likely be excluded.

In [None]:
# kurtosis
df_quantitative.apply(kurtosis, bias=False)

Kurtosis is a measure of outliers. The higher its value, the more likely there are outliers in the database. 
The lower the value, the more the results are clustered around the mean.

In our case, five variables exceed the safe threshold of $K=|3|$, of which Aspartate and Alamine very strongly.  This means that there will undoubtedly be many outliers in the database.

In [None]:
# skewness
df_quantitative.apply(skew, bias=False)

The skewness for most variables is positive, indicating that the distribution has an extended right arm.

Based on previous results I believe that almost none of the variables have a normal distribution, but to prove this we will perform a normality test.

In [None]:
# alpha = 0.05
# H0 = The sample comes from a normal distribution.
# H1 = The sample is not coming from a normal distribution.

for i in df_quantitative:
    print([i])
    a, b = shapiro(df_quantitative[[i]])
    if b < 0.05:
        print("H1")
    else:
        print("H0")

## Outleiers

We have already determined that our database will have a significant amount of outliers so to seal this we will perform a boxplot.

In [None]:
sns.boxplot(data=df_quantitative, orient="h").set_title("Plot showing outliers")

To find outliers we will use the interquartile range $IQR = Q_3 - Q_1$. The outlier observations are below the lower bound defined as $lb = Q_1 - 1.5*IQR$ and above the upper bound defined as $ub = Q_3+1.5*IQR$.

In [None]:
def remove_outliers(df_in):

    Q1 = df_in.quantile(0.25)
    Q3 = df_in.quantile(0.75)
    IQR = Q3 - Q1
    upper_limit = Q3 + 1.5*IQR
    lower_limit = Q1 - 1.5*IQR

    df_clean = df_in[~((df_in < lower_limit) | (df_in > upper_limit)).any(axis=1)]
    
    return df_clean

Having a function defined to remove outliers we will apply it once on a data base containing quantitative variables.

In [None]:
df_clean = remove_outliers(df_quantitative)

In [None]:
sns.boxplot(data=df_clean, orient="h").set_title(
    "Plot showing outliers after the 1st removal of outliers")

Our database unfortunately has many outliers. A lot of them had a high value, so a single procedure didn't give a very good result. 
Therefore, we will create a loop that will repeat the procedure a certain number of times.

*Keep in mind that the written function will not make any changes to the database if there are no outliers left in the database.*

In [None]:
for i in range(5):
    df_clean = remove_outliers(df_clean)

In [None]:
sns.boxplot(data=df_clean, orient="h").set_title(
    "Plot showing outliers after the 6th removal of outliers")

Based on the plot, we can conclude that there are no more outliers in the database. 
We should now ask the question how many cases we had to remove to reach this state.

In [None]:
print("Number of cases in df:", len(df))
print("Number of cases in df_clean:", len(df_clean))
print("We've removed:", round(100-(len(df_clean)*100/len(df)),2), "percent of rows.")

By removing outliers we have erased almost 80% of the entire database. This is very bad and we could suggest another solution to this problem. 

*For example*, we could replace the variables with the largest spread of values with qualitative variables, e.g. below normal, in normal, above normal, based on the ranges given in the study. 

In this notebook, however, we will not do this. We will continue to work with a reduced database.
Using the indexes, we will examine how the qualitative variables for the base look after removing outliers.

In [None]:
df_c_trimmed = df_c[df_c.index.isin(df_clean.index)]

In [None]:
df_c_trimmed.Dataset.value_counts()

In [None]:
df_c_trimmed.Gender.value_counts()

As we can see by removing the outliers we have accidentally solved the problem of strongly unbalanced classes.

At the very end, all that is left is to trim the main database based on the removed outliers.

In [None]:
df_trimmed = df[df.index.isin(df_clean.index)]

## Person correlation coefficients

$\rho_{X, Y}=\frac{\operatorname{cov}(X, Y)}{\sigma_{X} \sigma_{Y}}$

The formula above describes the Pearson linear correlation between two variables. We can use it for our quantitative data before and after removing outliers.

In [None]:
sns.heatmap(df_quantitative.corr(), annot=True, cmap='coolwarm',
            mask=np.triu(df_quantitative.corr())).set_title("Before removing outliers")

In [None]:
sns.heatmap(df_clean.corr(), annot=True, cmap='coolwarm',
            mask=np.triu(df_clean.corr())).set_title("After removing outliers")

From the correlation results above, it can be seen that removing outliers reduced the correlations in the database. This is a positive effect considering the logistic regression model, but a negative effect for PCA.

The only highly correlating variable is Albumin therefore we will remove it from the database before building the logistic regression model.

# PCA

...is a popular algorithm for dimensionality reduction. It performs a transformation of our current variables into principal components, the first two/three of which should explain a large enough percentage of the total variance to make the graph helpful in, for example, identifying groups. 

One of the requirements for this algorithm to work properly is that there is a strong correlation between our variables. Our database does not meet this requirement, and to prove this I will use Bartlett's test and KMO criterion.

## Bartlett 

Test the hypothesis that the correlation matrix is equal to the identity matrix. <br />
*H0*: The matrix of population correlations **is equal** to I. <br />
 *H1*: The matrix of population correlations **is not equal** to I.

In [None]:
calculate_bartlett_sphericity(df_clean)

### Kaiser-Meyer-Olkin

Calculate the Kaiser-Meyer-Olkin criterion for items and overall. This statistic represents the degree to which each observed variable is predicted, without error, by the other variables in the dataset. In general, a $KMO < 0.6$ is considered inadequate.

In [None]:
kmo_per_variable, kmo_total = calculate_kmo(df_clean)
print("per variable:", kmo_per_variable, "total:", kmo_total)

Based on both tests, it is safe to say that PCA would not help us in any way. 

# Logistic Regression

All necessary theoretical information can be found at this [link](https://en.wikipedia.org/wiki/Logistic_regression).

As I wrote previously we remove the highly correlated variable.

In [None]:
df_trimmed.drop(columns="Albumin", inplace=True)

## Splitting data to X & y

Since we practically decimated our database I decided to build two models. The first with our trimmed df, the second using the entire database.
The trimmed df is to small for the purposes of typical machine learning (in my opinion) so we will not split it into a training and test set. 

In [None]:
X = df_trimmed.loc[:, df_trimmed.columns!='Target']
y = df_trimmed.loc[:, 'Target']

...but the df is large enough so we will split it.

In [None]:
X_all = df.loc[:, df.columns!='Target']
y_all = df.loc[:, 'Target']

In [None]:
X_train_all, X_test_all, y_train_all, y_test_all = train_test_split(X_all, y_all, test_size = 0.30, random_state = 0, stratify = y_all)

## Model

I did not perform standardization of the variables therefore I increased the number of iterations for the model.

In [None]:
model = LogisticRegression(max_iter=1000)

### Trimmed df

In [None]:
res_1 = model.fit(X, y)
y_predict_1 = model.predict(X)
confusion_matrix(y_pred=y_predict_1,y_true=y)

In [None]:
print("Accuracy:", accuracy_score(y, y_predict_1))
print("Precision:", precision_score(y, y_predict_1))
print("Recall:", recall_score(y, y_predict_1))
print("Balanced accuracy score:", balanced_accuracy_score(y, y_predict_1))

In [None]:
logit_roc_auc_1 = roc_auc_score(y, y_predict_1)
fpr_1, tpr_1, thresholds_1 = roc_curve(y, res_1.predict_proba(X)[:, 1])
plt.pyplot.plot(fpr_1, tpr_1, label='Logistic Regression (area = %0.2f)' % logit_roc_auc_1)
plt.pyplot.plot([0, 1], [0, 1], 'r--')
plt.pyplot.xlim([0.0, 1.0])
plt.pyplot.ylim([0.0, 1.05])
plt.pyplot.xlabel('False Positive Rate')
plt.pyplot.ylabel('True Positive Rate')
plt.pyplot.title('Receiver operating characteristic for df')
plt.pyplot.legend(loc="lower right")

### df

In [None]:
res_2 = model.fit(X_train_all, y_train_all)
y_predict_2 = model.predict(X_test_all)
confusion_matrix(y_pred=y_predict_2, y_true=y_test_all)

In [None]:
print("Accuracy:", accuracy_score(y_test_all, y_predict_2))
print("Precision:", precision_score(y_test_all, y_predict_2))
print("Recall:", recall_score(y_test_all, y_predict_2))
print("Balanced accuracy score:", balanced_accuracy_score(y_test_all, y_predict_2))

In [None]:
logit_roc_auc_2 = roc_auc_score(y_test_all, y_predict_2)
fpr_2, tpr_2, thresholds_2 = roc_curve(y_test_all, res_2.predict_proba(X_test_all)[:, 1])
plt.pyplot.plot(fpr_2, tpr_2, label='Logistic Regression (area = %0.2f)' % logit_roc_auc_2)
plt.pyplot.plot([0, 1], [0, 1], 'r--')
plt.pyplot.xlim([0.0, 1.0])
plt.pyplot.ylim([0.0, 1.05])
plt.pyplot.xlabel('False Positive Rate')
plt.pyplot.ylabel('True Positive Rate')
plt.pyplot.title('Receiver operating characteristic for df_all')
plt.pyplot.legend(loc="lower right")

## Comparison of results

In [None]:
data = {"df_trimmed": [accuracy_score(y, y_predict_1), precision_score(y, y_predict_1), recall_score(y, y_predict_1), balanced_accuracy_score(y, y_predict_1)],
        "df": [accuracy_score(y_test_all, y_predict_2), precision_score(y_test_all, y_predict_2), recall_score(y_test_all, y_predict_2), balanced_accuracy_score(y_test_all, y_predict_2)]}

comparision = pd.DataFrame(data, index = ["Accuracy", "Precision", "Recall", "Balanced accuracy"])
print(comparision)

In [None]:
print("Liver patients percentage in df_trimmed:", df_trimmed.Target.sum()/len(df_trimmed.Target))
print("Liver patients percentage in df:", df.Target.sum()/len(df.Target))

# Conclusions

The first conclusion we can make is: DELETING almost 80% of our database because of outliers is a bad idea. We really shouldn't be doing this.

The models for the two cases are slightly different. I would say about 10% on average. Which means that laboriously checking assumptions, eliminating outliers, removing a highly correlated variable, etc. produced a poor end result.

Of course, it should be noted that in our decimated database, about 51% of cases had a diseased liver, and our accuracy is 61% in this model. Using this model, we slightly improve our assessment of whether a patient has a diseased liver or not, compared to assuming that all patients have the disease.
For the entire database, patients with diseased liver make up about 72% of the cases, and our model has a accuracy of 70%, so just looking at this rate, you could say that whether we use the model or assume that everyone has the disease...it doesn't matter.

The next step in further analysis should be to convert those variables with a lot of outliers into qualitative variables. Then applying the solution method to the strongly unbalanced classes.