___

## Indian Liver Project 

**Data Set Information**,

- This data set contains 416 liver patient records and 167 non liver patient records.The data set was collected from test samples in North East of Andhra Pradesh, India. 
    - 'is_patient' is a class label used to divide into groups(liver patient or not).
    - This data set contains 441 male patient records and 142 female patient records.
    - Any patient whose age exceeded 89 is listed as being of age "90".

**Attribute Information**:

- **age**: Age of the patient
- **gender**: Gender of the patient
- **tot_bilirubin**: Total Bilirubin
- **direct_bilirubin**: Direct Bilirubin
- **alkphos Alkaline**: Phosphotase
- **sgpt**: Alamine Aminotransferase
- **sgot**: Aspartate Aminotransferase
- **tot_proteins**: Total Protiens
- **albumin**: Albumin
- **ag_ratio**: Albumin and Globulin Ratio
- **is_patient**: Selector field used to split the data into two sets (labeled by the experts)

**[data source](https://www.kaggle.com/jeevannagaraj/indian-liver-patient-dataset)**

In [21]:
## =======================================================
#.     Importing Necessary Tools For the project
## =======================================================

import pandas as pd; import numpy as np
from sklearn.metrics import accuracy_score,  roc_auc_score
from sklearn.model_selection import train_test_split

# Import models that make the ensemble 
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import RandomForestClassifier

# Import Preprocessing tools
from sklearn.preprocessing import StandardScaler

# Import VotingClassifier
from sklearn.ensemble import VotingClassifier

In [22]:
import warnings
warnings.filterwarnings('ignore')

In [23]:
## ================================================
#    Read and explore the data
# =================================================
liver = pd.read_csv('liver.csv')
liver.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               583 non-null    int64  
 1   gender            583 non-null    object 
 2   tot_bilirubin     583 non-null    float64
 3   direct_bilirubin  583 non-null    float64
 4   tot_proteins      583 non-null    int64  
 5   albumin           583 non-null    int64  
 6   ag_ratio          583 non-null    int64  
 7   sgpt              583 non-null    float64
 8   sgot              583 non-null    float64
 9   alkphos           579 non-null    float64
 10  is_patient        583 non-null    int64  
dtypes: float64(5), int64(5), object(1)
memory usage: 50.2+ KB


In [24]:
liver.head()

Unnamed: 0,age,gender,tot_bilirubin,direct_bilirubin,tot_proteins,albumin,ag_ratio,sgpt,sgot,alkphos,is_patient
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1


In [25]:
# Checking for missing data
# ------------------------
liver.isnull().sum()

age                 0
gender              0
tot_bilirubin       0
direct_bilirubin    0
tot_proteins        0
albumin             0
ag_ratio            0
sgpt                0
sgot                0
alkphos             4
is_patient          0
dtype: int64

In [26]:
liver[liver['alkphos'].isnull()]

Unnamed: 0,age,gender,tot_bilirubin,direct_bilirubin,tot_proteins,albumin,ag_ratio,sgpt,sgot,alkphos,is_patient
209,45,Female,0.9,0.3,189,23,33,6.6,3.9,,1
241,51,Male,0.8,0.2,230,24,46,6.5,3.1,,1
253,35,Female,0.6,0.2,180,12,15,5.2,2.7,,2
312,27,Male,1.3,0.6,106,25,54,8.5,4.8,,2


### Missing data Imputation

The missing data points are coded correctly as `NaN`. There are only four data points, we can drop them. But, We are going to use a simple imputation method **mean imputation**. 

In order to perform **mean imputation** we can either use pandas `fillna()` method, which we have seen before, or `SimpleImputer` from `sklearn.impute`. Which is the one we are going to use here. 

**Here are the steps for doing simple imputation**,
 - Import `SimpleImputer` from `sklearn.impute`
 - Instantiate the **Imputer**.
 - Set the stratey of imputing such as **strategy = 'mean')
 - Fit and transform the data using `fit_transform()` the imputer method.

In [27]:
## ======================================
#     Missing Data Imputation
## ======================================

from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

#Imputing values
#================
liver['alkphos']=imp.fit_transform(liver[['alkphos']])

# Check the missing data again
#===========================
liver['alkphos'].isnull().sum()

0

In [28]:
## ============================================
#         Encoding the gender example
##==============================================
liver['gender'].unique()

array(['Female', 'Male'], dtype=object)

In [29]:
# """
# Here I am going to use the apply method to code the gender variable 
# Female = 0 and Male = 1
# """

liver['gender'] = liver['gender'].apply(lambda x:1 if x == 'Male' else 0)
liver['gender'].unique()

array([0, 1])

 In this tutorial, we will skip Data Exploration and focus on how the ensemble algorithm woks

In [30]:
#Now it's time to see the correlation
# ====================================
liver.corr().style.background_gradient(cmap='PuBu')

Unnamed: 0,age,gender,tot_bilirubin,direct_bilirubin,tot_proteins,albumin,ag_ratio,sgpt,sgot,alkphos,is_patient
age,1.0,0.05656,0.011763,0.007529,0.080425,-0.086883,-0.01991,-0.187461,-0.265924,-0.216089,-0.137351
gender,0.05656,1.0,0.089291,0.100436,-0.027496,0.082332,0.080336,-0.089121,-0.093799,-0.003404,-0.082416
tot_bilirubin,0.011763,0.089291,1.0,0.874618,0.206669,0.214065,0.237831,-0.008099,-0.22225,-0.206159,-0.220208
direct_bilirubin,0.007529,0.100436,0.874618,1.0,0.234939,0.233894,0.257544,-0.000139,-0.228531,-0.200004,-0.246046
tot_proteins,0.080425,-0.027496,0.206669,0.234939,1.0,0.12568,0.167196,-0.028514,-0.165453,-0.23396,-0.184866
albumin,-0.086883,0.082332,0.214065,0.233894,0.12568,1.0,0.791966,-0.042518,-0.029742,-0.002374,-0.163416
ag_ratio,-0.01991,0.080336,0.237831,0.257544,0.167196,0.791966,1.0,-0.025645,-0.08529,-0.070024,-0.151934
sgpt,-0.187461,-0.089121,-0.008099,-0.000139,-0.028514,-0.042518,-0.025645,1.0,0.784053,0.233904,0.035008
sgot,-0.265924,-0.093799,-0.22225,-0.228531,-0.165453,-0.029742,-0.08529,0.784053,1.0,0.686322,0.161388
alkphos,-0.216089,-0.003404,-0.206159,-0.200004,-0.23396,-0.002374,-0.070024,0.233904,0.686322,1.0,0.162319


In [31]:
X = liver.drop('is_patient', axis=1)
y = liver['is_patient'].values

X_train, X_test, y_train, y_test = train_test_split(
                                    X, y,
                                    test_size= 0.3,
                                    random_state= 1)

In [32]:
X_train.head()

Unnamed: 0,age,gender,tot_bilirubin,direct_bilirubin,tot_proteins,albumin,ag_ratio,sgpt,sgot,alkphos
440,49,0,0.8,0.2,198,23,20,7.0,4.3,1.5
110,24,0,0.7,0.2,188,11,10,5.5,2.3,0.71
396,74,1,1.0,0.3,175,30,32,6.4,3.4,1.1
311,54,0,23.2,12.6,574,43,47,7.2,3.5,0.9
395,45,1,0.8,0.2,140,24,20,6.3,3.2,1.0


In [33]:
# Scaling the data
# ---------------

scaler=StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)


In [34]:

# Instantiate logreg
logreg = LogisticRegression(random_state=1)

# Instantiate knn
knn = KNN(n_neighbors=27)

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf= 0.13, 
                            random_state=1)

# Define the list classifiers
classifiers = [('Logistic Regression', logreg), 
               ('K Nearest Neighbours', knn), 
               ('Classification Tree', dt)]

# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
 
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
   
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_pred, y_test) 
   
    # Evaluate clf's accuracy on the test set
    print('{:20}: {:.3f}'.format(clf_name,  accuracy))

Logistic Regression : 0.726
K Nearest Neighbours: 0.720
Classification Tree : 0.726


In [35]:
# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier

# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)     

# Fit vc to the training set
vc.fit(X_train, y_train)   

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_pred, y_test)
print('Voting Classifier: {:.3f}'.format(accuracy))

Voting Classifier: 0.737
