## Datasets Source
This dataset was from the UCI ML Repository:
https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset)

## Dataset Information
This training dataset contains 467 records, including 333 liver patient records and 134 non liver patient records. The data set was collected from north east of Andhra Pradesh, India. Label field is a class label used to divide into groups(liver patient or not). Any patient whose age exceeded 89 is listed as being of age "90".

## Attribute Information:
1. Age: Age of the patient (年齡)
2. Gender: Gender of the patient (性別)
3. TB: Total Bilirubin (總膽紅素)
4. DB: Direct Bilirubin (直接型膽紅素/結合型膽紅素)
5. Alkphos: Alkaline Phosphotase (鹼性磷酸酶)
6. Sgpt: Alamine Aminotransferase (麩胺酸丙酮酸轉氨基酶/GPT)
7. Sgot: Aspartate Aminotransferase (麩胺酸苯醋酸轉氨基酶/GOT)
8. TP: Total Protiens (總蛋白)
9. ALB: Albumin (白蛋白)
10. A/G Ratio: Albumin and Globulin Ratio (白蛋白/球蛋白比值)
11. Label: used to split the data into two sets

## Additional information
[如何解讀肝功能檢驗報告]
https://www.jah.org.tw/form/index-1.asp?m=3&m1=8&m2=366&gp=361&id=522


### Download the training set

In [None]:
# Download from Google Drive
!gdown --id 1Y2gYY8XUWgcIA_GbytBuXoRkLlAWxnAF

Downloading...
From: https://drive.google.com/uc?id=1Y2gYY8XUWgcIA_GbytBuXoRkLlAWxnAF
To: /content/project1_indian_liver_patient.zip
100% 8.37k/8.37k [00:00<00:00, 11.5MB/s]


In [None]:
!unzip project1_indian_liver_patient.zip
# if seeing the message: "replace project1_test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename:"
# you may enter "A"

Archive:  project1_indian_liver_patient.zip
replace project1_test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: project1_test.csv       
replace project1_train.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: project1_train.csv      


In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('project1_train.csv')
df.columns

Index(['Age', 'Gender', 'Total_Bilirubin', 'Direct_Bilirubin',
       'Alkaline_Phosphotase', 'Alamine_Aminotransferase',
       'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin',
       'Albumin_and_Globulin_Ratio', 'Label'],
      dtype='object')

In [None]:
df.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Label
0,40,Female,0.9,0.3,293,232,245,6.8,3.1,0.8,1
1,78,Male,1.0,0.3,152,28,70,6.3,3.1,0.9,1
2,60,Male,2.0,0.8,190,45,40,6.0,2.8,0.8,1
3,75,Male,10.6,5.0,562,37,29,5.1,1.8,0.5,1
4,19,Female,0.7,0.2,186,166,397,5.5,3.0,1.2,1


### The stage is yours

In [None]:
#資料維度
df.shape

(467, 11)

In [None]:
#資料型態
df.dtypes

Age                             int64
Gender                         object
Total_Bilirubin               float64
Direct_Bilirubin              float64
Alkaline_Phosphotase            int64
Alamine_Aminotransferase        int64
Aspartate_Aminotransferase      int64
Total_Protiens                float64
Albumin                       float64
Albumin_and_Globulin_Ratio    float64
Label                           int64
dtype: object

In [None]:
#缺失值
df.isnull().sum()

Age                           0
Gender                        0
Total_Bilirubin               0
Direct_Bilirubin              0
Alkaline_Phosphotase          0
Alamine_Aminotransferase      0
Aspartate_Aminotransferase    0
Total_Protiens                0
Albumin                       0
Albumin_and_Globulin_Ratio    4
Label                         0
dtype: int64

In [None]:
df['Albumin_and_Globulin_Ratio'] = df['Albumin_and_Globulin_Ratio'].fillna(df['Albumin_and_Globulin_Ratio'].mean())

In [None]:
#重複的資料

df_duplicados = df[df.duplicated(keep = False)]
df_duplicados

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Label
14,30,Male,1.6,0.4,332,84,139,5.6,2.7,0.9,1
22,72,Male,0.7,0.1,196,20,35,5.8,2.0,0.5,1
28,42,Male,8.9,4.5,272,31,61,5.8,2.0,0.5,1
139,72,Male,0.7,0.1,196,20,35,5.8,2.0,0.5,1
149,58,Male,1.0,0.5,158,37,43,7.2,3.6,1.0,1
220,30,Male,1.6,0.4,332,84,139,5.6,2.7,0.9,1
222,58,Male,1.0,0.5,158,37,43,7.2,3.6,1.0,1
224,38,Female,2.6,1.2,410,59,57,5.6,3.0,0.8,0
239,36,Male,0.8,0.2,158,29,39,6.0,2.2,0.5,0
278,49,Male,0.6,0.1,218,50,53,5.0,2.4,0.9,1


In [None]:
df_duplicados.shape

(16, 11)

In [None]:
#數值變量
df.describe()

Unnamed: 0,Age,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Label
count,467.0,467.0,467.0,467.0,467.0,467.0,467.0,467.0,467.0,467.0
mean,44.64454,3.40364,1.497859,290.687366,77.751606,111.413276,6.531049,3.141756,0.928683,0.713062
std,15.797878,6.497494,2.81543,240.896655,172.933719,305.893026,1.087601,0.80326,0.313539,0.452817
min,4.0,0.5,0.1,75.0,10.0,10.0,2.7,0.9,0.3,0.0
25%,33.0,0.8,0.2,177.0,23.0,25.0,5.8,2.6,0.7,0.0
50%,45.0,1.0,0.3,208.0,35.0,41.0,6.6,3.1,0.9,1.0
75%,57.0,2.65,1.3,298.0,61.0,88.0,7.2,3.8,1.1,1.0
max,78.0,75.0,19.7,2110.0,2000.0,4929.0,9.6,5.5,2.8,1.0


In [None]:
#def binary_encoding(df, column, positive_value):
 #   df = df.copy()
  #  df[column] = df[column].apply(lambda x: 1 if x == positive_value else 0)
   # return df

In [None]:
#df = binary_encoding(df, 'Gender', 'Male')

In [None]:
#data = binary_encoding(df, 'Label', 1)

In [None]:
df.Gender=df.Gender.map({'Male':0,'Female':1})

In [None]:
df.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Label
0,40,1,0.9,0.3,293,232,245,6.8,3.1,0.8,1
1,78,0,1.0,0.3,152,28,70,6.3,3.1,0.9,1
2,60,0,2.0,0.8,190,45,40,6.0,2.8,0.8,1
3,75,0,10.6,5.0,562,37,29,5.1,1.8,0.5,1
4,19,1,0.7,0.2,186,166,397,5.5,3.0,1.2,1


In [None]:
df.shape

(467, 11)

In [None]:
#data.shape

(467, 11)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
x = df.drop('Label', axis=1)
y = df['Label']

In [None]:
#scaler = StandardScaler()
#X = scaler.fit_transform(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2,stratify=y,random_state=111111)

In [None]:
x.shape

(467, 10)

In [None]:
y.value_counts()

1    333
0    134
Name: Label, dtype: int64

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression(max_iter=1000)

In [None]:
res = model.fit(X_train, y_train)

In [None]:
y_predict = model.predict(X_test)

In [None]:
#更改test的性別資料
df_test = pd.read_csv('project1_test.csv')
df.columns

Index(['Age', 'Gender', 'Total_Bilirubin', 'Direct_Bilirubin',
       'Alkaline_Phosphotase', 'Alamine_Aminotransferase',
       'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin',
       'Albumin_and_Globulin_Ratio', 'Label'],
      dtype='object')

In [None]:
#def binary_encoding(df_test, column, positive_value):
 #   df_test = df_test.copy()
  #  df_test[column] = df_test[column].apply(lambda x: 1 if x == positive_value else 0)
   # return df_test

In [None]:
#df_test = binary_encoding(df, 'Gender', 'Male')

In [None]:
df.Gender=df.Gender.map({'Male':0,'Female':1}) #將性別轉換為數字型態

In [None]:
df_test.shape

(116, 10)

In [None]:
#data = binary_encoding(df, 'Label', 1)

In [None]:
df_test.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio
0,18,Male,0.8,0.2,282,72,140,5.5,2.5,0.8
1,34,Male,4.1,2.0,289,875,731,5.0,2.7,1.1
2,49,Male,2.0,0.6,209,48,32,5.7,3.0,1.1
3,65,Male,7.9,4.3,282,50,72,6.0,3.0,1.0
4,40,Female,0.9,0.3,293,232,245,6.8,3.1,0.8


In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size = 0.8)

In [None]:
SVC = SVC()
SVC.fit(X_train, y_train)
y_pred = SVC.predict(X_test)
accuracy_score(y_test,y_pred)

0.7127659574468085

In [None]:
df_test.Gender=df_test.Gender.map({'Male':0,'Female':1}) #將性別轉換為數字型態

### Make prediction and submission file

In [None]:
df_test=pd.read_csv('project1_test.csv')
df_test.Gender=df_test.Gender.map({'Male':0,'Female':1})
df_submit = pd.DataFrame([], columns=['Id', 'Category'])
df_submit['Id'] = [f'{i:03d}' for i in range(len(df_test))]
df_submit['Category'] = SVC.predict(df_test)

In [None]:
df_submit.to_csv('submission_1.csv', index=None)