<a href="https://colab.research.google.com/github/HsiuuYing/python-machine-learning/blob/main/%E5%B0%88%E9%A1%8C%E5%9B%9B_Biomedical.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Datasets Source
This dataset was from the UCI ML Repository:
https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset)

## Dataset Information
This training dataset contains 467 records, including 333 liver patient records and 134 non liver patient records. The data set was collected from north east of Andhra Pradesh, India. Label field is a class label used to divide into groups(liver patient or not). Any patient whose age exceeded 89 is listed as being of age "90".

## Attribute Information:
1. Age: Age of the patient (年齡)
2. Gender: Gender of the patient (性別)
3. TB: Total Bilirubin (總膽紅素)
4. DB: Direct Bilirubin (直接型膽紅素/結合型膽紅素)
5. Alkphos: Alkaline Phosphotase (鹼性磷酸酶)
6. Sgpt: Alamine Aminotransferase (麩胺酸丙酮酸轉氨基酶/GPT)
7. Sgot: Aspartate Aminotransferase (麩胺酸苯醋酸轉氨基酶/GOT)
8. TP: Total Protiens (總蛋白)
9. ALB: Albumin (白蛋白)
10. A/G Ratio: Albumin and Globulin Ratio (白蛋白/球蛋白比值)
11. Label: used to split the data into two sets

## Additional information
[如何解讀肝功能檢驗報告]
https://www.jah.org.tw/form/index-1.asp?m=3&m1=8&m2=366&gp=361&id=522


### Download the training set

In [None]:
# Download from Google Drive
!gdown --id 1Y2gYY8XUWgcIA_GbytBuXoRkLlAWxnAF

Downloading...
From: https://drive.google.com/uc?id=1Y2gYY8XUWgcIA_GbytBuXoRkLlAWxnAF
To: /content/project1_indian_liver_patient.zip
100% 8.37k/8.37k [00:00<00:00, 9.13MB/s]


In [None]:
!unzip project1_indian_liver_patient.zip
# if seeing the message: "replace project1_test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename:"
# you may enter "A"

Archive:  project1_indian_liver_patient.zip
  inflating: project1_test.csv       
  inflating: project1_train.csv      


In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('project1_train.csv')
df.columns

Index(['Age', 'Gender', 'Total_Bilirubin', 'Direct_Bilirubin',
       'Alkaline_Phosphotase', 'Alamine_Aminotransferase',
       'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin',
       'Albumin_and_Globulin_Ratio', 'Label'],
      dtype='object')

In [None]:
df.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Label
0,40,Female,0.9,0.3,293,232,245,6.8,3.1,0.8,1
1,78,Male,1.0,0.3,152,28,70,6.3,3.1,0.9,1
2,60,Male,2.0,0.8,190,45,40,6.0,2.8,0.8,1
3,75,Male,10.6,5.0,562,37,29,5.1,1.8,0.5,1
4,19,Female,0.7,0.2,186,166,397,5.5,3.0,1.2,1


### The stage is yours

In [None]:
# 讀test資料
test = pd.read_csv('project1_test.csv')

In [None]:
# 性別讀取
df['Gender'][df['Gender']=='Male']=1
df['Gender'][df['Gender']=='Female']=0

test['Gender'][test['Gender']=='Male']=1
test['Gender'][test['Gender']=='Female']=0

test

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio
0,18,1,0.8,0.2,282,72,140,5.5,2.5,0.80
1,34,1,4.1,2.0,289,875,731,5.0,2.7,1.10
2,49,1,2.0,0.6,209,48,32,5.7,3.0,1.10
3,65,1,7.9,4.3,282,50,72,6.0,3.0,1.00
4,40,0,0.9,0.3,293,232,245,6.8,3.1,0.80
...,...,...,...,...,...,...,...,...,...,...
111,50,1,5.8,3.0,661,181,285,5.7,2.3,0.67
112,21,1,0.7,0.2,135,27,26,6.4,3.3,1.00
113,27,1,0.7,0.2,243,21,23,5.3,2.3,0.70
114,48,1,0.7,0.2,208,15,30,4.6,2.1,0.80


In [None]:
# 找缺失值
df.isnull()

# 處理缺失值
df.fillna(df.mean(),inplace=True)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# 移除Label並取得剩下欄位資料
X = df.drop(labels=['Label'],axis=1).values  
y = df['Label'].values

# 分割為學習資料集與測試資料集
X_train, X_test, y_train, y_test = train_test_split(X, y)

# 建立邏輯迴歸的實體
logisticModel = LogisticRegression()

# 學習
logisticModel.fit(X_train, y_train)

LogisticRegression()

In [None]:
# 預測
y_pred = logisticModel.predict(X_test)
y_pred

array([1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 1])

In [None]:
# 預測成功的比例
print('訓練集: ',logisticModel.score(X_train,y_train))
print('測試集: ',logisticModel.score(X_test,y_test))

訓練集:  0.74
測試集:  0.7264957264957265


### Make prediction and submission file

In [None]:
df_submit = pd.DataFrame([], columns=['Id', 'Category'])
df_submit['Id'] = [f'{i:03d}' for i in range(len(test))]
df_submit['Category'] = logisticModel.predict(test)

In [None]:
df_submit.to_csv('submission.csv', index=None)