# Chronic Kidney Disease

Chronic Kidney Disease veri seti, kronik böbrek hastalığı (CKD) ile ilişkili tıbbi ölçümleri ve hasta bilgilerini içeren bir veri setidir. Böbrek hastalığı, böbreklerin atık maddeleri ve fazla sıvıyı kandan yeterince filtreleyemediği durumlarda ortaya çıkar. Bu veri seti, hastaların böbrek fonksiyonlarını değerlendirmek ve hastalık durumunu tahmin etmek için kullanılır.

1. Bağımsız Değişkenler (X):

Aşağıdaki değişkenler, hastaların böbrek sağlığı ve genel sağlık durumu hakkında bilgi sağlar. Bu değişkenler, GFR'nin (bağımlı değişkenin) tahmin edilmesinde kullanılır.

1_Age (Yaş):

2_Serum_Creatinine (Serum Kreatinin):
    Kanda bulunan kreatinin seviyesi (mg/dL).
    Yüksek kreatinin, böbreklerin kötü çalıştığını gösterebilir.
    
3_BUN (Blood Urea Nitrogen):
    Kan üre azotu seviyesi (mg/dL).
    Yüksek BUN, böbrek disfonksiyonuna işaret edebilir.
    
4_Sodium (Sodyum):
    Kan sodyum seviyesi (mmol/L).
    Böbreklerin tuz dengesi üzerindeki etkisi incelenir.

5_Potassium (Potasyum):
    Kan potasyum seviyesi (mmol/L).
    Potasyum seviyesindeki anormallikler böbrek hastalığının bir işareti olabilir.

6_Chloride (Klorür):
    Kan klorür seviyesi (mmol/L).
    Elektrolit dengesi ve böbrek fonksiyonu üzerinde etkili olabilir.

7_Gender

2. Bağımlı Değişken (Y):
    
Hedef değişken, bağımsız değişkenler kullanılarak tahmin edilmeye çalışılır.

GFR (Glomerüler Filtrasyon Hızı):
    
Böbreklerin filtreleme hızını gösterir (mL/dk/1.73m²).
Düşük GFR değeri böbrek fonksiyon bozukluğuna işaret eder.

In [43]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score,mean_absolute_error,mean_absolute_percentage_error,mean_squared_error

df=pd.read_csv("kidney_disease_dataset.csv",sep=",")

In [44]:
print(df.head())

   Patient_ID  Age  Gender  Serum_Creatinine        BUN      Sodium  \
0           1   69    Male          0.746344  12.011655  137.288755   
1           2   32  Female          1.063745  11.425032  140.968519   
2           3   78  Female          0.661483  14.038299  144.281821   
3           4   38  Female          1.038746   9.003886  144.291395   
4           5   41    Male          1.275930  14.198249  138.419053   

   Potassium    Chloride         GFR  
0   4.199876   97.109699  106.581117  
1   5.064852  103.416220   93.442259  
2   5.485029   95.303538   93.784130  
3   3.981341  103.438129  105.886008  
4   5.248615   99.603291   98.065846  


In [45]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Patient_ID        299 non-null    int64  
 1   Age               299 non-null    int64  
 2   Gender            299 non-null    object 
 3   Serum_Creatinine  299 non-null    float64
 4   BUN               299 non-null    float64
 5   Sodium            299 non-null    float64
 6   Potassium         299 non-null    float64
 7   Chloride          299 non-null    float64
 8   GFR               299 non-null    float64
dtypes: float64(6), int64(2), object(1)
memory usage: 21.2+ KB
None


In [46]:
print(df.describe())

       Patient_ID         Age  Serum_Creatinine         BUN      Sodium  \
count  299.000000  299.000000        299.000000  299.000000  299.000000   
mean   150.000000   50.775920          0.909313   13.548600  139.889691   
std     86.458082   19.888109          0.245498    4.296200    2.962817   
min      1.000000   18.000000          0.504607    7.001751  130.000000   
25%     75.500000   34.000000          0.705336   10.556650  137.311664   
50%    150.000000   50.000000          0.916766   13.355672  139.845380   
75%    224.500000   68.500000          1.081996   16.898434  142.550311   
max    299.000000   84.000000          2.500000   50.000000  144.995577   

        Potassium    Chloride         GFR  
count  299.000000  299.000000  299.000000  
mean     4.527042   99.901652  104.480611  
std      0.571743    2.947192    9.404979  
min      2.500000   95.098276   40.000000  
25%      4.037959   97.290098   97.088361  
50%      4.525749   99.920017  104.837912  
75%      5.01015

In [47]:
print(df.isnull().sum().sum())

0


In [48]:
df.drop("Patient_ID",inplace=True,axis=1)

In [49]:
print(df.duplicated())

0      False
1      False
2      False
3      False
4      False
       ...  
294    False
295    False
296    False
297    False
298    False
Length: 299, dtype: bool


In [50]:
df=pd.get_dummies(df,columns=["Gender"])
print(df.head())

   Age  Serum_Creatinine        BUN      Sodium  Potassium    Chloride  \
0   69          0.746344  12.011655  137.288755   4.199876   97.109699   
1   32          1.063745  11.425032  140.968519   5.064852  103.416220   
2   78          0.661483  14.038299  144.281821   5.485029   95.303538   
3   38          1.038746   9.003886  144.291395   3.981341  103.438129   
4   41          1.275930  14.198249  138.419053   5.248615   99.603291   

          GFR  Gender_Female  Gender_Male  
0  106.581117          False         True  
1   93.442259           True        False  
2   93.784130           True        False  
3  105.886008           True        False  
4   98.065846          False         True  


In [51]:
df['Gender_Female'] = df['Gender_Female'].astype(int)
df['Gender_Male'] = df['Gender_Male'].astype(int)
print(df.head())

   Age  Serum_Creatinine        BUN      Sodium  Potassium    Chloride  \
0   69          0.746344  12.011655  137.288755   4.199876   97.109699   
1   32          1.063745  11.425032  140.968519   5.064852  103.416220   
2   78          0.661483  14.038299  144.281821   5.485029   95.303538   
3   38          1.038746   9.003886  144.291395   3.981341  103.438129   
4   41          1.275930  14.198249  138.419053   5.248615   99.603291   

          GFR  Gender_Female  Gender_Male  
0  106.581117              0            1  
1   93.442259              1            0  
2   93.784130              1            0  
3  105.886008              1            0  
4   98.065846              0            1  


In [52]:
from scipy import stats
import numpy as np
#Z-skoru, bir verinin ortalamadan kaç standart sapma uzaklıkta olduğunu gösteren bir ölçüdür.
z=np.abs(stats.zscore(df))
print(z)

          Age  Serum_Creatinine       BUN    Sodium  Potassium  Chloride  \
0    0.917867          0.664941  0.358345  0.879331   0.573185  0.948914   
1    0.945660          0.630110  0.495118  0.364733   0.942228  1.194513   
2    1.371157          1.011190  0.114175  1.484902   1.678367  1.562783   
3    0.643467          0.528111  1.059619  1.488139   0.956051  1.201960   
4    0.492370          1.495863  0.151468  0.497197   1.264175  0.101405   
..        ...               ...       ...       ...        ...       ...   
294  0.693832          0.670413  0.738674  1.144398   1.377600  1.126026   
295  1.572619          0.366416  0.090714  1.619777   1.562464  0.462572   
296  1.298220          0.050270  1.351506  0.142855   0.737609  0.563022   
297  0.895295          0.279272  1.376231  0.042425   1.455142  0.160117   
298  0.996026          1.243411  1.330568  0.819209   0.290774  0.635427   

          GFR  Gender_Female  Gender_Male  
0    0.223714       0.932068     0.932068  

In [53]:
print(np.where (z>3))

(array([ 5, 15, 25, 35, 45, 55], dtype=int64), array([1, 2, 3, 4, 5, 6], dtype=int64))


In [54]:
selected=df.iloc[np.where (z>3)[0]].index
print(selected)

Index([5, 15, 25, 35, 45, 55], dtype='int64')


In [55]:
dfN=df.drop(selected,axis=0,inplace=True)

In [56]:
dfN=df
print(dfN.shape)

(293, 9)


In [57]:
dfN=(dfN-dfN.min())/(dfN.max()-dfN.min())
print(dfN.head())

        Age  Serum_Creatinine       BUN    Sodium  Potassium  Chloride  \
0  0.772727          0.304101  0.386271  0.228472   0.350861  0.203617   
1  0.212121          0.703385  0.341042  0.596852   0.784771  0.842030   
2  0.909091          0.197347  0.542528  0.928546   0.995551  0.020779   
3  0.303030          0.671937  0.154368  0.929505   0.241234  0.844247   
4  0.348485          0.970310  0.554861  0.341626   0.876955  0.456045   

        GFR  Gender_Female  Gender_Male  
0  0.550863            0.0          1.0  
1  0.109678            1.0          0.0  
2  0.121158            1.0          0.0  
3  0.527522            1.0          0.0  
4  0.264932            0.0          1.0  


In [69]:
y=dfN.loc[:,"GFR"].values
x=dfN.drop("GFR",axis=1)

## Multiple linear regression

In [71]:
from sklearn.metrics import r2_score,mean_absolute_error,mean_absolute_percentage_error,mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

mlr=LinearRegression()
mlr.fit(x_train,y_train)
mlr_pred=mlr.predict(x_test)

print("mlr r2",r2_score(y_test,mlr_pred))
print("mlr mae",mean_absolute_error(y_test,mlr_pred))
print("mlr mape",mean_absolute_percentage_error(y_test,mlr_pred))
print("mlr mse",mean_squared_error(y_test,mlr_pred))
print("mlr rmse",(mean_squared_error(y_test,mlr_pred))**0.5)

mlr r2 -0.06625093209531463
mlr mae 0.2566608690868629
mlr mape 3.729760970493711
mlr mse 0.08848005128772532
mlr rmse 0.297455965291882


## Polynomial Linear Regression

In [73]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=3)
x_train_poly = poly.fit_transform(x_train)
x_test_poly = poly.transform(x_test)

plr= LinearRegression()
plr.fit(x_train_poly, y_train)

pol_pred = plr.predict(x_test_poly)

print("plr r2",r2_score(y_test,pol_pred))
print("plr mae",mean_absolute_error(y_test,pol_pred))
print("plr mape",mean_absolute_percentage_error(y_test,pol_pred))
print("plr mse",mean_squared_error(y_test,pol_pred))
print("plr rmse",(mean_squared_error(y_test,pol_pred))**0.5)

plr r2 -1.3143753860157825
plr mae 0.36958675357746057
plr mape 4.976778136454151
plr mse 0.19205240219702815
plr rmse 0.43823783747758266


## Decision Tree Regression

In [74]:
from sklearn.tree import DecisionTreeRegressor

dt=DecisionTreeRegressor()
dt.fit(x_train,y_train)
dt_pred=dt.predict(x_test)

print("dt r2",r2_score(y_test,dt_pred))
print("dt mae",mean_absolute_error(y_test,dt_pred))
print("dt mape",mean_absolute_percentage_error(y_test,dt_pred))
print("dt mse",mean_squared_error(y_test,dt_pred))
print("dt rmse",(mean_squared_error(y_test,dt_pred))**0.5)

dt r2 -1.5311406180471292
dt mae 0.3806623939673449
dt mape 5.615405854084821
dt mse 0.21004009934242654
dt rmse 0.4583013193767028


## Random forest Regression

In [75]:
from sklearn.ensemble import RandomForestRegressor

rf=RandomForestRegressor(n_estimators=100,random_state=42)
rf.fit(x_train,y_train.ravel())
rf_pred=rf.predict(x_test)

print("rf r2",r2_score(y_test,rf_pred))
print("rf mae",mean_absolute_error(y_test,rf_pred))
print("rf mape",mean_absolute_percentage_error(y_test,rf_pred))
print("rf mse",mean_squared_error(y_test,rf_pred))
print("rf rmse",(mean_squared_error(y_test,rf_pred))**0.5)

rf r2 -0.2335376211352722
rf mae 0.2789488083065021
rf mape 3.943469499922558
rf mse 0.10236190065401132
rf rmse 0.31994046423359973
