# Chronic Kidney Disease and Preventive Medicine
In this notebook, we will investigate the association between chronic kidney disease and factors.
The results of the study will be used to examine how healthcare professionals involved in preventive medicine need to approach the issue.
Chi-square test and logistic regression analysis were used in this analysis.

## Dataset Source
This dataset is based on the synthetic Chronic Kidney Disease dataset created by Rabie El Kharoua, originally shared on Kaggle:  
https://www.kaggle.com/datasets/rabieelkharoua/chronic-kidney-disease-dataset-analysis  

The dataset is licensed under [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author is credited.


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/chronic-kidney-disease-dataset-analysis/Chronic_Kidney_Dsease_data.csv


# Summary of the data

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Read CSV from Kaggle dataset
file_path = '/kaggle/input/chronic-kidney-disease-dataset-analysis/Chronic_Kidney_Dsease_data.csv'
df = pd.read_csv(file_path)
df.set_index('PatientID', inplace=True)

# Confirmation of basic information
print(df.info())
print(df.isnull().sum())
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
Index: 1659 entries, 1 to 1659
Data columns (total 53 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Age                            1659 non-null   int64  
 1   Gender                         1659 non-null   int64  
 2   Ethnicity                      1659 non-null   int64  
 3   SocioeconomicStatus            1659 non-null   int64  
 4   EducationLevel                 1659 non-null   int64  
 5   BMI                            1659 non-null   float64
 6   Smoking                        1659 non-null   int64  
 7   AlcoholConsumption             1659 non-null   float64
 8   PhysicalActivity               1659 non-null   float64
 9   DietQuality                    1659 non-null   float64
 10  SleepQuality                   1659 non-null   float64
 11  FamilyHistoryKidneyDisease     1659 non-null   int64  
 12  FamilyHistoryHypertension      1659 non-null   int64 

In [3]:
df['Diagnosis'].value_counts(normalize=True)

Diagnosis
1    0.918626
0    0.081374
Name: proportion, dtype: float64

# Correlation between Environmental and Occupational Exposures

In [4]:
# Correlation between Chronic Kidney Disease and HeavyMetalsExposure
ckd_heavymetal = pd.crosstab(df['HeavyMetalsExposure'], df['Diagnosis'])
print(ckd_heavymetal)

chi2, p, dof, expected = stats.chi2_contingency(ckd_heavymetal)

print("Chi-square test results:")
print(f"Chi2 = {chi2:.3f}, p-value = {p:.4f}")
if p < 0.05:
    print("There is a significant association between chronic kidney disease and HeavyMetalsExposure")
else:
    print("No significant association between chronic kidney disease and HeavyMetalsExposure")



Diagnosis              0     1
HeavyMetalsExposure           
0                    130  1456
1                      5    68
Chi-square test results:
Chi2 = 0.037, p-value = 0.8471
No significant association between chronic kidney disease and HeavyMetalsExposure


In [5]:
# Correlation between Chronic Kidney Disease and OccupationalExposureChemicals
ckd_chemical = pd.crosstab(df['OccupationalExposureChemicals'], df['Diagnosis'])
print(ckd_chemical)      


chi2, p, dof, expected = stats.chi2_contingency(ckd_chemical)
print("Chi-square test results:")

print(f"Chi2 = {chi2:.3f}, p-value = {p:.4f}")
if p < 0.05:
    print("There is a significant association between chronic kidney disease and OccupationalExposureChemicals")
else:
    print("No significant association between chronic kidney disease and OccupationalExposureChemicals")

Diagnosis                        0     1
OccupationalExposureChemicals           
0                              119  1369
1                               16   155
Chi-square test results:
Chi2 = 0.219, p-value = 0.6397
No significant association between chronic kidney disease and OccupationalExposureChemicals


In [6]:
# Correlation between Chronic Kidney Disease and WaterQuality
ckd_waterq = pd.crosstab(df['WaterQuality'], df['Diagnosis'])
print(ckd_waterq)      


chi2, p, dof, expected = stats.chi2_contingency(ckd_waterq)
print("Chi-square test results:")

print(f"Chi2 = {chi2:.3f}, p-value = {p:.4f}")
if p < 0.05:
    print("There is a significant association between chronic kidney disease and WaterQuality")
else:
    print("No significant association between chronic kidney disease and WaterQuality")


Diagnosis       0     1
WaterQuality           
0             105  1227
1              30   297
Chi-square test results:
Chi2 = 0.426, p-value = 0.5141
No significant association between chronic kidney disease and WaterQuality


No significant differences were found between all of Heavy metal exposure, Occupation exposure chemicals, Water quality and Chronic Kidney Disease.

# Correlation between Chronic Kidney Disease and Lifestile Factors 

In [7]:
# Correlation between Chronic Kidney Disease and BMI
import statsmodels.api as sm

X = sm.add_constant(df['BMI']) 
y = df['Diagnosis']

model = sm.Logit(y, X)
result = model.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.280586
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:              Diagnosis   No. Observations:                 1659
Model:                          Logit   Df Residuals:                     1657
Method:                           MLE   Df Model:                            1
Date:                Sat, 12 Apr 2025   Pseudo R-squ.:                0.005413
Time:                        08:54:48   Log-Likelihood:                -465.49
converged:                       True   LL-Null:                       -468.03
Covariance Type:            nonrobust   LLR p-value:                   0.02439
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.6722      0.340      4.919      0.000       1.006       2.338
BMI            0.0278      0.

In [8]:
# Correlation between Chronic Kidney Disease and Smoking
ckd_smoking = pd.crosstab(df['Smoking'], df['Diagnosis'])
print(ckd_smoking)      


chi2, p, dof, expected = stats.chi2_contingency(ckd_smoking)
print("Chi-square test results:")

print(f"Chi2 = {chi2:.3f}, p-value = {p:.4f}")
if p < 0.05:
    print("There is a significant association between chronic kidney disease and smoking")
else:
    print("No significant association between chronic kidney disease and smoking")

Diagnosis    0     1
Smoking             
0          101  1072
1           34   452
Chi-square test results:
Chi2 = 0.992, p-value = 0.3193
No significant association between chronic kidney disease and smoking


In [9]:
# Correlatinon between Chronic Kidney Disease and AlcoholConsumption
ckd_alcohol = pd.crosstab(df['AlcoholConsumption'], df['Diagnosis'])
print(ckd_alcohol)      


chi2, p, dof, expected = stats.chi2_contingency(ckd_alcohol)
print("Chi-square test results:")

print(f"Chi2 = {chi2:.3f}, p-value = {p:.4f}")
if p < 0.05:
    print("There is a significant association between chronic kidney disease and AlcoholConsumption")
else:
    print("No significant association between chronic kidney disease and AlcoholConsumption")

Diagnosis           0  1
AlcoholConsumption      
0.021740            0  1
0.027360            0  1
0.043682            0  1
0.053363            0  1
0.059254            0  1
...                .. ..
19.950964           0  1
19.959241           0  1
19.981815           0  1
19.986598           0  1
19.992713           0  1

[1659 rows x 2 columns]
Chi-square test results:
Chi2 = 1659.000, p-value = 0.4885
No significant association between chronic kidney disease and AlcoholConsumption


In [10]:
# Correlatinon between Chronic Kidney Disease and PhysicalActivity
X = sm.add_constant(df['PhysicalActivity']) 
y = df['Diagnosis']

model = sm.Logit(y, X)
result = model.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.281881
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:              Diagnosis   No. Observations:                 1659
Model:                          Logit   Df Residuals:                     1657
Method:                           MLE   Df Model:                            1
Date:                Sat, 12 Apr 2025   Pseudo R-squ.:               0.0008234
Time:                        08:54:48   Log-Likelihood:                -467.64
converged:                       True   LL-Null:                       -468.03
Covariance Type:            nonrobust   LLR p-value:                    0.3800
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
const                2.5648      0.187     13.741      0.000       2.199       2.931
PhysicalAct

In [11]:
# Correlatinon between Chronic Kidney Disease and DietQuality
X = sm.add_constant(df['DietQuality']) 
y = df['Diagnosis']

model = sm.Logit(y, X)
result = model.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.281122
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:              Diagnosis   No. Observations:                 1659
Model:                          Logit   Df Residuals:                     1657
Method:                           MLE   Df Model:                            1
Date:                Sat, 12 Apr 2025   Pseudo R-squ.:                0.003515
Time:                        08:54:48   Log-Likelihood:                -466.38
converged:                       True   LL-Null:                       -468.03
Covariance Type:            nonrobust   LLR p-value:                   0.06970
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
const           2.7216      0.193     14.099      0.000       2.343       3.100
DietQuality    -0.0570    

In [12]:
# Correlatinon between Chronic Kidney Disease and SleepQuality
X = sm.add_constant(df['SleepQuality']) 
y = df['Diagnosis']

model = sm.Logit(y, X)
result = model.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.281886
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:              Diagnosis   No. Observations:                 1659
Model:                          Logit   Df Residuals:                     1657
Method:                           MLE   Df Model:                            1
Date:                Sat, 12 Apr 2025   Pseudo R-squ.:               0.0008037
Time:                        08:54:48   Log-Likelihood:                -467.65
converged:                       True   LL-Null:                       -468.03
Covariance Type:            nonrobust   LLR p-value:                    0.3857
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
const            2.1084      0.372      5.663      0.000       1.379       2.838
SleepQuality     0.0458

A weak association between BMI and chronic kidney disease may be observed.
No significant differences were found between Smoking, Alcohol Consumption, Physical Activity, Diet Quality, Sleep Quality and Chronic Kidney Disease.

# Correlation between Chronic Kidney Disease and Health Behaviors

In [13]:
# Correlatinon between Chronic Kidney Disease and MedicalCheckupsFrequency
X = sm.add_constant(df['MedicalCheckupsFrequency']) 
y = df['Diagnosis']

model = sm.Logit(y, X)
result = model.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.282037
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:              Diagnosis   No. Observations:                 1659
Model:                          Logit   Df Residuals:                     1657
Method:                           MLE   Df Model:                            1
Date:                Sat, 12 Apr 2025   Pseudo R-squ.:               0.0002713
Time:                        08:54:48   Log-Likelihood:                -467.90
converged:                       True   LL-Null:                       -468.03
Covariance Type:            nonrobust   LLR p-value:                    0.6143
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const                        2.3453      0.178     13.153      0.000       1.996

In [14]:
# Correlatinon between Chronic Kidney Disease and MedicationAdherence
X = sm.add_constant(df['MedicationAdherence']) 
y = df['Diagnosis']

model = sm.Logit(y, X)
result = model.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.282112
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:              Diagnosis   No. Observations:                 1659
Model:                          Logit   Df Residuals:                     1657
Method:                           MLE   Df Model:                            1
Date:                Sat, 12 Apr 2025   Pseudo R-squ.:               4.158e-06
Time:                        08:54:48   Log-Likelihood:                -468.02
converged:                       True   LL-Null:                       -468.03
Covariance Type:            nonrobust   LLR p-value:                    0.9503
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                   2.4142      0.179     13.513      0.000       2.064       2.764
Me

In [15]:
# Correlatinon between Chronic Kidney Disease and MedicationAdherence
X = sm.add_constant(df['HealthLiteracy']) 
y = df['Diagnosis']

model = sm.Logit(y, X)
result = model.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.282100
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:              Diagnosis   No. Observations:                 1659
Model:                          Logit   Df Residuals:                     1657
Method:                           MLE   Df Model:                            1
Date:                Sat, 12 Apr 2025   Pseudo R-squ.:               4.766e-05
Time:                        08:54:48   Log-Likelihood:                -468.00
converged:                       True   LL-Null:                       -468.03
Covariance Type:            nonrobust   LLR p-value:                    0.8327
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
const              2.3903      0.182     13.163      0.000       2.034       2.746
HealthLiteracy   

No significant differences were found between Medical Checkups Frequency,Medication Adherence,Health Literacy and Chronic Kidney Disease.

# Conclusion
In the present analysis, BMI was found to be associated with a diagnosis of chronic kidney disease.
To prevent the development of chronic kidney disease, education regarding exercise and diet may be necessary to avoid gaining too much weight.
As an example, holding health classes for people with high BMI and distributing pamphlets on exercise and diet may be effective.
It is possible that some items in this analysis were not found to be related because of the large difference in the number of people diagnosed with chronic kidney disease and those who were not diagnosed with chronic kidney disease.
Therefore, a comparison of data on the same number of people diagnosed with chronic kidney disease and healthy people is considered necessary in the future.