# Chronic Kidney Disease Prediction with Machine Learning.
This project harnesses data science and machine learning to develop a powerful predictive model for Chronic Kidney Disease (CKD). By analyzing vast datasets and collaborating with healthcare experts, the team’s data-driven insights enable early detection, personalized interventions, and resource optimization—ultimately improving patient outcomes. Beyond the technical achievements, this work underscores the transformative potential of predictive analytics in healthcare, shaping a future where proactive, targeted care can combat CKD and enhance the quality of life for those at risk.

# Task 2: Unveiling Order in Complexity.
In the labyrinth of data, our code serves as a torchbearer, illuminating the shadows of duplications within the "kidney_details.csv" dataset. With each line executed, we unveil the duplicity woven into the fabric of information, a critical exploration in our quest for precision. This code isn't just a routine check; it's a sentinel standing guard against misinformation. By quantifying duplicates, we fortify our foundation, ensuring that the predictive model we build stands on the bedrock of accuracy. In these lines of code, we see not just repetitions but a call for clarity, a testament to our commitment to extracting reliable insights from the intricate tapestry of healthcare data.

In [2]:
duplicates = df.duplicated(df).sum()

#--- Inspect data ---
print(f"Total numbers of duplicates = {duplicates}")

Total numbers of duplicates = 0


# Task 3: Streamlining the Narrative.
As the lines of code gracefully dance, the 'id' column takes its exit, not just as a symbol but as a strategic move in the symphony of data orchestration. This isn't a mere deletion; it's a narrative refinement. By bidding farewell to the 'id,' our story gains clarity, shedding unnecessary layers to expose the core insights within the "kidney_details.csv" dataset. This code is a sculptor's chisel, shaping our dataset into a sleek, purpose-driven entity. In these lines, we witness the deliberate act of simplification, an essential step in crafting a predictive model that navigates the complexities of Chronic Kidney Disease with elegance and precision.

In [3]:
df = df.drop("id", axis=1)

#--- Inspect data ---
df

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
0,48.0,80.0,1.020,1.0,0.0,,normal,notpresent,notpresent,121.0,...,44.0,7800.0,5.2,yes,yes,no,good,no,no,ckd
1,7.0,50.0,1.020,4.0,0.0,,normal,notpresent,notpresent,,...,38.0,6000.0,,no,no,no,good,no,no,ckd
2,62.0,80.0,1.010,2.0,3.0,normal,normal,notpresent,notpresent,423.0,...,31.0,7500.0,,no,yes,no,poor,no,yes,ckd
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,...,32.0,6700.0,3.9,yes,no,no,poor,yes,yes,ckd
4,51.0,80.0,1.010,2.0,0.0,normal,normal,notpresent,notpresent,106.0,...,35.0,7300.0,4.6,no,no,no,good,no,no,ckd
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,55.0,80.0,1.020,0.0,0.0,normal,normal,notpresent,notpresent,140.0,...,47.0,6700.0,4.9,no,no,no,good,no,no,notckd
396,42.0,70.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,75.0,...,54.0,7800.0,6.2,no,no,no,good,no,no,notckd
397,12.0,80.0,1.020,0.0,0.0,normal,normal,notpresent,notpresent,100.0,...,49.0,6600.0,5.4,no,no,no,good,no,no,notckd
398,17.0,60.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,114.0,...,51.0,7200.0,5.9,no,no,no,good,no,no,notckd


# Task 4: Transforming Labels, Shaping Understanding.
In a digital metamorphosis, the code unfolds, bestowing a new identity upon columns within the "kidney_details.csv" dataset. This isn't just a renaming; it's a linguistic evolution, where labels transcend mere symbols to become the storytellers of health. The dictionary of 'cols_names' isn't just a guide; it's a linguistic alchemy, turning abbreviations into the poetry of medical insights. These lines of code are a translation, unlocking the potential for a more profound understanding of Chronic Kidney Disease. With each renamed column, the dataset whispers a more nuanced narrative, and in this transformation, we find not just data manipulation but a key to unlocking the richness of healthcare information.

In [4]:
df = df.rename(columns={"bp":"blood_pressure",
          "sg":"specific_gravity",
          "al":"albumin",
          "su":"sugar",
          "rbc":"red_blood_cells",
          "pc":"pus_cell",
          "pcc":"pus_cell_clumps",
          "ba":"bacteria",
          "bgr":"blood_glucose_random",
          "bu":"blood_urea",
          "sc":"serum_creatinine",
          "sod":"sodium",
          "pot":"potassium",
          "hemo":"haemoglobin",
          "pcv":"packed_cell_volume",
          "wc":"white_blood_cell_count",
          "rc":"red_blood_cell_count",
          "htn":"hypertension",
          "dm":"diabetes_mellitus",
          "cad":"coronary_artery_disease",
          "appet":"appetite",
          "pe":"pedal_edema",
          "ane":"anemia"})

#--- Inspect data ---
df

Unnamed: 0,age,blood_pressure,specific_gravity,albumin,sugar,red_blood_cells,pus_cell,pus_cell_clumps,bacteria,blood_glucose_random,...,packed_cell_volume,white_blood_cell_count,red_blood_cell_count,hypertension,diabetes_mellitus,coronary_artery_disease,appetite,pedal_edema,anemia,classification
0,48.0,80.0,1.020,1.0,0.0,,normal,notpresent,notpresent,121.0,...,44.0,7800.0,5.2,yes,yes,no,good,no,no,ckd
1,7.0,50.0,1.020,4.0,0.0,,normal,notpresent,notpresent,,...,38.0,6000.0,,no,no,no,good,no,no,ckd
2,62.0,80.0,1.010,2.0,3.0,normal,normal,notpresent,notpresent,423.0,...,31.0,7500.0,,no,yes,no,poor,no,yes,ckd
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,...,32.0,6700.0,3.9,yes,no,no,poor,yes,yes,ckd
4,51.0,80.0,1.010,2.0,0.0,normal,normal,notpresent,notpresent,106.0,...,35.0,7300.0,4.6,no,no,no,good,no,no,ckd
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,55.0,80.0,1.020,0.0,0.0,normal,normal,notpresent,notpresent,140.0,...,47.0,6700.0,4.9,no,no,no,good,no,no,notckd
396,42.0,70.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,75.0,...,54.0,7800.0,6.2,no,no,no,good,no,no,notckd
397,12.0,80.0,1.020,0.0,0.0,normal,normal,notpresent,notpresent,100.0,...,49.0,6600.0,5.4,no,no,no,good,no,no,notckd
398,17.0,60.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,114.0,...,51.0,7200.0,5.9,no,no,no,good,no,no,notckd


# Task 5: Crafting Harmony from Diversity.
In a ballet of binary decisions, the code elegantly segregates columns into two realms - the categorical and the numerical - within the "kidney_details.csv" dataset. This isn't just a classification; it's a symphony of data harmonization. The 'cat_col' and 'num_col' arrays are not just data structures; they are pathways to understanding. This code is a curator, organizing variables into their rightful categories, setting the stage for a predictive model that dances seamlessly between the qualitative and quantitative dimensions of Chronic Kidney Disease. In these lines, we witness not just data division but the orchestration of diversity, a prelude to a predictive masterpiece that transcends the boundaries of traditional healthcare analytics.

In [5]:
cat_col = df.select_dtypes(include = "object").columns.to_list()
num_col = df.select_dtypes(exclude = "object").columns.to_list()

#--- Inspect data ---
cat_col, num_col

(['red_blood_cells',
  'pus_cell',
  'pus_cell_clumps',
  'bacteria',
  'hypertension',
  'diabetes_mellitus',
  'coronary_artery_disease',
  'appetite',
  'pedal_edema',
  'anemia',
  'classification'],
 ['age',
  'blood_pressure',
  'specific_gravity',
  'albumin',
  'sugar',
  'blood_glucose_random',
  'blood_urea',
  'serum_creatinine',
  'sodium',
  'potassium',
  'haemoglobin',
  'packed_cell_volume',
  'white_blood_cell_count',
  'red_blood_cell_count'])

# Task 6: Redefining the Language of Health.
In the digital atelier of healthcare, this code emerges as a linguistic virtuoso, reshaping the narrative within the "kidney_details.csv" dataset. With precision strokes, it replaces the cryptic '\tno' and '\tyes' with the clear voices of 'no' and 'yes' in the realms of diabetes mellitus and coronary artery disease. The 'classification' column undergoes a poetic edit, transforming 'ckd\t' into the eloquent 'ckd.' This code isn't just about find-and-replace; it's a linguistic refinement, ensuring that the language of health is clear, unambiguous, and resonant with the nuances of medical understanding. In these lines, we witness not just data manipulation but a transformation of healthcare dialect, a pivotal step towards building a predictive model with a vocabulary that speaks the truth of Chronic Kidney Disease.



In [6]:
df.loc[df['diabetes_mellitus'] == '\tno', 'diabetes_mellitus'] = 'no'
df.loc[df['diabetes_mellitus'] == '\tyes', 'diabetes_mellitus'] = 'yes'

df.loc[df['coronary_artery_disease'] == '\tno', 'coronary_artery_disease'] = 'no'
df.loc[df['coronary_artery_disease'] == '\tyes', 'coronary_artery_disease'] = 'yes'

df.loc[df['classification'] == 'ckd\t', 'classification'] = 'ckd'

#--- Inspect data ---
df

Unnamed: 0,age,blood_pressure,specific_gravity,albumin,sugar,red_blood_cells,pus_cell,pus_cell_clumps,bacteria,blood_glucose_random,...,packed_cell_volume,white_blood_cell_count,red_blood_cell_count,hypertension,diabetes_mellitus,coronary_artery_disease,appetite,pedal_edema,anemia,classification
0,48.0,80.0,1.020,1.0,0.0,,normal,notpresent,notpresent,121.0,...,44.0,7800.0,5.2,yes,yes,no,good,no,no,ckd
1,7.0,50.0,1.020,4.0,0.0,,normal,notpresent,notpresent,,...,38.0,6000.0,,no,no,no,good,no,no,ckd
2,62.0,80.0,1.010,2.0,3.0,normal,normal,notpresent,notpresent,423.0,...,31.0,7500.0,,no,yes,no,poor,no,yes,ckd
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,...,32.0,6700.0,3.9,yes,no,no,poor,yes,yes,ckd
4,51.0,80.0,1.010,2.0,0.0,normal,normal,notpresent,notpresent,106.0,...,35.0,7300.0,4.6,no,no,no,good,no,no,ckd
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,55.0,80.0,1.020,0.0,0.0,normal,normal,notpresent,notpresent,140.0,...,47.0,6700.0,4.9,no,no,no,good,no,no,notckd
396,42.0,70.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,75.0,...,54.0,7800.0,6.2,no,no,no,good,no,no,notckd
397,12.0,80.0,1.020,0.0,0.0,normal,normal,notpresent,notpresent,100.0,...,49.0,6600.0,5.4,no,no,no,good,no,no,notckd
398,17.0,60.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,114.0,...,51.0,7200.0,5.9,no,no,no,good,no,no,notckd


# Task 7: Illuminating the Gaps in Understanding.¶
In the quest for a comprehensive understanding of Chronic Kidney Disease, this code unfolds as a torchbearer, casting light on the shadows within the "kidney_details.csv" dataset. The 'null_values' array isn't just a numerical summary; it's a map revealing the gaps in our knowledge. With each null value counted, this code becomes a compass, guiding us towards a dataset where every piece of information matters. It's not just about numbers; it's about recognizing the spaces where insights are yet to be uncovered. In these lines, we witness not just a computation but a declaration that in the pursuit of healthcare excellence, every missing piece is a potential revelation waiting to be found.

In [7]:
null_values = df.isnull().sum()

#--- Inspect data ---
null_values

age                          9
blood_pressure              12
specific_gravity            47
albumin                     46
sugar                       49
red_blood_cells            152
pus_cell                    65
pus_cell_clumps              4
bacteria                     4
blood_glucose_random        44
blood_urea                  19
serum_creatinine            17
sodium                      87
potassium                   88
haemoglobin                 52
packed_cell_volume          71
white_blood_cell_count     106
red_blood_cell_count       131
hypertension                 2
diabetes_mellitus            2
coronary_artery_disease      2
appetite                     1
pedal_edema                  1
anemia                       1
classification               0
dtype: int64

# Task 8: Bridging Data Gaps with Precision.
In the algorithmic ballet of healthcare analytics, this code emerges as a healer, mending the gaps within numerical and categorical columns of the "kidney_details.csv" dataset. With the rhythmic dance of imputation, missing numerical values find solace through the artistry of Random Value Imputation, while categorical columns harmonize with the most frequent tunes through Mode Imputation. This isn't just data manipulation; it's a symphony of precision, ensuring that every variable in our dataset contributes meaningfully to the understanding of Chronic Kidney Disease. In these lines, we witness not just imputation but a deliberate act of completeness, as the dataset transforms into a canvas where every stroke of data paints a clearer picture of health.

In [8]:
import random
import numpy as np

def Random_value_Imputation(feature, df, random_seed=None):
    if random_seed is not None:
        np.random.seed(random_seed)  # Set the NumPy random seed if provided

    # Generate random values from a uniform distribution between the min and max of the feature
    min_val = df[feature].min()
    max_val = df[feature].max()
    num_missing = df[feature].isnull().sum()
    
    # Convert num_missing to integer
    num_missing = int(num_missing)
    
    random_sample = np.random.uniform(min_val, max_val, num_missing)

    # Assign the random values to the missing values in the DataFrame
    df.loc[df[feature].isnull(), feature] = random_sample

    return df

def impute_mode(feature, df, random_seed):
    if random_seed is not None:
        random.seed(random_seed)

    mode_series = df[feature].mode()

    # Check if the mode_series is empty
    if not mode_series.empty:
        mode = mode_series[0]
        df[feature] = df[feature].fillna(mode)

    return df


for col in num_col:
    df = Random_value_Imputation(col, df, random_seed=42)

for col in cat_col:
    df = impute_mode(col, df, random_seed=42)

#--- Inspect data ---
df

Unnamed: 0,age,blood_pressure,specific_gravity,albumin,sugar,red_blood_cells,pus_cell,pus_cell_clumps,bacteria,blood_glucose_random,...,packed_cell_volume,white_blood_cell_count,red_blood_cell_count,hypertension,diabetes_mellitus,coronary_artery_disease,appetite,pedal_edema,anemia,classification
0,48.0,80.0,1.020,1.0,0.0,normal,normal,notpresent,notpresent,121.000000,...,44.0,7800.0,5.200000,yes,yes,no,good,no,no,ckd
1,7.0,50.0,1.020,4.0,0.0,normal,normal,notpresent,notpresent,197.284776,...,38.0,6000.0,4.309787,no,no,no,good,no,no,ckd
2,62.0,80.0,1.010,2.0,3.0,normal,normal,notpresent,notpresent,423.000000,...,31.0,7500.0,7.709214,no,yes,no,poor,no,yes,ckd
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.000000,...,32.0,6700.0,3.900000,yes,no,no,poor,yes,yes,ckd
4,51.0,80.0,1.010,2.0,0.0,normal,normal,notpresent,notpresent,106.000000,...,35.0,7300.0,4.600000,no,no,no,good,no,no,ckd
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,55.0,80.0,1.020,0.0,0.0,normal,normal,notpresent,notpresent,140.000000,...,47.0,6700.0,4.900000,no,no,no,good,no,no,notckd
396,42.0,70.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,75.000000,...,54.0,7800.0,6.200000,no,no,no,good,no,no,notckd
397,12.0,80.0,1.020,0.0,0.0,normal,normal,notpresent,notpresent,100.000000,...,49.0,6600.0,5.400000,no,no,no,good,no,no,notckd
398,17.0,60.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,114.000000,...,51.0,7200.0,5.900000,no,no,no,good,no,no,notckd


# Task 9: Encoding Wisdom, Decoding Insights.
In the symphony of data transformation, this code emerges as a sage, imparting a universal language to the categorical columns within the "kidney_details.csv" dataset. With the wand of LabelEncoder, it orchestrates a harmonious translation, turning diverse categorical variables into a numerical lexicon. This isn't just encoding; it's a translation of health nuances into a common dialect, enabling machine learning models to decipher the intricacies of Chronic Kidney Disease. In these lines, we witness not just a technical transformation but the empowerment of data to speak a language that transcends the boundaries of categorical diversity, paving the way for a predictive model with a profound understanding of health dynamics.

In [9]:
from sklearn.preprocessing import LabelEncoder

model = LabelEncoder()

for col in cat_col:
    df[col] = model.fit_transform(df[col])

#--- Inspect data ---
df

Unnamed: 0,age,blood_pressure,specific_gravity,albumin,sugar,red_blood_cells,pus_cell,pus_cell_clumps,bacteria,blood_glucose_random,...,packed_cell_volume,white_blood_cell_count,red_blood_cell_count,hypertension,diabetes_mellitus,coronary_artery_disease,appetite,pedal_edema,anemia,classification
0,48.0,80.0,1.020,1.0,0.0,1,1,0,0,121.000000,...,44.0,7800.0,5.200000,1,2,0,0,0,0,0
1,7.0,50.0,1.020,4.0,0.0,1,1,0,0,197.284776,...,38.0,6000.0,4.309787,0,1,0,0,0,0,0
2,62.0,80.0,1.010,2.0,3.0,1,1,0,0,423.000000,...,31.0,7500.0,7.709214,0,2,0,1,0,1,0
3,48.0,70.0,1.005,4.0,0.0,1,0,1,0,117.000000,...,32.0,6700.0,3.900000,1,1,0,1,1,1,0
4,51.0,80.0,1.010,2.0,0.0,1,1,0,0,106.000000,...,35.0,7300.0,4.600000,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,55.0,80.0,1.020,0.0,0.0,1,1,0,0,140.000000,...,47.0,6700.0,4.900000,0,1,0,0,0,0,1
396,42.0,70.0,1.025,0.0,0.0,1,1,0,0,75.000000,...,54.0,7800.0,6.200000,0,1,0,0,0,0,1
397,12.0,80.0,1.020,0.0,0.0,1,1,0,0,100.000000,...,49.0,6600.0,5.400000,0,1,0,0,0,0,1
398,17.0,60.0,1.025,0.0,0.0,1,1,0,0,114.000000,...,51.0,7200.0,5.900000,0,1,0,0,0,0,1


# Task 10: Carving Pathways for Prediction.
In the tapestry of predictive modeling, this code emerges as a cartographer, delineating the landscapes of training and testing sets within the "kidney_details.csv" dataset. The 'ind_col' and 'dep_col' arrays aren't just variables; they're the coordinates guiding us through the terrain of independent and dependent variables. With the ceremony of train-test split, the dataset metamorphoses into training grounds and unexplored territories. This isn't just data division; it's a strategic delineation, paving the way for a predictive model to traverse the dimensions of Chronic Kidney Disease with precision. In these lines, we witness not just a split but the creation of pathways where machine learning algorithms embark on a journey to decode the language of health.

In [10]:
from sklearn.model_selection import train_test_split

ind_col= [col for col in df.columns if col!='classification']
dep_col = 'classification'

# Create our indpendent and dependent variables
X = df[ind_col]
Y = df[dep_col]
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0, test_size=0.25)

X_train, X_test, y_train, y_test

(          age  blood_pressure  specific_gravity  albumin  sugar  \
 250  40.00000            80.0             1.025      0.0    0.0   
 63   46.00000            70.0             1.015      1.0    0.0   
 312  80.00000            70.0             1.020      0.0    0.0   
 159  59.00000            80.0             1.010      1.0    0.0   
 283  60.00000            70.0             1.020      0.0    0.0   
 ..        ...             ...               ...      ...    ...   
 323  43.00000            80.0             1.025      0.0    0.0   
 192  46.00000           110.0             1.015      0.0    0.0   
 117  15.72964            70.0             1.020      0.0    0.0   
 47   11.00000            80.0             1.010      3.0    0.0   
 172  62.00000            80.0             1.010      1.0    2.0   
 
      red_blood_cells  pus_cell  pus_cell_clumps  bacteria  \
 250                1         1                0         0   
 63                 0         1                0         0

# Task 11: mpowering Predictive Prowess.
In the grand finale of our predictive symphony, this code unveils the maestro – an XGBoostClassifier, trained to decipher the intricacies of Chronic Kidney Disease within the realms of the training set. As the model orchestrates its learning, it transforms the dataset's nuances into a harmonious prediction. With the flourish of predictions in the testing set, accuracy becomes the applause, a measure of the model's mastery in understanding and forecasting health dynamics. This isn't just machine learning; it's the culmination of a journey where data, algorithms, and precision converge to empower a predictive model that holds the promise of revolutionizing our approach to healthcare. In these lines, we witness not just a prediction but the dawn of a new era where artificial intelligence becomes a beacon, guiding us towards a future of informed and precise healthcare decisions.

In [12]:
from xgboost import XGBClassifier

model = XGBClassifier()

model.fit(X_train,y_train)

y_pred = model.predict(X_test)

from sklearn.metrics import accuracy_score

acc = accuracy_score(y_test, y_pred)

print(f"The accacc

0.98

# Task 12: Unveiling the Tapestry of Prediction.
In the gallery of healthcare foresight, this code emerges as a curator, unveiling the intricate tapestry of predictions within the "kidney_details.csv" dataset. With the echo of predictions resonating through 'y_pred2b,' a new chapter unfolds where the machine's discernment aligns with real-world health outcomes. The 'prediction_df' becomes a canvas, painting a visual narrative of anticipated and actual labels side by side. This isn't just data comparison; it's a revelation, where the model's predictions and the ground truth converge, allowing us to scrutinize, learn, and fine-tune our understanding of Chronic Kidney Disease. In these lines, we witness not just a display of predictions but the initiation of a dialogue between artificial intelligence and healthcare reality.

In [13]:
y_pred2b = model.predict(X_test)

prediction_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred2b})

prediction_df

Unnamed: 0,Actual,Predicted
132,0,0
309,1,1
341,1,1
196,0,0
246,0,0
...,...,...
146,0,0
135,0,0
390,1,1
264,1,1
