# Attributes Definitions

- **ID**: A unique identifier assigned to each patient.
- **No_Pation**: The patient's assigned number within the dataset.
- **Gender**: The sex of the patient, typically categorized as male or female.
- **AGE**: The age of the patient in years.
- **Urea**: The concentration of urea in the blood, indicating kidney function.
- **Cr (Creatinine ratio)**: A measure of creatinine levels in the blood, used to assess kidney function.
- **HbA1c**: Glycated hemoglobin percentage, reflecting average blood sugar levels over the past two to three months.
- **Chol (Cholesterol)**: The total cholesterol level in the blood, important for assessing cardiovascular health.
- **TG (Triglycerides)**: The level of triglycerides in the blood, a type of fat linked to heart disease risk.
- **HDL (High-Density Lipoprotein)**: Often referred to as "good" cholesterol; higher levels are typically better.
- **LDL (Low-Density Lipoprotein)**: Known as "bad" cholesterol; higher levels can lead to plaque buildup in arteries.
- **VLDL (Very Low-Density Lipoprotein)**: Another type of "bad" cholesterol; elevated levels can contribute to plaque formation.
- **BMI (Body Mass Index)**: A value derived from the weight and height of the patient, used to categorize underweight, normal weight, overweight, and obesity.
- **CLASS**: The classification of the patient's diabetes status, which may be Diabetic, Non-Diabetic, or Predict-Diabetic.

# Features Relationships in the Dataset

- **1. Gender**
  - **BMI**: Men and women may show different average BMI levels due to physiological differences.
  - **Chol, HDL, LDL**: Cholesterol levels might differ by gender due to hormonal factors.
  - **Class**: The prevalence of diabetes may vary between genders.

- **2. Age**
  - **Class**: Older individuals are more likely to have diabetes or pre-diabetes (Class `D` or `P`).
  - **Chol, TG, HDL, LDL**: Cholesterol and triglyceride levels tend to increase with age, which may indicate a higher risk of cardiovascular issues.
  - **HbA1c**: Glycated hemoglobin may increase with age due to a decline in glucose metabolism efficiency.
  - **BMI**: BMI might increase or decrease with age depending on lifestyle and health.

- **3. Urea**
  - **Cr (Creatinine)**: Both are indicators of kidney function; higher levels suggest kidney issues.
  - **Class**: Diabetic patients may have higher urea levels due to potential kidney damage from prolonged diabetes.
  - **Age**: Older individuals may show higher urea levels as kidney function declines with age.

- **4. Cr (Creatinine Ratio)**
  - **Urea**: Often correlated, as both indicate kidney function.
  - **Class**: Diabetic patients with poor kidney function may have higher creatinine levels.
  - **BMI**: High BMI can stress the kidneys, potentially leading to higher creatinine levels.

- **5. HbA1c**
  - **Class**: Strongly related, as HbA1c is a key marker for diagnosing diabetes. Higher values indicate diabetes (Class `D`) or pre-diabetes (Class `P`).
  - **Age**: May increase with age, as older individuals are more prone to glucose regulation issues.
  - **BMI**: Higher BMI is associated with poor glucose metabolism, leading to increased HbA1c levels.
  - **Chol, TG, LDL**: Poor glucose control often correlates with abnormal lipid profiles.

- **6. Chol (Cholesterol)**
  - **HDL, LDL, TG, VLDL**: Cholesterol components are directly related; high LDL and VLDL often mean higher total cholesterol, while HDL counteracts it.
  - **Class**: High cholesterol is often seen in diabetics due to poor lipid metabolism.
  - **Age**: Cholesterol levels typically increase with age.
  - **BMI**: Obesity often leads to higher cholesterol levels.

- **7. TG (Triglycerides)**
  - **Chol, VLDL**: Triglycerides contribute to VLDL levels and are part of total cholesterol.
  - **Class**: High triglycerides are common in diabetic patients.
  - **BMI**: High BMI is associated with elevated triglycerides.
  - **Age**: Levels may increase with age due to slower metabolism.

- **8. HDL (High-Density Lipoprotein)**
  - **Chol, LDL, TG**: HDL is inversely related to LDL and TG; higher HDL is considered protective.
  - **Class**: Lower HDL is often observed in diabetics.
  - **Gender**: Women may have higher HDL levels compared to men due to hormonal differences.
  - **BMI**: Higher BMI is associated with lower HDL levels.

- **9. LDL (Low-Density Lipoprotein)**
  - **Chol, HDL**: LDL contributes significantly to total cholesterol; higher LDL often coincides with lower HDL.
  - **Class**: High LDL levels are more common in diabetics.
  - **Age**: LDL levels typically increase with age.
  - **BMI**: Obesity often correlates with higher LDL levels.

- **10. VLDL (Very Low-Density Lipoprotein)**
  - **TG, Chol**: VLDL is primarily composed of triglycerides and contributes to total cholesterol.
  - **Class**: High VLDL levels are often seen in diabetics.
  - **BMI**: VLDL tends to be higher in individuals with high BMI.

- **11. BMI (Body Mass Index)**
  - **Class**: Higher BMI is strongly associated with diabetes risk (Class `P` and `D`).
  - **HbA1c**: Poor glucose control is often observed in individuals with high BMI.
  - **Chol, TG, LDL**: High BMI is linked to poor lipid profiles.
  - **Age**: BMI trends may vary with age due to lifestyle and health factors.

- **12. Class**
  - **HbA1c**: A direct diagnostic criterion for diabetes; higher HbA1c values correspond to Class `P` or `D`.
  - **BMI**: Obesity increases the risk of diabetes (Class `P` or `D`).
  - **Chol, TG, LDL, HDL, VLDL**: Lipid metabolism disorders are common in diabetic patients.
  - **Age**: Older individuals are more likely to fall into Class `P` or `D`.

# Importing Libraries and Loading Data

Libraries

In [2]:
# Essentials
#import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Warnings
import warnings
warnings.filterwarnings('ignore')

Data

In [3]:
iraqi_df = pd.read_csv("C:/Users/dahab/OneDrive/Desktop/T2D-Prediction-System--Data-Fusion-for-Enhanced-Decision-Making/datasets/clinical/Iraq_clinical_dataset.csv")

# Exploratory Data Analysis and Cleaning

First 5 rows

In [4]:
iraqi_df.head()

Unnamed: 0,ID,No_Pation,Gender,AGE,Urea,Cr,HbA1c,Chol,TG,HDL,LDL,VLDL,BMI,CLASS
0,502,17975,F,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,N
1,735,34221,M,26,4.5,62,4.9,3.7,1.4,1.1,2.1,0.6,23.0,N
2,420,47975,F,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,N
3,680,87656,F,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,N
4,504,34223,M,33,7.1,46,4.9,4.9,1.0,0.8,2.0,0.4,21.0,N


Dropping redundant columns

In [5]:
iraqi_df = iraqi_df.drop(['ID', 'No_Pation'], axis=1)
iraqi_df.head(1)

Unnamed: 0,Gender,AGE,Urea,Cr,HbA1c,Chol,TG,HDL,LDL,VLDL,BMI,CLASS
0,F,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,N


Renaming some columns

In [6]:
# Rename columns
iraqi_df = iraqi_df.rename(columns={'AGE': 'Age', 'CLASS': 'Class'})

# Display the updated DataFrame columns
print(iraqi_df.columns)

Index(['Gender', 'Age', 'Urea', 'Cr', 'HbA1c', 'Chol', 'TG', 'HDL', 'LDL',
       'VLDL', 'BMI', 'Class'],
      dtype='object')


Checking Duplicates

In [7]:
iraqi_df.duplicated().sum()

169

Dropping Duplicates

In [8]:
iraqi_df = iraqi_df.drop_duplicates()
iraqi_df.duplicated().sum()

0

Number of Columns and Rows

In [9]:
iraqi_df.shape

(831, 12)

Concise Summary of Data

In [10]:
iraqi_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 831 entries, 0 to 999
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Gender  831 non-null    object 
 1   Age     831 non-null    int64  
 2   Urea    831 non-null    float64
 3   Cr      831 non-null    int64  
 4   HbA1c   831 non-null    float64
 5   Chol    831 non-null    float64
 6   TG      831 non-null    float64
 7   HDL     831 non-null    float64
 8   LDL     831 non-null    float64
 9   VLDL    831 non-null    float64
 10  BMI     831 non-null    float64
 11  Class   831 non-null    object 
dtypes: float64(8), int64(2), object(2)
memory usage: 84.4+ KB


Unique Values of Categorical Columns

In [11]:
# Identify categorical columns
categorical_columns = iraqi_df.select_dtypes(include=['object']).columns

# Display unique values for each categorical column
for column in categorical_columns:
    unique_values = iraqi_df[column].unique()
    print(f"Unique values in '{column}' column: {unique_values}")

Unique values in 'Gender' column: ['F' 'M' 'f']
Unique values in 'Class' column: ['N' 'N ' 'P' 'Y' 'Y ']


Example of each

In [12]:
# Identify categorical columns
categorical_columns = iraqi_df.select_dtypes(include=['object']).columns

# Display one example for each unique value in each categorical column
for column in categorical_columns:
    print(f"Examples for column: {column}")
    examples = iraqi_df.groupby(column).first()
    print(examples)
    print("\n")

Examples for column: Gender
        Age  Urea  Cr  HbA1c  Chol   TG  HDL  LDL  VLDL   BMI Class
Gender                                                             
F        50   4.7  46    4.9   4.2  0.9  2.4  1.4   0.5  24.0     N
M        26   4.5  62    4.9   3.7  1.4  1.1  2.1   0.6  23.0     N
f        55   4.1  34   13.9   5.4  1.6  1.6  3.1   0.7  33.0     Y


Examples for column: Class
      Gender  Age  Urea  Cr  HbA1c  Chol   TG  HDL  LDL  VLDL   BMI
Class                                                              
N          F   50   4.7  46    4.9   4.2  0.9  2.4  1.4   0.5  24.0
N          M   38   6.1  83    5.4   4.5  1.7  0.9  2.8   0.8  24.6
P          M   34   3.9  81    6.0   6.2  3.9  0.8  1.9   1.8  23.0
Y          M   31   3.0  60   12.3   4.1  2.2  0.7  2.4  15.4  37.2
Y          M   31   3.0  60   12.3   4.1  2.2  0.7  2.4  15.4  37.2




Unique Values of Categorical Columns after cleanup

In [13]:
# Clean the 'Gender' column
iraqi_df['Gender'] = iraqi_df['Gender'].str.strip().str.upper()

# Clean the 'Class' column
iraqi_df['Class'] = iraqi_df['Class'].str.strip()

# Verify the unique values after cleanup
print("Unique values in 'Gender' column after cleanup:", iraqi_df['Gender'].unique())
print("Unique values in 'Class' column after cleanup:", iraqi_df['Class'].unique())

Unique values in 'Gender' column after cleanup: ['F' 'M']
Unique values in 'Class' column after cleanup: ['N' 'P' 'Y']


Renaming Values in the `Class` Column

In [14]:
# Replace 'Y' with 'D' in the 'Class' column
iraqi_df['Class'] = iraqi_df['Class'].replace('Y', 'D')

# Verify the unique values after the change
print("Unique values in the 'Class' column after replacing Y with D:", iraqi_df['Class'].unique())

Unique values in the 'Class' column after replacing Y with D: ['N' 'P' 'D']


Value Counts of `Class` column

In [15]:
iraqi_df['Class'].value_counts()

Class
D    695
N     96
P     40
Name: count, dtype: int64

Checking duplicates again

In [16]:
iraqi_df.duplicated().sum()

5

Dropping Duplicates

In [17]:
iraqi_df = iraqi_df.drop_duplicates()
iraqi_df.duplicated().sum()

0

Checking Missing Values

In [18]:
iraqi_df.isnull().sum().sum()

0

Statistics of the Dataset

In [19]:
iraqi_df.describe()

Unnamed: 0,Age,Urea,Cr,HbA1c,Chol,TG,HDL,LDL,VLDL,BMI
count,826.0,826.0,826.0,826.0,826.0,826.0,826.0,826.0,826.0,826.0
mean,53.490315,5.184677,69.024213,8.326344,4.898208,2.39937,1.211804,2.590061,1.774576,29.459274
std,8.808427,3.077831,59.557108,2.602589,1.328812,1.45685,0.67961,1.132863,3.517931,4.996676
min,20.0,0.5,6.0,0.9,0.0,0.3,0.2,0.3,0.1,19.0
25%,51.0,3.615,48.0,6.5,4.0,1.5,0.9,1.7,0.7,26.0
50%,55.0,4.6,59.0,8.1,4.8,2.015,1.1,2.5,1.0,30.0
75%,59.0,5.7,73.0,10.2,5.6,3.0,1.3,3.3,1.5,33.0
max,79.0,38.9,800.0,16.0,10.3,13.8,9.9,9.9,35.0,47.75


# Feature Engineering

Label Encoding for `Gender` column

In [20]:
# Label encode Gender
iraqi_df['Gender_Encoded'] = iraqi_df['Gender'].astype('category').cat.codes

# Drop the original Gender column
iraqi_df.drop(['Gender'], axis=1, inplace=True)

# Output the modified dataset
iraqi_df.head()

Unnamed: 0,Age,Urea,Cr,HbA1c,Chol,TG,HDL,LDL,VLDL,BMI,Class,Gender_Encoded
0,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,N,0
1,26,4.5,62,4.9,3.7,1.4,1.1,2.1,0.6,23.0,N,1
4,33,7.1,46,4.9,4.9,1.0,0.8,2.0,0.4,21.0,N,1
5,45,2.3,24,4.0,2.9,1.0,1.0,1.5,0.4,21.0,N,0
6,50,2.0,50,4.0,3.6,1.3,0.9,2.1,0.6,24.0,N,0


Label Encoding for `Class` column

In [21]:
# Label encode Gender
iraqi_df['Class_Encoded'] = iraqi_df['Class'].astype('category').cat.codes

# Drop the original Gender column
iraqi_df.drop(['Class'], axis=1, inplace=True)

# Output the modified dataset
iraqi_df.head()

Unnamed: 0,Age,Urea,Cr,HbA1c,Chol,TG,HDL,LDL,VLDL,BMI,Gender_Encoded,Class_Encoded
0,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,0,1
1,26,4.5,62,4.9,3.7,1.4,1.1,2.1,0.6,23.0,1,1
4,33,7.1,46,4.9,4.9,1.0,0.8,2.0,0.4,21.0,1,1
5,45,2.3,24,4.0,2.9,1.0,1.0,1.5,0.4,21.0,0,1
6,50,2.0,50,4.0,3.6,1.3,0.9,2.1,0.6,24.0,0,1


# Data Visualization

Count of Genders

In [22]:
"""# Create a temporary DataFrame for plotting with original gender labels
temp_df = iraqi_df.copy()
temp_df['Gender_Plot'] = temp_df['Gender_Encoded'].map({0: 'F', 1: 'M'})

# Plot Gender counts with specific colors and display counts on bars
plt.figure(figsize=(8, 6))
ax = sns.countplot(x='Gender_Plot', data=temp_df, palette=['pink', 'skyblue'])

# Adding counts on top of each bar
for p in ax.patches:
    ax.annotate(f'{int(p.get_height())}', 
                (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha='center', va='center', fontsize=12, color='black', xytext=(0, 5), 
                textcoords='offset points')

# Add labels and title
plt.title('Count of Genders', fontsize=16)
plt.xlabel('Gender', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.show()"""

"# Create a temporary DataFrame for plotting with original gender labels\ntemp_df = iraqi_df.copy()\ntemp_df['Gender_Plot'] = temp_df['Gender_Encoded'].map({0: 'F', 1: 'M'})\n\n# Plot Gender counts with specific colors and display counts on bars\nplt.figure(figsize=(8, 6))\nax = sns.countplot(x='Gender_Plot', data=temp_df, palette=['pink', 'skyblue'])\n\n# Adding counts on top of each bar\nfor p in ax.patches:\n    ax.annotate(f'{int(p.get_height())}', \n                (p.get_x() + p.get_width() / 2., p.get_height()), \n                ha='center', va='center', fontsize=12, color='black', xytext=(0, 5), \n                textcoords='offset points')\n\n# Add labels and title\nplt.title('Count of Genders', fontsize=16)\nplt.xlabel('Gender', fontsize=12)\nplt.ylabel('Count', fontsize=12)\nplt.show()"

Distribution of Numerical Features

In [23]:
"""# Select numerical columns from the dataset
numerical_features = iraqi_df.select_dtypes(include=['float64', 'int64']).columns

# Set up the figure for multiple subplots
plt.figure(figsize=(15, 15))
for i, column in enumerate(numerical_features, 1):
    plt.subplot(len(numerical_features) // 3 + 1, 3, i)  # Create subplots dynamically
    sns.histplot(iraqi_df[column], kde=True, color='skyblue', bins=30)
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')

plt.tight_layout()  # Adjust subplots to fit into figure area neatly
plt.show()"""

"# Select numerical columns from the dataset\nnumerical_features = iraqi_df.select_dtypes(include=['float64', 'int64']).columns\n\n# Set up the figure for multiple subplots\nplt.figure(figsize=(15, 15))\nfor i, column in enumerate(numerical_features, 1):\n    plt.subplot(len(numerical_features) // 3 + 1, 3, i)  # Create subplots dynamically\n    sns.histplot(iraqi_df[column], kde=True, color='skyblue', bins=30)\n    plt.title(f'Distribution of {column}')\n    plt.xlabel(column)\n    plt.ylabel('Frequency')\n\nplt.tight_layout()  # Adjust subplots to fit into figure area neatly\nplt.show()"

Correlation Matrix of Numerical Features

In [24]:
"""# Filter numeric columns only
numeric_columns = iraqi_df.select_dtypes(include=['float64', 'int64'])

# Compute the correlation matrix
correlation_matrix = numeric_columns.corr()

# Display the correlation matrix as a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True)

# Add a title
plt.title("Correlation Matrix of Numerical Features", fontsize=16)
plt.show()"""

'# Filter numeric columns only\nnumeric_columns = iraqi_df.select_dtypes(include=[\'float64\', \'int64\'])\n\n# Compute the correlation matrix\ncorrelation_matrix = numeric_columns.corr()\n\n# Display the correlation matrix as a heatmap\nplt.figure(figsize=(10, 8))\nsns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True)\n\n# Add a title\nplt.title("Correlation Matrix of Numerical Features", fontsize=16)\nplt.show()'

Gender Distribution Across Classes

In [25]:
"""# Reconstruct the original Class column from one-hot encoding
temp_df = iraqi_df.copy()
temp_df['Class_Plot'] = temp_df[['Class_N', 'Class_P', 'Class_D']].idxmax(axis=1).str.replace('Class_', '')

# Map encoded Gender values back to original labels
temp_df['Gender_Plot'] = temp_df['Gender_Encoded'].map({0: 'F', 1: 'M'})

# Create a grouped bar plot
plt.figure(figsize=(8, 6))
ax = sns.countplot(x='Class_Plot', hue='Gender_Plot', data=temp_df, palette=['pink', 'skyblue'])

# Add counts on top of each bar
for p in ax.patches:
    ax.annotate(
        f'{int(p.get_height())}',  # Text to display (bar height as integer)
        (p.get_x() + p.get_width() / 2., p.get_height()),  # Position at the center top of the bar
        ha='center',  # Align horizontally at center
        va='center',  # Align vertically
        fontsize=10,  # Font size
        color='black',  # Text color
        xytext=(0, 5),  # Offset from the top of the bar
        textcoords='offset points'  # Use offset positioning
    )

# Add labels and title
plt.title('Gender Distribution Across Classes', fontsize=16)
plt.xlabel('Class', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title='Gender')
plt.show()"""

"# Reconstruct the original Class column from one-hot encoding\ntemp_df = iraqi_df.copy()\ntemp_df['Class_Plot'] = temp_df[['Class_N', 'Class_P', 'Class_D']].idxmax(axis=1).str.replace('Class_', '')\n\n# Map encoded Gender values back to original labels\ntemp_df['Gender_Plot'] = temp_df['Gender_Encoded'].map({0: 'F', 1: 'M'})\n\n# Create a grouped bar plot\nplt.figure(figsize=(8, 6))\nax = sns.countplot(x='Class_Plot', hue='Gender_Plot', data=temp_df, palette=['pink', 'skyblue'])\n\n# Add counts on top of each bar\nfor p in ax.patches:\n    ax.annotate(\n        f'{int(p.get_height())}',  # Text to display (bar height as integer)\n        (p.get_x() + p.get_width() / 2., p.get_height()),  # Position at the center top of the bar\n        ha='center',  # Align horizontally at center\n        va='center',  # Align vertically\n        fontsize=10,  # Font size\n        color='black',  # Text color\n        xytext=(0, 5),  # Offset from the top of the bar\n        textcoords='offset points'

Pairplot of HDL, LDL, and VLDL Grouped by Class

In [26]:
"""# Create a temporary DataFrame for visualization
temp_df = iraqi_df.copy()

# Reconstruct the 'Class' column from one-hot encoded columns
temp_df['Class_Reconstructed'] = temp_df[['Class_N', 'Class_P', 'Class_D']].idxmax(axis=1).str.replace('Class_', '')

# Ensure necessary columns are present and drop rows with missing values
temp_df = temp_df[['HDL', 'LDL', 'VLDL', 'Class_Reconstructed']].dropna()

# Create a pair plot with Seaborn
sns.pairplot(
    temp_df,
    vars=['HDL', 'LDL', 'VLDL'],  # Numerical variables to plot
    hue='Class_Reconstructed',    # Group by Class
    palette='Set2',               # Set color palette
    diag_kind='kde',              # Kernel density estimation for diagonal
)

# Add a title
plt.suptitle('Pairplot of HDL, LDL, and VLDL Grouped by Class', y=1.02, fontsize=16)

# Show the plot
plt.show()"""

"# Create a temporary DataFrame for visualization\ntemp_df = iraqi_df.copy()\n\n# Reconstruct the 'Class' column from one-hot encoded columns\ntemp_df['Class_Reconstructed'] = temp_df[['Class_N', 'Class_P', 'Class_D']].idxmax(axis=1).str.replace('Class_', '')\n\n# Ensure necessary columns are present and drop rows with missing values\ntemp_df = temp_df[['HDL', 'LDL', 'VLDL', 'Class_Reconstructed']].dropna()\n\n# Create a pair plot with Seaborn\nsns.pairplot(\n    temp_df,\n    vars=['HDL', 'LDL', 'VLDL'],  # Numerical variables to plot\n    hue='Class_Reconstructed',    # Group by Class\n    palette='Set2',               # Set color palette\n    diag_kind='kde',              # Kernel density estimation for diagonal\n)\n\n# Add a title\nplt.suptitle('Pairplot of HDL, LDL, and VLDL Grouped by Class', y=1.02, fontsize=16)\n\n# Show the plot\nplt.show()"

Urea Values Across Classes

In [27]:
"""# Reconstruct the 'Class' column from one-hot encoded columns
temp_df = iraqi_df.copy()
temp_df['Class_Reconstructed'] = temp_df[['Class_N', 'Class_P', 'Class_D']].idxmax(axis=1).str.replace('Class_', '')

# Create the swarm plot for Urea grouped by reconstructed Class
plt.figure(figsize=(8, 6))
sns.swarmplot(x='Class_Reconstructed', y='Urea', data=temp_df, palette='husl', alpha=0.8)

# Add labels and title
plt.title('Urea Values Across Classes', fontsize=16)
plt.xlabel('Class', fontsize=12)
plt.ylabel('Urea', fontsize=12)
plt.show()"""

"# Reconstruct the 'Class' column from one-hot encoded columns\ntemp_df = iraqi_df.copy()\ntemp_df['Class_Reconstructed'] = temp_df[['Class_N', 'Class_P', 'Class_D']].idxmax(axis=1).str.replace('Class_', '')\n\n# Create the swarm plot for Urea grouped by reconstructed Class\nplt.figure(figsize=(8, 6))\nsns.swarmplot(x='Class_Reconstructed', y='Urea', data=temp_df, palette='husl', alpha=0.8)\n\n# Add labels and title\nplt.title('Urea Values Across Classes', fontsize=16)\nplt.xlabel('Class', fontsize=12)\nplt.ylabel('Urea', fontsize=12)\nplt.show()"

Joint Plot of Cr vs Urea

In [28]:
"""# Reconstruct the 'Class' column from one-hot encoded columns
temp_df = iraqi_df.copy()
temp_df['Class_Reconstructed'] = temp_df[['Class_N', 'Class_P', 'Class_D']].idxmax(axis=1).str.replace('Class_', '')

# Create a joint plot for Cr vs Urea, grouped by reconstructed Class
sns.jointplot(
    data=temp_df,
    x='Cr', 
    y='Urea', 
    hue='Class_Reconstructed', 
    kind='scatter', 
    palette='Set1', 
    height=8
)

# Add a title
plt.suptitle('Joint Plot of Cr vs Urea', y=1.02, fontsize=16)
plt.show()"""

"# Reconstruct the 'Class' column from one-hot encoded columns\ntemp_df = iraqi_df.copy()\ntemp_df['Class_Reconstructed'] = temp_df[['Class_N', 'Class_P', 'Class_D']].idxmax(axis=1).str.replace('Class_', '')\n\n# Create a joint plot for Cr vs Urea, grouped by reconstructed Class\nsns.jointplot(\n    data=temp_df,\n    x='Cr', \n    y='Urea', \n    hue='Class_Reconstructed', \n    kind='scatter', \n    palette='Set1', \n    height=8\n)\n\n# Add a title\nplt.suptitle('Joint Plot of Cr vs Urea', y=1.02, fontsize=16)\nplt.show()"

HbA1c Values Across Classes

In [29]:
"""# Reconstruct the 'Class' column from one-hot encoded columns
temp_df = iraqi_df.copy()
temp_df['Class_Reconstructed'] = temp_df[['Class_N', 'Class_P', 'Class_D']].idxmax(axis=1).str.replace('Class_', '')

# Create the swarm plot for HbA1c grouped by reconstructed Class
plt.figure(figsize=(8, 6))
sns.swarmplot(x='Class_Reconstructed', y='HbA1c', data=temp_df, palette='husl', alpha=0.8)

# Add labels and title
plt.title('HbA1c Values Across Classes', fontsize=16)
plt.xlabel('Class', fontsize=12)
plt.ylabel('HbA1c', fontsize=12)
plt.show()"""

"# Reconstruct the 'Class' column from one-hot encoded columns\ntemp_df = iraqi_df.copy()\ntemp_df['Class_Reconstructed'] = temp_df[['Class_N', 'Class_P', 'Class_D']].idxmax(axis=1).str.replace('Class_', '')\n\n# Create the swarm plot for HbA1c grouped by reconstructed Class\nplt.figure(figsize=(8, 6))\nsns.swarmplot(x='Class_Reconstructed', y='HbA1c', data=temp_df, palette='husl', alpha=0.8)\n\n# Add labels and title\nplt.title('HbA1c Values Across Classes', fontsize=16)\nplt.xlabel('Class', fontsize=12)\nplt.ylabel('HbA1c', fontsize=12)\nplt.show()"

Pairwise Relationships Between Features Grouped by Class

In [30]:
"""# Reconstruct the 'Class' column from one-hot encoded columns
temp_df = iraqi_df.copy()
temp_df['Class_Reconstructed'] = temp_df[['Class_N', 'Class_P', 'Class_D']].idxmax(axis=1).str.replace('Class_', '')

# Select numerical columns for pair plot
numeric_columns = iraqi_df.select_dtypes(include=['float64', 'int64']).columns  # Get column names as a list

# Create the pair plot for numerical features grouped by reconstructed Class
sns.pairplot(
    temp_df,
    vars=numeric_columns,  # Numerical features to include
    hue='Class_Reconstructed',  # Group by reconstructed Class
    diag_kind='kde',  # Kernel density estimation on diagonals
    palette='Set2',  # Color palette
    height=3
)

# Add a title
plt.suptitle('Pairwise Relationships Between Features Grouped by Class', y=1.02, fontsize=16)

# Show the plot
plt.show()"""

"# Reconstruct the 'Class' column from one-hot encoded columns\ntemp_df = iraqi_df.copy()\ntemp_df['Class_Reconstructed'] = temp_df[['Class_N', 'Class_P', 'Class_D']].idxmax(axis=1).str.replace('Class_', '')\n\n# Select numerical columns for pair plot\nnumeric_columns = iraqi_df.select_dtypes(include=['float64', 'int64']).columns  # Get column names as a list\n\n# Create the pair plot for numerical features grouped by reconstructed Class\nsns.pairplot(\n    temp_df,\n    vars=numeric_columns,  # Numerical features to include\n    hue='Class_Reconstructed',  # Group by reconstructed Class\n    diag_kind='kde',  # Kernel density estimation on diagonals\n    palette='Set2',  # Color palette\n    height=3\n)\n\n# Add a title\nplt.suptitle('Pairwise Relationships Between Features Grouped by Class', y=1.02, fontsize=16)\n\n# Show the plot\nplt.show()"

# Saving Processed CSV File

In [31]:
# Define the exact output file path
output_file_path = r'C:\Users\dahab\OneDrive\Desktop\T2D-Prediction-System--Data-Fusion-for-Enhanced-Decision-Making\processed_datasets\clinical\Iraq_pro.csv'

# Save the updated DataFrame to a CSV file
iraqi_df.to_csv(output_file_path, index=False)

print(f"Updated dataset saved to: {output_file_path}")

Updated dataset saved to: C:\Users\dahab\OneDrive\Desktop\T2D-Prediction-System--Data-Fusion-for-Enhanced-Decision-Making\processed_datasets\clinical\Iraq_pro.csv


# Conclusion