<a href="https://www.kaggle.com/code/mohamdhussein/sick-detection-for-infants?scriptVersionId=214879796" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

### **Objective:** Develop a neural network model to identify potential health issues in infants.

### **Dataset:** Synthetic data including Chest X-ray results, Body O2 and CO2 levels, Birth Diseases, and Age in days.

# Importing Libraries

In [1]:
import numpy as np
import pandas as pd

# Importin Data

In [2]:
dataset = pd.read_csv('/kaggle/input/synthetic-infant-health-dataset/Synthetic-Infant-Health-Data.csv', index_col=0)
dataset.head(10)

Unnamed: 0,BirthAsphyxia,HypDistrib,HypoxiaInO2,CO2,ChestXray,Grunting,LVHreport,LowerBodyO2,RUQO2,CO2Report,XrayReport,Disease,GruntingReport,Age,LVH,DuctFlow,CardiacMixing,LungParench,LungFlow,Sick
0,no,Equal,Severe,Normal,Normal,yes,no,5-12,<5,<7.5,Asy/Patchy,TGA,no,4-10_days,no,Lt_to_Rt,Transp.,Normal,Normal,no
1,no,Equal,Moderate,High,Grd_Glass,no,no,<5,5-12,>=7.5,Grd_Glass,Fallot,no,0-3_days,no,Rt_to_Lt,Mild,Abnormal,High,no
2,no,Equal,Severe,Normal,Plethoric,no,yes,5-12,5-12,>=7.5,Normal,PFC,no,0-3_days,no,Lt_to_Rt,Complete,Normal,High,no
3,no,Equal,Moderate,Normal,Plethoric,no,no,5-12,<5,<7.5,Plethoric,PAIVS,no,0-3_days,no,,Complete,Normal,Low,no
4,no,Equal,Moderate,Normal,Plethoric,no,yes,12+,5-12,<7.5,Plethoric,PAIVS,no,0-3_days,yes,Lt_to_Rt,Complete,Normal,Normal,yes
5,no,Equal,Moderate,High,Plethoric,no,no,12+,<5,<7.5,Grd_Glass,PAIVS,no,0-3_days,no,,,Congested,High,yes
6,yes,Equal,Moderate,Normal,Plethoric,no,no,5-12,5-12,<7.5,Grd_Glass,PAIVS,no,0-3_days,no,Lt_to_Rt,Complete,Abnormal,High,no
7,no,Equal,Severe,Normal,Oligaemic,yes,no,<5,<5,<7.5,Oligaemic,Fallot,yes,0-3_days,no,,Complete,Normal,Low,yes
8,no,Equal,Severe,Normal,Plethoric,no,yes,<5,5-12,<7.5,Plethoric,TGA,no,0-3_days,no,,Mild,Normal,Low,yes
9,no,Equal,Moderate,Normal,Plethoric,no,yes,<5,<5,<7.5,Plethoric,Fallot,no,0-3_days,yes,Lt_to_Rt,Complete,Normal,High,no


# Data Review

In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15000 entries, 0 to 14999
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   BirthAsphyxia   15000 non-null  object
 1   HypDistrib      15000 non-null  object
 2   HypoxiaInO2     15000 non-null  object
 3   CO2             15000 non-null  object
 4   ChestXray       15000 non-null  object
 5   Grunting        15000 non-null  object
 6   LVHreport       15000 non-null  object
 7   LowerBodyO2     15000 non-null  object
 8   RUQO2           15000 non-null  object
 9   CO2Report       15000 non-null  object
 10  XrayReport      15000 non-null  object
 11  Disease         15000 non-null  object
 12  GruntingReport  15000 non-null  object
 13  Age             15000 non-null  object
 14  LVH             15000 non-null  object
 15  DuctFlow        9311 non-null   object
 16  CardiacMixing   14273 non-null  object
 17  LungParench     15000 non-null  object
 18  LungFlow   

## Here are the descriptions for each column in the dataset:

1. **BirthAsphyxia**: Indicates whether the infant experienced birth asphyxia (a condition of oxygen deprivation at birth).
2. **HypDistrib**: Information related to hypoxemia distribution (oxygen deficiency in the blood).
4. **HypoxiaInO2**: Indicates hypoxia (low oxygen levels) related to oxygen intake.
5. **CO2**: Carbon dioxide levels in the body.
6. **ChestXray**: Results of the infant's chest X-ray.
7. **Grunting**: Indicates whether the infant is grunting, which can be a sign of respiratory distress.
8. **LVHreport**: Report on left ventricular hypertrophy (enlargement of the left ventricle of the heart).
9. **LowerBodyO2**: Oxygen levels in the lower body.
10. **RUQO2**: Oxygen levels in the right upper quadrant of the body.
11. **CO2Report**: Detailed report on carbon dioxide levels.
12. **XrayReport**: Detailed report on the chest X-ray.
13. **Disease**: Indicates the presence of any diseases.
14. **GruntingReport**: Detailed report on the infant's grunting.
15. **Age**: Age of the infant in days.
16. **LVH**: Indicates the presence of left ventricular hypertrophy.
17. **DuctFlow**: Information on ductal flow (related to the ductus arteriosus in the heart).
18. **CardiacMixing**: Indicates the mixing of oxygenated and deoxygenated blood in the heart.
19. **LungParench**: Information on lung parenchyma (functional tissue of the lung).
20. **LungFlow**: Information on lung blood flow.
21. **Sick**: Indicates whether the infant is sick.

In [4]:
for column in dataset.columns:
    print(f"Column Name: {column},","Unique Values:", dataset[column].unique())

Column Name: BirthAsphyxia, Unique Values: ['no' 'yes']
Column Name: HypDistrib, Unique Values: ['Equal' 'Unequal']
Column Name: HypoxiaInO2, Unique Values: ['Severe' 'Moderate' 'Mild']
Column Name: CO2, Unique Values: ['Normal' 'High' 'Low']
Column Name: ChestXray, Unique Values: ['Normal' 'Grd_Glass' 'Plethoric' 'Oligaemic' 'Asy/Patch']
Column Name: Grunting, Unique Values: ['yes' 'no']
Column Name: LVHreport, Unique Values: ['no' 'yes']
Column Name: LowerBodyO2, Unique Values: ['5-12' '<5' '12+']
Column Name: RUQO2, Unique Values: ['<5' '5-12' '12+']
Column Name: CO2Report, Unique Values: ['<7.5' '>=7.5']
Column Name: XrayReport, Unique Values: ['Asy/Patchy' 'Grd_Glass' 'Normal' 'Plethoric' 'Oligaemic']
Column Name: Disease, Unique Values: ['TGA' 'Fallot' 'PFC' 'PAIVS' 'TAPVD' 'Lung']
Column Name: GruntingReport, Unique Values: ['no' 'yes']
Column Name: Age, Unique Values: ['4-10_days' '0-3_days' '11-30_days']
Column Name: LVH, Unique Values: ['no' 'yes']
Column Name: DuctFlow, Uniq

## Here are the summaries for each unique value in the `ChestXray` and `Disease` columns:

### ChestXray
1. **Normal**: The chest X-ray shows no abnormalities.
2. **Grd_Glass**: "Ground Glass" opacity indicates a hazy area that reduces visibility of lung structures, often associated with various lung conditions.
3. **Plethoric**: Indicates that the lungs appear overly full of blood or fluid, which may suggest certain medical conditions like heart failure.
4. **Oligaemic**: Refers to reduced blood volume in the lungs, possibly indicating pulmonary conditions.
5. **Asy/Patch**: "Asymmetry/Patchy" refers to uneven or patchy areas on the chest X-ray, which can be indicative of infections, inflammations, or other lung conditions.

### Disease
1. **TGA (Transposition of the Great Arteries)**: A condition where the two main arteries leaving the heart are reversed.
2. **Fallot (Tetralogy of Fallot)**: A congenital heart defect that involves four heart malformations.
3. **PFC (Persistent Fetal Circulation)**: A condition where a newborn's circulation system doesn't adapt to breathing outside the womb, leading to high blood pressure in the lungs.
4. **PAIVS (Pulmonary Atresia with Intact Ventricular Septum)**: A rare heart defect where the pulmonary valve doesn't form properly, restricting blood flow to the lungs.
5. **TAPVD (Total Anomalous Pulmonary Venous Drainage)**: A congenital heart defect where the pulmonary veins don't connect normally to the left atrium.
6. **Lung**: Refers to lung-related conditions or diseases.

# Data Preprocessing

## Removing Nulls

- Nulls in DuctFlow equals 37.93%, so it can't be dropped or replaced. it will be handled using one-hot encoding
- Nulls in CardiacMixing equals 4.85% so that it will be replaced with the most frequent value

In [5]:
# Replace nulls in CardiacMixing with the most frequent value (mode)
most_frequent_value = dataset['CardiacMixing'].mode()[0]
print("The most_frequent_value:", most_frequent_value)
dataset['CardiacMixing'].fillna(most_frequent_value, inplace=True)

# Verify the changes
print("Numer of Nulls:", dataset['CardiacMixing'].isnull().sum())

The most_frequent_value: Complete
Numer of Nulls: 0


## Encoding data

### Ordinal encoding for ranked categories

In [6]:
for column in dataset.columns:
    if column in ['BirthAsphyxia', 'Grunting', 'LVHreport', 'GruntingReport', 'LVH', 'Sick']:
        dataset[column] = [1 if x == 'yes' else 0 for x in dataset[column]]
    elif column == 'HypDistrib':
        dataset[column] = [1 if x == 'Equal' else 0 for x in dataset[column]]
    elif column == 'HypoxiaInO2':
        dataset[column] = [2 if x == 'Severe' else 1 if x == 'Moderate' else 0 for x in dataset[column]]
    elif column in ['CO2', 'LungFlow']:
        dataset[column] = [2 if x == 'High' else 1 if x == 'Normal' else 0 for x in dataset[column]]
    elif column in ['LowerBodyO2', 'RUQO2']:
        dataset[column] = [2 if x == '12+' else 1 if x == '5-12' else 0 for x in dataset[column]]
    elif column == 'CO2Report':
        dataset[column] = [1 if x == '>=7.5' else 0 for x in dataset[column]]
    elif column == 'Age':
        dataset[column] = [2 if x == '11-30_days' else 1 if x == '4-10_days' else 0 for x in dataset[column]]
    elif column == 'CardiacMixing':
        dataset[column] = [2 if x == 'Complete' else 1 if x == 'Mild' else 0 for x in dataset[column]]
    elif column == 'LungParench':
        dataset[column] = [2 if x == 'Congested' else 1 if x == 'Abnormal' else 0 for x in dataset[column]]


### Label Encoding

In [7]:
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder
encoder = LabelEncoder()

# Columns to encode
columns_to_encode = ['Disease', 'XrayReport', 'ChestXray']

# Apply label encoding to each specified column
for column in columns_to_encode:
    dataset[column] = encoder.fit_transform(dataset[column])

### One-hot encoding for DuctFlow to handle Nulls

In [8]:
# One-hot encode the 'DuctFlow' column 
dataset = pd.get_dummies(dataset, columns=['DuctFlow'], prefix='DuctFlow').astype('float32')

In [9]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15000 entries, 0 to 14999
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   BirthAsphyxia      15000 non-null  float32
 1   HypDistrib         15000 non-null  float32
 2   HypoxiaInO2        15000 non-null  float32
 3   CO2                15000 non-null  float32
 4   ChestXray          15000 non-null  float32
 5   Grunting           15000 non-null  float32
 6   LVHreport          15000 non-null  float32
 7   LowerBodyO2        15000 non-null  float32
 8   RUQO2              15000 non-null  float32
 9   CO2Report          15000 non-null  float32
 10  XrayReport         15000 non-null  float32
 11  Disease            15000 non-null  float32
 12  GruntingReport     15000 non-null  float32
 13  Age                15000 non-null  float32
 14  LVH                15000 non-null  float32
 15  CardiacMixing      15000 non-null  float32
 16  LungParench        15000 no

# Model Training

## Split the data 

In [10]:
from sklearn.model_selection import train_test_split

X = dataset.drop("Sick", axis=1)
y = dataset["Sick"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## TensorFlow

In [49]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input, Dropout

# Define the model
model = Sequential([
    Input(shape=(X_train.shape[1],)), 
    Dense(64, activation='sigmoid'),
    Dense(32, activation='sigmoid'),
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test Accuracy: {accuracy:.2f}')

Epoch 1/10
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.7623 - loss: 0.5515 - val_accuracy: 0.7600 - val_loss: 0.5434
Epoch 2/10
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.7554 - loss: 0.5469 - val_accuracy: 0.7588 - val_loss: 0.5413
Epoch 3/10
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.7521 - loss: 0.5467 - val_accuracy: 0.7588 - val_loss: 0.5396
Epoch 4/10
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.7532 - loss: 0.5419 - val_accuracy: 0.7592 - val_loss: 0.5422
Epoch 5/10
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.7579 - loss: 0.5311 - val_accuracy: 0.7592 - val_loss: 0.5391
Epoch 6/10
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.7579 - loss: 0.5360 - val_accuracy: 0.7592 - val_loss: 0.5398
Epoch 7/10
[1m300/300[0m 

## PyTorch

In [51]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data_utils
import pandas as pd

# Assuming X_train, y_train, X_test, y_test are DataFrames
train_dataset = data_utils.TensorDataset(torch.tensor(X_train.values, dtype=torch.float32), torch.tensor(y_train.values, dtype=torch.long))
test_dataset = data_utils.TensorDataset(torch.tensor(X_test.values, dtype=torch.float32), torch.tensor(y_test.values, dtype=torch.long))

train_loader = data_utils.DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = data_utils.DataLoader(test_dataset, batch_size=32, shuffle=False)

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(X_train.shape[1], 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)

    def forward(self, x):
        x = torch.flatten(x, start_dim=1)
        x = torch.sigmoid(self.fc1(x))
        x = torch.sigmoid(self.fc2(x))
        x = torch.sigmoid(self.fc3(x)) 
        return x

net = Net()
criterion = nn.BCELoss()  # use binary cross-entropy loss
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)

# Function to calculate accuracy
def calculate_accuracy(loader):
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, labels in loader:
            outputs = net(inputs)
            predicted = (outputs > 0.5).float().squeeze()  # convert outputs to binary predictions and squeeze to match labels
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    return 100 * correct / total

# Train the Model
for epoch in range(10):  # number of epochs
    running_loss = 0.0
    net.train()
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = net(inputs).squeeze()
        loss = criterion(outputs, labels.float())
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
    
    train_accuracy = calculate_accuracy(train_loader)
    
    print(f'Epoch {epoch + 1}, Loss: {running_loss / len(train_loader):.3f}, Train Accuracy: {train_accuracy:.2f}%')

print('Finished Training')

# Calculate Test Accuracy
net.eval()
test_accuracy = calculate_accuracy(test_loader)
print(f'Test Accuracy: {test_accuracy:.2f}%')

Epoch 1, Loss: 0.554, Train Accuracy: 75.99%
Epoch 2, Loss: 0.552, Train Accuracy: 75.99%
Epoch 3, Loss: 0.552, Train Accuracy: 75.99%
Epoch 4, Loss: 0.552, Train Accuracy: 75.99%
Epoch 5, Loss: 0.551, Train Accuracy: 75.99%
Epoch 6, Loss: 0.550, Train Accuracy: 75.99%
Epoch 7, Loss: 0.551, Train Accuracy: 75.99%
Epoch 8, Loss: 0.549, Train Accuracy: 75.99%
Epoch 9, Loss: 0.548, Train Accuracy: 75.99%
Epoch 10, Loss: 0.547, Train Accuracy: 75.99%
Finished Training
Test Accuracy: 77.17%
