AI-Based Early Prediction of Non-Communicable Diseases (NCDs)
Introduction
Non-Communicable Diseases (NCDs), such as cardiovascular diseases, diabetes, cancer, and chronic respiratory diseases, are among the leading causes of death globally. Early prediction and intervention can significantly reduce mortality rates and improve patient outcomes.

This project leverages Artificial Intelligence (AI) and Machine Learning (ML) techniques to develop a predictive model capable of identifying individuals at risk of NCDs based on clinical, demographic, and lifestyle data.

Objectives
To collect and preprocess relevant datasets containing NCD risk factors.
To build and evaluate machine learning models for early NCD prediction.
To identify key features contributing to NCD risk using explainable AI (XAI) techniques.
Methodology
The workflow involves the following steps:

Data Collection and Cleaning: Sourcing open-access datasets, handling missing values, and normalizing data.
Exploratory Data Analysis (EDA): Visualizing trends, correlations, and outliers.
Model Selection: Testing various ML algorithms (Logistic Regression, Random Forest, etc.).
Model Evaluation: Using metrics like accuracy, precision, recall, F1-score, and ROC-AUC curves.
Interpretability: Implementing SHAP or LIME to highlight the most influential risk factors.
Tools and Libraries
Python: NumPy, Pandas, Matplotlib, Seaborn
ML Libraries: Scikit-learn, TensorFlow/PyTorch
Data Visualization: Plotly, Seaborn
Notebooks: Jupyter

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load dataset
file_path = r"C:\Users\91973\Downloads\project 1.xlsx"
df = pd.read_excel(file_path, engine='openpyxl')
# Display dataset info before cleaning
print("\nDataset Info Before Cleaning:")
print(df.info())
# Drop irrelevant columns if necessary (modify as needed)
df = df.drop(columns=["encounter_id", "patient_n"], errors='ignore')
# Handle missing values
df.replace("?", np.nan, inplace=True) # Convert '?' to NaN
df.fillna(df.median(numeric_only=True), inplace=True) # Fill NaN with column medians
df.fillna(df.mode().iloc[0], inplace=True) # Fill remaining NaNs with mode (for categorical)
# Check if dataset is still valid
if df.shape[0] == 0:
 raise ValueError("Dataset is empty after preprocessing!")
# Encode categorical variables
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoders = {}
for col in categorical_columns:
 le = LabelEncoder()
 df[col] = le.fit_transform(df[col].astype(str)) # Convert non-numeric data
 label_encoders[col] = le
# Define features (X) and target (y)
target_column = "A1Cresult" # Change this to the correct target column
if target_column not in df.columns:
 raise ValueError(f"Target column '{target_column}' not found in dataset!")
X = df.drop(columns=[target_column])
y = df[target_column]
# Standardize numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Display dataset shape after processing
print(f"Training Data Shape: {X_train.shape}, Testing Data Shape: {X_test.shape}")
# Train Random Forest model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.2f}")
# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Confusion matrix visualization
plt.figure(figsize=(6, 4))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
# Feature importance visualization
feature_importance = pd.Series(clf.feature_importances_, 
index=X.columns).sort_values(ascending=False)
plt.figure(figsize=(10, 5))
sns.barplot(x=feature_importance, y=feature_importance.index)
plt.xlabel("Feature Importance")
plt.ylabel("Features")
plt.title("Feature Importance from Random Forest")
plt.show()
