AI-Driven Epidemiological Modeling for Non-Communicable Disease (NCD) Trends
Introduction
Non-Communicable Diseases (NCDs) — including cardiovascular diseases, diabetes, cancer, and chronic respiratory conditions — are a major global health burden. Understanding the progression and trends of these diseases is crucial for effective public health planning and intervention strategies.

This project utilizes Artificial Intelligence (AI) and Machine Learning (ML) techniques to build epidemiological models that predict and analyze NCD trends over time. By leveraging historical data, the model identifies patterns and provides insights into how NCD prevalence may evolve, supporting data-driven decision-making for healthcare systems.

Objectives
To model the time series progression of NCDs using AI algorithms.
To identify key factors influencing NCD trends (age, lifestyle, socio-economic status, etc.).
To forecast future NCD incidences and assess the impact of potential interventions.
Methodology
The workflow includes:

Data Acquisition and Preprocessing: Collecting time-series health data, cleaning, and handling missing values.
Exploratory Data Analysis (EDA): Visualizing historical trends and correlations.
Model Development: Using algorithms like ARIMA, LSTM, and XGBoost for trend prediction.
Model Evaluation: Assessing model accuracy using metrics like RMSE, MAE, and R².
Interpretability: Applying explainable AI (XAI) tools to highlight influential risk factors.
Tools and Libraries
Python: NumPy, Pandas, Matplotlib, Seaborn
ML Libraries: Scikit-learn, TensorFlow, Keras
Time Series Analysis: Statsmodels, Prophet
Data Visualization: Plotly, Seaborn
Notebooks: Jupyter

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Load dataset properly
file_path = r"C:\Users\91973\Downloads\Project 2.xlsx"
df = pd.read_excel(file_path, engine='openpyxl')
# Drop irrelevant columns (keep only numeric)
df = df.select_dtypes(include=[np.number])
# Remove rows with missing target values
target_col = 'VALUE_NUMERIC'
df = df.dropna(subset=[target_col])
# Handle missing values in features
df.fillna(df.median(numeric_only=True), inplace=True)
# Outlier Removal using IQR
Q1 = df[target_col].quantile(0.25)
Q3 = df[target_col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df[target_col] >= lower_bound) & (df[target_col] <= upper_bound)]
# Define Features & Target
X = df.drop(columns=[target_col])
y = df[target_col]
# Standardize Features & Target
scaler_X = StandardScaler()
X_scaled = scaler_X.fit_transform(X)
scaler_y = StandardScaler()
y_scaled = scaler_y.fit_transform(y.values.reshape(-1, 1)).flatten() # Standardize target
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled, test_size=0.2, random_state=42)
# Model Training
model = RandomForestRegressor(n_estimators=200, random_state=42)
model.fit(X_train, y_train)
# Predictions
y_pred_scaled = model.predict(X_test)
# Convert predictions back to original scale
y_pred = scaler_y.inverse_transform(y_pred_scaled.reshape(-1, 1)).flatten()
y_test_original = scaler_y.inverse_transform(y_test.reshape(-1, 1)).flatten()
# Performance Metrics
mae = mean_absolute_error(y_test_original, y_pred)
mse = mean_squared_error(y_test_original, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test_original, y_pred)
print(f" Model Performance Metrics:")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-Squared (R²): {r2:.4f}")
# Scatter Plot for Predictions
plt.figure(figsize=(8,6))
plt.scatter(y_test_original, y_pred, alpha=0.5)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs Predicted Values")
plt.show()