Machine Learning for Identifying Socioeconomic Risk Factors of Non-Communicable Diseases (NCDs)
Introduction
Non-Communicable Diseases (NCDs) — such as cardiovascular diseases, diabetes, and cancer — are not only influenced by biological factors but also by socioeconomic determinants like income, education, occupation, and access to healthcare. Understanding these risk factors is essential for developing targeted public health interventions and reducing health disparities.

This project uses Machine Learning (ML) techniques to identify and quantify the impact of various socioeconomic factors on NCD risk. By analyzing large datasets, the model pinpoints which social and economic variables have the strongest correlations with disease prevalence, helping policymakers design data-driven strategies to mitigate these risks.

Objectives
To collect and preprocess socioeconomic and health datasets.
To build ML models that identify key predictors of NCDs.
To interpret model outputs using Explainable AI (XAI) to highlight influential socioeconomic factors.
Methodology
The project follows these steps:

Data Collection and Cleaning: Sourcing datasets containing both health outcomes and socioeconomic variables.
Exploratory Data Analysis (EDA): Understanding data distributions, correlations, and outliers.
Feature Selection: Using statistical tests and feature importance scores to identify significant predictors.
Model Building: Training various models (Logistic Regression, Decision Trees, Random Forest) for classification and regression tasks.
Model Evaluation: Assessing performance using accuracy, precision, recall, and F1-score metrics.
Interpretability: Utilizing SHAP values or LIME to explain how socioeconomic factors influence predictions.
Tools and Libraries
Python: NumPy, Pandas, Matplotlib, Seaborn
ML Libraries: Scikit-learn, TensorFlow
Data Visualization: Plotly, Seaborn
Explainable AI: SHAP, LIME
Notebooks: Jupyter

In [None]:
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Convert IncomeGroup to categorical before encoding
df["IncomeGroup"] = df["IncomeGroup"].astype(str)
# Encode target column correctly
target_encoder = LabelEncoder()
df["IncomeGroup"] = target_encoder.fit_transform(df["IncomeGroup"])
# Define features (X) and target (y)
X = df.drop(columns=["IncomeGroup"]) # Remove target column from features
y = df["IncomeGroup"]
# Standardize numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Proper Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42, 
stratify=y)
# Initialize XGBoost Classifier with reduced complexity
model = XGBClassifier(
 n_estimators=50, 
 max_depth=3, 
 learning_rate=0.05, 
 subsample=0.8, 
 colsample_bytree=0.8, 
 random_state=42
)
# Train the model
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=target_encoder.classes_))
