Step 1: Data Collection

You need to gather a diverse dataset containing relevant health features. For this example, we'll use a public dataset, such as the Heart Disease dataset from the UCI Machine Learning Repository.

In [1]:
import pandas as pd

# Load the dataset (replace the URL with the path to your dataset)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
columns = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
data = pd.read_csv(url, names=columns)

print(data.head())


    age  sex   cp  trestbps   chol  fbs  restecg  thalach  exang  oldpeak  \
0  63.0  1.0  1.0     145.0  233.0  1.0      2.0    150.0    0.0      2.3   
1  67.0  1.0  4.0     160.0  286.0  0.0      2.0    108.0    1.0      1.5   
2  67.0  1.0  4.0     120.0  229.0  0.0      2.0    129.0    1.0      2.6   
3  37.0  1.0  3.0     130.0  250.0  0.0      0.0    187.0    0.0      3.5   
4  41.0  0.0  2.0     130.0  204.0  0.0      2.0    172.0    0.0      1.4   

   slope   ca thal  target  
0    3.0  0.0  6.0       0  
1    2.0  3.0  3.0       2  
2    2.0  2.0  7.0       1  
3    3.0  0.0  3.0       0  
4    1.0  0.0  3.0       0  


Step 2: Data Preprocessing

1) Handle missing values.

2) Normalize or standardize features.

3) Convert categorical variables into numerical ones if necessary.

In [4]:
import numpy as np
# Handling missing values
data = data.replace('?', np.nan)
data = data.dropna()

# Convert data types
data = data.apply(pd.to_numeric)

# Normalize the data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_scaled = scaler.fit_transform(data.drop('target', axis=1))

# Prepare the final dataset
X = pd.DataFrame(data_scaled, columns=data.columns[:-1])
y = data['target']

print(X.head())
print(y.head())


        age       sex        cp  trestbps      chol       fbs   restecg  \
0  0.936181  0.691095 -2.240629  0.750380 -0.276443  2.430427  1.010199   
1  1.378929  0.691095  0.873880  1.596266  0.744555 -0.411450  1.010199   
2  1.378929  0.691095  0.873880 -0.659431 -0.353500 -0.411450  1.010199   
3 -1.941680  0.691095 -0.164289 -0.095506  0.051047 -0.411450 -1.003419   
4 -1.498933 -1.446980 -1.202459 -0.095506 -0.835103 -0.411450  1.010199   

    thalach     exang   oldpeak     slope        ca      thal  
0  0.017494 -0.696419  1.068965  2.264145 -0.721976  0.655877  
1 -1.816334  1.435916  0.381773  0.643781  2.478425 -0.894220  
2 -0.899420  1.435916  1.326662  0.643781  1.411625  1.172577  
3  1.633010 -0.696419  2.099753  2.264145 -0.721976 -0.894220  
4  0.978071 -0.696419  0.295874 -0.976583 -0.721976 -0.894220  
0    0
1    2
2    1
3    0
4    0
Name: target, dtype: int64


Step 3: Feature Selection

Using feature selection techniques to identify the most influential variables.

In [5]:
from sklearn.feature_selection import SelectKBest, f_classif

# Select the top 10 features
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# Get the selected feature names
selected_features = X.columns[selector.get_support()]
print("Selected features:", selected_features)


Selected features: Index(['age', 'sex', 'cp', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope',
       'ca', 'thal'],
      dtype='object')


Step 4: Model Development

Implement various machine learning algorithms and evaluate their performance.

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)

# Initialize the models
models = {
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "SVM": SVC()
}

# Train and evaluate the models
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"{name}:")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    # print("Precision:", precision_score(y_test, y_pred))
    # print("Recall:", recall_score(y_test, y_pred))
    # print("F1 Score:", f1_score(y_test, y_pred))
    # print("-" * 30)


Logistic Regression:
Accuracy: 0.6
Decision Tree:
Accuracy: 0.55
Random Forest:
Accuracy: 0.6666666666666666
SVM:
Accuracy: 0.6666666666666666


Step 5: Cross-Validation

Implement cross-validation techniques to assess the generalization performance of the models.

In [9]:
from sklearn.model_selection import cross_val_score

for name, model in models.items():
    scores = cross_val_score(model, X_selected, y, cv=5, scoring='accuracy')
    print(f"{name} Cross-Validation Accuracy: {scores.mean()} ± {scores.std()}")


Logistic Regression Cross-Validation Accuracy: 0.5854237288135593 ± 0.059743084712457026
Decision Tree Cross-Validation Accuracy: 0.4915254237288136 ± 0.04625321284712805
Random Forest Cross-Validation Accuracy: 0.6027118644067797 ± 0.024778117794799624
SVM Cross-Validation Accuracy: 0.5521468926553672 ± 0.023316022086803587


Step 6: Hyperparameter Tuning

Fine-tune the hyperparameters of selected machine learning models.

In [10]:
from sklearn.model_selection import GridSearchCV

# Hyperparameter tuning for Random Forest as an example
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30]
}

grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best parameters for Random Forest:", grid_search.best_params_)
print("Best cross-validation accuracy:", grid_search.best_score_)


Best parameters for Random Forest: {'max_depth': None, 'n_estimators': 100}
Best cross-validation accuracy: 0.586613475177305


Step 7: Model Interpretability (Optional)

Enhance the interpretability of the models using SHAP values or feature importance plots.

In [12]:
import shap

# SHAP values for Random Forest
best_model = grid_search.best_estimator_
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values[1], X_test, feature_names=selected_features)


ModuleNotFoundError: No module named 'shap'

Step 8: User Interface (Optional)

Develop a user-friendly interface using a framework like Flask.

In [13]:
from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)

# Load the trained model
model = joblib.load('random_forest_model.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    features = [data[feature] for feature in selected_features]
    features_scaled = scaler.transform([features])
    prediction = model.predict(features_scaled)
    return jsonify({'prediction': int(prediction[0])})

if __name__ == '__main__':
    app.run(debug=True)


FileNotFoundError: [Errno 2] No such file or directory: 'random_forest_model.pkl'

Step 9: Integration with Electronic Health Records (EHR) (Optional)

Explore the integration with EHR systems to facilitate seamless information flow.

Step 10: Documentation (Optional)

Provide comprehensive documentation covering data sources, methodology, model architecture, and usage instructions.

Step 11: Validation and Testing

Conduct extensive testing and validation to ensure the accuracy, reliability, and robustness of the system.

In [None]:
# Assuming validation is already performed through cross-validation
# Further testing can be done using unseen data if available