# Diabetes prediction model project: data processing, model training and business suggestion generation
This code implements a complete diabetes prediction model project, covering the whole process from data loading to business proposal generation. The project aims to establish an effective diabetes prediction model and provide valuable suggestions for related businesses.

The following code is mainly used to import various libraries and modules required for the project, preparing for subsequent data analysis, modeling, and visualization work.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, auc, precision_recall_curve
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings('ignore')

Part I :Read the CSV file to obtain the diabetes dataset. The basic information of the output data, such as data type, column name, and missing value status, is included. Print out the number of rows and columns in the dataset for users to quickly understand the size of the data. Help us identify potential issues in the data in advance, such as a small amount of data or insufficient number of features.

In [2]:
# 1. Data loading and exploration
def load_data(file_path):
    """Load diabetes dataset"""
    df = pd.read_csv(file_path)
    print(f"Basic data information:")
    df.info()
    
    # Display the number of rows and columns in the dataset
    rows, columns = df.shape
    
    if rows < 500:
        print("Warning: If the number of rows in the dataset is less than 500, it may cause overfitting of the model")
    if columns < 10:
        print("Warning: If the number of columns in the dataset is less than 10, there may be insufficient features")
    
    # Check the number of rows and columns in the data
    if rows < 30:
        raise ValueError("Error: The number of rows in the dataset is less than 30, and the amount of data is too small to model")
    
    # View the number of rows and columns in the dataset
    print(f"Number of rows in the dataset:{rows}，Number of columns:{columns}")
    
    return df

Part II :A function named preprocess_data is defined. Its core function is to preprocess the diabetes data set to prepare for subsequent machine learning modeling.
1. Identify and process invalid zero values in medical indicators, and convert them into missing values.
2. Use median imputation to handle missing values and ensure data integrity.
3. Create new classification features through binning operations to enhance the expressive power of data.
4. Encode the classification features so that they can be effectively processed by machine learning models.

In [None]:
# 2. Data preprocessing
def preprocess_data(df):
    """Data preprocessing: handling missing values, outliers, and feature engineering"""
    # copy data
    df_processed = df.copy()
    
    # Handling missing values (assuming 0 is a missing value, except for the Pregnant and Outcome columns)
    columns_to_replace = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
    for col in columns_to_replace:
        df_processed[col] = df_processed[col].replace(0, np.nan)
    
    # Calculate the proportion of missing values
    missing_values = df_processed.isnull().sum()
    print("\nMissing value statistics:")
    print(missing_values[missing_values > 0])
    
    # Handling missing values - using median padding
    imputer = SimpleImputer(strategy='median')
    df_processed[columns_to_replace] = imputer.fit_transform(df_processed[columns_to_replace])
    
    # Feature Engineering: Creating New Features
    df_processed['AgeGroup'] = pd.cut(df_processed['Age'], bins=[0, 30, 45, 60, 100], labels=['Youth', 'Middle aged', 'Middle aged', 'Elderly', 'Elderly'])
    df_processed['BMI_Category'] = pd.cut(df_processed['BMI'], bins=[0, 18.5, 25, 30, 100], labels=['underweight', 'normal', 'overweight', 'obese'])
    
    # Coding classification features
    df_processed = pd.get_dummies(df_processed, columns=['AgeGroup', 'BMI_Category'], drop_first=True)
    
    return df_processed

Part III :This code defines a function called visualize_data, whose main function is to visually analyze the diabetes dataset, helping us to deeply understand the data characteristics and their relationship with the target variables.
1. Correlation heatmap: It visually displays the correlation between various features, helping us discover multicollinearity and key predictive features.
2. Box plot: Compare the distribution differences of features between patients and non patients, providing a basis for feature engineering and model interpretation.
3. Target variable distribution: Check if the categories are balanced to avoid model bias caused by imbalanced data.
4. Feature importance ranking: quantify the prediction ability of each feature to diabetes, and provide reference for feature selection.

In [None]:
# 3. Data Visualization and Analysis
def visualize_data(df):
    """Data Visualization and Analysis"""
    # Set image clarity
    plt.rcParams['figure.dpi'] = 300
    
    # Draw feature correlation heatmap
    plt.figure(figsize=(12, 10))
    corr = df.corr()
    sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", square=True)
    plt.title("Feature correlation heatmap")
    plt.tight_layout()
    plt.savefig('correlation_heatmap.png')
    plt.close()
    
    # Draw feature distribution and box plot
    numeric_features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
    fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(16, 8))
    axes = axes.flatten()
    
    for i, feature in enumerate(numeric_features):
        sns.boxplot(x='Outcome', y=feature, data=df, ax=axes[i])
        axes[i].set_title(f'{feature} distribution')
    
    plt.tight_layout()
    plt.savefig('feature_distribution.png')
    plt.close()
    
    # Draw the distribution of target variables
    plt.figure(figsize=(6, 4))
    sns.countplot(x='Outcome', data=df)
    plt.title('Distribution of diabetes patients')
    plt.savefig('target_distribution.png')
    plt.close()
    
    # Calculate and print feature importance (based on correlation)
    print("\nCorrelation between characteristics and diabetes:")
    print(corr['Outcome'].sort_values(ascending=False)[1:])

Part IV :This code defines a function called train_model, whose core function is to conduct feature selection, model training and evaluation to build a diabetes prediction model.
1. Data partitioning: Divide the data into training and testing sets to ensure category balance.
2. Automated process: Integrating pre-processing, feature selection, and model training through pipelines.
3. Hyperparameter optimization: Use grid search and cross validation to find the optimal model configuration.
4. Multi model comparison: Train logistic regression and SVM models simultaneously, and evaluate performance from multiple dimensions.
5. Result visualization: Visualize the model performance through ROC curve, precision recall curve, and feature importance map.
The final returned best_madels and results objects contain the optimal configurations and evaluation results of all models, providing data support for subsequent model selection and business decisions.

In [None]:
# 4. Feature selection and model training
def train_model(df):
    """Feature selection, model training, and evaluation"""
    # Prepare features and target variables
    X = df.drop(['Outcome'], axis=1)
    y = df['Outcome']
    
    # Divide the training set and testing set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    
    print(f"\nTraining set size: {X_train.shape[0]}，Test set size: {X_test.shape[0]}")
    
    # Create feature selector
    feature_selector = SelectKBest(score_func=f_classif, k=8)
    
    # Create a standardization tool
    scaler = StandardScaler()
    
    # Create model pipeline
    pipelines = {
        'Logistic Regression': Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', scaler),
            ('feature_selector', feature_selector),
            ('classifier', LogisticRegression(random_state=42))
        ]),
        'SVM': Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', scaler),
            ('feature_selector', feature_selector),
            ('classifier', SVC(random_state=42, probability=True))
        ])
    }
    
    # Set hyperparameter grid
    param_grids = {
        'Logistic Regression': {
            'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100],
            'classifier__penalty': ['l1', 'l2', 'elasticnet', 'none'],
            'classifier__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
        },
        'SVM': {
            'classifier__C': [0.1, 1, 10, 100],
            'classifier__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
            'classifier__gamma': ['scale', 'auto']
        }
    }
    
    # cross validation
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    
    # Training and Evaluating Models
    best_models = {}
    results = {}
    
    for name, pipeline in pipelines.items():
        print(f"\nTrain the {name} model...")
        
        # grid search
        grid_search = GridSearchCV(
            pipeline, 
            param_grids[name], 
            cv=cv, 
            scoring='accuracy',
            n_jobs=-1,
            verbose=1
        )
        
        # training model
        grid_search.fit(X_train, y_train)
        
        # Save the best model
        best_models[name] = grid_search.best_estimator_
        
        # evaluation model
        y_pred = best_models[name].predict(X_test)
        y_prob = best_models[name].predict_proba(X_test)[:, 1]
        
        # Calculate evaluation indicators
        accuracy = accuracy_score(y_test, y_pred)
        cm = confusion_matrix(y_test, y_pred)
        report = classification_report(y_test, y_pred)
        
        # Calculate ROC curve
        fpr, tpr, _ = roc_curve(y_test, y_prob)
        roc_auc = auc(fpr, tpr)
        
        # Calculate the precision recall curve
        precision, recall, _ = precision_recall_curve(y_test, y_prob)
        
        # Save the Results
        results[name] = {
            'accuracy': accuracy,
            'confusion_matrix': cm,
            'report': report,
            'fpr': fpr,
            'tpr': tpr,
            'roc_auc': roc_auc,
            'precision': precision,
            'recall': recall
        }
        
        print(f"{name} optimum parameter: {grid_search.best_params_}")
        print(f"{name} accuracy: {accuracy:.4f}")
        print(f"{name} confusion matrix:\n{cm}")
        print(f"{name} Classification report:\n{report}")
    
    # Draw ROC curve
    plt.figure(figsize=(10, 8))
    for name, result in results.items():
        plt.plot(result['fpr'], result['tpr'], lw=2, label=f'{name} (AUC = {result["roc_auc"]:.3f})')
    
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('false positive rate')
    plt.ylabel('true positive rate')
    plt.title('ROC curve')
    plt.legend(loc="lower right")
    plt.savefig('roc_curve.png')
    plt.close()
    
    # Draw precision recall curve
    plt.figure(figsize=(10, 8))
    for name, result in results.items():
        plt.plot(result['recall'], result['precision'], lw=2, label=f'{name}')
    
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('recall rate')
    plt.ylabel('Accuracy')
    plt.title('Accuracy recall curve')
    plt.legend(loc="upper right")
    plt.savefig('precision_recall_curve.png')
    plt.close()
    
    # Feature importance analysis (applicable only to logistic regression)
    if 'Logistic Regression' in best_models:
        lr_model = best_models['Logistic Regression'].named_steps['classifier']
        feature_names = X.columns[best_models['Logistic Regression'].named_steps['feature_selector'].get_support()]
        coefficients = lr_model.coef_[0]
        
        # Create Feature Importance DataFrame
        feature_importance = pd.DataFrame({
            'Feature': feature_names,
            'Coefficient': coefficients,
            'Importance': np.abs(coefficients)
        }).sort_values('Importance', ascending=False)
        
        print("\nFeature Importance Analysis (Logistic Regression):")
        print(feature_importance)
        
        # Draw a feature importance map
        plt.figure(figsize=(10, 6))
        sns.barplot(x='Importance', y='Feature', data=feature_importance)
        plt.title('Feature importance (absolute value of logistic regression coefficient)')
        plt.tight_layout()
        plt.savefig('feature_importance.png')
        plt.close()
    
    return best_models, results

Part V :This code defines a function called generate_insights, whose core function is to generate business insights and recommendations based on the model training results. This function implements the transformation from machine learning models to business decisions:
1. Model interpretation: Extracting meaningful business insights by analyzing the performance of the best model.
2. Decision support: Based on the model results, propose specific action suggestions to help the medical team optimize the screening process.
3. Knowledge accumulation: Save the analysis conclusions in text form for cross departmental communication and long-term reference.

In [None]:
# 5. Model Explanation and Business Suggestions
def generate_insights(results, best_models, df):
    """Generate model explanations and business recommendations"""
    # Find the best model
    best_model_name = max(results, key=lambda k: results[k]['accuracy'])
    best_accuracy = results[best_model_name]['accuracy']
    
    print(f"\nbest model: {best_model_name}，accuracy: {best_accuracy:.4f}")
    
    # Generate insights
    insights = [
        f"1. The best model is {best_model_name}, with an accuracy of {best_accuracy:.2%}, indicating that the model has good predictive ability.",
        "2. Glucose, BMI and diabetes genetic function are the most important factors to predict diabetes.",
        "3. From the confusion matrix, the model still has room for improvement in identifying diabetes patients (positive), and more positive samples can be collected.",
        "4. This model can serve as a preliminary screening tool to help doctors identify high-risk patients, but the final diagnosis still needs to be combined with clinical symptoms."
    ]
    
    # Generate suggestions
    recommendations = [
        "1. Carry out prevention and publicity activities targeting high-risk populations (high BMI, high blood sugar levels) to promote a healthy lifestyle.",
        "2. Collect more sample data, especially diabetes patient data, to improve model performance.",
        "3. Consider using ensemble learning methods or deep learning to further improve prediction accuracy.",
        "4. Develop a simple application that allows doctors to enter patient data and obtain diabetes risk predictions."
    ]
    
    # Save insights and suggestions to a file
    with open('insights_and_recommendations.txt', 'w') as f:
        f.write("### Model Insights ###\n")
        for insight in insights:
            f.write(f"- {insight}\n")
        
        f.write("\n### Business Suggestions ###\n")
        for recommendation in recommendations:
            f.write(f"- {recommendation}\n")
    
    print("\nInsights and suggestions have been generated to insights_and_decommendations.txt")
    
    return insights, recommendations

Part VI :This code defines the main function of the project, which is to sequentially call various functional modules and connect the complete project process from data loading to generating business suggestions.

In [1]:
# main
def main():
    print("===== Diabetes prediction model project =====")
    
    # 1. Load data
    df = load_data('diabetes.csv')
    
    # 2. Data preprocessing
    df_processed = preprocess_data(df)
    
    # 3. Data Visualization and Analysis
    visualize_data(df_processed)
    
    # 4. Model Training and Evaluation
    best_models, results = train_model(df_processed)
    
    # 5. Generate insights and recommendations
    insights, recommendations = generate_insights(results, best_models, df_processed)
    
    print("\n===== Project Completion =====")