In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, classification_report, plot_confusion_matrix

import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv('../Data/new_data.csv')
df.head()

Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,BMI
0,50,2,168,62,110,80,1,1,0,0,1,0,21
1,55,1,156,85,140,90,3,1,0,0,1,1,34
2,51,1,165,64,130,70,3,1,0,0,0,1,23
3,48,2,169,82,150,100,1,1,0,0,1,1,28
4,47,1,156,56,100,60,1,1,0,0,0,0,23


## Machine Learning

### Train | Validation | Test Split Procedure

In [3]:
X = df.drop('cardio', axis=1)
y = df['cardio']

# Split the data into training and testing sets. 80% of data is training data, set aside other 20% for test
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Remaining 80% is split into valuation and test sets. 
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42)

# Scale the data using standardization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

# Scale the data using normalization
normalizer = MinMaxScaler()
X_train_norm = normalizer.fit_transform(X_train)
X_val_norm = normalizer.transform(X_val)
X_test_norm = normalizer.transform(X_test)

Here, we are preparing data for use in a machine learning model that will predict whether someone has cardiovascular disease or not, based on various health-related features. Here is a breakdown of each step:

1. The first line `X = df.drop('cardio', axis=1)` selects all columns from the input dataframe except for the 'cardio' column. These are the features that the machine learning model will use to make its predictions. The second line `y = df['cardio']` selects only the 'cardio' column from the input dataframe. This is the column that contains the labels or outcomes we are trying to predict.

2. The third line uses the `train_test_split` function from the `sklearn library` to split the data into training and testing sets. We are using 80% of the data for training and 20% for testing. The `random_state` parameter is set to 42, which ensures that the data is split in the same way every time the code is run.

3. The fourth line further splits the training data into training and validation sets. We are using a `75/25` split (60% for training, 20% for validation) to tune our model's hyperparameters later.

4. The next three lines scale the data using standardization. Standardization scales the data to have a mean of 0 and a standard deviation of 1. This is useful for machine learning algorithms that assume the features are normally distributed. The `fit_transform` method fits the scaler on the training data and applies it to the training, validation, and testing data. We are overwriting the original `X_train` variable with the transformed data.

5. The last three lines scale the data using normalization. Normalization scales the data to be between 0 and 1. This is useful for machine learning algorithms that are sensitive to the scale of the features. The `fit_transform` method fits the normalizer on the training data and applies it to the training, validation, and testing data. We are creating new variables `X_train_norm, X_val_norm,` and `X_test_norm` to store the transformed data. 

### Modelling

In [4]:
# Put models in a dictionary
models = {"Logistic Regression": LogisticRegression(),
          "Adaboost Classifier": AdaBoostClassifier(),
          "Gradientboost Classifier": GradientBoostingClassifier(),
          "Random Forest": RandomForestClassifier()}

# Create a function to fit and score models
def fit_and_score(models, X_train, X_val, X_test, y_train, y_val, y_test):
    # Set random seed
    np.random.seed(42)
    # Make a dictionary to keep model scores
    model_scores = {}
    # Loop through models
    for name, model in models.items():
        # Fit the model to the training set
        model.fit(X_train, y_train)
        # Make predictions on the validation and test sets
        y_val_pred = model.predict(X_val)
        y_test_pred = model.predict(X_test)
        # Calculate accuracy and precision scores on the validation and test sets
        val_acc = accuracy_score(y_val, y_val_pred)
        val_prec = precision_score(y_val, y_val_pred)
        test_acc = accuracy_score(y_test, y_test_pred)
        test_prec = precision_score(y_test, y_test_pred)
        # Save the scores to the model_scores dictionary
        model_scores[name] = {"validation_accuracy": val_acc,
                              "validation_precision": val_prec,
                              "test_accuracy": test_acc,
                              "test_precision": test_prec}
        # Print the model's results on a new line
        print("\n" + name)
        print(model_scores[name])
    return model_scores

This code defines a dictionary containing several machine learning models, including `Logistic Regression, AdaBoost Classifier, Gradient Boosting Classifier,` and `Random Forest`. Then, a function named `"fit_and_score"` is defined to fit and score these models on training, validation, and testing data. The function takes as input the models, training, validation, and testing data, and returns a dictionary of model scores that include validation and test accuracy and precision scores.

The function loops through each model and fits it on the training data, then makes predictions on the validation and testing data. It calculates accuracy and precision scores on both the validation and test data, and saves these scores to a dictionary called "model_scores". The function then prints the results for each model and returns the "model_scores" dictionary.

In [5]:
# Call the fit_and_score function and print the results
model_scores = fit_and_score(models, X_train, X_val, X_test, y_train, y_val, y_test)


Logistic Regression
{'validation_accuracy': 0.7195714285714285, 'validation_precision': 0.7410451232749263, 'test_accuracy': 0.7200714285714286, 'test_precision': 0.7406254862299674}

Adaboost Classifier
{'validation_accuracy': 0.7338571428571429, 'validation_precision': 0.7748795480976907, 'test_accuracy': 0.7345, 'test_precision': 0.7738075452883497}

Gradientboost Classifier
{'validation_accuracy': 0.7366428571428572, 'validation_precision': 0.7543299908842297, 'test_accuracy': 0.7383571428571428, 'test_precision': 0.7537505682679194}

Random Forest
{'validation_accuracy': 0.7141428571428572, 'validation_precision': 0.7186733958183129, 'test_accuracy': 0.7107857142857142, 'test_precision': 0.7128286165780778}
