<a href="https://colab.research.google.com/github/111DataScienceWizard/TREBIRTH/blob/main/Collab%20Notes/Binary_Classification_SKLearn_models_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Installing Necessary libraries**

In [None]:
 !pip install onnx



In [None]:
!pip install skl2onnx



In [None]:
!pip install onnxruntime



In [None]:
!pip install tf2onnx



**Importing Necessary Libraries**

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from imblearn.over_sampling import SMOTE
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
import onnx
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import onnxruntime as ort
from tf2onnx import convert
from tensorflow import keras

In [None]:
def add_labels(df, label):
    df['label'] = label
    return df

In [None]:
def preprocess_data(data):
    columns_to_keep = ['Radar ADC', 'LSM Magnitude']
    return data[columns_to_keep]

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Loading train data

In [None]:
healthy_data_1 = add_labels(pd.read_excel('/content/drive/MyDrive/New Version/HealthyS1_ButtonTop.xlsx'), 'healthy')
healthy_data_2 = add_labels(pd.read_excel('/content/drive/MyDrive/New Version/HealthyS2_ButtonRight.xlsx'), 'healthy')
healthy_data_3 = add_labels(pd.read_excel('/content/drive/MyDrive/New Version/HealthyS3_ButtonBottom.xlsx'), 'healthy')

In [None]:
infected_data_1 = add_labels(pd.read_excel('/content/drive/MyDrive/New Version/Handheld_I1.xlsx'), 'Infected')
infected_data_2 = add_labels(pd.read_excel('/content/drive/MyDrive/New Version/Handheld_I2.xlsx'), 'Infected')
infected_data_3 = add_labels(pd.read_excel('/content/drive/MyDrive/New Version/Handheld_I3.xlsx'), 'Infected')

Concatenating data

In [None]:
healthy_data = pd.concat([healthy_data_1, healthy_data_2, healthy_data_3])
infected_data= pd.concat([infected_data_1, infected_data_2, infected_data_3])

In [None]:
healthy_data = preprocess_data(healthy_data)
infected_data = preprocess_data(infected_data)

In [None]:
numeric_columns1 = healthy_data.select_dtypes(include=np.number).columns
numeric_columns2 = infected_data.select_dtypes(include=np.number).columns

Loading test data

In [None]:
h_d1 = add_labels(pd.read_excel('/content/drive/MyDrive/New Version/HealthyS4_ButtonLeft.xlsx'), 'healthy')

In [None]:
i_d1 = add_labels(pd.read_excel('/content/drive/MyDrive/New Version/Handheld_I4.xlsx'), 'Infected')
i_d2 = add_labels(pd.read_excel('/content/drive/MyDrive/New Version/Handheld_I5.xlsx'), 'Infected')

Concatenating test data

In [None]:
h_d = pd.concat([h_d1])
i_d= pd.concat([i_d1, i_d2])

In [None]:
h_d = preprocess_data(h_d)
i_d = preprocess_data(i_d)

In [None]:
numeric_columns4 = h_d.select_dtypes(include=np.number).columns
numeric_columns5 = i_d.select_dtypes(include=np.number).columns

defining rollingstatisticsextractor functions in a class

This defines a class RollingStatisticsExtractor that inherits from BaseEstimator and TransformerMixin. It's meant to be used as part of a scikit-learn pipeline for data preprocessing.

The classes BaseEstimator and TransformerMixin are part of the scikit-learn library and are used for creating custom transformers in scikit-learn pipelines.

BaseEstimator:

BaseEstimator is the base class for all estimators in scikit-learn.
An estimator in scikit-learn is any object that learns from data. It may be a classification algorithm, a regression algorithm, or a transformer that extracts features or preprocesses data.
Inheriting from BaseEstimator ensures that your custom class complies with scikit-learn's conventions, making it compatible with various tools in the scikit-learn ecosystem.
TransformerMixin:

TransformerMixin is another base class in scikit-learn that extends BaseEstimator.
When your class inherits from TransformerMixin, it gains additional functionality related to transformers, specifically the fit_transform method.
TransformerMixin provides a default implementation of the fit_transform method based on the fit and transform methods. This can be convenient when creating custom transformers.

In [None]:
class RollingStatisticsExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, window_size):
        self.window_size = window_size
        self.feature_names_out_ = None

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        numeric_columns = X.select_dtypes(include=np.number).columns
        rolling_statistics = X[numeric_columns].rolling(window=self.window_size, min_periods=5)
        aggregated_features = pd.DataFrame()

        for column in numeric_columns:
            aggregated_features[f'{column}_mean'] = rolling_statistics[column].mean()
            aggregated_features[f'{column}_median'] = rolling_statistics[column].median()
            aggregated_features[f'{column}_std'] = rolling_statistics[column].std()
            aggregated_features[f'{column}_rms'] = rolling_statistics[column].apply(
                lambda x: np.sqrt(np.mean(x ** 2)))
            aggregated_features[f'{column}_peak2peak'] = rolling_statistics[column].max() - rolling_statistics[column].min()

        # Replace NaN values with the mean of the corresponding column
        aggregated_features = aggregated_features.fillna(aggregated_features.mean())

        # Set feature names for this transformer
        self.feature_names_out_ = aggregated_features.columns.tolist()
        return aggregated_features

    def fit_transform(self, X, y=None, **fit_params):
        return self.fit(X).transform(X)

    def get_feature_names_out(self, input_features=None):
        return self.feature_names_out_

defining functions for EDA

# Create an instance of the EDATransformer
eda_transformer = EDATransformer()

# Apply the EDA transformation
transformed_data = eda_transformer.fit_transform(your_data)

In [None]:
class EDATransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        print("Data Information:")
        print(X.info())

        print("\nData head:")
        print(X.head())

        print("\nData Columns:")
        print(X.columns)

        print("\nData Shape:")
        print(X.shape)

        print("\nData types:")
        print(X.dtypes)

        print("\nData Summary Statistics:")
        print(X.describe())

        print("\nMissing Values:")
        print(X.isnull().sum())

        print("\nUnique Values:")
        for column in X.columns:
            print(f"{column}: {X[column].nunique()} unique values")

        print("\nValue Counts:")
        for column in X.columns:
            print(f"{column}:\n{X[column].value_counts()}\n")

        print("\nCorrelation Map:")
        plt.figure(figsize=(12, 8))
        sns.heatmap(X.corr(), annot=True, cmap='coolwarm', linewidths=.5)
        plt.title('Correlation Map')
        plt.show()

        # Additional steps
        print("\nPerforming EDA on Aggregated Features:")
        transform(X)

        cleaned_data = X.dropna()
        for i in X.columns:
            print(i)
            print(X[i].nunique())
            print(X[i].unique())

        return X

numerical columns

In [None]:
numeric_columns_healthy = healthy_data[numeric_columns1].select_dtypes(include=np.number).columns
numeric_columns_infected = infected_data[numeric_columns2].select_dtypes(include=np.number).columns
numeric_columns_h_t = h_d[numeric_columns4].select_dtypes(include=np.number).columns
numeric_columns_i_t = i_d[numeric_columns5].select_dtypes(include=np.number).columns


Extracting features

In [None]:
# Create a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('rolling_stats_healthy', RollingStatisticsExtractor(window_size=500), numeric_columns_healthy),
        ('rolling_stats_infected', RollingStatisticsExtractor(window_size=500), numeric_columns_infected),
        ('rolling_stats_h_t', RollingStatisticsExtractor(window_size=500), numeric_columns_h_t),
        ('rolling_stats_i_t', RollingStatisticsExtractor(window_size=500), numeric_columns_i_t)
    ],
    remainder='passthrough'  # Include non-numeric columns as they are
)

ASSigning pipeline to each dataset and creating dataframe with the transferred data and column names

In [None]:
# Access the transformers using the get_params method
healthy_transformer = preprocessor.get_params()['rolling_stats_healthy']
infected_transformer = preprocessor.get_params()['rolling_stats_infected']

# Create separate pipelines for healthy and infected data
healthy_pipeline = Pipeline([
    ('preprocessor', healthy_transformer)
])

infected_pipeline = Pipeline([
    ('preprocessor', infected_transformer)
])

# Fit and transform the healthy data
healthy_data_with_features_array = healthy_pipeline['preprocessor'].fit_transform(healthy_data)

# Get column names from the fitted transformer
column_names_healthy = healthy_pipeline['preprocessor'].get_feature_names_out()

# Create DataFrame with the transformed data and column names
healthy_data_with_features = pd.DataFrame(healthy_data_with_features_array, columns=column_names_healthy)

# Fit and transform the infected data
infected_data_with_features_array = infected_pipeline['preprocessor'].fit_transform(infected_data)

# Get column names from the fitted transformer
column_names_infected = infected_pipeline['preprocessor'].get_feature_names_out()

# Create DataFrame with the transformed data and column names
infected_data_with_features = pd.DataFrame(infected_data_with_features_array, columns=column_names_infected)


combining orginal features with extracted features and giving labels, train and test assigning

In [None]:
# Assuming `healthy_data` and `infected_data` contain the original features
# Combine the original features with the extracted features for healthy data
healthy_data_with_combined_features = pd.concat([healthy_data, healthy_data_with_features], axis=1)

# Add labels to the combined healthy data
healthy_data_with_combined_features['label'] = 0  # Healthy label is 0

# Combine the original features with the extracted features for infected data
infected_data_with_combined_features = pd.concat([infected_data.reset_index(drop=True), infected_data_with_features.reset_index(drop=True)], axis=1)

# Add labels to the combined infected data
infected_data_with_combined_features['label'] = 1  # Infected label is 1

# Concatenate healthy and infected data
train_data = pd.concat([healthy_data_with_combined_features, infected_data_with_combined_features], ignore_index=True, axis=0)

# Print the columns of the combined_data
print(train_data.columns)

# Specify selected features
selected_features = ['LSM Magnitude_mean', 'LSM Magnitude_median', 'LSM Magnitude_rms', 'LSM Magnitude', 'Radar ADC_peak2peak',]
# Split features (X_combined) and labels (y_combined) for the combined data

X_train = train_data[selected_features]
y_train = train_data['label']

# Print the first few rows of X_combined and y_combined
print(X_train.head())
print(y_train.head())


Index(['Radar ADC', 'LSM Magnitude', 'Radar ADC_mean', 'Radar ADC_median',
       'Radar ADC_std', 'Radar ADC_rms', 'Radar ADC_peak2peak',
       'LSM Magnitude_mean', 'LSM Magnitude_median', 'LSM Magnitude_std',
       'LSM Magnitude_rms', 'LSM Magnitude_peak2peak', 'label'],
      dtype='object')
   LSM Magnitude_mean  LSM Magnitude_median  LSM Magnitude_rms  LSM Magnitude  \
0            1.003203              1.003151           1.003207       0.995339   
1            1.003203              1.003151           1.003207       0.995339   
2            1.003203              1.003151           1.003207       0.995339   
3            1.003203              1.003151           1.003207       0.995339   
4            0.995506              0.995339           0.995506       0.996173   

   Radar ADC_peak2peak  
0           408.671412  
1           408.671412  
2           408.671412  
3           408.671412  
4            21.000000  
0    0
1    0
2    0
3    0
4    0
Name: label, dtype: int64


ASSigning pipeline to each dataset and creating dataframe with the transferred data and column names

In [None]:
# Access the transformers using the get_params method
h_transformer = preprocessor.get_params()['rolling_stats_h_t']
i_transformer = preprocessor.get_params()['rolling_stats_i_t']

# Create separate pipelines for healthy and infected data
h_pipeline = Pipeline([
    ('preprocessor', h_transformer)
])

i_pipeline = Pipeline([
    ('preprocessor', i_transformer)
])

# Fit and transform the healthy data
healthy_data_with_features_array = h_pipeline['preprocessor'].fit_transform(h_d)

# Get column names from the fitted transformer
column_names_healthy = h_pipeline['preprocessor'].get_feature_names_out()

# Create DataFrame with the transformed data and column names
healthy_data_with_features = pd.DataFrame(healthy_data_with_features_array, columns=column_names_healthy)

# Fit and transform the infected data
infected_data_with_features_array = i_pipeline['preprocessor'].fit_transform(i_d)

# Get column names from the fitted transformer
column_names_infected = infected_pipeline['preprocessor'].get_feature_names_out()

# Create DataFrame with the transformed data and column names
infected_data_with_features = pd.DataFrame(infected_data_with_features_array, columns=column_names_infected)



combining orginal features with extracted features and giving labels, train and test assigning

In [None]:
# Assuming `healthy_data` and `infected_data` contain the original features
# Combine the original features with the extracted features for healthy data
healthy_data_with_combined_features = pd.concat([h_d, healthy_data_with_features], axis=1)

# Add labels to the combined healthy data
healthy_data_with_combined_features['label'] = 0  # Healthy label is 0

# Combine the original features with the extracted features for infected data
infected_data_with_combined_features = pd.concat([i_d.reset_index(drop=True), infected_data_with_features.reset_index(drop=True)], axis=1)

# Add labels to the combined infected data
infected_data_with_combined_features['label'] = 1  # Infected label is 1

# Concatenate healthy and infected data
test_data = pd.concat([healthy_data_with_combined_features, infected_data_with_combined_features], ignore_index=True, axis=0)

# Print the columns of the combined_data
print(test_data.columns)

# Specify selected features
selected_features = ['LSM Magnitude_mean', 'LSM Magnitude_median', 'LSM Magnitude_rms', 'LSM Magnitude', 'Radar ADC_peak2peak',]

# Split features (X_combined) and labels (y_combined) for the combined data
X_test = test_data[selected_features]
y_test = test_data['label']

# Print the first few rows of X_combined and y_combined
print(X_test.head())
print(y_test.head())


Index(['Radar ADC', 'LSM Magnitude', 'Radar ADC_mean', 'Radar ADC_median',
       'Radar ADC_std', 'Radar ADC_rms', 'Radar ADC_peak2peak',
       'LSM Magnitude_mean', 'LSM Magnitude_median', 'LSM Magnitude_std',
       'LSM Magnitude_rms', 'LSM Magnitude_peak2peak', 'label'],
      dtype='object')
   LSM Magnitude_mean  LSM Magnitude_median  LSM Magnitude_rms  LSM Magnitude  \
0            0.979301              0.979219           0.979304       0.982328   
1            0.979301              0.979219           0.979304       0.982328   
2            0.979301              0.979219           0.979304       0.978678   
3            0.979301              0.979219           0.979304       0.978678   
4            0.980138              0.978678           0.980140       0.978678   

   Radar ADC_peak2peak  
0          1061.072688  
1          1061.072688  
2          1061.072688  
3          1061.072688  
4           118.000000  
0    0
1    0
2    0
3    0
4    0
Name: label, dtype: int64


In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)

smote to balance imbalanced data

In [None]:
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_imputed, y_train)

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), selected_features)
    ])

defining function to train and evaluate the models in pipeline

In [None]:
def train_and_evaluate(model, X_train_resampled, y_train_resampled, X_test, y_test, threshold=0.5):
    full_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('model', model)
    ])

    full_pipeline.fit(X_train_resampled, y_train_resampled)

    # For Training Set
    y_train_pred_probs = full_pipeline.predict_proba(X_train_resampled)
    print(f"{type(model).__name__} Training Predicted Probabilities:\n{y_train_pred_probs}")

    y_train_pred = (y_train_pred_probs[:, 1] > threshold).astype(int)  # Convert to class labels using threshold
    print(f"{type(model).__name__} Training Accuracy: {accuracy_score(y_train_resampled, y_train_pred)}")
    print(f"{type(model).__name__} Training Classification Report:\n{classification_report(y_train_resampled, y_train_pred)}")

    # For Test Set
    y_test_pred_probs = full_pipeline.predict_proba(X_test)
    print(f"{type(model).__name__} Test Predicted Probabilities:\n{y_test_pred_probs}")

    y_test_pred = (y_test_pred_probs[:, 1] > threshold).astype(int)  # Convert to class labels using threshold
    print(f"{type(model).__name__} Test Accuracy: {accuracy_score(y_test, y_test_pred)}")
    print(f"{type(model).__name__} Test Classification Report:\n{classification_report(y_test, y_test_pred)}")

    return full_pipeline


Random forest pipeline training and evaluation

In [None]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

In [None]:
import pandas as pd

# Convert NumPy arrays to pandas DataFrames
X_train_resampled_df = pd.DataFrame(X_train_resampled, columns=selected_features)
X_test_df = pd.DataFrame(X_test, columns=selected_features)

# Assuming you have your labels as 1D arrays or lists
y_train_resampled_df = pd.Series(y_train_resampled, name='label')
y_test_df = pd.Series(y_test, name='label')

# Now, you can use these DataFrames in your function
rf_pipeline = train_and_evaluate(rf_model, X_train_resampled_df, y_train_resampled_df, X_test_df, y_test_df, threshold=0.5)

RandomForestClassifier Training Predicted Probabilities:
[[1. 0.]
 [1. 0.]
 [1. 0.]
 ...
 [1. 0.]
 [1. 0.]
 [1. 0.]]
RandomForestClassifier Training Accuracy: 1.0
RandomForestClassifier Training Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     10314
           1       1.00      1.00      1.00     10314

    accuracy                           1.00     20628
   macro avg       1.00      1.00      1.00     20628
weighted avg       1.00      1.00      1.00     20628

RandomForestClassifier Test Predicted Probabilities:
[[0. 1.]
 [0. 1.]
 [0. 1.]
 ...
 [0. 1.]
 [0. 1.]
 [0. 1.]]
RandomForestClassifier Test Accuracy: 0.6970997528937443
RandomForestClassifier Test Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00      2329
           1       0.70      1.00      0.82      5360

    accuracy                           0.70      7689
   macro avg    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


logistic regression pipeline training and evaluation

In [None]:
logreg_model = LogisticRegression(random_state=42, max_iter=1000)
logreg_pipeline = train_and_evaluate(logreg_model, X_train_resampled_df, y_train_resampled_df, X_test_df, y_test_df, threshold=0.5)


LogisticRegression Training Predicted Probabilities:
[[0.92795393 0.07204607]
 [0.92795393 0.07204607]
 [0.92795393 0.07204607]
 ...
 [0.99237014 0.00762986]
 [0.97744221 0.02255779]
 [0.98819249 0.01180751]]
LogisticRegression Training Accuracy: 0.9290285049447353
LogisticRegression Training Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.93      0.93     10314
           1       0.93      0.92      0.93     10314

    accuracy                           0.93     20628
   macro avg       0.93      0.93      0.93     20628
weighted avg       0.93      0.93      0.93     20628

LogisticRegression Test Predicted Probabilities:
[[0.05476728 0.94523272]
 [0.05476728 0.94523272]
 [0.04146013 0.95853987]
 ...
 [0.03886936 0.96113064]
 [0.03891066 0.96108934]
 [0.038952   0.961048  ]]
LogisticRegression Test Accuracy: 0.7062036675770581
LogisticRegression Test Classification Report:
              precision    recall  f1-score   suppo

decision tree pipeline training and evaluation

In [None]:
dt_model = DecisionTreeClassifier(random_state=42)
dt_pipeline = train_and_evaluate(dt_model, X_train_resampled_df, y_train_resampled_df, X_test_df, y_test_df, threshold=0.5)


DecisionTreeClassifier Training Predicted Probabilities:
[[1. 0.]
 [1. 0.]
 [1. 0.]
 ...
 [1. 0.]
 [1. 0.]
 [1. 0.]]
DecisionTreeClassifier Training Accuracy: 1.0
DecisionTreeClassifier Training Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     10314
           1       1.00      1.00      1.00     10314

    accuracy                           1.00     20628
   macro avg       1.00      1.00      1.00     20628
weighted avg       1.00      1.00      1.00     20628

DecisionTreeClassifier Test Predicted Probabilities:
[[0. 1.]
 [0. 1.]
 [0. 1.]
 ...
 [0. 1.]
 [0. 1.]
 [0. 1.]]
DecisionTreeClassifier Test Accuracy: 0.7893094030433087
DecisionTreeClassifier Test Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.31      0.47      2329
           1       0.77      1.00      0.87      5360

    accuracy                           0.79      7689
   macro avg    

svm pipeline training and evaluation

In [None]:
svm_model = SVC(probability=True, random_state=42)
svm_pipeline = train_and_evaluate(svm_model, X_train_resampled_df, y_train_resampled_df, X_test_df, y_test_df, threshold=0.5)

SVC Training Predicted Probabilities:
[[9.99989465e-01 1.05348079e-05]
 [9.99989465e-01 1.05348079e-05]
 [9.99989465e-01 1.05348079e-05]
 ...
 [9.98523533e-01 1.47646679e-03]
 [9.99535811e-01 4.64189483e-04]
 [9.99793783e-01 2.06216978e-04]]
SVC Training Accuracy: 0.9794454140003879
SVC Training Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.97      0.98     10314
           1       0.97      0.99      0.98     10314

    accuracy                           0.98     20628
   macro avg       0.98      0.98      0.98     20628
weighted avg       0.98      0.98      0.98     20628

SVC Test Predicted Probabilities:
[[0.06265009 0.93734991]
 [0.06265009 0.93734991]
 [0.06470434 0.93529566]
 ...
 [0.03443122 0.96556878]
 [0.03443557 0.96556443]
 [0.03443994 0.96556006]]
SVC Test Accuracy: 0.6970997528937443
SVC Test Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00     

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


installing onxx necessary libraries

In [None]:
pip install onnxruntime-tools


Collecting onnxruntime-tools
  Downloading onnxruntime_tools-1.7.0-py3-none-any.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.7/212.7 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Collecting py3nvml (from onnxruntime-tools)
  Downloading py3nvml-0.2.7-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.5/55.5 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Collecting xmltodict (from py3nvml->onnxruntime-tools)
  Downloading xmltodict-0.13.0-py2.py3-none-any.whl (10.0 kB)
Installing collected packages: xmltodict, py3nvml, onnxruntime-tools
Successfully installed onnxruntime-tools-1.7.0 py3nvml-0.2.7 xmltodict-0.13.0


In [None]:
from skl2onnx.common.data_types import FloatTensorType

defining functions to convert models to onxx

In [None]:
def convert_to_onnx(model, X, onnx_path, input_names=None):
    if input_names is None:
        input_names = list(X.columns)

    # Convert input names to ONNX-compatible format
    initial_type = [(name, FloatTensorType([None, 1])) for name in input_names]

    onnx_model = convert_sklearn(model, initial_types=initial_type)

    with open(onnx_path, 'wb') as f:
        f.write(onnx_model.SerializeToString())

converting random forest pipeline to onxx and ort

In [None]:
# Example usage with RandomForest model
convert_to_onnx(rf_pipeline, X_test_df, 'rf_model.onnx', input_names=selected_features)
rf_ort_model = ort.InferenceSession('rf_model.onnx')

converting logistic regression pipeline to onxx and ort

In [None]:
convert_to_onnx(logreg_pipeline, X_test_df, 'logreg_model.onnx', input_names=selected_features)
logreg_ort_model = ort.InferenceSession('logreg_model.onnx')

converting decision tree pipeline to onxx and ort

In [None]:
convert_to_onnx(dt_pipeline, X_test_df, 'dt_model.onnx', input_names=selected_features)
dt_ort_model = ort.InferenceSession('dt_model.onnx')

converting svm pipeline to onxx and ort

In [None]:
convert_to_onnx(svm_pipeline, X_test_df, 'svm_model.onnx', input_names=selected_features)
svm_ort_model = ort.InferenceSession('svm_model.onnx')

defining function to evaluate onxx model

threshold = 0.5: This line sets a threshold value for converting predicted probabilities to predicted classes. If the predicted probability for a class is greater than this threshold, it will be considered as the predicted class; otherwise, it won't.

input_names = [input.name for input in dt_ort_model.get_inputs()]: This line retrieves the names of the input nodes of the ONNX model (dt_ort_model). The ONNX model can have multiple inputs, and this line creates a list (input_names) containing the names of those inputs.

X_test[selected_features].iloc[:, i].values.astype(np.float32)[:, None]: This part of the code extracts the relevant columns (selected_features) from the test data (X_test) and prepares them for input to the ONNX model. It converts the values to a NumPy array, changes the data type to float32, and adds an additional axis to make it suitable for input to the ONNX model.

{input_names[i]: ... for i in range(len(input_names))}: This part of the code creates a dictionary that maps the input names of the ONNX model to the corresponding preprocessed test data. It uses a dictionary comprehension to iterate over the input names and their corresponding preprocessed data.

dt_ort_model.run(None, ...)[0]: This line performs inference using the ONNX model (dt_ort_model). It takes the input data (dictionary) prepared in the previous step, runs the inference, and retrieves the output. The [0] is used because run returns a list of outputs, and we are assuming there is only one output in this case.

So, in summary, this code sets a threshold, prepares the input data for the ONNX model using the test data, and then uses the ONNX model to make predictions on the test data. The output (dt_ort_predictions) contains the predicted probabilities for each class.

In [None]:
def evaluate_onnx_model(model, X_test, y_test, selected_features, threshold=0.5):
    input_names = [input.name for input in model.get_inputs()]
    print(f"Expected input names for the model: {input_names}")

    # Get predicted probabilities using ONNX model
    ort_predictions = model.run(None, {input_names[i]: X_test[selected_features].iloc[:, i].values.astype(np.float32)[:, None] for i in range(len(input_names))})[0]

    # Check if the array has only one dimension
    if len(ort_predictions.shape) == 1:
        ort_predictions = ort_predictions[:, None]  # Add a second dimension

    # Convert predicted probabilities to predicted classes using the threshold
    ort_predicted_classes_prob = (ort_predictions > threshold).astype(int)

    # Create a DataFrame to display true labels and predicted probabilities side by side
    result_df = pd.DataFrame({
        'True Label': y_test,
        'Predicted Probability': ort_predictions.flatten(),
        'Predicted Class': ort_predicted_classes_prob.flatten()
    })

    # Print the DataFrame
    print(result_df)

    # Calculate accuracy
    ort_accuracy_prob = accuracy_score(y_test, ort_predicted_classes_prob)
    print(f"\nONNX Model Accuracy using Probabilities and Threshold: {ort_accuracy_prob}")


'True Label': This column in the DataFrame represents the actual labels from the test set (y_test). It contains the ground truth values.

'Predicted Probability': This column represents the predicted probabilities generated by the machine learning model. The variable ort_predictions likely contains the raw predicted probabilities for each instance in the test set. The flatten() method is used to convert the multi-dimensional array into a one-dimensional array suitable for DataFrame construction.

'Predicted Class': This column represents the predicted classes based on the predicted probabilities. It seems like ort_predicted_classes_prob contains the predicted classes, and flatten() is again used to convert the array to one-dimensional.

printing random forest ort pipeline predictions and accuracy

In [None]:
evaluate_onnx_model(rf_ort_model, X_test_df, y_test_df, selected_features, threshold=0.5)

Expected input names for the model: ['LSM_Magnitude_mean', 'LSM_Magnitude_median', 'LSM_Magnitude_rms', 'LSM_Magnitude', 'Radar_ADC_peak2peak']
      True Label  Predicted Probability  Predicted Class
0              0                      1                1
1              0                      1                1
2              0                      1                1
3              0                      1                1
4              0                      1                1
...          ...                    ...              ...
7684           1                      1                1
7685           1                      1                1
7686           1                      1                1
7687           1                      1                1
7688           1                      1                1

[7689 rows x 3 columns]

ONNX Model Accuracy using Probabilities and Threshold: 0.6970997528937443


printing logistic regression ort pipeline predictions and accuracy

In [None]:
evaluate_onnx_model(logreg_ort_model, X_test_df, y_test_df, selected_features, threshold=0.5)

Expected input names for the model: ['LSM_Magnitude_mean', 'LSM_Magnitude_median', 'LSM_Magnitude_rms', 'LSM_Magnitude', 'Radar_ADC_peak2peak']
      True Label  Predicted Probability  Predicted Class
0              0                      1                1
1              0                      1                1
2              0                      1                1
3              0                      1                1
4              0                      0                0
...          ...                    ...              ...
7684           1                      1                1
7685           1                      1                1
7686           1                      1                1
7687           1                      1                1
7688           1                      1                1

[7689 rows x 3 columns]

ONNX Model Accuracy using Probabilities and Threshold: 0.7062036675770581


printing decision tree ort pipeline predictions and accuracy

In [None]:
evaluate_onnx_model(dt_ort_model, X_test_df, y_test_df, selected_features, threshold=0.5)

Expected input names for the model: ['LSM_Magnitude_mean', 'LSM_Magnitude_median', 'LSM_Magnitude_rms', 'LSM_Magnitude', 'Radar_ADC_peak2peak']
      True Label  Predicted Probability  Predicted Class
0              0                      1                1
1              0                      1                1
2              0                      1                1
3              0                      1                1
4              0                      0                0
...          ...                    ...              ...
7684           1                      1                1
7685           1                      1                1
7686           1                      1                1
7687           1                      1                1
7688           1                      1                1

[7689 rows x 3 columns]

ONNX Model Accuracy using Probabilities and Threshold: 0.7893094030433087


printing svm ort pipeline predictions and accuracy

In [None]:
evaluate_onnx_model(svm_ort_model, X_test_df, y_test_df, selected_features, threshold=0.5)

Expected input names for the model: ['LSM_Magnitude_mean', 'LSM_Magnitude_median', 'LSM_Magnitude_rms', 'LSM_Magnitude', 'Radar_ADC_peak2peak']
      True Label  Predicted Probability  Predicted Class
0              0                      1                1
1              0                      1                1
2              0                      1                1
3              0                      1                1
4              0                      1                1
...          ...                    ...              ...
7684           1                      1                1
7685           1                      1                1
7686           1                      1                1
7687           1                      1                1
7688           1                      1                1

[7689 rows x 3 columns]

ONNX Model Accuracy using Probabilities and Threshold: 0.6970997528937443
