# Preprocessing
We will import out dataset (https://www.kaggle.com/datasets/gauravduttakiit/smoker-status-prediction-using-biosignals?resource=download&select=train_dataset.csv) and start preprocessing. Our dataset will also need to be split into train/test.

In [1]:
import graphviz
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler


In [2]:
full_dataset_df = pd.read_csv("data/kaggle_smoker_dataset.csv", header=0)

We have all numerical data so to start there is no need to get dummy variables

In [3]:
full_dataset_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38984 entries, 0 to 38983
Data columns (total 23 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   age                  38984 non-null  int64  
 1   height(cm)           38984 non-null  int64  
 2   weight(kg)           38984 non-null  int64  
 3   waist(cm)            38984 non-null  float64
 4   eyesight(left)       38984 non-null  float64
 5   eyesight(right)      38984 non-null  float64
 6   hearing(left)        38984 non-null  int64  
 7   hearing(right)       38984 non-null  int64  
 8   systolic             38984 non-null  int64  
 9   relaxation           38984 non-null  int64  
 10  fasting blood sugar  38984 non-null  int64  
 11  Cholesterol          38984 non-null  int64  
 12  triglyceride         38984 non-null  int64  
 13  HDL                  38984 non-null  int64  
 14  LDL                  38984 non-null  int64  
 15  hemoglobin           38984 non-null 

In [4]:
full_dataset_df.head()

Unnamed: 0,age,height(cm),weight(kg),waist(cm),eyesight(left),eyesight(right),hearing(left),hearing(right),systolic,relaxation,...,HDL,LDL,hemoglobin,Urine protein,serum creatinine,AST,ALT,Gtp,dental caries,smoking
0,35,170,85,97.0,0.9,0.9,1,1,118,78,...,70,142,19.8,1,1.0,61,115,125,1,1
1,20,175,110,110.0,0.7,0.9,1,1,119,79,...,71,114,15.9,1,1.1,19,25,30,1,0
2,45,155,65,86.0,0.9,0.9,1,1,110,80,...,57,112,13.7,3,0.6,1090,1400,276,0,0
3,45,165,80,94.0,0.8,0.7,1,1,158,88,...,46,91,16.9,1,0.9,32,36,36,0,0
4,20,165,60,81.0,1.5,0.1,1,1,109,64,...,47,92,14.9,1,1.2,26,28,15,0,0


Separate our class labels from the full dataset into it's own df

In [5]:
# if cell is ran twice it will fail - try excepts handles that error
try:
    labels_df = full_dataset_df["smoking"]
    full_dataset_df.drop(["smoking"], axis=1, inplace=True)
except KeyError as e:
    pass
    
label_names = ["non-smoker", "smoker"] # 0 = non-smoker and 1 = smoker


In [6]:
train_df, test_df, train_labels_df, test_labels_df = train_test_split(full_dataset_df, labels_df, test_size=0.2, random_state=33)
features = train_df.columns
train_df.to_csv("output/train_data.csv", index=False)
test_df.to_csv("output/test_data.csv", index=False)
train_labels_df.to_csv("output/train_labels.csv", index=False)
test_labels_df.to_csv("output/test_labels.csv", index=False)
train_df.shape

(31187, 22)

In [7]:
train_labels_df.value_counts()

smoking
0    19706
1    11481
Name: count, dtype: int64

In [8]:
test_df.shape

(7797, 22)

In [9]:
test_labels_df.value_counts()

smoking
0    4960
1    2837
Name: count, dtype: int64

Looks like we have more non smoking labels in both train and test compared to smoking which we are trying to predict

In [10]:
train_labels_array = train_labels_df.to_numpy()
test_labels_array = test_labels_df.to_numpy()

min_max_scaler = MinMaxScaler().fit(train_df)
train_normalized_array = min_max_scaler.transform(train_df)
test_normalized_array = min_max_scaler.transform(test_df)
np.set_printoptions(precision=2, linewidth=80, suppress=True)

train_normalized_array

array([[0.54, 0.36, 0.19, ..., 0.  , 0.01, 0.  ],
       [0.46, 0.45, 0.38, ..., 0.01, 0.04, 0.  ],
       [0.23, 0.55, 0.43, ..., 0.01, 0.06, 0.  ],
       ...,
       [0.31, 0.64, 0.33, ..., 0.  , 0.01, 0.  ],
       [0.54, 0.27, 0.19, ..., 0.01, 0.02, 0.  ],
       [0.62, 0.18, 0.14, ..., 0.01, 0.01, 0.  ]])

In [11]:
pd.DataFrame(train_normalized_array, columns=features).to_csv("output/train_data_normalized.csv", index=False)
pd.DataFrame(test_normalized_array, columns=features).to_csv("output/test_data_normalized.csv", index=False)

Our dataset has baseline normalization and each team member running their own model can choose to use this or their own preprocessing process depending on their model.

However, this preprocessed data will be using by our baseline model.