# Data analysis before Data Transformation
Data visualization has provided the following information about the unprocessed dataset:
1. When analysis of correlation between age and diabetes is made, there a small fall of cases in between the middle 60s and 70s. Afterwards, an abrupt jump is observed at around 80.

2. The cases of diabetes in males and females differs very little.

3. The mean and median do not differ majorly, leading to an almost symmetrical dataset. 

4. The quartiles show an expected variation in attributes.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# For term documentation, please visit the Wiki on GitLab: Statistical Term Documentation #

df = pd.read_csv('smotted_dataset.csv')

#Show ratio through chart
#sns.countplot(x='diabetes', data=df)

# Age Distribution
sns.histplot(data=df, x='age', hue='diabetes', multiple='stack')

# Blood Glucose Levels Distribution
#sns.histplot(data=df, x='current_blood_glucose_level', hue='diabetes', multiple='stack')

#HbA1c_level
#sns.histplot(data=df, x='average_blood_glucose_level', hue='diabetes', multiple='stack')

# Weight Distribution
# Plot KDE for BMI with diabetes status
#sns.kdeplot(data=df, x='bmi', hue='diabetes', fill=True)

numerical_df = df.select_dtypes(include=['float64', 'int64']).drop(columns=['hypertension', 'heart_disease', 'diabetes'])
# Mean and Median
print("Median:", numerical_df.median(), "\n")
print("Mean:", numerical_df.mean(), "\n")

# Standard Deviation
print("Standard Deviation:", numerical_df.std())

# The values at the quartile divisions
print(numerical_df.quantile(q=[0.25, 0.5, 0.75], axis=0, numeric_only=True))

plt.show()

# Removal of Incomplete Examples in Dataset
With the utilization of Pandas, any incomplete examples present in our dataset were removed. Afterwards, we have analyzed the data to ensure that all of our labels are direct labels and not proxy labels.

In [None]:
import pandas as pd

df = pd.read_csv('cleaned_dataset.csv')
df.dropna()
df = df.drop_duplicates()
df.reset_index(drop=True, inplace=True)

df.to_csv('cleaned_dataset.csv', index=False)

# Balancing Dataset
Furthermore, the next step was reaching a dataset which was as balanced as possible. Initially, the dataset had a ratio of approximately 10:1 (10 non-diabetic individuals for a single diabetic patient).

The following steps were performed:
1. The majority class was downsampled by a factor of 5 (20%). Although the downsampling could have been up to a factor of 10, this would lead to extreme data loss.

In [None]:
from imblearn.under_sampling import RandomUnderSampler
import pandas as pd

df = pd.read_csv('cleaned_dataset.csv')

X = df.drop('diabetes', axis=1)
y = df['diabetes']

# Undersampling the majority class
# Sampling Strategy allows to remove a certain percentage of the majority in this case. Currently, we undersample by a factor of 5.#
# Ratio is 5:1 for non-diabetics 
rus = RandomUnderSampler(random_state=42, sampling_strategy = 0.2)
X_res, y_res = rus.fit_resample(X, y)

# Ratio after undersampling
ratio = 42410 / 8482
#print(X_res.value_counts(), y_res.value_counts(), f"Ratio: {ratio}")
df_resampled = pd.concat([X_res, y_res], axis=1)
df_resampled['age'] = df_resampled['age'].astype(int)
df_resampled.to_csv('downsampled_dataset.csv', index=False) # Creates a different file with removed majority #

2. The minority class was oversampled utilizing the Synthetic Minority Over-sampling Technique for Nominal and Continuous. (SMOTENC) method. SMOTENC is required, as the dataset contains both categorical and numerical data. Although overfitting is still a problem to be considered, as new data isn't created but fabricated from pre-existing data, it reduces the chances of overfitting in comparison to Random Oversampling.

In [None]:
from imblearn.over_sampling import SMOTENC
import pandas as pd

# Read the data
df = pd.read_csv('downsampled_dataset.csv')
X = df.drop('diabetes', axis=1)
y = df['diabetes']

# Convert categorical variables to numeric codes before SMOTE
valid_smoking_history = ['never', 'former', 'current', 'No Info']
X['smoking_history'] = pd.Categorical(X['smoking_history'], 
                                    categories=valid_smoking_history,
                                    ordered=False)
X['smoking_history'] = X['smoking_history'].cat.codes

# Get the indices of categorical features
categorical_features_indices = [X.columns.get_loc(col) for col in ['gender', 'smoking_history']]

# Apply SMOTE
smote = SMOTENC(random_state=42, 
                sampling_strategy=0.5, 
                k_neighbors=5, 
                categorical_features=categorical_features_indices)
X_res, y_res = smote.fit_resample(X, y)

# Convert smoking_history back to categories
X_res['smoking_history'] = pd.Categorical.from_codes(
    X_res['smoking_history'].astype('int'),
    categories=valid_smoking_history
)

# Round numerical columns to match desired format
# Assuming the columns are in this order: age, hypertension, heart_disease, bmi, HbA1c_level, blood_glucose_level
X_res['age'] = X_res['age'].round().astype(int)
X_res['bmi'] = X_res['bmi'].round(2)
X_res['average_blood_glucose_level'] = X_res['average_blood_glucose_level'].round(1)
X_res['current_blood_glucose_level'] = X_res['current_blood_glucose_level'].round().astype(int)
# hypertension and heart_disease should already be 0 or 1

# Create the resampled dataset
df_resampled = pd.concat([X_res, y_res], axis=1)

# Save with correct formatting
df_resampled.to_csv('smotted_dataset.csv', index=False, float_format='%.2f')

# Splitting Dataset
The dataset was split into a Training Set, Validation Set and Test Set:
- Training Set: 70%
- Validation Set: 15%
- Test Set: 15%

The initial splitting of the dataset leads to a 1:3 imbalance for diabetics. Variations will be tested on the models being trained and documented in the future. 

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv('smotted_dataset.csv')
X = df.drop('diabetes', axis=1)
y = df['diabetes']

# First split
X_train, X_test, y_train, y_test = train_test_split(X,y , 
                                   random_state=104,  
                                   test_size=0.3,  
                                   shuffle=True)

df_train = pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis = 1)

df_train.to_csv('train_dataset.csv', index = False, float_format='%.2f')

# Second split
X = df_test.drop('diabetes', axis=1)
y = df_test['diabetes']
X_validation, X_test, y_validation, y_test = train_test_split(X,y , 
                                            random_state=104,  
                                            test_size=0.5,  
                                            shuffle=True)

df_test = pd.concat([X_test, y_test], axis = 1)
df_validation = pd.concat([X_validation, y_validation], axis = 1)

df_test.to_csv('test_dataset.csv', index= False, float_format='%.2f')
df_validation.to_csv('validation_dataset.csv', index= False, float_format='%.2f')

# Transforming Categorical Data into Floating Point using Hot-End Encoding
As models can only train with floating point values, categorical data (data which is string, or numbers which can not be cateorized as numerical data, such as postal codes) must be transformed into numerical data. As each categorical value of our dataset does not contain more than 4 different categories, we have opted for Hot-End Encoding.

Hod-End Encoding will split the categorical data into multiple columns. For example, when the column "Gender" is Hot-End Encoded, it will end up having Gender_Female and Gender_Male. 1 will represent the presence, 0 the absence, in order to avoid biases.

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Read the dataset
df = pd.read_csv('Data Transformation/linear_scaled_dataset.csv')

# Extract categorical columns from the dataframe
categorical_columns = ['gender', 'hypertension', 'heart_disease', 'smoking_history', 'diabetes']

# Initialize OneHotEncoder (dropping the first category to avoid redundant columns)
encoder = OneHotEncoder(sparse_output=False, drop='first')

# Apply one-hot encoding to the categorical columns
one_hot_encoded = encoder.fit_transform(df[categorical_columns])

# Create a DataFrame with the one-hot encoded columns
one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(categorical_columns))

# Create a DataFrame for the original numerical columns
numerical_columns = ['age', 'bmi', 'average_blood_glucose_level', 'current_blood_glucose_level']
numerical_df = df[numerical_columns]

# Concatenate the one-hot encoded dataframe with the original numerical dataframe
df_encoded = pd.concat([numerical_df, one_hot_df], axis=1)

# Save the encoded DataFrame to a CSV file
df_encoded.to_csv('cat_to_num.csv', index=False)

# Normalization
Data visualization after the balancing technique SMOTE-NC has been applied shows the following:
1. When analysis of correlation between age and diabetes is made, a zig-zag pattern is observed and a sudden jump in registered cases happens at 75-80, to about double the amount of the other highest reigstered number of cases. To combat prediction bias, both will be addressed.

2. The cases of diabetes in males and females differs slightly.

3. The mean and median still do not differ majorly, although the SMOTE-NC balancing has added some variation. The standard deviation shows the biggest change in the blood glucose levels, which have spiked from 40.90 to 52.55.

4. The quartiles show an expected variation in attributes, although has a higher variation in comparison to the original dataset.

## Linear Scaling
Given that the dataset does not seem to be in a normal distribution, required for Z-Score standardization, nor have a consistent relation of power law, we have opted for linear scaling: the lower and upper values should, in theory, not change over time. Additionally, the dataset contains few to no outliers. 

The only problem is the fact that the features are not all uniformly distributed across their ranges, are some are more right skewed. Age would be an example.

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

df = pd.read_csv('smotted_dataset.csv')

scaler = MinMaxScaler()
df[['age', 'bmi', 'average_blood_glucose_level', 'current_blood_glucose_level']] = scaler.fit_transform(df[['age', 'bmi', 'average_blood_glucose_level', 'current_blood_glucose_level']])

df.to_csv('linear_scaled_dataset.csv', index=False)