## ML Final Exam

### Question 1:  Classification Task (15 Marks)



The given “Health Monitoring”dataset has 3000 rows ,4 features (Heart_Rate, Blood_Pressure,
Cholesterol, Blood_Sugar), a target variable (Risk_Level) and some missing values. The target
dataset has 3 classes ('Low', 'Medium', or 'High'). Perform the following tasks:

In [None]:
#!pip install pandas matplotlib seaborn

In [None]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.impute import KNNImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split,  GridSearchCV
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
# Load the data to the DataFrame
df_ML_q1 = pd.read_csv('Health_Monitoring_System_Data.csv')

print(df_ML_q1.shape)
df_ML_q1.head()

In [None]:
df_ML_q1.describe()

#df_ML_q1.dtypes

In [None]:
# Check for missing values in the dataset
df_ML_q1.isnull().sum()

### a) Data Preprocessing

There are a few methods in handling missing values.  The following are some of that we can consider using.  
1) Dropping missing value: 
Dropping Missing Values:
    Benefits:
        - Simple and straightforward.
        - Preserves the structure of the dataset.
    Shortcomings:
        - May result in loss of a significant amount of data, especially if there are many missing values.
        - If the missing values are not completely at random, dropping them may introduce bias.

2) Imputation using Mean/Median/Mode:
    Benefits:
        - Easy to implement.
        - Preserves the number of observations in the dataset.
        - Works well for variables with a normal distribution.
    Shortcomings:
        - Ignores relationships between features.
        - Can distort the distribution and relationships in the data.
        - May not be suitable for variables with skewed distributions.
    
3) Imputation using Predictive Models (e.g., Regression):
    Benefits:
        - Utilizes relationships between variables.
        - Preserves the variability and structure of the data.
    Shortcomings:
        - Assumes the relationship between variables is linear.
        - May not work well if the dataset is small or if the relationship is complex.
        - Sensitive to outliers.

4) Multiple Imputation:
    Benefits:
        - Takes into account the uncertainty associated with imputed values.
        - Preserves the variance and relationships in the data.
    Shortcomings:
        - More computationally expensive.
        - Requires assumptions about the distribution of the data.

5) Imputation using K-Nearest Neighbors (KNN):
    Benefits:
        - Considers relationships between variables.
        - Non-parametric and can handle non-linear relationships.
    Shortcomings:
        - Sensitive to the choice of the number of neighbors (K).
        - Computationally expensive for large datasets.

The choice of the strategy depends on the nature of the data, the extent of missingness, and the goals of the analysis. It's often recommended to explore the data, understand the missing data mechanisms, and select an approach that aligns with the specific characteristics of the dataset.








In [None]:
# Verify the nature of missing values; 
# if the rows with missing value in 'Heart_Rate' have missing values in 'Blood_Sugar' as well.

# Check if missing values in 'Heart_Rate' correspond to missing values in 'Blood_Sugar'
missing_heart_rate = df_ML_q1['Heart_Rate'].isnull()
missing_blood_sugar = df_ML_q1['Blood_Sugar'].isnull()

# Check if all rows (index) with missing 'Heart_Rate' also have missing 'Blood_Sugar'
all_missing_heart_rate_have_missing_blood_sugar = missing_heart_rate.equals(missing_blood_sugar)

# Print the result
print(f"All rows with missing 'Heart_Rate' also missing 'Blood_Sugar': {all_missing_heart_rate_have_missing_blood_sugar}")


Since there is no missing value on the target column, 'Risk_Level', we can use machine learning (ML) imputation methods based on the other columns to populate the missing values in the 'Heart_Rate' and 'Blood_Sugar' columns.

In this predictive imputation method, we train a machine learning model on the rows where the target column has values (this is the case in our dataset).  The model will then predict the missing values in the columns.

Here's a general outline of the steps:
    - Identify the target column; 'Heart_Rate' and 'Blood_Sugar'.
    - Split the dataset into two parts: Training set containing all rows without missing values, and Test_set containing rows with missing values.
    - Use the non-missing values in other columns as features as variables to train the model.
    . Use the trained model to predict the missing values.


The choice of imputation model, including whether to use KNeighborsRegressor, depends on various factors such as the nature of your data, the distribution of missing values, and the underlying relationships between variables. KNeighborsRegressor is just one option, and its suitability can vary based on your specific use case. Here are some reasons why KNeighborsRegressor might be recommended:

Local Relationships: KNeighborsRegressor imputes missing values by considering the values of the nearest neighbors. This can be beneficial if the underlying relationships in your data are local or if there is spatial/temporal structure. For example, similar individuals might have similar heart rates.

Non-linearity: If the relationship between the features and the target variable (Heart_Rate or Blood_Sugar) is non-linear, KNeighborsRegressor can capture these non-linear patterns. Other imputation techniques like mean imputation or linear regression assume linearity, which may not be suitable for all datasets.

Flexibility: KNeighborsRegressor is a non-parametric method and doesn't make strong assumptions about the distribution of the data. It can adapt to complex patterns without assuming a specific functional form.

However, it's important to note the limitations as well:

Computational Cost: Calculating distances to find nearest neighbors can be computationally expensive, especially with large datasets. Other imputation methods, like mean imputation or regression imputation, might be computationally more efficient.

Sensitivity to Noise: If your dataset has noise or outliers, KNeighborsRegressor might be sensitive to them, potentially leading to less robust imputation results.

Hyperparameter Tuning: The performance of KNeighborsRegressor depends on the choice of hyperparameters, such as the number of neighbors (n_neighbors). Tuning these parameters is crucial for achieving good imputation results.

In practice, it's often a good idea to try multiple imputation methods, including both parametric and non-parametric approaches, and evaluate their performance using cross-validation or other validation strategies to choose the method that works best for your specific dataset.

In [None]:
# Impute missing values using KNeighbors


# Use KNeighborsRegressor model to impute missing values in 'Heart_Rate' and 'Blood_Sugar' columns
columns_to_impute = ['Heart_Rate', 'Blood_Sugar']

# Create a copy of the DataFrame for imputation
df_imputed = df_ML_q1.copy()

# Define a range of n_neighbors values to try
neighbor_values = list(range(2, 11))

# Initialize variables to keep track of the best imputation
best_imputation = None
best_n_neighbors = None
smallest_nulls = float('inf')  # Initialize with a large value

for n_neighbors in neighbor_values:
    imputer = KNNImputer(n_neighbors=n_neighbors)
    
    # Perform imputation for both columns
    df_temp = df_imputed.copy()  # Create a temporary DataFrame for each iteration
    df_temp[columns_to_impute] = imputer.fit_transform(df_temp.drop(columns_to_impute + ['Risk_Level'], axis=1))

    # Check the number of remaining null values
    nulls = df_temp[columns_to_impute].isnull().sum().sum()

    # Update the best imputation if the current one has fewer nulls
    if nulls < smallest_nulls:
        best_imputation = df_temp.copy()
        best_n_neighbors = n_neighbors
        smallest_nulls = nulls

# Print the best n_neighbors and the corresponding imputation
print(f"Best n_neighbors: {best_n_neighbors}")
print("Imputed DataFrame:")
print(best_imputation)
# Verify the imputed DataFrame
print(best_imputation.isnull().sum())


In [None]:
# Save the imputed DataFrame as the DataFrame for the question.
df_ML_q1 = best_imputation.copy()
df_ML_q1.isnull().sum()

#### Normalize the features in the Dataset.
There are two major factors determining the normalization method of a dataset.  Those factors are: The nature of distribution and outliers.  So, we can plot the features in a graph which would help determine the most appropriate normalization method.

Normal Distribution:

If your data follows a normal distribution, StandardScaler may be a good choice. StandardScaler assumes that the features are normally distributed and scales them to have a mean of 0 and a standard deviation of 1.
If your data deviates significantly from a normal distribution, other scaling methods like MinMaxScaler or RobustScaler may be more appropriate.
Outliers:

Sensitive to Outliers: 

MinMaxScaler is sensitive to outliers, meaning that extreme values can disproportionately influence the scaling. If your data contains outliers, StandardScaler may be more robust.
Robust to Outliers: RobustScaler is designed to be robust to outliers, making it a suitable choice if your data has extreme values.

In [None]:
sns.pairplot(df_ML_q1, diag_kind = "kde")

Based on th above graph.  All the features resemble the normal distribution with no outliers.  So, the most appropriate Normalization method is StandardScaler.

In [None]:
df_ML_q1.head(20)

In [None]:
# Extract features and target variable
X = df_ML_q1[['Heart_Rate', 'Blood_Pressure', 'Cholesterol', 'Blood_Sugar']]
y = df_ML_q1['Risk_Level']

# Normalize features using StandardScaler
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)

#### Split the data into training and test sets.


In [None]:
# Split the data into training and test sets (80% training, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.2, random_state=42, stratify=y)

# Now X_train, X_test, y_train, y_test can be used for training and testing your machine learning model
print(f'X_Train\'s Shape: {X_train.shape}')
print(f'y_Train\'s Shape: {y_train.shape}')
print(f'X_Test\'s Shape: {X_test.shape}')
print(f'y_Test\'s Shape: {y_test.shape}')

### b) Model Building

There are several machine learning models commonly used for classification problems.  Below are some of the common ones and their suitability.

1) Logistic Regression: A simple adn interpretable model suitable for binary and multi-class classification.
2) Decision Trees: Easy to understand and interpret, can handle both numerical and categorical data.
3) Random Forest: This is an improved version of Decision Trees; an ensemble of decision trees.  Often times more robust and accurate than the individual trees. I handles over-fitting well.
4) Support Vector Machines (SVM): Effective in high-dimensional space, good for binary and multi-class classification.
5) K-Nearest Neighbors (KNN): This is a non-parametric learning algorithm suitable for both binary and multi-class classification.
6) Naive Bayes:  The a fast and simple classification model where it assumes independence between features. 

Let's test them all and see which one is the most accurate based on the accuracy score: Accuracy = Number of Correct Predictions / Total Number of Predictions 
​

In [None]:
# 1) Logistic Regression 


# Create a Logistic Regression model
logistic_model = LogisticRegression(random_state=42)

# Define hyperparameters for grid search
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'penalty': ['l2'], 'solver': ['lbfgs', 'liblinear']}

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(logistic_model, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best model
best_logistic_model = grid_search.best_estimator_

# Make predictions on the test data
y_pred_logistic = best_logistic_model.predict(X_test)

# Calculate accuracy score
accuracy_logistic = accuracy_score(y_test, y_pred_logistic)

# Capture the model name and accuracy score dynamically
model_name_accuracy_logistic = {type(best_logistic_model).__name__: accuracy_logistic}
print(model_name_accuracy_logistic)

# Print a detailed classification report with dynamic model name
print("Classification Report for", type(best_logistic_model).__name__, ":\n", classification_report(y_test, y_pred_logistic))

In [None]:
# 2) Decision Tree


# Create a Decision Tree model
dt_model = DecisionTreeClassifier(random_state=42)

# Define hyperparameters for grid search
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(dt_model, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best model
best_dt_model = grid_search.best_estimator_

# Make predictions on the test data
y_pred_dt = best_dt_model.predict(X_test)

# Calculate accuracy score
accuracy_dt = accuracy_score(y_test, y_pred_dt)

# Capture the model name and accuracy score dynamically
model_name_accuracy_dt = {type(best_dt_model).__name__: accuracy_dt}
print(model_name_accuracy_dt)

# Print the results
print(f"Best Hyperparameters: {grid_search.best_params_}")
print(f"Accuracy: {accuracy_dt * 100:.2f}%")
print("Classification Report for", type(best_dt_model).__name__, ":\n", classification_report(y_test, y_pred_dt))


In [None]:
# 3) Random Forest.  This can take up to 5 mins.  


# Create a Random Forest model
rf_model = RandomForestClassifier(random_state=42)

# Define hyperparameters for grid search
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(rf_model, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best model
best_rf_model = grid_search.best_estimator_

# Make predictions on the test data
y_pred_rf = best_rf_model.predict(X_test)

# Calculate accuracy score
accuracy_rf = accuracy_score(y_test, y_pred_rf)

# Capture the model name and accuracy score dynamically
model_name_accuracy_rf = {type(best_rf_model).__name__: accuracy_rf}
print(model_name_accuracy_rf)

# Print the results
print(f"Best Hyperparameters: {grid_search.best_params_}")
print(f"Accuracy: {accuracy_rf * 100:.2f}%")
print("Classification Report for", type(rf_model).__name__, ":\n", classification_report(y_test, y_pred_rf))

In [None]:
# 4) Support Vector Machine(SVM)


# Create an SVM model
svm_model = SVC(random_state=42)

# Define hyperparameters for grid search
param_grid = {'C': [0.1, 1, 10, 100], 'kernel': ['linear', 'rbf', 'poly', 'sigmoid']}

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(svm_model, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best model
best_svm_model = grid_search.best_estimator_

# Make predictions on the test data
y_pred_svm = best_svm_model.predict(X_test)

# Evaluate the model
accuracy_svm = accuracy_score(y_test, y_pred_svm)
model_name_accuracy_svm = {type(best_svm_model).__name__: accuracy_svm}

# Print the results with dynamic model name
print(f"Best Hyperparameters: {grid_search.best_params_}")
print(f"Accuracy for {type(best_svm_model).__name__}: {accuracy_svm * 100:.2f}%")
print("Classification Report for", type(best_svm_model).__name__, ":\n", classification_report(y_test, y_pred_svm, zero_division=1))

In [None]:
# 5) K-Nearest Neighbors (KNN)


# Create a KNN model
knn_model = KNeighborsClassifier()

# Define hyperparameters for grid search
param_grid = {
    'n_neighbors': list(range(2, 11)),
    'weights': ['uniform', 'distance'],
    'p': [1, 2]  # 1 for Manhattan distance, 2 for Euclidean distance
}

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(knn_model, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best model
best_knn_model = grid_search.best_estimator_

# Make predictions on the test data
y_pred_knn = best_knn_model.predict(X_test)

# Evaluate the model
accuracy_knn = accuracy_score(y_test, y_pred_knn)
model_name_accuracy_knn = {type(best_knn_model).__name__: accuracy_knn}

# Print the results with dynamic model name
print(f"Best Hyperparameters: {grid_search.best_params_}")
print(f"Accuracy for {type(best_knn_model).__name__}: {accuracy_knn * 100:.2f}%")
print("Classification Report for", type(best_knn_model).__name__, ":\n", classification_report(y_test, y_pred_knn))

In [None]:
# 6) Naive Bayes


# Define hyperparameters for grid search
param_grid_gnb = {
    'var_smoothing': [1e-20, 1e-15, 1e-10, 1e-5, 1e-2, 1e-1, 1.0]
}

# Use GridSearchCV to find the best hyperparameters
grid_search_gnb = GridSearchCV(GaussianNB(), param_grid_gnb, cv=5)
grid_search_gnb.fit(X_train, y_train)

# Get the best model
best_gnb_model = grid_search_gnb.best_estimator_

# Make predictions on the test data
y_pred_gnb = best_gnb_model.predict(X_test)

# Evaluate the model
accuracy_gnb = accuracy_score(y_test, y_pred_gnb)
model_name_accuracy_gnb = {type(best_gnb_model).__name__: accuracy_gnb}

# Print the results with dynamic model name
print(f"Best Hyperparameters for {type(best_gnb_model).__name__}: {grid_search_gnb.best_params_}")
print(f"Accuracy for {type(best_gnb_model).__name__}: {accuracy_gnb * 100:.2f}%")
print("Classification Report for", type(best_gnb_model).__name__, ":\n", classification_report(y_test, y_pred_gnb))

In [None]:
# Plot the features with the Risk Level
# using  data=df_ML_q1 or best_imputation


plt.figure(figsize=(10,7))

# List of column to plot
columns = ['Heart_Rate', 'Blood_Pressure', 'Cholesterol', 'Blood_Sugar']

for i, column in enumerate(columns, 1):
    plt.subplot(2,2,i)
    sns.boxplot(x='Risk_Level', y=column, data=df_ML_q1, order=['Low', 'Medium', 'High'])
    plt.title(f'Boxplot of {column} by Risk Level')
    
plt.tight_layout()
plt.show()

In [None]:
# Compile the results from the models into a list.


# List to store results
all_results = []

# List of models
models = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Support Vector Machines(SVM)': SVC(random_state=42),
    'K-Nearest Neighbours(KNN)': KNeighborsClassifier(),
    'Naive Bayes: GaussianNB': GaussianNB(),
}

# Loop through each model
for model_name, model in models.items():
    if model_name == 'Logistic Regression':
        y_pred = y_pred_logistic
        accuracy = accuracy_logistic
    elif model_name == 'Decision Tree':
        y_pred = y_pred_dt
        accuracy = accuracy_dt
    elif model_name == 'Random Forest':
        y_pred = y_pred_rf
        accuracy = accuracy_rf
    elif model_name == 'Support Vector Machines(SVM)':
        y_pred = y_pred_svm
        accuracy = accuracy_svm
    elif model_name == 'K-Nearest Neighbours(KNN)':
        y_pred = y_pred_knn
        accuracy = accuracy_knn
    elif model_name == 'Naive Bayes: GaussianNB':
        y_pred = y_pred_gnb
        accuracy = accuracy_gnb

    # Capture the model name, predictions, and accuracy score
    model_results = {
        'Model Name': model_name,
        'Accuracy': accuracy
    }

    # Append the results to the list
    all_results.append(model_results)

# Print compiled results
for result in all_results:
    print(result)

In [None]:
# Plot - best performing model by accuracy


# Convert the list of dictionaries into a DataFrame
results_df = pd.DataFrame(all_results)

# Identify the best model
best_model = results_df.loc[results_df['Accuracy'].idxmax(), 'Model Name']

# Plot the results using seaborn
plt.figure(figsize=(10, 6))
plot = sns.barplot(x='Accuracy', y='Model Name', data=results_df, palette='viridis', orient='h')  # orient='h' for horizontal
plt.title('Model Accuracy Comparison')
plt.xlabel('Accuracy')
plt.ylabel('Model')
plt.xlim(0.0, 0.5)  # Set the x-axis limit between 0 and 0.6 for accuracy
plt.yticks(rotation=0)  # Rotate y-axis labels for better visibility

# Display the accuracy values inside each bar
for index, value in enumerate(results_df['Accuracy']):
    plt.text(value, index, f'{value:.2%}', va='center')

# Highlight the best model with a dotted line
best_model_index = results_df.index[results_df['Model Name'] == best_model].tolist()[0]
plot.patches[best_model_index].set_facecolor('red')
plt.axvline(results_df.loc[results_df['Model Name'] == best_model, 'Accuracy'].values[0],
            color='red', linestyle='--', label=f'Best Model: {best_model}')
plt.show()