In [187]:
# Name    : Aprilyanto Setiyawan Siburian
# NIM     : 24060121120022
# Dataset : National Health and Nutrition Health Survey 2013-2014 (NHANES) Age Prediction Subset
# Link    : https://archive.ics.uci.edu/dataset/887/national+health+and+nutrition+health+survey+2013-2014+(nhanes)+age+prediction+subset

In [188]:
# 1. Lakukan Eksplorasi terhadapat algoritma klasifikasi lain yang ada! (LOGICAL REGRESSION CLASSIFICATION)

In [189]:
# Importing the necessary functions and librarys
import pandas as pd
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Importing IRIS Dataset
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pd.read_csv(url, names=names)

In [190]:
# Displaying the first few rows of the dataset to quickly inspect the structure and content of a dataset
dataset.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [191]:
# Checking the shape of the dataset, which provides information about the number of rows and columns in the Dataset
dataset.shape

(150, 5)

In [192]:
# Preprocessing the dataset
X = dataset.values[:,0:4] # Using the first four columns as the features of the model
Y = dataset.values[:,4] # Target feature (labels or classes) is in the fifth column, which is "class" column.

# Splitting-out validation dataset, validation options and evaluation metric
validation_size = 0.20
seed = 7
scoring = 'accuracy'

# Splitting the data into training and validation sets
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

# Printing the shapes of training and validation data
print("X_train =", X_train.shape)
print("X_validation =", X_validation.shape)
print("Y_train =", Y_train.shape)
print("Y_validation =", Y_validation.shape)

X_train = (120, 4)
X_validation = (30, 4)
Y_train = (120,)
Y_validation = (30,)


In [193]:
# Creating a logistic regression model
model = LogisticRegression(max_iter=2000, random_state=seed)

# Training the model
model.fit(X_train, Y_train)

# Making the predictions on the validation set
predictions = model.predict(X_validation)

# Evaluating the model
accuracy = accuracy_score(Y_validation, predictions)
conf_matrix = confusion_matrix(Y_validation, predictions)
classification_rep = classification_report(Y_validation, predictions)

print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{classification_rep}')

Accuracy: 0.8666666666666667
Confusion Matrix:
[[ 7  0  0]
 [ 0 10  2]
 [ 0  2  9]]
Classification Report:
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00         7
Iris-versicolor       0.83      0.83      0.83        12
 Iris-virginica       0.82      0.82      0.82        11

       accuracy                           0.87        30
      macro avg       0.88      0.88      0.88        30
   weighted avg       0.87      0.87      0.87        30



From the output above, the logistic regression model seems to perform reasonably well, with 0.8666666666666667 or 86.67% high accuracy across classes. The confusion matrix provides a detailed breakdown of the model's performance on each class. Also the classification report provides precision, recall, and F1-score for each class, along with support (number of instances) for each class. As we can see, "Iris-setosa" class has the 1 or 100% maximum and highest precision among all the classes and "Iris-virginica" has the opposite.

The key components of the Classification Report:
*   Precision: The ratio of correctly predicted positive observations to
the total predicted positives.
*   Recall: The ratio of correctly predicted positive observations to the total actual positives.
*   F1-score: The harmonic mean of precision and recall.
*   Support: The number of actual occurrences of the class in the specified dataset.
*   Macro Avg: The unweighted average of precision, recall, and F1-score
*   Weighted Avg: The weighted average based on the support for each class.

In [194]:
# 2. Buatlah evaluasi algoritma dengan dataset yang telah dicoba pada tugas praktikum sebelumnya (dengan menggunakan 3 model yaitu KNN, NB dan SVM)!

In [195]:
# Installing the ucimlrepo package
# ucimlrepo is a Python package for easily importing datasets from the UC Irvine Machine Learning Repository into scripts and notebooks.

!pip3 install -U ucimlrepo



In [196]:
# Importing the necessary functions and librarys
from ucimlrepo import fetch_ucirepo
from sklearn import model_selection
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder

# Fetching the dataset National Health and Nutrition Health Survey 2013-2014 (NHANES) Age Prediction Subset
# This dataset has 10 attributes and needs to be renamed to make it readable with an array of names
url = "https://archive.ics.uci.edu/static/public/887/data.csv"
names = ['ID', 'age_group', 'age', 'gender', 'activity_level', 'bmi', 'blood_glucose', 'is_diabetic', 'ogtt', 'insulin_levels']

# Reading the csv file from the URL and generating a DataFrame via read_csv function
dataset = pd.read_csv(url)

# Renaming the columns with the array of names
dataset.rename(columns={'SEQN': names[0], 'age_group': names[1], 'RIDAGEYR': names[2], 'RIAGENDR': names[3], 'PAQ605': names[4], 'BMXBMI': names[5], 'LBXGLU': names[6], 'DIQ010': names[7], 'LBXGLT': names[8], 'LBXIN': names[9]}, inplace=True)

In [197]:
# Since the "age_group" column is categorical and contains string values, we need to encode the column from Adult and Senior to 0 and 1 respectively
label_encoder = LabelEncoder()
dataset['age_group'] = label_encoder.fit_transform(dataset['age_group'])

In [198]:
# Displaying the first few rows of the dataset to quickly inspect the structure and content of a dataset
dataset.head()

Unnamed: 0,ID,age_group,age,gender,activity_level,bmi,blood_glucose,is_diabetic,ogtt,insulin_levels
0,73564.0,0,61.0,2.0,2.0,35.7,110.0,2.0,150.0,14.91
1,73568.0,0,26.0,2.0,2.0,20.3,89.0,2.0,80.0,3.85
2,73576.0,0,16.0,1.0,2.0,23.2,89.0,2.0,68.0,6.14
3,73577.0,0,32.0,1.0,2.0,28.9,104.0,2.0,84.0,16.15
4,73580.0,0,38.0,2.0,1.0,35.9,103.0,2.0,81.0,10.92


In [199]:
# Checking the shape of the dataset, which provides information about the number of rows and columns in the Dataset
dataset.shape

(2278, 10)

In [200]:
# Preprocessing the dataset
X = dataset.values[:,0:4] # Using the first four columns as the features of the model
Y = dataset.values[:,1] # Target feature (labels or classes) is in the second column, which is "age_group" column.

# Splitting-out validation dataset, validation options and evaluation metric
validation_size = 0.20
seed = 7
scoring = 'accuracy'

# Splitting the data into training and validation sets
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

# Printing the shapes of training and validation data
print("X_train =", X_train.shape)
print("X_validation =", X_validation.shape)
print("Y_train =", Y_train.shape)
print("Y_validation =", Y_validation.shape)

X_train = (1822, 4)
X_validation = (456, 4)
Y_train = (1822,)
Y_validation = (456,)


In [201]:
# Spot-Checking Algorithms
models = []
models.append(('KNN', KNeighborsClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

# Evaluating each model in turn
results = []
names = []
for name, model in models:
  kfold = model_selection.KFold(n_splits=10, random_state=seed, shuffle=True)
  cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
  results.append(cv_results)
  names.append(name)
  msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
  print(msg)

KNN: 0.859479 (0.018690)
NB: 1.000000 (0.000000)
SVM: 0.838639 (0.019217)


From the output above, we can see that NB has the highest estimated accuracy value with 1 or 100% maximum accuracy. Even so, we will create all of the three models to compare them and try to test the accuracy of all three models with the existing data.

In [202]:
# Creating the KNN Model
knn = KNeighborsClassifier(n_neighbors=5)

# Train the model
knn.fit(X_train, Y_train)

# Make predictions
predictions = knn.predict(X_validation)

# Printing the evaluation of the performance of a machine learning model
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions, zero_division=1))

0.8947368421052632
[[383   3]
 [ 45  25]]
              precision    recall  f1-score   support

         0.0       0.89      0.99      0.94       386
         1.0       0.89      0.36      0.51        70

    accuracy                           0.89       456
   macro avg       0.89      0.67      0.73       456
weighted avg       0.89      0.89      0.87       456



From the output above, the KNN model seems to perform reasonably well, with 0.8947368421052632 or 89.47% high accuracy across classes. The confusion matrix provides a detailed breakdown of the model's performance on each class. Also the classification report provides precision, recall, and F1-score for each class, along with support (number of instances) for each class. As we can see, the KNN model has the same precision score across classes with 0.89 or 89% high which means the KNN model can predict the age prediction subset into the Adult (0) or Senior (1) with the given data.

The key components of the Classification Report:
*   Precision: The ratio of correctly predicted positive observations to
the total predicted positives.
*   Recall: The ratio of correctly predicted positive observations to the total actual positives.
*   F1-score: The harmonic mean of precision and recall.
*   Support: The number of actual occurrences of the class in the specified dataset.
*   Macro Avg: The unweighted average of precision, recall, and F1-score
*   Weighted Avg: The weighted average based on the support for each class.

In [203]:
# Creating the Naive Bayes Model (GaussianNB for continuous features)
nb = GaussianNB()

# Train the model
nb.fit(X_train, Y_train)

# Make predictions
predictions = nb.predict(X_validation)

# Printing the evaluation of the performance of a machine learning model
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions, zero_division=1))

1.0
[[386   0]
 [  0  70]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       386
         1.0       1.00      1.00      1.00        70

    accuracy                           1.00       456
   macro avg       1.00      1.00      1.00       456
weighted avg       1.00      1.00      1.00       456



From the output above, the NB model seems to perform perfectly well, with 1 or 100% maximum accuracy across classes. The confusion matrix provides a detailed breakdown of the model's performance on each class. Also the classification report provides precision, recall, and F1-score for each class, along with support (number of instances) for each class. As we can see, the KNN model has the same perfect precision score across classes with 1 or 100% high which means the NB model can predict the age prediction subset into the Adult (0) or Senior (1) flawlessly with the given data.

The key components of the Classification Report:
*   Precision: The ratio of correctly predicted positive observations to
the total predicted positives.
*   Recall: The ratio of correctly predicted positive observations to the total actual positives.
*   F1-score: The harmonic mean of precision and recall.
*   Support: The number of actual occurrences of the class in the specified dataset.
*   Macro Avg: The unweighted average of precision, recall, and F1-score
*   Weighted Avg: The weighted average based on the support for each class.

In [204]:
# Creating the SVM model
svm = SVC()

# Train the model
svm.fit(X_train, Y_train)

# Make predictions
predictions = svm.predict(X_validation)

# Printing the evaluation of the performance of a machine learning model
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions, zero_division=1))

0.8464912280701754
[[386   0]
 [ 70   0]]
              precision    recall  f1-score   support

         0.0       0.85      1.00      0.92       386
         1.0       1.00      0.00      0.00        70

    accuracy                           0.85       456
   macro avg       0.92      0.50      0.46       456
weighted avg       0.87      0.85      0.78       456



From the output above, the SVM model seems to perform reasonably well, with 0.8464912280701754 or 84.65% a little bit lower than the KNN model accuracy across classes. The confusion matrix provides a detailed breakdown of the model's performance on each class. Also the classification report provides precision, recall, and F1-score for each class, along with support (number of instances) for each class. As we can see, the SVM model does not have the well-balanced precision score across classes like the KNN model, but the SVM model does perfectly well in predicting the Senior (1) age group with 1 or 100% flawless score.

The key components of the Classification Report:
*   Precision: The ratio of correctly predicted positive observations to
the total predicted positives.
*   Recall: The ratio of correctly predicted positive observations to the total actual positives.
*   F1-score: The harmonic mean of precision and recall.
*   Support: The number of actual occurrences of the class in the specified dataset.
*   Macro Avg: The unweighted average of precision, recall, and F1-score
*   Weighted Avg: The weighted average based on the support for each class.

# CONCLUSION
In the testing of the accuracy against the available data out of the three models, NB is the most accurate model for the National Health and Nutrition Examination Survey 2013-2014 (NHANES) Age Prediction Subset, with a perfect score of 1 or 100%. On the other hand, KNN and SVM models have scores that are not significantly different. The KNN model achieves a balanced score in predicting all age groups with 0.89 or 89% precision score, while the SVM model accurately predicts the Senior (1) age group with 1 or 100% perfect precision score.