## Task 1: Classification

### (0) Problem and Data Description

##### (0.1) Data Description
The source of this dataset we use (Body performance Data) is from [here](https://www.kaggle.com/kukuroo3/body-performance-data/code) (Kaggle).
It has 12 columns:
Age (20 to 64),
Gender (Female and Male),
height (cm),
weight (kg),
body fat (percent),
diastolic blood pressure (min),
systolic blood pressure (min),
grip force,
sit and bend forward (cm),
sit-ups (counts),
broad jump (cm),
class (A to D) 

##### (0.2) Problem description
This dataset reflects people's health levels ranging from A class (best) to D class, in relation to some physical data (e.g., age, diastolic and weight). Therefore, SVM, Randomforest, and XGBoost are introduced to fit this dataset and then predict the health levels given data of physical performance.
<br/>
<br/> SVM, Randomforest, and XGBoost are all from sklearn package, while GridSearchCV and RandomizedSearchCV are applied to tune the (hyper)parameters. In addition, this dataset contains two columns in which the elements are the String type. Accordingly, we do some conversion and modification on it (See the Data Preprocessing section) before it is taken to train.
<br/>
<br/> The results of the performance of these three models are shown as classification reports which chiefly include four measures, i.e., f1-score, recall, precision, and accuracy.



##### (0.3) Data Preprocessing

In [14]:
# The class of health levels are converted from ABCD to 3210.
# To make the gender feature numeric, male and female are changed to 1 and 0, respectively.
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
dataset = pd.read_csv('bodyPerformance.csv')
dataset.gender = [1 if gd == 'M' else 0 for gd in dataset.gender]
Y = dataset['class'].values
Y = np.array([ord('a')-ord(cls)-29 for cls in Y]).astype('int')
X = dataset.drop(["class"], axis=1)
X = (X - np.min(X)) / (np.max(X) - np.min(X)).values # normalization
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=132)
target_names = ['D', 'C', 'B', 'A']
print('-----Exploratory Data Analysis-----')
print('(1) Dataset structure: ')
print(dataset.head(2))
print()
print('(2) Shape of dataset (includes the label col): ')
print(dataset.shape)
print()
print('(3) Shape of train and test sets: ')
print('Shape of X_train:', X_train.shape)
print('Shape of Y_train:', Y_train.shape)
print('Shape of X_test:', X_test.shape)
print('Shape of Y_test:', Y_test.shape)
print()
print('(4) Columns: ')
print(dataset.columns)
print()
print('(5) Length of data: ', len(dataset))
# print()
# print('(6) Data Information: ', dataset.info())


-----Exploratory Data Analysis-----
(1) Dataset structure: 
    age  gender  height_cm  weight_kg  body fat_%  diastolic  systolic  \
0  27.0       1      172.3      75.24        21.3       80.0     130.0   
1  25.0       1      165.0      55.80        15.7       77.0     126.0   

   gripForce  sit and bend forward_cm  sit-ups counts  broad jump_cm class  
0       54.9                     18.4            60.0          217.0     C  
1       36.4                     16.3            53.0          229.0     A  

(2) Shape of dataset (includes the label col): 
(13393, 12)

(3) Shape of train and test sets: 
Shape of X_train: (9375, 11)
Shape of Y_train: (9375,)
Shape of X_test: (4018, 11)
Shape of Y_test: (4018,)

(4) Columns: 
Index(['age', 'gender', 'height_cm', 'weight_kg', 'body fat_%', 'diastolic',
       'systolic', 'gripForce', 'sit and bend forward_cm', 'sit-ups counts',
       'broad jump_cm', 'class'],
      dtype='object')

(5) Length of data:  13393


### (1) SVM

##### (1.1) Parameter tuning for SVM using GridSearchCV

In [34]:
from sklearn import svm
svm_clf = svm.SVC(decision_function_shape='ovo', random_state=333)

In [35]:
svm_hyperparams  = {
        'C': [0.1, 1, 10, 100, 1000],
        'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
        }

svm_randCV = GridSearchCV(svm_clf, svm_hyperparams, cv = 3)
svm_randCV.fit(X, Y)
svm_randCV.best_estimator_

SVC(C=1000, decision_function_shape='ovo', gamma=1, random_state=333)

##### (1.2) Results of SVM

In [36]:
from sklearn.metrics import accuracy_score

svm_clf = svm.SVC(C=1000, decision_function_shape='ovo', gamma=1, random_state=333)
svm_clf.fit(X_train, Y_train)
svm_Y_predicted = svm_clf.predict(X_test)
svm_acc = accuracy_score(Y_test, svm_Y_predicted)
print("Accuracy of SVM on bodyperformance dataset: %.2f%%" % (svm_acc * 100.0))
print(classification_report(Y_test, svm_Y_predicted, target_names=target_names))


Accuracy of SVM on bodyperformance dataset: 70.03%
              precision    recall  f1-score   support

           D       0.88      0.78      0.83      1018
           C       0.66      0.63      0.65       985
           B       0.55      0.59      0.57       967
           A       0.73      0.79      0.76      1048

    accuracy                           0.70      4018
   macro avg       0.70      0.70      0.70      4018
weighted avg       0.71      0.70      0.70      4018



### (2) RandomForest

##### (2.1) Parameter tuning for RandomForest using RandomizedSearchCV

In [27]:
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(random_state=333)

In [28]:

rf_hyperparams = {"n_estimators": [20, 50, 100],
                "max_depth": [5, 20, 40, 60, 100],
                "max_features": ['auto', 'sqrt'] ,
                "min_samples_split": [2, 4, 6, 8],
                "min_samples_leaf": [1, 2, 3],
                "bootstrap": [True, False],
                "criterion": ["gini", "entropy"]}
rf_randCV = RandomizedSearchCV(rf_clf, rf_hyperparams, n_iter = 30)
rf_randCV.fit(X, Y)
rf_randCV.best_estimator_

RandomForestClassifier(criterion='entropy', max_depth=20, max_features='sqrt',
                       min_samples_leaf=2, random_state=333)

##### (2.2) Results of RandomForest

In [29]:
from sklearn.metrics import accuracy_score
rf_clf = RandomForestClassifier(criterion='entropy', max_depth=20, max_features='sqrt',
                       min_samples_leaf=2, random_state=333)
rf_clf.fit(X_train, Y_train)
rf_Y_predicted = rf_clf.predict(X_test)
rf_acc = accuracy_score(Y_test, rf_Y_predicted)
print("Accuracy of RandomForest on bodyperformance dataset: %.2f%%" % (rf_acc * 100.0))
print(classification_report(Y_test, rf_Y_predicted, target_names=target_names))

Accuracy of RandomForest on bodyperformance dataset: 73.32%
              precision    recall  f1-score   support

           D       0.90      0.81      0.85      1018
           C       0.73      0.67      0.70       985
           B       0.59      0.61      0.60       967
           A       0.72      0.83      0.77      1048

    accuracy                           0.73      4018
   macro avg       0.74      0.73      0.73      4018
weighted avg       0.74      0.73      0.73      4018



### (3) XGBoost

##### (3.1) Parameter tuning for XGBoost using RandomizedSearchCV

In [30]:
import xgboost as xgb
xgb_clf = xgb.XGBClassifier(random_state=333, objective='softprob', eval_metric='mlogloss', use_label_encoder =False)

In [31]:
xgb_hyperparams  = {
        'n_estimators': range(0, 220, 20),
        'max_depth': range(3, 10, 1),
        'learning_rate': np.linspace(0.01, 0.2, 20),
        'subsample': np.linspace(0.5, 1, 5),
        'colsample_bytree': np.linspace(0.5, 1, 5),
        'gamma': [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
        }

xgb_randCV = RandomizedSearchCV(xgb_clf, xgb_hyperparams, random_state=333, cv = 3, scoring = 'neg_log_loss', n_iter = 30)
xgb_randCV.fit(X, Y)
xgb_randCV.best_estimator_



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.875,
              enable_categorical=False, eval_metric='mlogloss', gamma=0.4,
              gpu_id=-1, importance_type=None, interaction_constraints='',
              learning_rate=0.19, max_delta_step=0, max_depth=8,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=60, n_jobs=16, num_parallel_tree=1,
              objective='multi:softprob', predictor='auto', random_state=333,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=None, subsample=0.875,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1, ...)

##### (3.2) Results of XGBoost

In [33]:
from sklearn.metrics import accuracy_score
xgb_clf = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.875,
              enable_categorical=False, eval_metric='mlogloss', gamma=0.4,
              gpu_id=-1, importance_type=None, interaction_constraints='',
              learning_rate=0.19, max_delta_step=0, max_depth=8,
              min_child_weight=1, monotone_constraints='()',
              n_estimators=60, n_jobs=16, num_parallel_tree=1,
              objective='multi:softprob', predictor='auto', random_state=333,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=None, subsample=0.875,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1)
xgb_clf.fit(X_train, Y_train)
xgb_Y_predicted = xgb_clf.predict(X_test)
acc = accuracy_score(Y_test, xgb_Y_predicted)
print("Accuracy of XGBoost on bodyperformance dataset: %.2f%%" % (acc * 100.0))
print(classification_report(Y_test, xgb_Y_predicted, target_names=target_names))

Accuracy of XGBoost on bodyperformance dataset: 74.64%
              precision    recall  f1-score   support

           D       0.92      0.83      0.87      1018
           C       0.74      0.67      0.71       985
           B       0.60      0.63      0.61       967
           A       0.74      0.84      0.79      1048

    accuracy                           0.75      4018
   macro avg       0.75      0.74      0.75      4018
weighted avg       0.75      0.75      0.75      4018



### (4) Conclusion

We found from the results of the classification reports that the performance of the Randomforest and XGBoost models are similar (73.32% and 74.64%, respectively) on this dataset for the multi-classification tasks. XGBoost, however, has better results with an accuracy of around 75% over the other two models, while the accuracy on prediction for SVM is about 70%. Additionally, all three models are good at predicting the health class 'D' but falling down on predicting the health class 'B'.
<br/>
<br/> 
These three models all show the ability to deal with multi-classification tasks but are still not ideal. Perhaps using a Convolutional neural network or its variants (e.g. ResNet) is capable of substantially improving the results.