# Assignment 8: Ensemble-Based Classifier
### DSC-540
### David Bui

# Dataset
Description: This dataset contains info in the matters of body movement during several activities. The goal of this learning model is to implement the ensemble method with 4 models and follow the instructions written within this article. https://journals.lww.com/acsm-msse/Fulltext/2017/09000/Ensemble_Methods_for_Classification_of_Physical.24.aspx

Source: UCI Machine Learning Repository
Link: http://archive.ics.uci.edu/ml/datasets/pamap2+physical+activity+monitoring


In [158]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB # For the Behavior Knowledge Space
from sklearn.naive_bayes import GaussianNB # Naive Bayes
from sklearn.metrics import accuracy_score
import itertools
import pickle

# Initializing the Dataset 
There 9 dat files which represent the 9 test subjects. It is composed of 8 males and 1 female. These dat files are combined into a single text file and then placed into a dataframe. The data goes through several layers of filtering and feature selection before handing the newly formed data over to the learning models.

In [3]:
import os # for the file merging
path = '..\OneDrive\Desktop\GCU Studies\DSC-540\Topic 8\Assignment\Data'

file_list = os.listdir(path)
for filename in sorted(file_list):
    out_filename = 'pamap.txt'
    with open(out_filename, 'a') as outfile:
        with open(path + '/' + filename, 'r') as infile:
            outfile.write(infile.read())

In [159]:
cols = ['timestamp', 'activityID', 'heartrate', 'x-axis', 'y-axis', 'z-axis', 'w1', 'w2', 'w3'
        ,'w4', 'w5', 'w6', 'w7', 'w8', 'w9', 'w10', 'w11', 'w12', 'w13', 'w14']

df = pd.read_csv('pamap.txt', ' ', header=None)
df.drop(df.iloc[:, 20:], axis = 1, inplace=True)
df.columns=cols
df.head()

Unnamed: 0,timestamp,activityID,heartrate,x-axis,y-axis,z-axis,w1,w2,w3,w4,w5,w6,w7,w8,w9,w10,w11,w12,w13,w14
0,8.38,0,104.0,30.0,2.37223,8.60074,3.51048,2.43954,8.76165,3.35465,-0.092217,0.056812,-0.015845,14.6806,-69.2128,-5.58905,1.0,0.0,0.0,0.0
1,8.39,0,,30.0,2.18837,8.5656,3.66179,2.39494,8.55081,3.64207,-0.024413,0.047759,0.006474,14.8991,-69.2224,-5.82311,1.0,0.0,0.0,0.0
2,8.4,0,,30.0,2.37357,8.60107,3.54898,2.30514,8.53644,3.7328,-0.057976,0.032574,-0.006988,14.242,-69.5197,-5.12442,1.0,0.0,0.0,0.0
3,8.41,0,,30.0,2.07473,8.52853,3.66021,2.33528,8.53622,3.73277,-0.002352,0.03281,-0.003747,14.8908,-69.5439,-6.17367,1.0,0.0,0.0,0.0
4,8.42,0,,30.0,2.22936,8.83122,3.7,2.23055,8.59741,3.76295,0.012269,0.018305,-0.053325,15.5612,-68.8196,-6.28927,1.0,0.0,0.0,0.0


### Preprocessing:
Under ActivityID the '0' element was removed because it represented 'other' which gives little to no value for interpretation or distinction. The 'other' activity is also quite significant in size, being nearly 4 times the size of the 2nd largest activity. WIthin the article they focus on hand and wrist movement, which allows reason for a feature dimension reduction. Lastly, null elements can be found, which attribute to roughly 0.0057% of the data. Due to the size of the dataframe imputation would be quiet lenthy, for this reason they are simply removed.

In [160]:
# filling missing data
def clean_data(df):
    # removing data with transient activity
    df = df.drop(df[df['activityID']==0].index)
    # remove non-numeric data cells
    df = df.apply(pd.to_numeric, errors = 'coerce')
    # fill in NaN values using iterpolation which is what they used in the article.
    df = df.interpolate()
    return df
  
df = clean_data(df)
df.reset_index(drop=True,inplace=True)
df.drop('heartrate', inplace=True, axis=1)
print('Number of NaN elements: ',df.isnull().sum().sum())

Number of NaN elements:  0


In [161]:
# WIthin the article 'other' which is 0 was removed from the study. It also would of created an imbalance
# in the data which would effect the sampling methods later on. Heartrate is also dropped since they
# are focusing on 3D acceleration only.
df.drop(df[df['activityID'] == 0].index, inplace=True)
df = df[['timestamp', 'activityID', 'x-axis', 'y-axis', 'z-axis']] # Dataset 1
df = df.loc[df['activityID'] < 9]
df.describe()

Unnamed: 0,timestamp,activityID,x-axis,y-axis,z-axis
count,1257309.0,1257309.0,1257309.0,1257309.0,1257309.0
mean,1791.47,3.883755,32.30984,-4.515079,3.534577
std,1244.218,2.013225,1.786229,6.788416,7.683087
min,37.66,1.0,27.5,-145.367,-104.301
25%,535.75,2.0,30.9375,-9.02574,0.873487
50%,2289.19,4.0,32.6875,-5.18562,3.33814
75%,2908.59,6.0,33.6875,0.0980879,6.28399
max,4007.81,7.0,35.25,62.8596,155.699


### Feature Extraction

In [162]:
# Slicing of 10 second windows with 50% overlapping
window = df['timestamp'].random()
indexNames = df[(df['timestamp'] <= window-5) | (df['timestamp'] >= window+5)].index
#indexNames = df[(df['timestamp'] <= df['timestamp'].min()-5) | (df['timestamp'] >= df['timestamp'].max()+5)].index

# Here the dataframe is restricted into the select window slice.
df.drop(indexNames , inplace=True)

# Reset the index after doing all this
df.reset_index(inplace=True)
df.drop('index', inplace=True, axis=1)

(1257309, 5)

In [163]:
# Here I am not implementing the 25 and 75 percentiles, this is due to me reachings 45 features already,
# which is the exact number used within the article and I'm trying to get as close to them as possible.
sample = df.sample(n=5000)
sample.reset_index(inplace=True)
sample.drop('index', inplace=True, axis=1)
t = sample.sample(n=10)

series = t.agg(['sum','mean','var','std','skew','kurt','median','min','max']).unstack()

new_df = pd.concat([sample,series.set_axis([f'{x}_{y}'
                                for x, y in series.index])
                                  .to_frame().T], axis=1)
    

In [164]:
for i in range(0, len(sample)):
    t = sample.sample(n=10)
    series = t.agg(['sum','mean','var','std','skew','kurt','median','min','max']).unstack()
    
    #new_df.loc[i, :] = series.set_axis([f'{x}_{y}' for x, y in series.index])

In [85]:
# In the process of feature extraction, the original data was erased for some reason
new_df['timestamp'] = sample['timestamp']
new_df['activityID'] = sample['activityID']
new_df['x-axis'] = sample['x-axis']
new_df['y-axis'] = sample['y-axis']
new_df['z-axis'] = sample['z-axis']
new_df.head(5)

Unnamed: 0,timestamp,activityID,x-axis,y-axis,z-axis,timestamp_mean,timestamp_var,timestamp_std,timestamp_skew,timestamp_kurt,...,y-axis_min,y-axis_max,z-axis_mean,z-axis_var,z-axis_std,z-axis_skew,z-axis_kurt,z-axis_median,z-axis_min,z-axis_max
0,843.83,3,34.25,-8.41404,4.86328,2160.733,1391820.0,1179.754255,-0.683112,-1.076137,...,-14.669,6.7711,3.955801,18.754033,4.330593,0.008089,0.209104,3.385215,-3.87675,11.0376
1,720.36,3,33.25,-9.11462,2.97958,1818.26,1726689.0,1314.035502,-0.133315,-1.86633,...,-15.2855,5.15601,5.05912,23.70582,4.868862,0.811003,-0.350633,3.827275,-0.649272,13.2915
2,133.88,1,33.625,6.76227,2.89743,2108.648,2172930.0,1474.086143,-0.37284,-1.769826,...,-13.5641,12.4191,7.476966,31.410942,5.604547,0.750276,-1.243794,5.326455,0.925382,15.8905
3,607.92,3,34.0,-9.32691,2.75772,1954.612,1544999.0,1242.980012,-0.204804,-1.930438,...,-12.2669,12.3184,2.758507,19.977624,4.469634,-0.527023,1.913948,2.881975,-6.67623,9.96933
4,744.49,3,33.125,2.29445,7.96671,1798.173,1703125.0,1305.038129,-0.369391,-1.923656,...,-10.6631,6.88393,5.550876,34.80841,5.899865,2.021176,5.310004,4.50556,-1.06283,20.5761


### Normalization and Feature Selection

In [None]:
# The article says it uses correlation to select which features to use in the models. Here x-axis and
# y-axis are the only ones left. The article gets jumbled up between depicting several datasets, so I might
# of cross the wires somewhere. Since its focusing on 3d movement, I'm leaving z-axis in, I just want to 
# make note of this to show that it is not being ignored.
corr = new_df.corr().abs()
corr[corr['activityID']>0.25].index


In [86]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

#se = StandardScaler()
se = MinMaxScaler(feature_range = (0, 1))
X = new_df.drop('activityID', axis=1)
X = se.fit_transform(X)
y = new_df['activityID']
new_df.to_csv('out.csv')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

In [100]:
new_df.isnull().values.any()

False

# Bagging Classifier

In [None]:
from sklearn.ensemble import BaggingClassifier
bag_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(random_state=42, max_depth=20),
    n_estimators = 100, # Number of base estimators in the esemble.
    max_samples=0.05, # Percent of training data taken
    oob_score=True, # Replacement implemented
    random_state=42
)
bag_clf.fit(X_train,y_train)
bag_pred = dt_clf.predict(X_test)

# Boosting Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
boost_clf = BaggingClassifier(
    base_estimator=GradientBoostingClassifier(random_state=42, max_depth=20),
    n_estimators = 100, # Number of base estimators in the esemble.
    max_samples=0.05, # Percent of training data taken
    oob_score=True, # Replacement implemented
    random_state=42
)
boost_clf.fit(X_train,y_train)
boost_pred = dt_clf.predict(X_test)

# Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
bag_clf = BaggingClassifier(
    base_estimator=RandomForestCLassifier(random_state=42, max_depth=20),
    n_estimators = 100, # Number of base estimators in the esemble.
    max_samples=0.05, # Percent of training data taken
    oob_score=True, # Replacement implemented
    random_state=42
)
bag_clf.fit(X_train,y_train)
bag_pred = dt_clf.predict(X_test)

# Weighted Majority Voting Model

### Decision Tree

In [105]:
# The researchers used a decision tree with a depth of 20
from sklearn.tree import DecisionTreeClassifier

dt_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(random_state=42, max_depth=20),
    n_estimators = 100, # Number of base estimators in the esemble.
    max_samples=0.05, # Percent of training data taken
    oob_score=True, # Replacement implemented
    random_state=42
)
dt_clf.fit(X_train,y_train)
dt_pred = dt_clf.predict(X_test)

### K Nearest Neighbor

In [106]:
from sklearn.neighbors import KNeighborsClassifier

# Within the article they use 7 neighbors.
k_clf = BaggingClassifier(
    base_estimator=KNeighborsClassifier(n_neighbors=7),
    n_estimators = 100, # Number of base estimators in the esemble.
    max_samples=0.05, # Percent of training data taken
    oob_score=True, # Replacement implemented
    random_state=42
)
k_clf.fit(X_train,y_train)
k_pred = k_clf.predict(X_test)

### Support Vector Machine

In [107]:
from sklearn.svm import SVC # Support vector machine
# Implement a 'one-vs-rest' type SVM
# This is definately the most time expensive to implement.
svm_clf = BaggingClassifier(
    base_estimator=SVC(decision_function_shape='ovr', probability=True, kernel='linear'),
    n_estimators = 100, # Number of base estimators in the esemble.
    max_samples=0.05, # Percent of training data taken
    oob_score=True, # Replacement implemented
    random_state=42
)
svm_clf.fit(X_train,y_train)
svm_pred = svm_clf.predict(X_train)

### Artificial Neural Network

In [108]:
from sklearn.neural_network import MLPClassifier # ANN
# 50 neurons in the hidden layer
# linear activation 'ReLu'
# learning rate = 0.001
# 100 epochs
ann_clf = BaggingClassifier(
    base_estimator=MLPClassifier(hidden_layer_sizes=(50,), max_iter=100, random_state=42),
    n_estimators = 100, # Number of base estimators in the esemble.
    max_samples=0.05, # Percent of training data taken
    oob_score=True, # Replacement implemented
    random_state=42
)
ann_clf.fit(X_train,y_train)
ann_pred = ann_clf.predict(X_test)



### Weighted Majority Voting

In [95]:
from sklearn.ensemble import VotingClassifier 
wmv_clf = VotingClassifier(estimators=[
     ('BDT', dt_clf)
    ,('knn', k_clf)
    ,('svm',svm_clf)
    ,('ann', ann_clf)], voting='soft', n_jobs=-1)

wmv_clf.fit(X_train,y_train)

VotingClassifier(estimators=[('BDT',
                              BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=20,
                                                                                      random_state=42),
                                                max_samples=0.05,
                                                n_estimators=100,
                                                oob_score=True,
                                                random_state=42)),
                             ('knn',
                              BaggingClassifier(base_estimator=KNeighborsClassifier(n_neighbors=7),
                                                max_samples=0.05,
                                                n_estimators=100,
                                                oob_score=True,
                                                random_state=42)),
                             ('svm',
                              BaggingClassifier(base_estima

### NB combination 


In [139]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

nb_clf = MultinomialNB()
nb_clf.fit(X_train,y_train)

nb_pred = nb_clf.predict(X_test)

acc_score = accuracy_score(y_test, y_pred)
conf_mat = confusion_matrix(
        y_test, y_pred)

print(acc_score)
print(conf_mat)


#wmv_pred = wmv_clf.predict(X_test)
#confusion_matrix(wmv_pred, y_test)

ValueError: Negative values in data passed to MultinomialNB (input X)

In [145]:
wmv_pred = wmv_clf.predict(X_test)
confusion_matrix(wmv_pred, y_test)

array([[315,   1,   2,   0,   0,   0,   0],
       [  4, 291,  17,   0,   0,   0,   0],
       [  5,  19, 335,   0,   0,   0,   0],
       [  0,   0,   0, 353,   5,  10,   8],
       [  0,   0,   0,   0, 101,   0,   0],
       [  0,   0,   0,  13,  31, 232,  11],
       [  0,   0,   0,   5,  19,   7, 216]], dtype=int64)

# Evaluation and Results
There is an analysis of the precision between Weighted Majority Vote and the other models. In the article they did the exact same, difference being they added f1_scoring with a cross validation with a LeaveOneOut splitting strategy.

In [97]:
wmv_clf.score(X_test, y_test)

0.9215

In [141]:

print('Ann:', {wmv_clf.named_estimators_['ann'].score(X_test, y_test)})
print('BDT:', {wmv_clf.named_estimators_['BDT'].score(X_test, y_test)})
print('SVM:', {wmv_clf.named_estimators_['svm'].score(X_test, y_test)})
print('knn:', {wmv_clf.named_estimators_['knn'].score(X_test, y_test)})

Ann: {0.112}
BDT: {0.0205}
SVM: {0.1395}
knn: {0.153}


In [133]:
# (LOSO) Cross-validation
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score

scoring = {'accuracy' : make_scorer(accuracy_score), 
           'precision' : make_scorer(precision_score),
           'recall' : make_scorer(recall_score), 
           'f1_score' : make_scorer(f1_score)}

#kfold = model_selection.KFold(n_splits=10, random_state=42)
model=MLPClassifier(hidden_layer_sizes=(50,), max_iter=100, random_state=42)

results = ann_clf.cross_val_score(estimator=model,
                                          X=features,
                                          y=labels,
                                          cv=LeaveOneOut(),#(LOSO)
                                          scoring=scoring)

AttributeError: 'BaggingClassifier' object has no attribute 'cross_val_score'