#**스마트폰 센서 데이터 기반 모션 분류**
# 단계2 : 기본 모델링


## 0.미션

* 데이터 전처리
    * 가변수화, 데이터 분할, NaN 확인 및 조치, 스케일링 등 필요한 전처리 수행
* 다양한 알고리즘으로 분류 모델 생성
    * 최소 4개 이상의 알고리즘을 적용하여 모델링 수행 
    * 성능 비교
    * 각 모델의 성능을 저장하는 별도 데이터 프레임을 만들고 비교
* 옵션 : 다음 사항은 선택사항입니다. 시간이 허용하는 범위 내에서 수행하세요.
    * 상위 N개 변수를 선정하여 모델링 및 성능 비교
        * 모델링에 항상 모든 변수가 필요한 것은 아닙니다.
        * 변수 중요도 상위 N개를 선정하여 모델링하고 타 모델과 성능을 비교하세요.
        * 상위 N개를 선택하는 방법은, 변수를 하나씩 늘려가며 모델링 및 성능 검증을 수행하여 적절한 지점을 찾는 것입니다.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 1.환경설정

### (1) 라이브러리 불러오기

* 세부 요구사항
    - 기본적으로 필요한 라이브러리를 import 하도록 코드가 작성되어 있습니다.
    - 필요하다고 판단되는 라이브러리를 추가하세요.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# 필요하다고 판단되는 라이브러리를 추가하세요.
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

* 함수 생성

In [None]:
# 변수의 특성 중요도 계산하기
def plot_feature_importance(importance, names, result_only = False, topn = 'all'):
    feature_importance = np.array(importance)
    feature_name = np.array(names)

    data={'feature_name':feature_name,'feature_importance':feature_importance}
    fi_temp = pd.DataFrame(data)

    #변수의 특성 중요도 순으로 정렬하기
    fi_temp.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    fi_temp.reset_index(drop=True, inplace = True)

    if topn == 'all' :
        fi_df = fi_temp.copy()
    else :
        fi_df = fi_temp.iloc[:topn]

    #변수의 특성 중요도 그래프로 그리기
    if result_only == False :
        plt.figure(figsize=(10,20))
        sns.barplot(x='feature_importance', y='feature_name', data = fi_df)

        plt.xlabel('importance')
        plt.ylabel('feature name')
        plt.grid()

    return fi_df

### (2) 데이터 불러오기

* 주어진 데이터셋
    * data01_train.csv : 학습 및 검증용
* 세부 요구사항
    - 전체 데이터 'data01_train.csv' 를 불러와 'data' 이름으로 저장합니다.
        - data에서 변수 subject는 삭제합니다.
    - 데이터프레임에 대한 기본 정보를 확인합니다.( .head(), .shape 등)

#### 1) 데이터 로딩

In [None]:
data = pd.read_csv("/content/drive/MyDrive/딥러닝/5차 미프/data01_train.csv")

In [None]:
data.drop("subject", axis = 1, inplace = True)

In [None]:
data.head()

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-skewness(),fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",Activity
0,0.288508,-0.009196,-0.103362,-0.988986,-0.962797,-0.967422,-0.989,-0.962596,-0.96565,-0.929747,...,-0.487737,-0.816696,-0.042494,-0.044218,0.307873,0.07279,-0.60112,0.331298,0.165163,STANDING
1,0.265757,-0.016576,-0.098163,-0.989551,-0.994636,-0.987435,-0.990189,-0.99387,-0.987558,-0.937337,...,-0.23782,-0.693515,-0.062899,0.388459,-0.765014,0.771524,0.345205,-0.769186,-0.147944,LAYING
2,0.278709,-0.014511,-0.108717,-0.99772,-0.981088,-0.994008,-0.997934,-0.982187,-0.995017,-0.942584,...,-0.535287,-0.829311,0.000265,-0.525022,-0.891875,0.021528,-0.833564,0.202434,-0.032755,STANDING
3,0.289795,-0.035536,-0.150354,-0.231727,-0.006412,-0.338117,-0.273557,0.014245,-0.347916,0.008288,...,-0.004012,-0.408956,-0.255125,0.612804,0.747381,-0.072944,-0.695819,0.287154,0.111388,WALKING
4,0.394807,0.034098,0.091229,0.088489,-0.106636,-0.388502,-0.010469,-0.10968,-0.346372,0.584131,...,-0.157832,-0.563437,-0.044344,-0.845268,-0.97465,-0.887846,-0.705029,0.264952,0.137758,WALKING_DOWNSTAIRS


In [None]:
data.shape

(5881, 562)

In [None]:
data.info(verbose = True)

#### 2) 기본 정보 조회

In [None]:
sum(data.isnull().sum())

0

## **2. 데이터 전처리**

* 가변수화, 데이터 분할, NaN 확인 및 조치, 스케일링 등 필요한 전처리를 수행한다. 


### (1) 데이터 분할1 : x, y

* 세부 요구사항
    - x, y로 분할합니다.

In [None]:
x = data.drop("Activity", axis = 1)
y = data.loc[:, "Activity"]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

### (2) 스케일링(필요시)


* 세부 요구사항
    - 스케일링을 필요로 하는 알고리즘 사용을 위해서 코드 수행
    - min-max 방식 혹은 standard 방식 중 한가지 사용.

In [None]:
# scaler = StandardScaler()
# x_train_sc = scaler.fit_transform(x_train)
# x_test_sc = scaler.transform(x_test)

### (3) 데이터분할2 : train, validation

* 세부 요구사항
    - train : val = 8 : 2 혹은 7 : 3
    - random_state 옵션을 사용하여 다른 모델과 비교를 위해 성능이 재현되도록 합니다.

In [None]:
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size = 0.2, random_state = 42)

## **3. 기본 모델링**



* 세부 요구사항
    - 최소 4개 이상의 알고리즘을 적용하여 모델링을 수행한다. 
    - 각 알고리즘별로 전체 변수로 모델링, 상위 N개 변수를 선택하여 모델링을 수행하고 성능 비교를 한다.
    - (옵션) 알고리즘 중 1~2개에 대해서, 변수 중요도 상위 N개를 선정하여 모델링하고 타 모델과 성능을 비교.
        * 상위 N개를 선택하는 방법은, 변수를 하나씩 늘려가며 모델링 및 성능 검증을 수행하여 적절한 지점을 찾는 것이다.

### (1) 알고리즘1 : AutoML

In [None]:
!pip install mljar-supervised

In [None]:
from supervised.automl import AutoML

automl = AutoML(mode = "Perform")
automl.fit(x_train, y_train)

AutoML directory: AutoML_1
The task is multiclass_classification with evaluation metric logloss
AutoML will use algorithms: ['Linear', 'Random Forest', 'LightGBM', 'Neural Network']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'not_so_random', 'golden_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'ensemble']
* Step simple_algorithms will try to check up to 1 model
1_Linear logloss 0.048747 trained in 65.53 seconds (1-sample predict time 0.1359 seconds)
* Step default_algorithms will try to check up to 3 models
2_Default_LightGBM logloss 0.037312 trained in 764.67 seconds (1-sample predict time 0.1479 seconds)
3_Default_NeuralNetwork logloss 0.107223 trained in 43.32 seconds (1-sample predict time 0.1718 seconds)
4_Default_RandomForest logloss 0.313477 trained in 114.85 seconds (1-sample predict time 0.1742 seconds)
* Step not_so_random will try to check up to 12 models
5_LightGBM loglos

In [None]:
!zip -r AutoML_1.zip /content/AutoML_1

In [None]:
predictions = automl.predict_all(x_test)
print("정확도 :", accuracy_score(y_test, predictions["label"]))

정확도 : 0.994052676295667


### (2) 알고리즘2 : RandomForest

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(max_depth = 20, random_state = 42)
model.fit(x_train, y_train)

In [None]:
y_pred = model.predict(x_test)
print("정확도 :", accuracy_score(y_test, y_pred))
print(classification_report(y_pred, y_test))

정확도 : 0.9787595581988106
                    precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       231
           SITTING       0.97      0.96      0.97       204
          STANDING       0.96      0.98      0.97       222
           WALKING       0.98      0.98      0.98       197
WALKING_DOWNSTAIRS       0.97      0.97      0.97       145
  WALKING_UPSTAIRS       0.98      0.98      0.98       178

          accuracy                           0.98      1177
         macro avg       0.98      0.98      0.98      1177
      weighted avg       0.98      0.98      0.98      1177



In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {
    'n_estimators': [10, 50, 100],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

model = RandomForestClassifier()
grid = GridSearchCV(estimator = model, param_grid = param_grid, cv = 3, verbose = 2, n_jobs = -1)
grid.fit(x_train, y_train)

best_rf = grid.best_estimator_
y_pred = best_rf.predict(x_test)
print("정확도 :", accuracy_score(y_test, y_pred))

Fitting 3 folds for each of 243 candidates, totalling 729 fits


  warn(


정확도 : 0.9821580288870009


### (3) 알고리즘3 : 중요 변수 상위 25개

In [None]:
importances = model.feature_importances_

# 중요도에 따라 정렬된 특성 인덱스 가져오기
select_index = np.argsort(importances)[::-1]

top_25 = select_index[:25]

x_train_selected = x_train.iloc[:, top_25]
x_test_selected = x_test.iloc[:, top_25]

model_selected = RandomForestClassifier(max_depth = 20, random_state = 42)
model_selected.fit(x_train_selected, y_train)

y_pred_selected = model_selected.predict(x_test_selected)
print("정확도 :", accuracy_score(y_test, y_pred_selected))
print(classification_report(y_pred, y_test))

정확도 : 0.9762107051826678
                    precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       231
           SITTING       0.97      0.96      0.97       204
          STANDING       0.96      0.98      0.97       222
           WALKING       0.98      0.98      0.98       197
WALKING_DOWNSTAIRS       0.97      0.97      0.97       145
  WALKING_UPSTAIRS       0.98      0.98      0.98       178

          accuracy                           0.98      1177
         macro avg       0.98      0.98      0.98      1177
      weighted avg       0.98      0.98      0.98      1177



### (4) 알고리즘4 : 실습1 모든 관점의 변수 중요도

In [None]:
lst = ["tGravityAcc-min()-Y", "angle(Y,gravityMean)", "tGravityAcc-max()-Y",
"tGravityAcc-mean()-Y", "tGravityAcc-energy()-Y", "tGravityAcc-max()-Z",
"tGravityAcc-max()-X", "tGravityAcc-mean()-X", "angle(X,gravityMean)",
"angle(Z,gravityMean)", "fBodyAcc-max()-X", "tGravityAcc-min()-X",
"tGravityAcc-energy()-X", "tBodyAccJerk-entropy()-X", "tBodyAcc-max()-X",
"fBodyAcc-mad()-X", "tGravityAcc-min()-Z", "tGravityAcc-entropy()-Y",
"tGravityAcc-energy()-Z", "fBodyAcc-std()-X", "tGravityAcc-mean()-Z",
"fBodyAccMag-std()", "fBodyAcc-bandsEnergy()-1,8", "tBodyAccMag-std()",
"fBodyAccMag-mad()", "tBodyAccMag-mad()", "tBodyAcc-energy()-X",
"fBodyAccJerk-max()-X", "tBodyGyroJerk-iqr()-Z", "tBodyAccJerk-mad()-X",
"tBodyAccJerk-std()-X", "tBodyAccJerk-iqr()-X", "fBodyAccJerk-entropy()-X",
"fBodyAccJerk-bandsEnergy()-1,16", "fBodyAccJerk-mean()-X",
"tBodyAccJerk-energy()-X", "fBodyAccJerk-bandsEnergy()-1,8",
"fBodyAccJerk-bandsEnergy()-1,24", "fBodyAccJerk-mad()-X", "Activity"]

In [None]:
select_data = data[lst]

In [None]:
x_tmp = select_data.drop("Activity", axis = 1)
y_tmp = select_data.loc[:, "Activity"]
x_train_tmp, x_test_tmp, y_train_tmp, y_test_tmp = train_test_split(x_tmp, y_tmp, test_size = 0.2, random_state = 42)

In [None]:
model = RandomForestClassifier(max_depth = 20, random_state = 42)
model.fit(x_train_tmp, y_train_tmp)

y_pred = model.predict(x_test_tmp)
print("정확도 :", accuracy_score(y_test_tmp, y_pred))
print(classification_report(y_pred, y_test))

정확도 : 0.9762107051826678
                    precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       231
           SITTING       0.97      0.96      0.97       203
          STANDING       0.96      0.98      0.97       223
           WALKING       0.99      0.98      0.98       200
WALKING_DOWNSTAIRS       0.96      0.95      0.96       146
  WALKING_UPSTAIRS       0.96      0.98      0.97       174

          accuracy                           0.98      1177
         macro avg       0.97      0.97      0.97      1177
      weighted avg       0.98      0.98      0.98      1177



### (5) 알고리즘5 : pycaret

In [None]:
!pip install pycaret
!pip show pycaret

In [None]:
!pip install pycaret[full]

In [None]:
from pycaret.classification import predict_model
from pycaret.classification import setup, compare_models

clf_setup = setup(data = data, target = 'Activity', session_id = 123)

best_model = compare_models()
predictions = predict_model(best_model)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Activity
2,Target type,Multiclass
3,Target mapping,"LAYING: 0, SITTING: 1, STANDING: 2, WALKING: 3, WALKING_DOWNSTAIRS: 4, WALKING_UPSTAIRS: 5"
4,Original data shape,"(5881, 562)"
5,Transformed data shape,"(5881, 562)"
6,Transformed train set shape,"(4116, 562)"
7,Transformed test set shape,"(1765, 562)"
8,Numeric features,561
9,Preprocess,True


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.9891,0.9998,0.9891,0.9892,0.9891,0.9868,0.9869,43.578
xgboost,Extreme Gradient Boosting,0.9859,0.9998,0.9859,0.9861,0.9859,0.983,0.9831,73.939
lr,Logistic Regression,0.9849,0.9993,0.9849,0.9851,0.9849,0.9819,0.9819,12.336
gbc,Gradient Boosting Classifier,0.9837,0.9995,0.9837,0.9839,0.9837,0.9804,0.9805,368.93
et,Extra Trees Classifier,0.982,0.9994,0.982,0.9824,0.982,0.9784,0.9784,2.05
ridge,Ridge Classifier,0.9798,0.0,0.9798,0.9801,0.9798,0.9757,0.9758,0.476
lda,Linear Discriminant Analysis,0.9794,0.999,0.9794,0.9795,0.9793,0.9751,0.9752,1.4
rf,Random Forest Classifier,0.9728,0.9992,0.9728,0.9731,0.9728,0.9673,0.9673,5.937
svm,SVM - Linear Kernel,0.9706,0.0,0.9706,0.9726,0.9705,0.9646,0.9651,0.847
knn,K Neighbors Classifier,0.9531,0.9948,0.9531,0.9541,0.9529,0.9436,0.9439,0.548


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Light Gradient Boosting Machine,0.9898,0.9998,0,0,0,0.9877,0.9877


In [None]:
from sklearn.metrics import classification_report
print(classification_report(predictions["prediction_label"], predictions["Activity"]))

                    precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       335
           SITTING       0.98      0.98      0.98       310
          STANDING       0.98      0.98      0.98       326
           WALKING       1.00      0.99      1.00       301
WALKING_DOWNSTAIRS       0.99      1.00      0.99       235
  WALKING_UPSTAIRS       0.99      0.99      0.99       258

          accuracy                           0.99      1765
         macro avg       0.99      0.99      0.99      1765
      weighted avg       0.99      0.99      0.99      1765



In [None]:
#Pycaret > mljar >>> RF(전체) + 튜닝 > RF(전체) > RF(상위 25개) == RF(실습1 결과(39개))
#0.9998    0.994       0.982            0.978        0.976               0.976