#**스마트폰 센서 데이터 기반 모션 분류**
# 단계3 : 단계별 모델링


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 0.미션

단계별로 나눠서 모델링을 수행하고자 합니다.  

* 단계1 : 정적(0), 동적(1) 행동 분류 모델 생성
* 단계2 : 세부 동작에 대한 분류모델 생성
    * 단계1 모델에서 0으로 예측 -> 정적 행동 3가지 분류 모델링
    * 단계1 모델에서 1으로 예측 -> 동적 행동 3가지 분류 모델링
* 모델 통합
    * 두 단계 모델을 통합하고, 새로운 데이터에 대해서 최종 예측결과와 성능평가가 나오도록 함수로 만들기
* 성능 비교
    * 기본 모델링의 성능과 비교
    * 모든 모델링은 [다양한 알고리즘 + 성능 튜닝]을 수행해야 합니다.


## 1.환경설정

### (1) 라이브러리 불러오기

* 세부 요구사항
    - 기본적으로 필요한 라이브러리를 import 하도록 코드가 작성되어 있습니다.
    - 필요하다고 판단되는 라이브러리를 추가하세요.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 필요하다고 판단되는 라이브러리를 추가하세요.





### (2) 데이터 불러오기

* 주어진 데이터셋
    * data01_train.csv : 학습 및 검증용

 <br/>  

* 세부 요구사항
    - data01_train.csv 를 불러와 'data' 이름으로 저장합니다.
        - data에서 변수 subject는 삭제합니다.
    - data01_test.csv 를 불러와 'new_data' 이름으로 저장합니다.


In [None]:
data=pd.read_csv('/content/drive/MyDrive/에이블스쿨_미니프로젝트/Mini_Project5_2/data01_train.csv')

In [None]:
data.drop('subject', axis=1, inplace=True)
new_data=pd.read_csv('/content/drive/MyDrive/에이블스쿨_미니프로젝트/Mini_Project5_2/data01_test.csv')

## 2.데이터 전처리

* 세부 요구사항
    - Label 추가 : data 에 Activity_dynamic 를 추가합니다. Activity_dynamic은 과제1에서 is_dynamic과 동일한 값입니다.
    - x와 y1, y2로 분할하시오.
        * y1 : Activity
        * y2 : Activity_dynamic
    - train : val = 8 : 2 혹은 7 : 3
    - random_state 옵션을 사용하여 다른 모델과 비교를 위해 성능이 재현되도록 합니다.

In [None]:
data['Activity_dynamic']=data['Activity'].apply(lambda x:0 if x in ['STANDING', 'SITTING', 'LAYING'] else 1)

In [None]:
X=data.drop(['Activity_dynamic','Activity'], axis=1)
y1=data['Activity']
y2=data['Activity_dynamic']

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(X,y2,test_size=0.2)

## **3.단계별 모델링**

![](https://github.com/DA4BAM/image/blob/main/step%20by%20step.png?raw=true)

### (1) 단계1 : 정적/동적 행동 분류 모델

* 세부 요구사항
    * 정적 행동(Laying, Sitting, Standing)과 동적 행동(동적 : Walking, Walking-Up, Walking-Down)을 구분하는 모델 생성.
    * 몇가지 모델을 만들고 가장 성능이 좋은 모델을 선정하시오.

#### 1) 알고리즘1 :

In [None]:
from sklearn.metrics import *

In [None]:
from sklearn.linear_model import LogisticRegression
model1=LogisticRegression()
model1.fit(x_train,y_train)
pred1=model1.predict(x_val)
print('accuracy:',accuracy_score(y_val, pred1))

accuracy: 1.0


#### 2) 알고리즘2 :

In [None]:
from sklearn.ensemble import RandomForestClassifier
model2=RandomForestClassifier()
model2.fit(x_train,y_train)
pred2=model2.predict(x_val)
print('accuracy:',accuracy_score(y_val, pred2))

accuracy: 1.0


### (2) 단계2-1 : 정적 동작 세부 분류

* 세부 요구사항
    * 정적 행동(Laying, Sitting, Standing)인 데이터 추출
    * Laying, Sitting, Standing 를 분류하는 모델을 생성
    * 몇가지 모델을 만들고 가장 성능이 좋은 모델을 선정하시오.

In [None]:
data['Activity'].unique()

array(['STANDING', 'LAYING', 'WALKING', 'WALKING_DOWNSTAIRS',
       'WALKING_UPSTAIRS', 'SITTING'], dtype=object)

In [None]:
data.loc[data['Activity'].isin(['LAYING', 'SITTING', 'STANDING'])]

In [None]:
st_data=data.loc[data['Activity'].isin(['LAYING', 'SITTING', 'STANDING'])]
st_data['Activity'].value_counts()

LAYING      1115
STANDING    1087
SITTING     1032
Name: Activity, dtype: int64

In [None]:
X=st_data.drop(['Activity_dynamic','Activity'], axis=1)
y=st_data['Activity']
x_train, x_val, y_train, y_val = train_test_split(X,y,test_size=0.2)

In [None]:
from sklearn.linear_model import LogisticRegression
model=LogisticRegression(max_iter=1500)
model.fit(x_train,y_train)
pred=model.predict(x_val)
print('accuracy:',round(accuracy_score(y_val, pred),4))

accuracy: 0.9784


In [None]:
from sklearn.ensemble import RandomForestClassifier
model=RandomForestClassifier()
model.fit(x_train,y_train)
pred=model.predict(x_val)
print('accuracy:',round(accuracy_score(y_val, pred),4))

accuracy: 0.9784


### (3) 단계2-2 : 동적 동작 세부 분류

* 세부 요구사항
    * 동동적 행동(Walking, Walking Upstairs, Walking Downstairs)인 데이터 추출
    * Walking, Walking Upstairs, Walking Downstairs 를 분류하는 모델을 생성
    * 몇가지 모델을 만들고 가장 성능이 좋은 모델을 선정하시오.

In [None]:
dy_data=data.loc[data['Activity'].isin(['WALKING', 'WALKING_DOWNSTAIRS','WALKING_UPSTAIRS'])]
dy_data['Activity'].value_counts()

WALKING               998
WALKING_UPSTAIRS      858
WALKING_DOWNSTAIRS    791
Name: Activity, dtype: int64

In [None]:
dy_data.head()

In [None]:
X=dy_data.drop(['Activity_dynamic','Activity'], axis=1)
y=dy_data['Activity']
x_train, x_val, y_train, y_val = train_test_split(X,y,test_size=0.2)

In [None]:
from sklearn.linear_model import LogisticRegression
model=LogisticRegression(max_iter=1500)
model.fit(x_train,y_train)
pred=model.predict(x_val)
print('accuracy:',round(accuracy_score(y_val, pred),4))

accuracy: 1.0


In [None]:
from sklearn.ensemble import RandomForestClassifier
model=RandomForestClassifier()
model.fit(x_train,y_train)
pred=model.predict(x_val)
print('accuracy:',round(accuracy_score(y_val, pred),4))

accuracy: 0.9887


### (4) 분류 모델 합치기


* 세부 요구사항
    * 두 단계 모델을 통합하고, 새로운 데이터(test)에 대해서 최종 예측결과와 성능평가가 나오도록 함수로 만들기
    * 데이터 파이프라인 구축 : test데이터가 로딩되어 전처리 과정을 거치고, 예측 및 성능 평가 수행

![](https://github.com/DA4BAM/image/blob/main/pipeline%20function.png?raw=true)

#### 1) 함수 만들기

In [None]:
from sklearn.model_selection import train_test_split
data['Activity_dynamic']=data['Activity'].apply(lambda x:0 if x in ['STANDING', 'SITTING', 'LAYING'] else 1)
X=data.drop(['Activity_dynamic','Activity'], axis=1)
y1=data['Activity']
y2=data['Activity_dynamic']
x_train, x_val, y_train, y_val = train_test_split(X,y2,test_size=0.2)

#### 모델1

In [None]:
from sklearn.linear_model import LogisticRegression
model1=LogisticRegression(max_iter=1500)
model1.fit(x_train,y_train)
pred1=model1.predict(x_val)
print('accuracy:',round(accuracy_score(y_val, pred1),4))

accuracy: 1.0


#### 모델2-1 정적

In [None]:
st_data=data.loc[data['Activity'].isin(['LAYING', 'SITTING', 'STANDING'])]
X=st_data.drop(['Activity_dynamic','Activity'], axis=1)
y=st_data['Activity']
x_train, x_val, y_train, y_val = train_test_split(X,y,test_size=0.2)

In [None]:
from sklearn.linear_model import LogisticRegression
model2_1=LogisticRegression(max_iter=1500)
model2_1.fit(x_train,y_train)
pred2_1=model2_1.predict(x_val)
print('accuracy:',round(accuracy_score(y_val, pred2_1),4))

accuracy: 0.9722


#### 모델2-2 동적

In [None]:
dy_data=data.loc[data['Activity'].isin(['WALKING', 'WALKING_DOWNSTAIRS','WALKING_UPSTAIRS'])]
X=dy_data.drop(['Activity_dynamic','Activity'], axis=1)
y=dy_data['Activity']
x_train, x_val, y_train, y_val = train_test_split(X,y,test_size=0.2)

In [None]:
from sklearn.linear_model import LogisticRegression
model2_2=LogisticRegression(max_iter=1500)
model2_2.fit(x_train,y_train)
pred2_2=model2_2.predict(x_val)
print('accuracy:',round(accuracy_score(y_val, pred2_2),4))

accuracy: 0.9962


####모델 합치기

In [None]:
new_data=pd.read_csv('/content/drive/MyDrive/에이블스쿨_미니프로젝트/Mini_Project5_2/data01_test.csv')

In [None]:
result=pd.DataFrame({'index':range(new_data.shape[0])})
result.head()

In [None]:
new_data.head()

In [None]:
new_data.drop('subject',axis=1,inplace=True)

In [None]:
x_test = new_data.drop('Activity', axis=1)

In [None]:
x_test.head(5)

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-meanFreq(),fBodyBodyGyroJerkMag-skewness(),fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)"
0,0.284379,-0.021981,-0.116683,-0.99249,-0.97964,-0.963321,-0.992563,-0.977304,-0.958142,-0.93885,...,0.255432,-0.509523,-0.850065,-0.018043,0.092304,0.07422,-0.714534,-0.671943,-0.018351,-0.185733
1,0.27744,-0.028086,-0.118412,-0.99662,-0.927676,-0.972294,-0.997346,-0.931405,-0.971788,-0.939837,...,-0.166341,-0.210792,-0.613367,-0.022456,-0.155414,0.247498,-0.112257,-0.826816,0.184489,-0.068699
2,0.305833,-0.041023,-0.087303,0.00688,0.1828,-0.237984,0.005642,0.028616,-0.236474,0.016311,...,0.468354,0.579587,0.394388,-0.362616,0.171069,0.576349,-0.688314,-0.743234,0.272186,0.053101
3,0.276053,-0.016487,-0.108381,-0.995379,-0.983978,-0.975854,-0.995877,-0.98528,-0.974907,-0.941425,...,0.337635,-0.566291,-0.841455,0.289548,0.079801,-0.020033,0.291898,-0.639435,-0.111998,-0.123298
4,0.271998,0.016904,-0.078856,-0.973468,-0.702462,-0.86945,-0.97981,-0.711601,-0.856807,-0.92076,...,-0.594792,0.447577,0.214219,0.010111,0.114179,-0.830776,-0.325098,-0.840817,0.116237,-0.096615


In [None]:
# 정적 동적 분류
pred1=model1.predict(x_test)
new_data['model1_pred']=pred1

In [None]:
new_data['model2_pred']=""

In [None]:
new_data.head(5)

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",Activity,model1_pred,model2_pred
0,0.284379,-0.021981,-0.116683,-0.99249,-0.97964,-0.963321,-0.992563,-0.977304,-0.958142,-0.93885,...,-0.018043,0.092304,0.07422,-0.714534,-0.671943,-0.018351,-0.185733,SITTING,0,
1,0.27744,-0.028086,-0.118412,-0.99662,-0.927676,-0.972294,-0.997346,-0.931405,-0.971788,-0.939837,...,-0.022456,-0.155414,0.247498,-0.112257,-0.826816,0.184489,-0.068699,STANDING,0,
2,0.305833,-0.041023,-0.087303,0.00688,0.1828,-0.237984,0.005642,0.028616,-0.236474,0.016311,...,-0.362616,0.171069,0.576349,-0.688314,-0.743234,0.272186,0.053101,WALKING,1,
3,0.276053,-0.016487,-0.108381,-0.995379,-0.983978,-0.975854,-0.995877,-0.98528,-0.974907,-0.941425,...,0.289548,0.079801,-0.020033,0.291898,-0.639435,-0.111998,-0.123298,SITTING,0,
4,0.271998,0.016904,-0.078856,-0.973468,-0.702462,-0.86945,-0.97981,-0.711601,-0.856807,-0.92076,...,0.010111,0.114179,-0.830776,-0.325098,-0.840817,0.116237,-0.096615,STANDING,0,


In [None]:
# 정적
pred0 = new_data.loc[new_data['model1_pred']==0].iloc[:,:-3]
pred2_1=model2_1.predict(pred0)
new_data.loc[new_data['model1_pred']==0, 'model2_pred']=pred2_1

In [None]:
new_data.head(5)

In [None]:
#동적
pred1 = new_data.loc[new_data['model1_pred']==1].iloc[:,:-3]
pred2_2=model2_2.predict(pred1)
new_data.loc[new_data['model1_pred']==1, 'model2_pred']=pred2_2

In [None]:
new_data.head(5)

In [None]:
accuracy_score(new_data['Activity'], new_data['model2_pred'])

0.9809653297076818

In [None]:
max_len=10

In [None]:
# 함수 만들기
def test(new_data, model1, model2_1, model2_2):
    # 데이터 전처리
    new_data.drop('subject',axis=1,inplace=True)
    x_test = new_data.drop('Activity', axis=1)

    # 정적 동적 분류
    pred1=model1.predict(x_test)
    new_data['model1_pred']=pred1

    # model2 단계 들어가기 전에 예측값을 담을 변수 만들기
    new_data['model2_pred']=""

    # 정적
    pred0 = new_data.loc[new_data['model1_pred']==0].iloc[:,:-3]
    pred2_1=model2_1.predict(pred0)
    if model2_1.__class__.__name__=='XGBClassifier':
        new_data.loc[new_data['model1_pred']==0, 'model2_pred']=list(label2_1.inverse_transform(pred2_1))
    else:
        new_data.loc[new_data['model1_pred']==0, 'model2_pred']=pred2_1

    #동적
    pred1 = new_data.loc[new_data['model1_pred']==1].iloc[:,:-3]
    pred2_2=model2_2.predict(pred1)
    if model2_2.__class__.__name__=='XGBClassifier':
        new_data.loc[new_data['model1_pred']==1, 'model2_pred']=list(label2_2.inverse_transform(pred2_2))
    else:
        new_data.loc[new_data['model1_pred']==1, 'model2_pred']=pred2_2

    # 정확도
    print('정확도 :',round(accuracy_score(new_data['Activity'], new_data['model2_pred']),4))

    return new_data[['Activity', 'model1_pred','model2_pred']]

In [None]:
new_data=pd.read_csv('/content/drive/MyDrive/에이블스쿨_미니프로젝트/Mini_Project5_2/data01_test.csv')

In [None]:
result = test(new_data, model1, model2_1, model2_2)

정확도 : 0.981
10


In [None]:
label = LabelEncoder()
data['label_Activity']=label.fit_transform(data['Activity'])
data.head()

In [None]:
# xgboost 정적
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
st_data=data.loc[data['Activity'].isin(['LAYING', 'SITTING', 'STANDING'])]
X=st_data.drop(['Activity_dynamic','Activity','label_Activity'], axis=1)
y=st_data['Activity']
x_train, x_val, y_train, y_val = train_test_split(X,y,test_size=0.2)
label2_1 = LabelEncoder()
y_train_la = label2_1.fit_transform(y_train)
y_val_la = label2_1.transform(y_val)
xgb2_1=XGBClassifier()
xgb2_1.fit(x_train, y_train_la)
xgb_pred=xgb2_1.predict(x_val)
print(classification_report(y_val_la,xgb_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       233
           1       0.99      0.99      0.99       204
           2       0.99      0.99      0.99       210

    accuracy                           0.99       647
   macro avg       0.99      0.99      0.99       647
weighted avg       0.99      0.99      0.99       647



In [None]:
list(label2_1.inverse_transform(xgb_pred))

In [None]:
# xgboost 동적
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
dy_data=data.loc[data['Activity'].isin(['WALKING', 'WALKING_DOWNSTAIRS','WALKING_UPSTAIRS'])]
X=dy_data.drop(['Activity_dynamic','Activity','label_Activity'], axis=1)
y=dy_data['Activity']
x_train, x_val, y_train, y_val = train_test_split(X,y,test_size=0.2)
label2_2 = LabelEncoder()
y_train_la = label2_2.fit_transform(y_train)
y_val_la = label2_2.transform(y_val)
xgb2_2=XGBClassifier()
xgb2_2.fit(x_train, y_train_la)
xgb_pred=xgb2_2.predict(x_val)
print(classification_report(y_val_la,xgb_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      1.00       197
           1       0.99      1.00      1.00       152
           2       1.00      0.99      0.99       181

    accuracy                           1.00       530
   macro avg       1.00      1.00      1.00       530
weighted avg       1.00      1.00      1.00       530



In [None]:
new_data=pd.read_csv('/content/drive/MyDrive/에이블스쿨_미니프로젝트/Mini_Project5_2/data01_test.csv')

In [None]:
result = test(new_data, model1, xgb2_1, xgb2_2)

정확도 : 0.9905


In [None]:
result.head()

Unnamed: 0,Activity,model1_pred,model2_pred
0,SITTING,0,SITTING
1,STANDING,0,STANDING
2,WALKING,1,WALKING
3,SITTING,0,SITTING
4,STANDING,0,STANDING


In [None]:
result.loc[result['Activity']!=result['model2_pred']]

Unnamed: 0,Activity,model1_pred,model2_pred
62,STANDING,0,SITTING
207,SITTING,0,STANDING
295,SITTING,0,STANDING
302,WALKING,1,WALKING_UPSTAIRS
310,SITTING,0,STANDING
332,WALKING,1,WALKING_UPSTAIRS
570,LAYING,0,SITTING
590,SITTING,0,STANDING
647,WALKING_DOWNSTAIRS,1,WALKING
835,SITTING,0,STANDING


# feature 중요도 순서로 feature줄여서 학습 및 예측

In [None]:
# 변수의 특성 중요도 계산하기
def plot_feature_importance(importance, names, result_only = False, topn = 'all'):
    feature_importance = np.array(importance)
    feature_name = np.array(names)

    data={'feature_name':feature_name,'feature_importance':feature_importance}
    fi_temp = pd.DataFrame(data)

    #변수의 특성 중요도 순으로 정렬하기
    fi_temp.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    fi_temp.reset_index(drop=True, inplace = True)

    if topn == 'all' :
        fi_df = fi_temp.copy()
    else :
        fi_df = fi_temp.iloc[:topn]

    #변수의 특성 중요도 그래프로 그리기
    if result_only == False :
        plt.figure(figsize=(10,100))
        sns.barplot(x='feature_importance', y='feature_name', data = fi_df)

        plt.xlabel('importance')
        plt.ylabel('feature name')
        plt.grid()

    return fi_df

In [None]:
target='Activity'
X=data.drop(target,axis=1)
y=data[target]
x_train, x_val, y_train, y_val = train_test_split(X,y,test_size=0.2)

In [None]:
model = RandomForestClassifier()
model.fit(x_train, y_train)
y_pred=model.predict(x_val)
print(classification_report(y_val, y_pred))
fi_df = plot_feature_importance(model.feature_importances_, list(x_train), result_only = True, topn = 'all')

                    precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       223
           SITTING       0.94      0.99      0.96       203
          STANDING       0.99      0.94      0.97       225
           WALKING       0.99      1.00      1.00       198
WALKING_DOWNSTAIRS       0.98      0.99      0.98       147
  WALKING_UPSTAIRS       0.99      0.98      0.99       181

          accuracy                           0.98      1177
         macro avg       0.98      0.98      0.98      1177
      weighted avg       0.98      0.98      0.98      1177



In [None]:
data.head(1)

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",Activity,Activity_dynamic,label_Activity
0,0.288508,-0.009196,-0.103362,-0.988986,-0.962797,-0.967422,-0.989,-0.962596,-0.96565,-0.929747,...,-0.042494,-0.044218,0.307873,0.07279,-0.60112,0.331298,0.165163,STANDING,0,2


In [None]:
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
target='Activity'
X=data.drop(['Activity','Activity_dynamic','label_Activity'],axis=1)
y=data[target]
x_train, x_val, y_train, y_val = train_test_split(X,y,test_size=0.2, random_state=42)
label = LabelEncoder()
y_train_la = label.fit_transform(y_train)
y_val_la = label.transform(y_val)
xgb=XGBClassifier(random_state=42)
xgb.fit(x_train, y_train_la)
xgb_pred=xgb.predict(x_val)
print(classification_report(y_val_la,xgb_pred))
fi_df = plot_feature_importance(xgb.feature_importances_, list(x_train), result_only = True, topn = 'all')

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       231
           1       0.98      0.99      0.99       200
           2       1.00      0.98      0.99       226
           3       0.99      0.98      0.99       198
           4       0.99      0.99      0.99       145
           5       0.99      1.00      0.99       177

    accuracy                           0.99      1177
   macro avg       0.99      0.99      0.99      1177
weighted avg       0.99      0.99      0.99      1177



In [None]:
len(xgb.feature_importances_)

561

In [None]:
fi_df = plot_feature_importance(xgb.feature_importances_, list(x_train), result_only = True, topn = 50)

In [None]:
top50_df = data[list(fi_df['feature_name'])]

In [None]:
target='Activity'
X=top50_df
y=data[target]
x_train, x_val, y_train, y_val = train_test_split(X,y,test_size=0.2)

In [None]:
x_train.shape, x_val.shape, y_train.shape, y_val.shape

((4704, 50), (1177, 50), (4704,), (1177,))

In [None]:
label = LabelEncoder()
y_train_la = label.fit_transform(y_train)
y_val_la = label.transform(y_val)
xgb=XGBClassifier()
xgb.fit(x_train, y_train_la)
xgb_pred=xgb.predict(x_val)
print(classification_report(y_val_la,xgb_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       207
           1       0.97      0.96      0.96       201
           2       0.96      0.97      0.97       211
           3       0.98      0.99      0.98       202
           4       0.99      0.97      0.98       179
           5       0.97      0.99      0.98       177

    accuracy                           0.98      1177
   macro avg       0.98      0.98      0.98      1177
weighted avg       0.98      0.98      0.98      1177



In [None]:
fi_df = plot_feature_importance(xgb.feature_importances_, list(x_train), result_only = True, topn = 30)
top30_df = data[list(fi_df['feature_name'])]
target='Activity'
X=top30_df
y=data[target]
x_train, x_val, y_train, y_val = train_test_split(X,y,test_size=0.2)
label = LabelEncoder()
y_train_la = label.fit_transform(y_train)
y_val_la = label.transform(y_val)
xgb=XGBClassifier()
xgb.fit(x_train, y_train_la)
xgb_pred=xgb.predict(x_val)
print(classification_report(y_val_la,xgb_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       198
           1       0.96      0.98      0.97       200
           2       0.99      0.96      0.97       231
           3       0.98      1.00      0.99       201
           4       1.00      0.98      0.99       167
           5       0.99      0.99      0.99       180

    accuracy                           0.98      1177
   macro avg       0.99      0.99      0.99      1177
weighted avg       0.98      0.98      0.98      1177



# 캐글

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
data = pd.read_csv('/content/drive/MyDrive/에이블스쿨_미니프로젝트/Mini_Project5_2/train.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,timestamp,A_x,A_y,A_z,B_x,B_y,B_z,label
0,0,2019-01-12 00:45:54.450,-0.25913,-0.834869,-0.485499,0.196409,,0.384934,8
1,1,2000-01-01 01:37:06.440,0.37049,0.175042,0.122625,-0.338242,0.358245,0.126491,2
2,2,2019-01-12 00:45:33.900,-0.257837,-0.881947,-0.391895,0.196027,0.894537,0.411221,8
3,3,2000-01-01 00:46:22.680,-0.937753,-0.055961,0.362041,-0.929881,0.087673,0.134609,11
4,4,2000-01-01 00:49:56.620,-0.98832,-0.19039,0.157909,-0.954669,-0.02481,-0.38842,6


In [19]:
x_test = pd.read_csv('/content/drive/MyDrive/에이블스쿨_미니프로젝트/Mini_Project5_2/test.csv')
x_test.drop(['Unnamed: 0', 'timestamp'], axis=1, inplace=True)
x_test.head()

Unnamed: 0,A_x,A_y,A_z,B_x,B_y,B_z
0,-1.000957,-0.170691,0.124889,-0.979561,0.00315,-0.264673
1,-0.87483,0.132696,-0.501727,-1.274911,0.045122,0.12127
2,-1.219112,0.074678,0.435331,-0.86082,0.22274,0.008689
3,-0.907752,-0.171816,0.211507,-0.972017,0.337799,1.013534
4,-1.031261,0.00034,-0.091693,-0.217434,-0.323466,0.931614


In [21]:
sample=pd.read_csv('/content/drive/MyDrive/에이블스쿨_미니프로젝트/Mini_Project5_2/sample.csv')

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Unnamed: 0  100000 non-null  int64  
 1   timestamp   100000 non-null  object 
 2   A_x         90000 non-null   float64
 3   A_y         90000 non-null   float64
 4   A_z         90000 non-null   float64
 5   B_x         90000 non-null   float64
 6   B_y         90000 non-null   float64
 7   B_z         90000 non-null   float64
 8   label       100000 non-null  int64  
dtypes: float64(6), int64(2), object(1)
memory usage: 6.9+ MB


In [10]:
dn_data = data.dropna()

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import *
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [11]:
list(dn_data)

['Unnamed: 0', 'timestamp', 'A_x', 'A_y', 'A_z', 'B_x', 'B_y', 'B_z', 'label']

In [12]:
X = dn_data.drop(['Unnamed: 0', 'timestamp','label'],axis=1)
y=dn_data['label']
x_train, x_val, y_train, y_val = train_test_split(X,y,test_size=0.2,stratify=y,random_state=42)

In [13]:
x_train.shape, x_val.shape, y_train.shape, y_val.shape

((42575, 6), (10644, 6), (42575,), (10644,))

In [15]:
model1 = LogisticRegression(max_iter=2000)
model1.fit(x_train, y_train)
pred1 = model1.predict(x_val)
print(classification_report(y_val, pred1))

              precision    recall  f1-score   support

           1       0.48      0.56      0.51      1238
           2       0.38      0.41      0.39      1291
           3       0.43      0.60      0.50       886
           4       0.51      0.45      0.48       770
           5       0.48      0.06      0.11       582
           6       0.57      0.37      0.45       572
           7       0.94      1.00      0.97      1134
           8       0.95      0.98      0.96      1211
           9       0.78      0.95      0.86       978
          10       0.48      0.48      0.48      1014
          11       0.64      0.51      0.56       968

    accuracy                           0.62     10644
   macro avg       0.60      0.58      0.57     10644
weighted avg       0.61      0.62      0.60     10644



In [16]:
model2=RandomForestClassifier(random_state=42)
model2.fit(x_train, y_train)
pred2=model2.predict(x_val)
print(classification_report(y_val, pred2))

              precision    recall  f1-score   support

           1       0.80      0.86      0.83      1238
           2       0.92      0.89      0.90      1291
           3       0.79      0.73      0.76       886
           4       0.81      0.86      0.83       770
           5       0.67      0.60      0.63       582
           6       0.97      0.94      0.95       572
           7       1.00      1.00      1.00      1134
           8       1.00      1.00      1.00      1211
           9       0.99      0.96      0.98       978
          10       0.86      0.92      0.89      1014
          11       0.91      0.92      0.92       968

    accuracy                           0.89     10644
   macro avg       0.88      0.88      0.88     10644
weighted avg       0.89      0.89      0.89     10644



In [32]:
accuracy_score(y_val, pred2)

0.8931792559188275

In [20]:
result = model2.predict(x_test)

In [23]:
result[:5]

array([ 6,  4, 11,  9,  7])

In [24]:
sample['label'] = result

In [25]:
sample.head()

Unnamed: 0,ID,label
0,0,6
1,1,4
2,2,11
3,3,9
4,4,7


In [28]:
sample.to_csv('/content/drive/MyDrive/에이블스쿨_미니프로젝트/Mini_Project5_2/result1.csv', index=False)

# 동적 정적 분리 후 예측
* 동적 : 걷기(1), 뛰기(2), 천천히 걷기(3), 계단 오르기(4), 계단 내려가기(5), 자전거 타기(9), 서서 자전거 타기(10), 자전거에 앉아있기(11)
* 정적 : 서있기(6), 앉아있기(7), 누워있기(8)

In [49]:
dn_data['is_dy']=0
dn_data.loc[dn_data['label'].isin([1,2,3,4,5,9,10,11]),'is_dy']=1
dn_data.loc[dn_data['label'].isin([6,7,8]),'is_dy']=0
dn_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dn_data['is_dy']=0


Unnamed: 0.1,Unnamed: 0,timestamp,A_x,A_y,A_z,B_x,B_y,B_z,label,is_dy
1,1,2000-01-01 01:37:06.440,0.37049,0.175042,0.122625,-0.338242,0.358245,0.126491,2,1
2,2,2019-01-12 00:45:33.900,-0.257837,-0.881947,-0.391895,0.196027,0.894537,0.411221,8,0
3,3,2000-01-01 00:46:22.680,-0.937753,-0.055961,0.362041,-0.929881,0.087673,0.134609,11,1
4,4,2000-01-01 00:49:56.620,-0.98832,-0.19039,0.157909,-0.954669,-0.02481,-0.38842,6,0
5,5,2000-01-01 01:34:24.140,-0.654583,0.068285,-0.029109,-0.176341,-0.256252,-0.510816,2,1


### 동적 정적 분류

In [50]:
X = dn_data.drop(['Unnamed: 0', 'timestamp','label','is_dy'],axis=1)
y=dn_data['is_dy']
x_train, x_val, y_train, y_val = train_test_split(X,y,test_size=0.2,stratify=y,random_state=42)

In [51]:
model2=RandomForestClassifier(random_state=42)
model2.fit(x_train, y_train)
pred2=model2.predict(x_val)
print(classification_report(y_val, pred2))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99      2917
           1       0.99      1.00      0.99      7727

    accuracy                           0.99     10644
   macro avg       0.99      0.99      0.99     10644
weighted avg       0.99      0.99      0.99     10644



### 동적일 경우 분류

In [63]:
dy_data=dn_data.loc[dn_data['label'].isin([1,2,3,4,5,9,10,11])]
X=dy_data.drop(['Unnamed: 0', 'timestamp','label','is_dy'], axis=1)
y=dy_data['label']
x_train, x_val, y_train, y_val = train_test_split(X,y,test_size=0.2)

In [69]:
model2_2=RandomForestClassifier(n_estimators=1000, max_depth=15, random_state=42)
model2_2.fit(x_train, y_train)
pred2_2=model2_2.predict(x_val)
print(classification_report(y_val, pred2_2))

              precision    recall  f1-score   support

           1       0.79      0.89      0.84      1247
           2       0.92      0.88      0.90      1256
           3       0.81      0.77      0.79       880
           4       0.83      0.85      0.84       790
           5       0.68      0.55      0.61       594
           9       0.98      0.97      0.98       958
          10       0.87      0.91      0.89      1025
          11       0.93      0.92      0.93       977

    accuracy                           0.86      7727
   macro avg       0.85      0.84      0.85      7727
weighted avg       0.86      0.86      0.86      7727



### 정적일 경우 분류

In [70]:
st_data=dn_data.loc[dn_data['label'].isin([6,7,8])]
X=st_data.drop(['Unnamed: 0', 'timestamp','label','is_dy'], axis=1)
y=st_data['label']
x_train, x_val, y_train, y_val = train_test_split(X,y,test_size=0.2)

In [71]:
model2_1=RandomForestClassifier(random_state=42)
model2_1.fit(x_train, y_train)
pred2_1=model2_1.predict(x_val)
print(classification_report(y_val, pred2_1))

              precision    recall  f1-score   support

           6       1.00      1.00      1.00       568
           7       1.00      1.00      1.00      1151
           8       1.00      1.00      1.00      1199

    accuracy                           1.00      2918
   macro avg       1.00      1.00      1.00      2918
weighted avg       1.00      1.00      1.00      2918



In [59]:
x_test.head()

Unnamed: 0,A_x,A_y,A_z,B_x,B_y,B_z,model1_pred,model2_pred
0,-1.000957,-0.170691,0.124889,-0.979561,0.00315,-0.264673,0,6
1,-0.87483,0.132696,-0.501727,-1.274911,0.045122,0.12127,1,4
2,-1.219112,0.074678,0.435331,-0.86082,0.22274,0.008689,0,11
3,-0.907752,-0.171816,0.211507,-0.972017,0.337799,1.013534,1,9
4,-1.031261,0.00034,-0.091693,-0.217434,-0.323466,0.931614,0,7


In [72]:
x_test = pd.read_csv('/content/drive/MyDrive/에이블스쿨_미니프로젝트/Mini_Project5_2/test.csv')
x_test.drop(['Unnamed: 0', 'timestamp'], axis=1, inplace=True)

In [73]:
# 정적 동적 분류
pred1=model2.predict(x_test)
x_test['model1_pred']=pred1

# model2 단계 들어가기 전에 예측값을 담을 변수 만들기
x_test['model2_pred']=0
# 정적
pred0 = x_test.loc[x_test['model1_pred']==0].drop(['model1_pred','model2_pred'], axis=1)
pred2_1=model2_1.predict(pred0)
x_test.loc[x_test['model1_pred']==0, 'model2_pred']=pred2_1

#동적
pred1 = x_test.loc[x_test['model1_pred']==1].drop(['model1_pred','model2_pred'], axis=1)
pred2_2=model2_2.predict(pred1)
x_test.loc[x_test['model1_pred']==1, 'model2_pred']=pred2_2

# 정확도
# print('정확도 :',round(accuracy_score(x_test['Activity'], x_test['model2_pred']),4))
x_test.head()

Unnamed: 0,A_x,A_y,A_z,B_x,B_y,B_z,model1_pred,model2_pred
0,-1.000957,-0.170691,0.124889,-0.979561,0.00315,-0.264673,0,6
1,-0.87483,0.132696,-0.501727,-1.274911,0.045122,0.12127,1,4
2,-1.219112,0.074678,0.435331,-0.86082,0.22274,0.008689,1,11
3,-0.907752,-0.171816,0.211507,-0.972017,0.337799,1.013534,1,9
4,-1.031261,0.00034,-0.091693,-0.217434,-0.323466,0.931614,0,7


In [47]:
x_test['model2_pred']

0         6
1         4
2        11
3         9
4         7
         ..
13229     4
13230     9
13231     4
13232     1
13233     7
Name: model2_pred, Length: 13234, dtype: int64

In [74]:
sample['label'] = x_test['model2_pred']
sample.to_csv('/content/drive/MyDrive/에이블스쿨_미니프로젝트/Mini_Project5_2/result3.csv', index=False)

# 시분초 만들어서 학습해보기

In [80]:
dn_data['timestamp'] = pd.to_datetime(dn_data['timestamp'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dn_data['timestamp'] = pd.to_datetime(dn_data['timestamp'])


In [103]:
dn_data['hour'] = dn_data['timestamp'].dt.hour
dn_data['min'] = dn_data['timestamp'].dt.minute
dn_data['sec'] = dn_data['timestamp'].dt.second
dn_data['year'] = dn_data['timestamp'].dt.year
dn_data['month'] = dn_data['timestamp'].dt.month
dn_data['day'] = dn_data['timestamp'].dt.day
dn_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dn_data['hour'] = dn_data['timestamp'].dt.hour
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dn_data['min'] = dn_data['timestamp'].dt.minute
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dn_data['sec'] = dn_data['timestamp'].dt.second
A value is trying to be set on a copy of a slice from a DataFr

Unnamed: 0.1,Unnamed: 0,timestamp,A_x,A_y,A_z,B_x,B_y,B_z,label,is_dy,hour,min,sec,year,month,day
1,1,2000-01-01 01:37:06.440,0.37049,0.175042,0.122625,-0.338242,0.358245,0.126491,2,1,1,37,6,2000,1,1
2,2,2019-01-12 00:45:33.900,-0.257837,-0.881947,-0.391895,0.196027,0.894537,0.411221,8,0,0,45,33,2019,1,12
3,3,2000-01-01 00:46:22.680,-0.937753,-0.055961,0.362041,-0.929881,0.087673,0.134609,11,1,0,46,22,2000,1,1
4,4,2000-01-01 00:49:56.620,-0.98832,-0.19039,0.157909,-0.954669,-0.02481,-0.38842,6,0,0,49,56,2000,1,1
5,5,2000-01-01 01:34:24.140,-0.654583,0.068285,-0.029109,-0.176341,-0.256252,-0.510816,2,1,1,34,24,2000,1,1


In [104]:
X = dn_data.drop(['Unnamed: 0', 'timestamp','label','is_dy'],axis=1)
y=dn_data['label']
x_train, x_val, y_train, y_val = train_test_split(X,y,test_size=0.2,stratify=y,random_state=42)

In [98]:
model2=RandomForestClassifier(random_state=42)
model2.fit(x_train, y_train)
pred2=model2.predict(x_val)
print(classification_report(y_val, pred2))

              precision    recall  f1-score   support

           1       0.96      0.95      0.95      1238
           2       0.97      0.96      0.97      1291
           3       0.98      0.90      0.94       886
           4       0.94      0.99      0.97       770
           5       0.92      0.99      0.95       582
           6       1.00      0.99      0.99       572
           7       1.00      1.00      1.00      1134
           8       1.00      1.00      1.00      1211
           9       1.00      1.00      1.00       978
          10       0.94      0.98      0.96      1014
          11       0.99      0.97      0.98       968

    accuracy                           0.97     10644
   macro avg       0.97      0.98      0.97     10644
weighted avg       0.97      0.97      0.97     10644



In [90]:
dn_data.groupby('label',as_index=False)[['hour','min','sec']].mean()

Unnamed: 0,label,hour,min,sec
0,1,0.344053,29.399806,27.79638
1,2,0.430342,23.806137,28.378894
2,3,0.74458,30.883695,30.352529
3,4,0.310667,40.503763,27.494679
4,5,0.711981,32.471335,26.798833
5,6,0.0,33.188046,29.679133
6,7,0.532099,27.716578,29.663845
7,8,0.0,45.391843,30.503798
8,9,1.559599,26.144551,31.570844
9,10,0.0,27.670087,33.091949


In [92]:
dn_data.loc[dn_data['label']==1,['hour','min','sec']].describe()

Unnamed: 0,hour,min,sec
count,6188.0,6188.0,6188.0
mean,0.344053,29.399806,27.79638
std,0.550139,16.558046,17.109406
min,0.0,0.0,0.0
25%,0.0,22.0,15.0
50%,0.0,26.0,26.0
75%,1.0,28.0,42.0
max,2.0,59.0,59.0


In [105]:
x_test = pd.read_csv('/content/drive/MyDrive/에이블스쿨_미니프로젝트/Mini_Project5_2/test.csv')
x_test['timestamp'] = pd.to_datetime(x_test['timestamp'])
x_test['hour'] = x_test['timestamp'].dt.hour
x_test['min'] = x_test['timestamp'].dt.minute
x_test['sec'] = x_test['timestamp'].dt.second
x_test['year'] = x_test['timestamp'].dt.year
x_test['month'] = x_test['timestamp'].dt.month
x_test['day']=x_test['timestamp'].dt.day
x_test.drop(['Unnamed: 0', 'timestamp'], axis=1, inplace=True)

In [100]:
result = model2.predict(x_test)

In [101]:
sample['label'] = result
sample.to_csv('/content/drive/MyDrive/에이블스쿨_미니프로젝트/Mini_Project5_2/result5.csv', index=False)

In [106]:
from sklearn.ensemble import ExtraTreesClassifier
model3=ExtraTreesClassifier()
model3.fit(x_train, y_train)
pred3 = model3.predict(x_val)
print(classification_report(y_val, pred3))

              precision    recall  f1-score   support

           1       0.99      0.97      0.98      1238
           2       0.98      0.99      0.98      1291
           3       0.99      0.92      0.95       886
           4       0.95      1.00      0.97       770
           5       0.93      0.99      0.96       582
           6       1.00      1.00      1.00       572
           7       1.00      1.00      1.00      1134
           8       1.00      1.00      1.00      1211
           9       1.00      1.00      1.00       978
          10       0.98      1.00      0.99      1014
          11       1.00      0.99      1.00       968

    accuracy                           0.99     10644
   macro avg       0.98      0.99      0.98     10644
weighted avg       0.99      0.99      0.99     10644



In [107]:
result = model3.predict(x_test)

In [108]:
sample['label'] = result
sample.to_csv('/content/drive/MyDrive/에이블스쿨_미니프로젝트/Mini_Project5_2/result6.csv', index=False)

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import GridSearchCV
from tqdm import tqdm
# ExtraTreesClassifier 모델 정의
model4 = ExtraTreesClassifier()

# 하이퍼파라미터 그리드 설정
param_grid = {
    'n_estimators': [300, 500 ,1000],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [None, 5, 10, 15, 25],
    'bootstrap': [True, False],
    'class_weight': [None, 'balanced'],
}
grid_search = GridSearchCV(estimator=model4, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
# tqdm 사용 시작
# 데이터와 타겟을 X와 y로 가정하고 fit을 호출하여 Grid Search를 실행
grid_search.fit(x_train, y_train)

In [None]:
grid_search.best_params_

In [None]:
pred4 = grid_search.predict(x_val)
print(classification_report(y_val, pred4))
print(round(accuracy_score(y_val, pred4),4))

In [124]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint
param_dist = {
    'n_estimators': sp_randint(50, 200),
    'max_depth': sp_randint(3, 20),
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'subsample': [0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    'min_child_samples': sp_randint(2, 20),
}
random_search = RandomizedSearchCV(model8, param_distributions=param_dist, n_iter=10, scoring='accuracy', cv=3, verbose=2, n_jobs=-1, random_state=42)

# 랜덤 서치 실행
random_search.fit(x_train, y_train)

# 최적의 하이퍼파라미터와 결과 출력
print("Best Parameters: ", random_search.best_params_)
print("Best Score: ", random_search.best_score_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002282 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1653
[LightGBM] [Info] Number of data points in the train set: 42575, number of used features: 11
[LightGBM] [Info] Start training from score -2.151880
[LightGBM] [Info] Start training from score -2.109943
[LightGBM] [Info] Start training from score -2.486576
[LightGBM] [Info] Start training from score -2.625364
[LightGBM] [Info] Start training from score -2.904970
[LightGBM] [Info] Start training from score -2.923152
[LightGBM] [Info] Start training from score -2.239222
[LightGBM] [Info] Start training from score -2.173320
[LightGBM] [Info] Start training from score -2.386963
[LightGBM] [Info] Start training from score -2.351563
[LightGBM] [Info] Start training from score -2

In [125]:
pred8=random_search.predict(x_val)
print(classification_report(y_val,pred8))

              precision    recall  f1-score   support

           1       0.99      0.98      0.98      1238
           2       0.99      0.99      0.99      1291
           3       0.99      0.97      0.98       886
           4       0.98      0.99      0.98       770
           5       0.96      0.99      0.98       582
           6       1.00      1.00      1.00       572
           7       1.00      1.00      1.00      1134
           8       1.00      1.00      1.00      1211
           9       1.00      1.00      1.00       978
          10       0.99      1.00      0.99      1014
          11       1.00      1.00      1.00       968

    accuracy                           0.99     10644
   macro avg       0.99      0.99      0.99     10644
weighted avg       0.99      0.99      0.99     10644



In [126]:
result = random_search.predict(x_test)
sample['label'] = result
sample.to_csv('/content/drive/MyDrive/에이블스쿨_미니프로젝트/Mini_Project5_2/result10.csv', index=False)



In [155]:
from lightgbm import LGBMClassifier
model8 = LGBMClassifier(learning_rate=0.2, max_depth=20, n_estimators=200, subsample=0.7)
model8.fit(x_train, y_train)
pred8=model8.predict(x_val)
print(classification_report(y_val,pred8))
print(accuracy_score(y_val,pred8))

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002494 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1653
[LightGBM] [Info] Number of data points in the train set: 42575, number of used features: 11
[LightGBM] [Info] Start training from score -2.151880
[LightGBM] [Info] Start training from score -2.109943
[LightGBM] [Info] Start training from score -2.486576
[LightGBM] [Info] Start training from score -2.625364
[LightGBM] [Info] Start training from score -2.904970
[LightGBM] [Info] Start training from score -2.923152
[LightGBM] [Info] Start training from score -2.239222
[LightGBM] [Info] Start training from score -2.173320
[LightGBM] [Info] Start training from score -2.386963
[LightGBM] [Info] Start training from score -2.351563
[LightGBM] [Info] Start training from score -2.398013
              precision    recall  f1-score   support

In [160]:
model8 = LGBMClassifier(learning_rate=0.22, max_depth=20, n_estimators=200, subsample=0.7, random_state=27)
model8.fit(X, y)
pred8=model8.predict(x_val)
print(classification_report(y_val,pred8))
print(accuracy_score(y_val,pred8))

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002754 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1653
[LightGBM] [Info] Number of data points in the train set: 53219, number of used features: 11
[LightGBM] [Info] Start training from score -2.151804
[LightGBM] [Info] Start training from score -2.109870
[LightGBM] [Info] Start training from score -2.486467
[LightGBM] [Info] Start training from score -2.625563
[LightGBM] [Info] Start training from score -2.905232
[LightGBM] [Info] Start training from score -2.923244
[LightGBM] [Info] Start training from score -2.239226
[LightGBM] [Info] Start training from score -2.173366
[LightGBM] [Info] Start training from score -2.387019
[LightGBM] [Info] Start training from score -2.351469
[LightGBM] [Info] Start training from score -2.397914
              precision    recall  f1-score   support

In [161]:
result = model8.predict(x_test)
sample['label'] = result
sample.to_csv('/content/drive/MyDrive/에이블스쿨_미니프로젝트/Mini_Project5_2/result17.csv', index=False)



### 자전거랑 아닌거 분리
###