Universidade Estadual de Campinas - UNICAMP

Faculdade de Engenharia Elétrica e de Computação - FEEC

### IA048 – Aprendizado de Máquina

Alunos:
* Tiago Corrêa de Araújo de Amorim (RA: 100.675)
* Taylon L C Martins (RA: 177.379)

# Lista 02

## Tarefa

Abordar o problema de reconhecimento de atividades humanas (HAR, do inglês _human activity recognition_) a partir de informações capturadas por sensores de smartphones.

Utilizar regressão logística e kNN, com os dados pré-processados e dados brutos.

**Dataset**

* [UCI HAR](https://archive.ics.uci.edu/dataset/240/human+activity+recognition+using+smartphones)

* Human Activity Recognition database built from the recordings of 30 subjects performing activities of daily living (ADL) while carrying a waist-mounted smartphone with embedded inertial sensors.

In [66]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import confusion_matrix

# import numpy as np
# from matplotlib.dates import DateFormatter
# from matplotlib.patches import Patch

# from sklearn.preprocessing import MinMaxScaler
# from sklearn.model_selection import TimeSeriesSplit
# from sklearn.model_selection import cross_val_score

# from sklearn.linear_model import LinearRegression
# from sklearn.linear_model import ElasticNet
# from sklearn.model_selection import GridSearchCV

# from sklearn.metrics import root_mean_squared_error
# from sklearn.metrics import mean_absolute_percentage_error

## Read Preprocessed Dataset

In [46]:
def read_csv_(path, sep=r"\s+|;|:|,"):
    return pd.read_csv(
        filepath_or_buffer=path,
        sep=sep,
        engine='python',
        header=None)

In [49]:
labels = read_csv_(r'..\Lista02\UCI_HAR_Dataset\activity_labels.txt')
labels.drop(0, axis=1, inplace=True)
labels = labels.rename(columns={1: 'activity'})

def get_label_name(i):
    return labels['activity'][i-1]

In [56]:
features = read_csv_(r'..\Lista02\UCI_HAR_Dataset\features.txt', sep=r"\s+")
features.drop(0, axis=1, inplace=True)
features = features.rename(columns={1: 'feature'})

def get_features_name(i):
    return features['feature'][i]

In [13]:
X = read_csv_(r'..\Lista02\UCI_HAR_Dataset\train\X_train.txt')
y = read_csv_(r'..\Lista02\UCI_HAR_Dataset\train\y_train.txt')
X_test = read_csv_(r'..\Lista02\UCI_HAR_Dataset\test\X_test.txt')
y_test = read_csv_(r'..\Lista02\UCI_HAR_Dataset\test\y_test.txt')

In [57]:
print('Train data')
print(f'  X: {X.shape}')
print(f'  y: {y.shape}')
print('Test data')
print(f'  X: {X_test.shape}')
print(f'  y: {y_test.shape}')

Train data
  X: (7352, 561)
  y: (7352, 1)
Test data
  X: (2947, 561)
  y: (2947, 1)


In [62]:
X.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,551,552,553,554,555,556,557,558,559,560
count,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,...,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0
mean,0.274488,-0.017695,-0.109141,-0.605438,-0.510938,-0.604754,-0.630512,-0.526907,-0.60615,-0.468604,...,0.125293,-0.307009,-0.625294,0.008684,0.002186,0.008726,-0.005981,-0.489547,0.058593,-0.056515
std,0.070261,0.040811,0.056635,0.448734,0.502645,0.418687,0.424073,0.485942,0.414122,0.544547,...,0.250994,0.321011,0.307584,0.336787,0.448306,0.608303,0.477975,0.511807,0.29748,0.279122
min,-1.0,-1.0,-1.0,-1.0,-0.999873,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-0.995357,-0.999765,-0.97658,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,0.262975,-0.024863,-0.120993,-0.992754,-0.978129,-0.980233,-0.993591,-0.978162,-0.980251,-0.936219,...,-0.023692,-0.542602,-0.845573,-0.121527,-0.289549,-0.482273,-0.376341,-0.812065,-0.017885,-0.143414
50%,0.277193,-0.017219,-0.108676,-0.946196,-0.851897,-0.859365,-0.950709,-0.857328,-0.857143,-0.881637,...,0.134,-0.343685,-0.711692,0.009509,0.008943,0.008735,-0.000368,-0.709417,0.182071,0.003181
75%,0.288461,-0.010783,-0.097794,-0.242813,-0.034231,-0.262415,-0.29268,-0.066701,-0.265671,-0.017129,...,0.289096,-0.126979,-0.503878,0.150865,0.292861,0.506187,0.359368,-0.509079,0.248353,0.107659
max,1.0,1.0,1.0,1.0,0.916238,1.0,1.0,0.967664,1.0,1.0,...,0.9467,0.989538,0.956845,1.0,1.0,0.998702,0.996078,1.0,0.478157,1.0


## Logistic Regression

Opções utilizadas:

* Validação cruzada estratificada em 5 pastas.
* Normalização $l_2$ ($\frac{1}{2} ||w||_2^2$), com otimização do seu inverso ($c = \frac{1}{l_2}$)
* Função objetivo da otimização: acurácia.
* Estratégia: multinomial (entropia cruzada).

Fontes: 

* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html

* https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

In [74]:
clf = LogisticRegressionCV(cv=5,
                           solver='saga',
                        #    scoring='roc_auc',
                           random_state=42)
clf.fit(X, y[0].values)



In [63]:
clf.C_

array([2.7825594, 2.7825594, 2.7825594, 2.7825594, 2.7825594, 2.7825594])

In [65]:
y_pred = clf.predict(X_test)
accuracy = clf.score(X_test, y_test)
print("Accuracy:", accuracy)

Accuracy: 0.9623345775364778


In [67]:
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
[[494   0   2   0   0   0]
 [ 25 445   1   0   0   0]
 [  3   9 408   0   0   0]
 [  0   3   0 431  57   0]
 [  1   0   0  10 521   0]
 [  0   0   0   0   0 537]]
