# "Stati emozionali alla guida" ("Emotinal states while driving")

## Parte 1 - manipolazione dei dati

Questo codice importa diverse librerie e moduli comunemente utilizzati per la manipolazione dei dati, l'apprendimento automatico e le attività di valutazione. Queste librerie forniscono funzionalità per lavorare con file, dataframes, array, suddividere i dataset, implementare algoritmi di apprendimento automatico, calcolare l'accuratezza e ridimensionare le caratteristiche.

In [1]:
import os
import glob
import pandas as pd
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler

Questo codice può essere utile quando bisogno di leggere più file Excel e memorizzarli come DataFrames separati per ulteriori elaborazioni o analisi.

In [2]:
# Imposto il percorso
folder_path2 = '..\Dati\Dati_Luigi_Nuovo'

excel_files2 = glob.glob(os.path.join(folder_path2, "*.xlsx"))

# Creo lista per i dataframes
dfs2 = []

# Creo una lista di colonna per import dei file excel, non essendoci intestazione
column_headers = [chr(ord('A') + i) for i in range(32)]

In [3]:
print(column_headers)

['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`']


Questo codice legge più file Excel, estrae colonne specifiche che sono quelle che mi servono, aggiunge una colonna 'File_Name' basata sul nome del file e unisce tutti i DataFrames risultanti in un singolo DataFrame.

In [4]:
# Per ogni file, salvo i dati su dataframe
for excel_file2 in excel_files2:
    # leggo i file e specifico le colonne
    df2 = pd.read_excel(excel_file2, header=None, names=column_headers[:pd.read_excel(excel_file2).shape[1]])
    df2['File_Name'] = os.path.basename(excel_file2)  # Aggiungo la colonna 'File_Name' utilizzando il nome di base del file
    
    df2 = df2[['File_Name', 'D', 'H', 'J', 'L']] # Nome File, Time, HR, RR, HRV solo queste colonne da salvare

    dfs2.append(df2)

# Unisco tutti i DataFrames in un unico DataFrame
merged_df2 = pd.concat(dfs2)

In [5]:
print(merged_df2)

                  File_Name         D   H   J     L
0     AVL_BD_20211104G.xlsx  10:07:43  73  22   872
1     AVL_BD_20211104G.xlsx  10:07:43  73  22   872
2     AVL_BD_20211104G.xlsx  10:07:44  72  22   849
3     AVL_BD_20211104G.xlsx  10:07:45  74  22   842
4     AVL_BD_20211104G.xlsx  10:07:46  71  24  1068
...                     ...       ...  ..  ..   ...
5587  AVL_SN_20211103G.xlsx  18:25:41  73  15   812
5588  AVL_SN_20211103G.xlsx  18:25:42  73  15   733
5589  AVL_SN_20211103G.xlsx  18:25:43  74  15   760
5590  AVL_SN_20211103G.xlsx  18:25:44  74  15   747
5591  AVL_SN_20211103G.xlsx  18:25:45  75  15   762

[105616 rows x 5 columns]


Lo scopo di questo codice è leggere più file Excel, aggiungere colonne aggiuntive a ciascun DataFrame e raccogliere tutti i DataFrames in una lista per ulteriori elaborazioni.

In [6]:
folder_path1 = "..\Dati\Dati_Luigi_arousal"

excel_files1 = glob.glob(os.path.join(folder_path1, "*.xlsx"))

dfs1 = []

for i, excel_file1 in enumerate(excel_files1):
    df1 = pd.read_excel(excel_file1)
    df1['File_Name'] = os.path.basename(excel_file1) 
    df1['Progressivo'] = i  # Aggiungo la colonna 'Progressivo' con il numero del file, magari può essere utile se si vuole studiare singolo file
    dfs1.append(df1)

In [7]:
# join tutti i DataFrames
merged_df1 = pd.concat(dfs1)

merged_df1.reset_index(drop=True, inplace=True)

In [8]:
print(merged_df1)

            Time  Valance  Arousal              File_Name  Progressivo
0       10:07:43       -1      201  AVL_BD_20211104G.xlsx            0
1       10:07:43       -1      201  AVL_BD_20211104G.xlsx            0
2       10:07:44       -1      201  AVL_BD_20211104G.xlsx            0
3       10:07:45       -1      201  AVL_BD_20211104G.xlsx            0
4       10:07:46       -1      201  AVL_BD_20211104G.xlsx            0
...          ...      ...      ...                    ...          ...
105599  18:25:41       84      116  AVL_SN_20211103G.xlsx           14
105600  18:25:42       83      117  AVL_SN_20211103G.xlsx           14
105601  18:25:43       82      118  AVL_SN_20211103G.xlsx           14
105602  18:25:44       81      119  AVL_SN_20211103G.xlsx           14
105603  18:25:45       80      120  AVL_SN_20211103G.xlsx           14

[105604 rows x 5 columns]


In [9]:
# rinomino colonne per essere più parlanti
merged_df2 = merged_df2.rename(columns={'D':'Time','H':'HR','J':'RR','L':'HRV'})

In [10]:
print(merged_df2)

                  File_Name      Time  HR  RR   HRV
0     AVL_BD_20211104G.xlsx  10:07:43  73  22   872
1     AVL_BD_20211104G.xlsx  10:07:43  73  22   872
2     AVL_BD_20211104G.xlsx  10:07:44  72  22   849
3     AVL_BD_20211104G.xlsx  10:07:45  74  22   842
4     AVL_BD_20211104G.xlsx  10:07:46  71  24  1068
...                     ...       ...  ..  ..   ...
5587  AVL_SN_20211103G.xlsx  18:25:41  73  15   812
5588  AVL_SN_20211103G.xlsx  18:25:42  73  15   733
5589  AVL_SN_20211103G.xlsx  18:25:43  74  15   760
5590  AVL_SN_20211103G.xlsx  18:25:44  74  15   747
5591  AVL_SN_20211103G.xlsx  18:25:45  75  15   762

[105616 rows x 5 columns]


In [11]:
# join dei due dataframe per nome file e time. Se time non esiste nel join, i dati sono inservibili e vengono scartati
joined_df = pd.merge(merged_df1, merged_df2, on=['File_Name','Time'], how='left')

print(joined_df)

            Time  Valance  Arousal              File_Name  Progressivo  HR   
0       10:07:43       -1      201  AVL_BD_20211104G.xlsx            0  73  \
1       10:07:43       -1      201  AVL_BD_20211104G.xlsx            0  73   
2       10:07:43       -1      201  AVL_BD_20211104G.xlsx            0  73   
3       10:07:43       -1      201  AVL_BD_20211104G.xlsx            0  73   
4       10:07:44       -1      201  AVL_BD_20211104G.xlsx            0  72   
...          ...      ...      ...                    ...          ...  ..   
169461  18:25:41       84      116  AVL_SN_20211103G.xlsx           14  73   
169462  18:25:42       83      117  AVL_SN_20211103G.xlsx           14  73   
169463  18:25:43       82      118  AVL_SN_20211103G.xlsx           14  74   
169464  18:25:44       81      119  AVL_SN_20211103G.xlsx           14  74   
169465  18:25:45       80      120  AVL_SN_20211103G.xlsx           14  75   

        RR  HRV  
0       22  872  
1       22  872  
2       2

## Parte 2 - Equal Width Binning

Taglio automaticamente i dati dell'arousal in 4 bin

In [12]:
joined_df['Arousal_Bin'] = pd.cut(joined_df['Arousal'], bins=4)

In [13]:
print(joined_df)

            Time  Valance  Arousal              File_Name  Progressivo  HR   
0       10:07:43       -1      201  AVL_BD_20211104G.xlsx            0  73  \
1       10:07:43       -1      201  AVL_BD_20211104G.xlsx            0  73   
2       10:07:43       -1      201  AVL_BD_20211104G.xlsx            0  73   
3       10:07:43       -1      201  AVL_BD_20211104G.xlsx            0  73   
4       10:07:44       -1      201  AVL_BD_20211104G.xlsx            0  72   
...          ...      ...      ...                    ...          ...  ..   
169461  18:25:41       84      116  AVL_SN_20211103G.xlsx           14  73   
169462  18:25:42       83      117  AVL_SN_20211103G.xlsx           14  73   
169463  18:25:43       82      118  AVL_SN_20211103G.xlsx           14  74   
169464  18:25:44       81      119  AVL_SN_20211103G.xlsx           14  74   
169465  18:25:45       80      120  AVL_SN_20211103G.xlsx           14  75   

        RR  HRV     Arousal_Bin  
0       22  872  (151.0, 201.

In [14]:
# ogni bin [50,100) avrà un numero progressivo da 0 a 3
bin_mapping = {bin_val: i for i, bin_val in enumerate(joined_df['Arousal_Bin'].unique())}

# sort dei bin
sorted_bins = sorted(joined_df['Arousal_Bin'].unique())

# mappo ogni bin al numero
bin_mapping = {bin_val: i for i, bin_val in enumerate(sorted_bins)}

# salvo il mapping in una nuova colonna
joined_df['Bin_Num'] = joined_df['Arousal_Bin'].map(bin_mapping)

In [15]:
print(joined_df)

            Time  Valance  Arousal              File_Name  Progressivo  HR   
0       10:07:43       -1      201  AVL_BD_20211104G.xlsx            0  73  \
1       10:07:43       -1      201  AVL_BD_20211104G.xlsx            0  73   
2       10:07:43       -1      201  AVL_BD_20211104G.xlsx            0  73   
3       10:07:43       -1      201  AVL_BD_20211104G.xlsx            0  73   
4       10:07:44       -1      201  AVL_BD_20211104G.xlsx            0  72   
...          ...      ...      ...                    ...          ...  ..   
169461  18:25:41       84      116  AVL_SN_20211103G.xlsx           14  73   
169462  18:25:42       83      117  AVL_SN_20211103G.xlsx           14  73   
169463  18:25:43       82      118  AVL_SN_20211103G.xlsx           14  74   
169464  18:25:44       81      119  AVL_SN_20211103G.xlsx           14  74   
169465  18:25:45       80      120  AVL_SN_20211103G.xlsx           14  75   

        RR  HRV     Arousal_Bin Bin_Num  
0       22  872  (151

## Parte 3 - Random Forest

Scelto:
150 alberi, parametro trovato con grid search per migliorare accuracy. 100 albero erano pochi per il problema
30 foglie ad albero massimo
10 livelli per ogni albero
5 dati necessari per fare split
2 dati minimo per foglia

In [16]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# liste per salvare modelli e accuracies
rf_models = []
rf_accuracies = []

for file_name in joined_df['File_Name'].unique():

    file_data = joined_df[joined_df['File_Name'] == file_name]

    # ottengo i dati di input e output X->y dove y è la label
    X = file_data[['HR', 'RR', 'HRV']] 
    y = file_data['Bin_Num']

    # Converto la colonna 'Time' in ora minuto e secondo
    X.loc[:, 'Hour'] = file_data['Time'].apply(lambda x: x.hour)
    X.loc[:, 'Minute'] = file_data['Time'].apply(lambda x: x.minute)
    X.loc[:, 'Second'] = file_data['Time'].apply(lambda x: x.second)


    # scaling dei dati da 0 a 1
    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(X)

    # split in training e test (70% training, 30% testing)
    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

    # creo modello RF
    # rf_model = RandomForestClassifier()
    rf_model = RandomForestClassifier(max_depth=10, min_samples_split=5, min_samples_leaf=2, max_features='sqrt', n_estimators=150, max_leaf_nodes=30)


    # fit modello su training
    rf_model.fit(X_train, y_train)

    # evaluation modello su test
    y_pred = rf_model.predict(X_test)

    # calcolo accuracy
    accuracy = accuracy_score(y_test, y_pred)
    rf_accuracies.append(accuracy)

    # salvo modello
    rf_models.append(rf_model)




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'Hour'] = file_data['Time'].apply(lambda x: x.hour)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'Minute'] = file_data['Time'].apply(lambda x: x.minute)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'Second'] = file_data['Time'].apply(lambda x: x.second)
A value is try

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'Hour'] = file_data['Time'].apply(lambda x: x.hour)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'Minute'] = file_data['Time'].apply(lambda x: x.minute)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'Second'] = file_data['Time'].apply(lambda x: x.second)
A value is try

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'Hour'] = file_data['Time'].apply(lambda x: x.hour)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'Minute'] = file_data['Time'].apply(lambda x: x.minute)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'Second'] = file_data['Time'].apply(lambda x: x.second)


In [17]:
for i, accuracy in enumerate(rf_accuracies):
    print(f"Accuracy for File {i+1}: {accuracy}")

Accuracy for File 1: 0.8257825370675453
Accuracy for File 2: 0.7508884150675196
Accuracy for File 3: 0.9527104959630911
Accuracy for File 4: 0.8317838466008135
Accuracy for File 5: 0.8083333333333333
Accuracy for File 6: 0.7659005879208979
Accuracy for File 7: 0.8566344687341129
Accuracy for File 8: 0.7649205108157414
Accuracy for File 9: 0.848816029143898
Accuracy for File 10: 0.7380254154447703
Accuracy for File 11: 0.7336013400335009
Accuracy for File 12: 0.9035977859778598
Accuracy for File 13: 0.9026315789473685
Accuracy for File 14: 0.8984375
Accuracy for File 15: 0.8179214402618658


In [18]:
print("Max  accuracy for RF: ",max(rf_accuracies))
print("Min  accuracy for RF: ",min(rf_accuracies))
print("Mean accuracy for RF: ",sum(rf_accuracies)/len(rf_accuracies))

Max  accuracy for RF:  0.9527104959630911
Min  accuracy for RF:  0.7336013400335009
Mean accuracy for RF:  0.8266656856874879


## Parte 4 - SVM

data are not linearly separable, kernel='linear' points to worst results
parameter C = 500 to set a soft svm

In [19]:
from sklearn.svm import SVC

svm_models = []
svm_accuracies = []

for file_name in joined_df['File_Name'].unique():
    
    file_data = joined_df[joined_df['File_Name'] == file_name]

    X = file_data[['HR', 'RR', 'HRV']]  
    y = file_data['Bin_Num']

    X.loc[:, 'Hour'] = file_data['Time'].apply(lambda x: x.hour)
    X.loc[:, 'Minute'] = file_data['Time'].apply(lambda x: x.minute)
    X.loc[:, 'Second'] = file_data['Time'].apply(lambda x: x.second)

    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(X)

    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

    #svm_model = SVC()
    #svm_model = SVC(C=1.0, kernel='rbf', gamma='scale', shrinking=True)
    svm_model = SVC(C=500.0, kernel='rbf', gamma='scale', shrinking=True, decision_function_shape='ovr', random_state=42)

    svm_model.fit(X_train, y_train)

    y_pred = svm_model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    svm_accuracies.append(accuracy)

    svm_models.append(svm_model)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'Hour'] = file_data['Time'].apply(lambda x: x.hour)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'Minute'] = file_data['Time'].apply(lambda x: x.minute)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'Second'] = file_data['Time'].apply(lambda x: x.second)
A value is try

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'Hour'] = file_data['Time'].apply(lambda x: x.hour)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'Minute'] = file_data['Time'].apply(lambda x: x.minute)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'Second'] = file_data['Time'].apply(lambda x: x.second)
A value is try

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'Hour'] = file_data['Time'].apply(lambda x: x.hour)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'Minute'] = file_data['Time'].apply(lambda x: x.minute)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'Second'] = file_data['Time'].apply(lambda x: x.second)


In [20]:
for i, accuracy in enumerate(svm_accuracies):
    print(f"Accuracy for File {i+1}: {accuracy}")

Accuracy for File 1: 0.8434925864909391
Accuracy for File 2: 0.7853589196872779
Accuracy for File 3: 0.9238754325259516
Accuracy for File 4: 0.8672283556072051
Accuracy for File 5: 0.7929487179487179
Accuracy for File 6: 0.7808658471405665
Accuracy for File 7: 0.8235892221657346
Accuracy for File 8: 0.8571800886108939
Accuracy for File 9: 0.8583788706739527
Accuracy for File 10: 0.8198110133594004
Accuracy for File 11: 0.8542043551088777
Accuracy for File 12: 0.8731549815498155
Accuracy for File 13: 0.9026315789473685
Accuracy for File 14: 0.8152901785714286
Accuracy for File 15: 0.8694762684124386


In [21]:
print("Max  accuracy per RF: ",max(svm_accuracies))
print("Min  accuracy per RF: ",min(svm_accuracies))
print("Mean accuracy per RF: ",sum(svm_accuracies)/len(svm_accuracies))

Max  accuracy per RF:  0.9238754325259516
Min  accuracy per RF:  0.7808658471405665
Mean accuracy per RF:  0.8444990944533711
