https://www.kaggle.com/competitions/brist1d/data 

**Files**  
**activities.txt** - a list of activity names that appear in the activity-X:XX columns  
**sample_submission.csv** - a sample submission file in the correct format  
**test.csv** - the test set  
**train.csv** - the training set

**Columns**  
**train.csv**  
- id - row id consisting of participant number and a count for that participant  
- p_num - participant number  
- time - time of day in the format HH:MM:SS  
- bg-X:XX - blood glucose reading in mmol/L, X:XX(H:MM) time in the past (e.g. bg-2:35, would be the blood glucose reading from 2 hours and 35 minutes before the time value for that row), recorded by the continuous glucose monitor  
- insulin-X:XX - total insulin dose received in units in the last 5 minutes, X:XX(H:MM) time in the past (e.g. insulin-2:35, would be the total insulin dose received between 2 hours and 40 minutes and 2 hours and 35 minutes before the time value for that row), recorded by the insulin pump  
- carbs-X:XX - total carbohydrate value consumed in grammes in the last 5 minutes, X:XX(H:MM) time in the past (e.g. carbs-2:35, would be the total carbohydrate value consumed between 2 hours and 40 minutes and 2 hours and 35 minutes before the time value for that row), recorded by the participant
hr-X:XX - mean heart rate in beats per minute in the last 5 minutes, X:XX(H:MM) time in the past (e.g. hr-2:35, would be the mean heart rate between 2 hours and 40 minutes and 2 hours and 35 minutes before the time value for that row), recorded by the smartwatch  
- steps-X:XX - total steps walked in the last 5 minutes, X:XX(H:MM) time in the past (e.g. steps-2:35, would be the total steps walked between 2 hours and 40 minutes and 2 hours and 35 minutes before the time value for that row), recorded by the smartwatch  
- cals-X:XX - total calories burnt in the last 5 minutes, X:XX(H:MM) time in the past (e.g. cals-2:35, would be the total calories burned between 2 hours and 40 minutes and 2 hours and 35 minutes before the time value for that row), calculated by the smartwatch  
- activity-X:XX - self-declared activity performed in the last 5 minutes, X:XX(H:MM) time in the past (e.g. activity-2:35, would show a string name of the activity performed between 2 hours and 40 minutes and 2 hours and 35 minutes before the time value for that row), set on the smartwatch  
- bg+1:00 - blood glucose reading in mmol/L an hour in the future, this is the value you will be predicting **(not provided in test.csv)**

Данные берутся за последний час. Используются все признаки, в датафрейме index_col='id'. 'time' преобразуется в 'hour' и 'minute'. В activities NaN заменены на категорию 'No activity'.  

Стоит обратить внимание на проблемы у p11 c insulin-X:XX.

Алгоритм HistGradientBoostingRegressor - реализация градиентно-бустированных деревьев. Этот оценщик имеет встроенную поддержку пропущенных значений (NaN) и категориальных данных.  
HistGradientBoostingRegressor(l2_regularization=0, max_leaf_nodes=np.int64(100),
                              min_samples_leaf=np.int64(80),
                              random_state=RandomState(MT19937) at 0x2B71B6F2140)   

Перекрестная проверка выполняется   
при cv=KFold(n_splits=5, shuffle=True, random_state=rng). Результат RMSE: mean = n1.8112113344521983, std = 0.010219906419843432  
при cv=GroupKFold, где группы формируются по пациентам (p_num), данные предварительно перемешиваются. Результат RMSE: mean = 2.0924813206032367, std = 0.11052920756134064  

train и valid: В валидационной выборке пациент p03 и последнии 1172 по всем пациентам, остальное осталось в обучающей (80%). Данные в выборках перемешиваются.  
RMSE = 2.124073113303095

In [1]:
import numpy as np

import pandas as pd
from pandas.api.types import CategoricalDtype

from sklearn.model_selection import GridSearchCV, GroupKFold, cross_val_score
from sklearn.metrics import root_mean_squared_error
from sklearn.ensemble import HistGradientBoostingRegressor

In [2]:
random_state = 42

In [3]:
csv_file_path = 'train.csv'
train_data = pd.read_csv(csv_file_path, index_col='id')

train_data.head()

  train_data = pd.read_csv(csv_file_path, index_col='id')


Unnamed: 0_level_0,p_num,time,bg-5:55,bg-5:50,bg-5:45,bg-5:40,bg-5:35,bg-5:30,bg-5:25,bg-5:20,...,activity-0:40,activity-0:35,activity-0:30,activity-0:25,activity-0:20,activity-0:15,activity-0:10,activity-0:05,activity-0:00,bg+1:00
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
p01_0,p01,06:10:00,,,9.6,,,9.7,,,...,,,,,,,,,,13.4
p01_1,p01,06:25:00,,,9.7,,,9.2,,,...,,,,,,,,,,12.8
p01_2,p01,06:40:00,,,9.2,,,8.7,,,...,,,,,,,,,,15.5
p01_3,p01,06:55:00,,,8.7,,,8.4,,,...,,,,,,,,,,14.8
p01_4,p01,07:10:00,,,8.4,,,8.1,,,...,,,,,,,,,,12.7


In [4]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 177024 entries, p01_0 to p12_25298
Columns: 507 entries, p_num to bg+1:00
dtypes: float64(433), object(74)
memory usage: 686.1+ MB


In [5]:
# Количество записей по пациентам
train_data.groupby(['p_num']).count()

Unnamed: 0_level_0,time,bg-5:55,bg-5:50,bg-5:45,bg-5:40,bg-5:35,bg-5:30,bg-5:25,bg-5:20,bg-5:15,...,activity-0:40,activity-0:35,activity-0:30,activity-0:25,activity-0:20,activity-0:15,activity-0:10,activity-0:05,activity-0:00,bg+1:00
p_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
p01,8459,71,3896,4309,64,3759,4454,57,3625,4595,...,341,342,329,341,348,331,342,348,334,8459
p02,25872,25482,25483,25484,25485,25486,25487,25488,25489,25489,...,244,245,246,246,245,244,243,242,241,25872
p03,26028,25701,25702,25703,25704,25705,25706,25707,25708,25709,...,112,112,112,112,112,112,112,112,112,26028
p04,24686,24396,24396,24396,24396,24395,24395,24395,24395,24396,...,284,284,284,284,284,284,285,284,283,24686
p05,8288,439,2750,4622,407,2674,4731,377,2591,4845,...,94,102,98,95,104,99,98,106,102,8288
p06,8383,108,2722,5255,104,2618,5368,100,2515,5479,...,215,218,219,214,222,223,219,223,226,8383
p10,25454,25106,25105,25105,25105,25105,25106,25105,25105,25105,...,903,902,901,901,901,902,903,904,906,25454
p11,24555,23914,23920,23925,23930,23935,23940,23945,23950,23955,...,465,463,462,460,458,456,455,454,453,24555
p12,25299,24553,24559,24565,24571,24577,24583,24589,24595,24601,...,80,80,80,80,80,80,80,80,80,25299


In [6]:
# статистика
train_data.groupby(['p_num']).describe()

Unnamed: 0_level_0,bg-5:55,bg-5:55,bg-5:55,bg-5:55,bg-5:55,bg-5:55,bg-5:55,bg-5:55,bg-5:50,bg-5:50,...,cals-0:00,cals-0:00,bg+1:00,bg+1:00,bg+1:00,bg+1:00,bg+1:00,bg+1:00,bg+1:00,bg+1:00
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
p_num,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
p01,71.0,9.459155,3.2173,4.7,6.2,9.9,12.1,14.6,3896.0,8.884702,...,10.98,53.0,8459.0,8.888781,4.132187,2.2,5.6,8.1,11.5,27.8
p02,25482.0,9.309242,2.917305,2.2,7.2,8.8,10.9,22.2,25483.0,9.309724,...,5.4,116.1,25872.0,9.338358,2.92694,2.2,7.2,8.8,11.0,22.2
p03,25701.0,8.576223,3.141339,2.2,6.3,7.9,10.4,22.2,25702.0,8.576134,...,7.3,65.85,26028.0,8.58308,3.140478,2.2,6.3,7.9,10.4,22.2
p04,24396.0,7.763166,2.249274,2.2,6.2,7.4,9.0,18.4,24396.0,7.762408,...,6.18,42.51,24686.0,7.761359,2.246061,2.2,6.2,7.4,9.0,18.4
p05,439.0,8.966515,3.009345,3.2,6.7,8.8,10.85,17.3,2750.0,8.051091,...,6.67,45.38,8288.0,8.135582,3.117917,2.2,5.8,7.8,10.1,22.2
p06,108.0,8.815741,2.552593,4.2,6.55,9.1,10.8,13.7,2722.0,8.717377,...,13.19,68.03,8383.0,8.936872,3.76678,2.9,6.2,8.1,10.8,27.8
p10,25106.0,6.366976,1.57987,2.2,5.3,6.0,7.1,15.9,25105.0,6.367222,...,11.99,71.77,25454.0,6.372932,1.576365,2.2,5.3,6.0,7.2,15.9
p11,23914.0,9.378527,2.88051,2.2,7.3,9.2,11.3,20.8,23920.0,9.37819,...,7.62,66.39,24555.0,9.380721,2.88773,2.2,7.3,9.3,11.3,21.6
p12,24553.0,7.862624,2.837006,2.8,6.0,7.2,9.0,22.2,24559.0,7.862828,...,20.69,57.88,25299.0,7.847757,2.830094,2.8,6.0,7.2,9.0,22.2


In [7]:
train_data.duplicated().sum()

np.int64(0)

In [8]:
list_columns = ['bg', 'insulin', 'carbs', 'hr', 'steps', 'cals', 'activity']

In [9]:
def data_hour(data=train_data):
    df = data.copy()
    for _ in list_columns:
        df = df.drop(train_data.loc[:,'{}-5:55'.format(_):'{}-1:00'.format(_)], axis=1)
    return df

In [10]:
train_data_hour = data_hour()

In [11]:
train_data_hour

Unnamed: 0_level_0,p_num,time,bg-0:55,bg-0:50,bg-0:45,bg-0:40,bg-0:35,bg-0:30,bg-0:25,bg-0:20,...,activity-0:40,activity-0:35,activity-0:30,activity-0:25,activity-0:20,activity-0:15,activity-0:10,activity-0:05,activity-0:00,bg+1:00
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
p01_0,p01,06:10:00,,,17.5,,,17.3,,,...,,,,,,,,,,13.4
p01_1,p01,06:25:00,,,17.3,,,16.2,,,...,,,,,,,,,,12.8
p01_2,p01,06:40:00,,,16.2,,,15.1,,,...,,,,,,,,,,15.5
p01_3,p01,06:55:00,,,15.1,,,14.4,,,...,,,,,,,,,,14.8
p01_4,p01,07:10:00,,,14.4,,,13.9,,,...,,,,,,,,,,12.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
p12_25294,p12,23:35:00,6.3,6.5,6.9,7.5,7.9,8.2,8.7,8.6,...,,,,,,,,,,11.1
p12_25295,p12,23:40:00,6.5,6.9,7.5,7.9,8.2,8.7,8.6,8.9,...,,,,,,,,,,10.9
p12_25296,p12,23:45:00,6.9,7.5,7.9,8.2,8.7,8.6,8.9,9.3,...,,,,,,,,,,10.7
p12_25297,p12,23:50:00,7.5,7.9,8.2,8.7,8.6,8.9,9.3,9.7,...,,,,,,,,,,10.5


In [12]:
with open("activities.txt", "r") as file:
    activities = [line[:-1] for line in file]
activities.append('No activity')
activities

['Indoor climbing',
 'Run',
 'Strength training',
 'Swim',
 'Bike',
 'Dancing',
 'Stairclimber',
 'Spinning',
 'Walking',
 'HIIT',
 'Outdoor Bike',
 'Walk',
 'Aerobic Workout',
 'Tennis',
 'Workout',
 'Hike',
 'Zumba',
 'Sport',
 'Yoga',
 'Swimming',
 'Weights',
 'Running',
 'No activity']

In [13]:
cat_dtype = CategoricalDtype(categories=activities, ordered=False)

In [14]:
number_of_elements = train_data_hour.size  # Количество элементов в датасете
train_len = train_data_hour.shape[0]  # Всего записей (строк) в датасете

number_missing_values_in_column = train_data_hour.isnull().sum()  # Количество NaN в колонке
number_missing_values = number_missing_values_in_column.sum()  # Количество всех NaN

print(f"Всего записей: {train_len}, всего значений: {number_of_elements}")
print(f"Пропущенных значений: {number_missing_values}, {round(((number_missing_values / number_of_elements) * 100), 2)} %")

percentage = train_data_hour.isnull().mean() * 100
missing = pd.DataFrame({"Кол-во пропущенных значений": number_missing_values_in_column, 
                        "Процент пропусков": round(percentage, 2)}).sort_values("Процент пропусков")
missing

Всего записей: 177024, всего значений: 15401088
Пропущенных значений: 6692289, 43.45 %


Unnamed: 0,Кол-во пропущенных значений,Процент пропусков
p_num,0,0.00
time,0,0.00
bg+1:00,0,0.00
bg-0:00,2696,1.52
bg-0:15,3272,1.85
...,...,...
carbs-0:35,174467,98.56
carbs-0:10,174481,98.56
carbs-0:25,174487,98.57
carbs-0:40,174488,98.57


In [15]:
print(f"Всего записей: {train_len}")
print(f"Нужно на валидацию 20 %: {round(train_len * 0.2)}")

Всего записей: 177024
Нужно на валидацию 20 %: 35405


In [16]:
csv_file_path = 'test.csv'
test_data = pd.read_csv(csv_file_path, index_col='id')
test_data.head()

Unnamed: 0_level_0,p_num,time,bg-5:55,bg-5:50,bg-5:45,bg-5:40,bg-5:35,bg-5:30,bg-5:25,bg-5:20,...,activity-0:45,activity-0:40,activity-0:35,activity-0:30,activity-0:25,activity-0:20,activity-0:15,activity-0:10,activity-0:05,activity-0:00
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
p01_8459,p01,06:45:00,,9.2,,,10.2,,,10.3,...,,,,,,,,,,
p01_8460,p01,11:25:00,,,9.9,,,9.4,,,...,,,,,,,,Walk,Walk,Walk
p01_8461,p01,14:45:00,,5.5,,,5.5,,,5.2,...,,,,,,,,,,
p01_8462,p01,04:30:00,,3.4,,,3.9,,,4.7,...,,,,,,,,,,
p01_8463,p01,04:20:00,,,8.3,,,10.0,,,...,,,,,,,,,,


In [17]:
test_data.groupby(['p_num']).count()

Unnamed: 0_level_0,time,bg-5:55,bg-5:50,bg-5:45,bg-5:40,bg-5:35,bg-5:30,bg-5:25,bg-5:20,bg-5:15,...,activity-0:45,activity-0:40,activity-0:35,activity-0:30,activity-0:25,activity-0:20,activity-0:15,activity-0:10,activity-0:05,activity-0:00
p_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
p01,244,2,96,146,1,94,146,1,95,148,...,5,4,5,4,4,4,4,4,5,3
p02,227,222,222,222,222,222,221,221,221,221,...,9,9,9,8,8,8,8,8,7,6
p04,258,256,256,256,256,256,256,256,256,256,...,7,5,6,7,6,6,5,5,5,5
p05,276,4,127,140,5,119,148,5,116,150,...,5,5,3,4,4,5,6,4,5,5
p06,234,3,94,133,3,95,133,2,91,136,...,7,9,12,14,12,11,8,7,6,7
p10,179,175,175,175,175,175,175,174,175,175,...,9,8,10,10,11,11,9,11,13,14
p11,221,217,218,218,218,218,218,217,217,217,...,4,4,4,3,2,2,1,1,1,1
p12,288,282,282,281,281,281,281,282,282,282,...,1,1,2,3,3,3,2,2,2,1
p15,294,291,292,292,291,290,291,291,291,291,...,0,0,0,0,0,0,0,0,0,0
p16,248,240,240,240,240,240,240,239,239,239,...,4,3,3,3,3,3,2,2,2,2


In [18]:
test_data_hour = data_hour(test_data)

In [19]:
number_of_elements = test_data_hour.size  # Количество элементов в датасете
test_len = test_data_hour.shape[0]  # Всего записей (строк) в датасете

number_missing_values_in_column = test_data_hour.isnull().sum()  # Количество NaN в колонке
number_missing_values = number_missing_values_in_column.sum()  # Количество всех NaN

print(f"test \nВсего записей: {test_len}, всего значений: {number_of_elements}")
print(f"Пропущенных значений: {number_missing_values}, {round(((number_missing_values / number_of_elements) * 100), 2)} %")

percentage = test_data_hour.isnull().mean() * 100
missing = pd.DataFrame({"Кол-во пропущенных значений": number_missing_values_in_column, "Процент пропусков": round(percentage, 2)}).sort_values("Процент пропусков")
missing

test 
Всего записей: 3644, всего значений: 313384
Пропущенных значений: 142566, 45.49 %


Unnamed: 0,Кол-во пропущенных значений,Процент пропусков
p_num,0,0.00
time,0,0.00
bg-0:00,132,3.62
insulin-0:55,159,4.36
insulin-0:50,159,4.36
...,...,...
activity-0:40,3582,98.30
activity-0:00,3583,98.33
activity-0:25,3583,98.33
activity-0:10,3586,98.41


In [20]:
def transformation_ds(data=train_data_hour):
  X = data.copy()
  # Переводим колонку 'time' to datetime and extract features
  X['time'] = pd.to_datetime(X['time'], format='%H:%M:%S')
  # Выводим часы и минуты как числовые данные
  X['hour'] = X['time'].dt.hour
  X['minute'] = X['time'].dt.minute
  # Удаляем колонку 'time'
  X.drop('time', axis=1, inplace=True)
  X['p_num'] = X['p_num'].astype('category')
  activity_columns = X.select_dtypes(include=['object']).columns
  X[activity_columns] = X[activity_columns].fillna(activities[-1])
  X[activity_columns] = X[activity_columns].astype(cat_dtype)
  return X

In [21]:
X = transformation_ds(train_data_hour)
y = X.pop('bg+1:00')

In [22]:
X[::3]

Unnamed: 0_level_0,p_num,bg-0:55,bg-0:50,bg-0:45,bg-0:40,bg-0:35,bg-0:30,bg-0:25,bg-0:20,bg-0:15,...,activity-0:35,activity-0:30,activity-0:25,activity-0:20,activity-0:15,activity-0:10,activity-0:05,activity-0:00,hour,minute
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
p01_0,p01,,,17.5,,,17.3,,,16.2,...,No activity,No activity,No activity,No activity,No activity,No activity,No activity,No activity,6,10
p01_3,p01,,,15.1,,,14.4,,,13.9,...,No activity,No activity,No activity,No activity,No activity,No activity,No activity,No activity,6,55
p01_6,p01,,,13.8,,,13.4,,,12.8,...,No activity,No activity,No activity,No activity,No activity,No activity,No activity,No activity,7,40
p01_9,p01,,,15.5,,,14.8,,,12.7,...,No activity,No activity,No activity,No activity,No activity,No activity,No activity,No activity,8,25
p01_12,p01,,,11.4,,,11.9,,,15.1,...,No activity,No activity,No activity,No activity,No activity,No activity,No activity,No activity,9,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
p12_25284,p12,7.7,7.7,7.4,7.1,6.9,6.7,6.4,6.2,6.2,...,No activity,No activity,No activity,No activity,No activity,No activity,No activity,No activity,22,45
p12_25287,p12,7.1,6.9,6.7,6.4,6.2,6.2,6.1,6.3,6.5,...,No activity,No activity,No activity,No activity,No activity,No activity,No activity,No activity,23,0
p12_25290,p12,6.4,6.2,6.2,6.1,6.3,6.5,6.9,7.5,7.9,...,No activity,No activity,No activity,No activity,No activity,No activity,No activity,No activity,23,15
p12_25293,p12,6.1,6.3,6.5,6.9,7.5,7.9,8.2,8.7,8.6,...,No activity,No activity,No activity,No activity,No activity,No activity,No activity,No activity,23,30


In [23]:
from sklearn.utils import shuffle
rng = np.random.RandomState(random_state)
X_shuffle, y_shuffle = shuffle(X, y, random_state=rng)

In [24]:
from sklearn.model_selection import KFold

rng = np.random.RandomState(random_state)
estimator_hgbr = HistGradientBoostingRegressor(categorical_features='from_dtype', random_state=rng)
hyperparameter_grid = {'l2_regularization': range(2), 'max_leaf_nodes': np.arange(80, 110, 10), 'min_samples_leaf': np.arange(50, 90, 10), }
inner_cv = KFold(n_splits=5, shuffle=True, random_state=rng)

grid_search = GridSearchCV(
    estimator=estimator_hgbr,
    param_grid=hyperparameter_grid,    
    scoring='neg_root_mean_squared_error',
    cv=inner_cv,
    n_jobs=-1
    )
grid_search.fit(X_shuffle,y_shuffle)
hgbr = grid_search.best_estimator_ # лучшая модель

outer_cv = KFold(n_splits=5, shuffle=True, random_state=rng)
results = cross_val_score(
    grid_search,
    X, y,
    cv=outer_cv,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1
    )

In [25]:
-results, -results.mean(), results.std()

(array([1.82975771, 1.80582309, 1.80600333, 1.81396328, 1.80050925]),
 np.float64(1.8112113344521983),
 np.float64(0.010219906419843432))

In [26]:
grid_search.best_estimator_

HistGradientBoostingRegressor(l2_regularization=0, max_leaf_nodes=np.int64(100),
                              min_samples_leaf=np.int64(80),
                              random_state=RandomState(MT19937) at 0x2B71B6F2140)

In [27]:
grid_search.best_score_

np.float64(-1.8085438619891654)

In [28]:
from sklearn.model_selection import ShuffleSplit

# перекрестная проверка
rng = np.random.RandomState(random_state)
cv_shuffle = ShuffleSplit(test_size=0.2, random_state=rng)
results_score = cross_val_score(
    hgbr,
    X, y,
    cv=cv_shuffle,  
    scoring="neg_root_mean_squared_error"
    )
-results_score.mean(), results_score.std()

(np.float64(1.8180871655073534), np.float64(0.009522829949999395))

In [29]:
# перекрестная проверка
outer_cv = GroupKFold(n_splits=5)
results_score = cross_val_score(
    hgbr,
    X_shuffle, y_shuffle,
    groups=X_shuffle.p_num,
    cv=outer_cv,   
    scoring="neg_root_mean_squared_error"
    )
-results_score.mean(), results_score.std()

(np.float64(2.0924813206032367), np.float64(0.11052920756134064))

In [30]:
-results_score

array([2.17649484, 2.21303054, 1.89408284, 2.08190962, 2.09688877])

In [31]:
root_mean_squared_error(y_shuffle, hgbr.predict(X_shuffle))

1.6787142256951564

In [32]:
train_data_hour_p03 = train_data_hour.groupby(['p_num'], sort=False, observed=False).get_group(('p03',))
train_data_hour_other = train_data_hour.drop(train_data_hour_p03.index).groupby(['p_num'], sort=False, observed=False).tail(1172)
X_valid = pd.concat([train_data_hour_other, train_data_hour_p03])
X_train = train_data_hour.drop(X_valid.index)
# Присваиваем признаки и метку
y_train = X_train.pop('bg+1:00')
y_valid = X_valid.pop('bg+1:00')
X_train = transformation_ds(X_train)
X_valid = transformation_ds(X_valid)

X_train_shuffle, y_train_shuffle = shuffle(X_train, y_train)
X_valid_shuffle, y_valid_shuffle = shuffle(X_valid, y_valid)

In [33]:
rng = np.random.RandomState(random_state)
hgbr_my = HistGradientBoostingRegressor(categorical_features='from_dtype', 
                                     random_state=rng, 
                                     l2_regularization=0, 
                                     max_leaf_nodes=100,
                                     min_samples_leaf=80,
                                     )
hgbr_my.fit(X_train_shuffle, y_train_shuffle)

In [34]:
root_mean_squared_error(y_valid_shuffle, hgbr_my.predict(X_valid_shuffle))

2.126720496212023

In [35]:
X_test = transformation_ds(test_data_hour)
predictions = hgbr.predict(X_test)

In [36]:
output = pd.DataFrame({'id': test_data.index, 'bg+1:00': predictions.round(1)})
#output.to_csv('submission_hour_shuffle.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


In [37]:
output

Unnamed: 0,id,bg+1:00
0,p01_8459,9.3
1,p01_8460,5.7
2,p01_8461,8.1
3,p01_8462,10.2
4,p01_8463,6.6
...,...,...
3639,p24_256,6.3
3640,p24_257,10.7
3641,p24_258,6.8
3642,p24_259,8.1
