we will forecast twelve-hours of traffic flow in a U.S. metropolis. The time series in this dataset are labelled with both location coordinates 
and a direction of travel -- a combination of features that will test your skill at spatio-temporal forecasting within a highly dynamic traffic network.
Which model 
will prevail? The venerable linear regression? The deservedly-popular ensemble of decision trees? Or maybe a cutting-edge graph neural-network? We can't wait to see!



Data Description

In this competition, you'll forecast twelve-hours of traffic flow in a major U.S. metropolitan area. Time, space, and directional 
features give you the chance to model interactions across a network of roadways.

Files and Field Descriptions

train.csv - the training set, comprising
 measurements of traffic congestion across 65 roadways from April through September of 1991.

row_id - a unique identifier for this instance

time - the 20-minute period 
in which each measurement was taken

x - the east-west midpoint coordinate of the roadway

y - the north-south midpoint coordinate of the roadway

direction - the direction 
of travel of the roadway. EB indicates "eastbound" travel, for example, while SW indicates a "southwest" direction of travel.

congestion - congestion 
for the roadway during each hour; the target. The congestion measurements have been normalized to the range 0 to 100.

test.csv - the test set; you will make hourly 
predictions for roadways identified by a coordinate location and a direction of travel on the day of 1991-09-30.

sample_submission.csv - a sample submission file in 
the correct format

### Import train_data and test_data

In [11]:
import pandas as pd
import numpy as np


In [3]:
train_df = pd.read_csv('Desktop/Новая папка/Data/dataset/data_1/train.csv')
test_df = pd.read_csv('Desktop/Новая папка/Data/dataset/data_1/test.csv')


In [24]:
train_df.head()

Unnamed: 0,row_id,time,x,y,direction,congestion
0,0,1991-04-01 00:00:00,0,0,EB,70
1,1,1991-04-01 00:00:00,0,0,NB,49
2,2,1991-04-01 00:00:00,0,0,SB,24
3,3,1991-04-01 00:00:00,0,1,EB,18
4,4,1991-04-01 00:00:00,0,1,NB,60


In [7]:
train_df.dtypes


row_id         int64
time          object
x              int64
y              int64
direction     object
congestion     int64
dtype: object

### Data preprocessing

1. Feature engineering

In [25]:
def feature_engineering(data):
    data['time'] = pd.to_datetime(data['time'])
    data['month'] = pd.to_datetime(data['time']).dt.month
    data['weekday'] = pd.to_datetime(data['time']).dt.weekday
    data['hour'] = pd.to_datetime(data['time']).dt.hour
    data['minute'] = pd.to_datetime(data['time']).dt.minute
    data['is_month_start'] = data['time'].dt.is_month_start.astype('int')
    data['is_month_end'] = data['time'].dt.is_month_end.astype('int')
    data['is_weekend'] = (data['time'].dt.dayofweek > 5).astype('int')
    data['is_afternoon'] = (data['time'].dt.hour > 12).astype('int')
    data['road'] = data['x'].astype(str) + data['y'].astype(str) + data['direction']
    data['moment'] = data['time'].dt.hour*3 + data['time'].dt.minute // 20
    data = data.drop(['row_id', 'direction'], axis = 1)
    return data


In [35]:
train_df = feature_engineering(train_df)
test_df = feature_engineering(test_df)

In [36]:
train_df

Unnamed: 0,time,x,y,congestion,month,weekday,hour,minute,is_month_start,is_month_end,is_weekend,is_afternoon,road,moment
0,1991-04-01 00:00:00,0,0,70,4,0,0,0,1,0,0,0,00EB,0
1,1991-04-01 00:00:00,0,0,49,4,0,0,0,1,0,0,0,00NB,0
2,1991-04-01 00:00:00,0,0,24,4,0,0,0,1,0,0,0,00SB,0
3,1991-04-01 00:00:00,0,1,18,4,0,0,0,1,0,0,0,01EB,0
4,1991-04-01 00:00:00,0,1,60,4,0,0,0,1,0,0,0,01NB,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
848830,1991-09-30 11:40:00,2,3,54,9,0,11,40,0,1,0,0,23NB,35
848831,1991-09-30 11:40:00,2,3,28,9,0,11,40,0,1,0,0,23NE,35
848832,1991-09-30 11:40:00,2,3,68,9,0,11,40,0,1,0,0,23SB,35
848833,1991-09-30 11:40:00,2,3,17,9,0,11,40,0,1,0,0,23SW,35


mins = pd.DataFrame(train_df.groupby(['road', 'weekday', 'hour', 'minute']).congestion.min().astype(int)).reset_index() mins = mins.rename(columns={'congestion':'min'}) train_df = train_df.merge(mins, on=['road', 'weekday', 'hour', 'minute'], how='left') test_df = test_df.merge(mins, on=['road', 'weekday', 'hour', 'minute'], how='left') 

In [7]:

maxs = pd.DataFrame(train_df.groupby(['road', 'weekday', 'hour', 'minute']).congestion.max().astype(int)).reset_index() maxs = maxs.rename(columns={'congestion':'max'}) train_df = train_df.merge(maxs, on=['road', 'weekday', 'hour', 'minute'], how='left') test_df = test_df.merge(maxs, on=['road', 'weekday', 'hour', 'minute'], how='left') 

In [8]:

medians = pd.DataFrame(train_df.groupby(['road', 'weekday', 'hour', 'minute']).congestion.median().astype(int)).reset_index() medians = medians.rename(columns={'congestion':'median'}) train_df = train_df.merge(medians, on=['road', 'weekday', 'hour', 'minute'], how='left') test_df = test_df.merge(medians, on=['road', 'weekday', 'hour', 'minute'], how='left') 

In [9]:

pd.get_option('display.max_columns') pd.set_option('display.max_columns', 20) 

In [10]:

train_df.head()

test_df.head()



2. Label Encoding

In [12]:

from sklearn.preprocessing import LabelEncoder cate_features = ['road'] le = LabelEncoder() for feature in cate_features: le.fit(train_df[feature]) train_df[feature] = le.transform(train_df[feature]) test_df[feature] = le.transform(test_df[feature]) 

In [13]:

train_df.head()

test_df.head()




Modelling

1. Split train_df to train data and valid data

In [15]:

tst_start = pd.to_datetime('1991-09-23 12:00') tst_finish = pd.to_datetime('1991-09-23 23:40') X_train = train_df[train_df['time'] < tst_start] y_train = X_train['congestion'] X_train = X_train.drop(['congestion', 'time'], axis=1) X_valid = train_df[(train_df['time'] >= tst_start) & (train_df['time'] <= tst_finish)] y_valid = X_valid['congestion'] X_valid = X_valid.drop(['time', 'congestion'], axis=1) 

2. Define model and check the validation score

In [16]:

from sklearn.metrics import mean_absolute_error def mae_valid(model): model.fit(X_train, y_train) y_pred = model.predict(X_valid) mae = mean_absolute_error(y_valid, y_pred) return(mae) 

In [17]:

from catboost import CatBoostRegressor model_cat = CatBoostRegressor(logging_level='Silent', depth=8, eval_metric='MAE', loss_function='MAE', n_estimators=800) 

In [18]:

score = mae_valid(model_cat) print(f'\nCAT score : {score}') 

CAT score : 4.83885047450609 

3. Train the model

In [19]:

y_train = train_df['congestion'] train_df = train_df.drop(['congestion', 'time'], axis=1) test_df = test_df.drop('time', axis=1) 

In [20]:

model_cat.fit(train_df, y_train) cat_prediction = model_cat.predict(test_df) 

Create submission data

In [21]:

submission = pd.read_csv('../input/tabular-playground-series-mar-2022/sample_submission.csv') 

In [22]:

submission['congestion'] = cat_prediction submission['congestion'] = submission['congestion'].round().astype(int)
submission.to_csv('submission.csv', index=False) 











In [51]:
mins = pd.DataFrame(train_df.groupby(['road', 'weekday', 'hour', 'minute']).congestion.min().astype(int)).reset_index() 
mins = mins.rename(columns={'congestion':'min'}) 
train_df = train_df.merge(mins, on=['road', 'weekday', 'hour', 'minute'], how='left') 
test_df = test_df.merge(mins, on=['road', 'weekday', 'hour', 'minute'], how='left')

In [54]:
maxs = pd.DataFrame(train_df.groupby(['road', 'weekday', 'hour', 'minute']).congestion.max().astype(int)).reset_index() 
maxs = maxs.rename(columns={'congestion':'max'}) 
train_df = train_df.merge(maxs, on=['road', 'weekday', 'hour', 'minute'], how='left') 
test_df = test_df.merge(maxs, on=['road', 'weekday', 'hour', 'minute'], how='left') 


In [55]:
medians = pd.DataFrame(train_df.groupby(['road', 'weekday', 'hour', 'minute']).congestion.median().astype(int)).reset_index() 
medians = medians.rename(columns={'congestion':'median'}) 
train_df = train_df.merge(medians, on=['road', 'weekday', 'hour', 'minute'], how='left') 
test_df = test_df.merge(medians, on=['road', 'weekday', 'hour', 'minute'], how='left')

In [56]:
pd.get_option('display.max_columns') 
pd.set_option('display.max_columns', 20)

In [57]:
train_df.head()

Unnamed: 0,time,x,y,congestion,month,weekday,hour,minute,is_month_start,is_month_end,...,is_afternoon,road,moment,min_x,min_y,min,max_x,max_y,max,median
0,1991-04-01,0,0,70,4,0,0,0,1,0,...,0,00EB,0,30,30,30,80,80,80,35
1,1991-04-01,0,0,49,4,0,0,0,1,0,...,0,00NB,0,13,13,13,69,69,69,29
2,1991-04-01,0,0,24,4,0,0,0,1,0,...,0,00SB,0,21,21,21,91,91,91,24
3,1991-04-01,0,1,18,4,0,0,0,1,0,...,0,01EB,0,0,0,0,26,26,26,17
4,1991-04-01,0,1,60,4,0,0,0,1,0,...,0,01NB,0,52,52,52,72,72,72,63


In [58]:
test_df.head()

Unnamed: 0,time,x,y,month,weekday,hour,minute,is_month_start,is_month_end,is_weekend,is_afternoon,road,moment,min_x,min_y,min,max_x,max_y,max,median
0,1991-09-30 12:00:00,0,0,9,0,12,0,0,1,0,0,00EB,36,23,23,23,63,63,63,47
1,1991-09-30 12:00:00,0,0,9,0,12,0,0,1,0,0,00NB,36,24,24,24,52,52,52,35
2,1991-09-30 12:00:00,0,0,9,0,12,0,0,1,0,0,00SB,36,28,28,28,74,74,74,56
3,1991-09-30 12:00:00,0,1,9,0,12,0,0,1,0,0,01EB,36,10,10,10,34,34,34,22
4,1991-09-30 12:00:00,0,1,9,0,12,0,0,1,0,0,01NB,36,59,59,59,95,95,95,72


2. Label Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder 
cate_features = ['road'] 
le = LabelEncoder() 
for feature in cate_features: 
    le.fit(train_df[feature]) 
    train_df[feature] = le.transform(train_df[feature]) 
    test_df[feature] = le.transform(test_df[feature]) 

In [None]:
train_df.head()

In [None]:
test_df.head()

### Modeling

1. Split train_df to train data and valid data

In [None]:
tst_start = pd.to_datetime('1991-09-23 12:00')
tst_finish = pd.to_datetime('1991-09-23 23:40')
X_train = train_df[train_df['time'] < tst_start]
y_train = X_train['congestion']
X_train = X_train.drop(['congestion', 'time'], axis = 1)

X_valid = train_df[(train_df['time'] >= tst_start) & (train_df['time'] <= tst_finish)]
y_valid = X_valid['congestion']
X_valid = X_valid.drop(['time', 'congestion'], axis = 1)

2. Define model and check the validation score

In [None]:
from sklearn.metrics import mean_absolute_error

def mae_valid(model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_valid)
    mae = mean_absolute_error(y_valid, y_pred)
    return (mae)

In [None]:
from catboost import CatBoostRegressor
model_cat = CatBoostRegressor(logging_level = 'Silent', loss_function = 'MAE', n_estimators = 800)

In [None]:
score = mae_valid(model_cat)
print(f'\nCAT score : {score}')

3. Train the model

In [None]:
y_train = train_df['congestion']
train_df = train_df.drop(['congestion', 'time'], axis = 1)
test_df = test_df.drop('time', axis = 1)

In [None]:
model_cat.fit(train_df, y_train)
cat_prediction = model_cat,predict(test_df)

### Create submission data

In [2]:
submission = pd.read_csv('Desktop/Новая папка/Data/dataset/data_1/sample_submission.csv')

In [None]:
submission['congestion'] = cat_prediction
submission['congestion'] = submission['congestion'].round().astype(int)
submission.to_csv('submission.csv', index = False)

In [None]:
submission