# 🚍 Tabular Playground Series - Mar 2022
Practice your ML skills on this approachable dataset!

## Work in Progress...


### Objective
In this competition, you'll forecast twelve-hours of traffic flow in a major U.S. metropolitan area. Time, space, and directional features give you the chance to model interactions across a network of roadways.

### Strategy
...

### Data

In this competition, you'll forecast twelve-hours of traffic flow in a major U.S. metropolitan area. Time, space, and directional features give you the chance to model interactions across a network of roadways.

**Files and Field Descriptions**
* train.csv - the training set, comprising measurements of traffic congestion across 65 roadways from April through September of 1991.
* row_id - a unique identifier for this instance
* time - the 20-minute period in which each measurement was taken
* x - the east-west midpoint coordinate of the roadway
* y - the north-south midpoint coordinate of the roadway
* direction - the direction of travel of the roadway. EB indicates "eastbound" travel, for example, while SW indicates a "southwest" direction of travel.
* congestion - congestion levels for the roadway during each hour; the target. The congestion measurements have been normalized to the range 0 to 100.

test.csv - the test set; you will make hourly predictions for roadways identified by a coordinate location and a direction of travel on the day of 1991-09-30.

sample_submission.csv - a sample submission file in the correct format

### References / Other Notebooks
https://www.kaggle.com/alexryzhkov/lightautoml-with-fe-tps-mar-22?scriptVersionId=89294062 (I used some feature Engineering from this Notebook)


### Updates
**03/04/2022**
* Created the baseline model.
* Created the CV validation loop and added spatial features.

**03/05/2022**
* Improved the notebook documentation, added table of content.
* Added Optuna Hyperparameter optimization sections

### Table of Content
1.  Loading the Requiered Libraries.
2.  Setting the Notebook.
3.  Loading the Information (CSV) Into A Dataframe.
4.  Exploring the Information Available.    
        4.1. Analysing the Trian Dataset.
        4.2. Analysing the Test Dataset.
5.  Creating New Model Features.
        5.1. Creating Time Base Features.
        5.2. Creating Spatial Features.
        5.3. Creating Multi-Field Combinational Features.
        5.4. Visualizing The Features Created.
6.  Post Processing the Information for the Model.
7.  Creating a Simple Train / Test Split Strategy.
8.  Building a Baseline GBT Model.
9.  Baseline Model Submission File Generation.
10. Creating A More Comples Cross Validation Loop Strategy.
11. Training A Model Using the CV Loop.
12. CV Loop Model Submission File Generation.
13. Optuna Hypeparam Optimization.
        


___

# 1. Loading the Requiered Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
%%time
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_absolute_error

___

# 2. Setting the Notebook

In [None]:
%%time
# I like to disable my Notebook Warnings.
import warnings
warnings.filterwarnings('ignore')

In [None]:
%%time
# Notebook Configuration...

# Amount of data we want to load into the Model...
DATA_ROWS = None
# Dataframe, the amount of rows and cols to visualize...
NROWS = 50
NCOLS = 15
# Main data location path...
BASE_PATH = '...'

In [None]:
%%time
# Configure notebook display settings to only use 2 decimal places, tables look nicer.
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_columns', NCOLS) 
pd.set_option('display.max_rows', NROWS)

___

# 3. Loading the Information (CSV) Into A Dataframe

In [None]:
%%time
trn_data = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2022/train.csv', parse_dates = ['time'])
tst_data = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2022/test.csv', parse_dates = ['time'])

submission = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2022/sample_submission.csv')

___

# 4. Exploring the Information Available

## 4.1. Analysing the Trian Dataset

In [None]:
%%time
trn_data.info()

In [None]:
%%time
trn_data.head()

In [None]:
%%time
trn_data.describe()

In [None]:
%%time
trn_data['x'].unique()

In [None]:
%%time
trn_data['y'].unique()

In [None]:
%%time
trn_data['direction'].unique()

In [None]:
%%time
summary = trn_data.groupby(['x', 'y', 'direction'])['row_id'].count().reset_index()
summary

___

## 4.2. Analysing the Test Dataset

In [None]:
%%time
tst_data.head()

In [None]:
%%time
tst_data.describe()

___

# 5. Creating New Model Features

## 5.1. Creating Time Base Features

In [None]:
%%time
def create_time_features(df):
    df['minute'] = df['time'].dt.minute
    df['hour'] = df['time'].dt.hour
    df['day'] = df['time'].dt.day
    df['dayofweek'] = df['time'].dt.dayofweek
    df['month'] = df['time'].dt.month
    df['is_month_start'] = df['time'].dt.is_month_start.astype('int')
    df['is_month_end'] = df['time'].dt.is_month_end.astype('int')
    df['year'] = df['time'].dt.year
    df['is_weekend'] = np.where(df['dayofweek'] > 4, 1, 0)
    df['is_afternoon'] = (df['time'].dt.hour > 12).astype('int')
    
    return df

In [None]:
%%time
trn_data = create_time_features(trn_data)
tst_data = create_time_features(tst_data)

___

## 5.2. Creating Spatial Features

In [None]:
%%time
def calculate_distances(df):
    df['dist_to_00'] = np.sqrt(np.square(df['x'] - 0) + np.square(df['y'] - 0))
    df['dist_to_01'] = np.sqrt(np.square(df['x'] - 0) + np.square(df['y'] - 1))
    df['dist_to_02'] = np.sqrt(np.square(df['x'] - 0) + np.square(df['y'] - 2))
    df['dist_to_03'] = np.sqrt(np.square(df['x'] - 0) + np.square(df['y'] - 3))
    df['dist_to_10'] = np.sqrt(np.square(df['x'] - 1) + np.square(df['y'] - 0))
    df['dist_to_11'] = np.sqrt(np.square(df['x'] - 1) + np.square(df['y'] - 1))
    df['dist_to_12'] = np.sqrt(np.square(df['x'] - 1) + np.square(df['y'] - 2))
    df['dist_to_13'] = np.sqrt(np.square(df['x'] - 1) + np.square(df['y'] - 3))
    df['dist_to_20'] = np.sqrt(np.square(df['x'] - 2) + np.square(df['y'] - 0))
    df['dist_to_21'] = np.sqrt(np.square(df['x'] - 2) + np.square(df['y'] - 1)) 
    df['dist_to_22'] = np.sqrt(np.square(df['x'] - 2) + np.square(df['y'] - 2))
    df['dist_to_23'] = np.sqrt(np.square(df['x'] - 2) + np.square(df['y'] - 3))    
    return df

In [None]:
%%time
trn_data = calculate_distances(trn_data)
tst_data = calculate_distances(tst_data)

___

## 5.3. Creating Multi-Field Combinational Features

In [None]:
%%time
dir_mapper = {'EB': [1,0], 
              'NB': [0,1], 
              'SB': [0,-1], 
              'WB': [-1,0], 
              'NE': [1,1], 
              'SW': [-1,-1], 
              'NW': [-1,1], 
              'SE': [1,-1]}

In [None]:
%%time
def feature_engineering(df):
    df['converted_direction_coord_0'] = df['direction'].map(lambda x: dir_mapper[x][0])
    df['converted_direction_coord_1'] = df['direction'].map(lambda x: dir_mapper[x][1])
    df['hour+minute'] = df['time'].dt.hour * 60 + df['time'].dt.minute
    df['x+y'] = df['x'].astype('str') + df['y'].astype('str')
    df['x+y+direction'] = df['x'].astype('str') + df['y'].astype('str') + df['direction'].astype('str')
    df['x+y+direction_0'] = df['x'].astype('str') + df['y'].astype('str') + df['converted_direction_coord_0'].astype('str')
    df['x+y+direction_1'] = df['x'].astype('str') + df['y'].astype('str') + df['converted_direction_coord_1'].astype('str')
    df['hour+direction'] = df['hour'].astype('str') + df['direction'].astype('str')
    df['hour+x+y'] = df['hour'].astype('str') + df['x'].astype('str') + df['y'].astype('str')
    df['hour+direction+x'] = df['hour'].astype('str') + df['direction'].astype('str') + df['x'].astype('str')
    df['hour+direction+y'] = df['hour'].astype('str') + df['direction'].astype('str') + df['y'].astype('str')
    df['hour+direction+x+y'] = df['hour'].astype('str') + df['direction'].astype('str') + df['x'].astype('str') + df['y'].astype('str')
    df['hour+x'] = df['hour'].astype('str') + df['x'].astype('str')
    df['hour+y'] = df['hour'].astype('str') + df['y'].astype('str')
    
    return df

In [None]:
%%time
trn_data = feature_engineering(trn_data)
tst_data = feature_engineering(tst_data)

___

## 5.4. Visualizing The Features Created

In [None]:
%%time
trn_data.sample(10)

In [None]:
%%time
trn_data.info()

___

# 6. Post Processing the Information for the Model

In [None]:
%%time
from sklearn.preprocessing import LabelEncoder 

def encode_categorical(train_df, test_df, categ_feat = ['direction']):
    '''
    
    '''
    encoder_dict = {}
    
    concat_data = pd.concat([trn_data[categ_feat], tst_data[categ_feat]])
    
    for col in concat_data.columns:
        print('Encoding: ', col, '...')
        encoder = LabelEncoder()
        encoder.fit(concat_data[col])
        encoder_dict[col] = encoder

        train_df[col + '_enc'] = encoder.transform(train_df[col])
        test_df[col + '_enc'] = encoder.transform(test_df[col])
    
    train_df = train_df.drop(columns = categ_feat, axis = 1)
    test_df = test_df.drop(columns = categ_feat, axis = 1)

    return train_df, test_df

In [None]:
%%time
categorical_features = ['direction', 'x+y', 'x+y+direction', 
                        'x+y+direction_0', 'x+y+direction_1', 'hour+direction', 'hour+x+y', 'hour+direction+x', 'hour+direction+y', 'hour+direction+x+y',
                        'hour+x', 'hour+y']

trn_data, tst_data = encode_categorical(trn_data, tst_data, categorical_features)

In [None]:
%%time
ignore = ['row_id', 'time', 'congestion', 'direction', 'year']
features = [feat for feat in trn_data.columns if feat not in ignore]
target_feature = 'congestion'

___

# 7. Creating a Simple Train / Test Split Strategy

In [None]:
%%time
from sklearn.model_selection import train_test_split
test_size_pct = 0.15
X_train, X_valid, y_train, y_valid = train_test_split(trn_data[features], trn_data[target_feature], test_size = test_size_pct, random_state = 42)

___

# 8. Building a Baseline GBT Model

In [None]:
%%time
from xgboost  import XGBRegressor

In [None]:
%%time
optuna_param = {'n_estimators': 2676,
                 'max_depth': 7,
                 'learning_rate': 0.13428054009987778,
                 'subsample': 0.9473205460026051,
                 'colsample_bytree': 0.5939032159218427,
                 'reg_lambda': 1.475346592023048,
                 'reg_alpha': 6.056263903457767,
                 'gamma': 1.3720990427109092,
                 'random_state': 69,
                 'objective': 'reg:squarederror',
                 'tree_method': 'gpu_hist',
               }

param = {'n_estimators': 512,
         'learning_rate': 0.1,
         'random_state': 69,
         'objective': 'reg:squarederror',
         'tree_method': 'gpu_hist',} 

In [None]:
%%time
xgb = XGBRegressor(**param)
xgb.fit(X_train, y_train, eval_set = [(X_valid, y_valid)], eval_metric = ['mae'], early_stopping_rounds = 128, verbose = 500)

In [None]:
%%time
import matplotlib.pyplot as plt

def feature_importance(xgb):
    importances = xgb.feature_importances_
    i = np.argsort(importances)
    features = X_train.columns
    plt.title('Feature Importance')
    plt.barh(range(len(i)), importances[i], align='center')
    plt.yticks(range(len(i)), [features[x] for x in i])
    plt.xlabel('Scale')
    plt.show()

In [None]:
%%time
plt.figure(figsize = (8,6))
feature_importance(xgb)

In [None]:
%%time
from sklearn.metrics import mean_absolute_error
preds = xgb.predict(X_valid)
print("Mean accuracy score:", mean_absolute_error(y_valid, preds))

___

# 9. Baseline Model Submission File Generation

In [None]:
%%time
preds = xgb.predict(tst_data[features])

In [None]:
%%time
submission['congestion'] = preds
submission.to_csv('submission_simple_split_80_20_031122.csv', index = False)

___

# 10. Creating A More Comples Cross Validation Loop Strategy

In [None]:
%%time
N_SPLITS = 10
folds = StratifiedKFold(n_splits = N_SPLITS, shuffle = True)

___

# 11. Training A Model Using the CV Loop

In [None]:
%%time
optuna_param = {'n_estimators': 2676,
                 'max_depth': 7,
                 'learning_rate': 0.13428054009987778,
                 'subsample': 0.9473205460026051,
                 'colsample_bytree': 0.5939032159218427,
                 'reg_lambda': 1.475346592023048,
                 'reg_alpha': 6.056263903457767,
                 'gamma': 1.3720990427109092,
                 'random_state': 69,
                 'objective': 'reg:squarederror',
                 'tree_method': 'gpu_hist',}

param = {'n_estimators': 512,
         'learning_rate': 0.1,
         'random_state': 69,
         'objective': 'reg:squarederror',
         'tree_method': 'gpu_hist',} 

In [None]:
%%time

scores  = []
y_preds = []

for fold, (trn_id, val_id) in enumerate(folds.split(trn_data[features], trn_data[target_feature])):  
    X_train, y_train = trn_data[features].iloc[trn_id], trn_data[target_feature].iloc[trn_id]
    X_valid, y_valid = trn_data[features].iloc[val_id], trn_data[target_feature].iloc[val_id]

    model = XGBRegressor(**param)
    model.fit(X_train, y_train, eval_set = [(X_valid, y_valid)], eval_metric = ['mae'], early_stopping_rounds = 128, verbose = False)
    
    valid_pred = model.predict(X_valid)
    score = mean_absolute_error(y_valid, valid_pred)
    
    print("Fold:", fold, "MAE:", score)
    scores.append(score)
    y_preds.append(model.predict(tst_data[features]))

In [None]:
%%time
print("Mean accuracy score:", np.array(scores).mean())

---

# 12. CV Loop Model Submission File Generation

In [None]:
%%time
submission['congestion'] = np.mean(y_preds, axis = 0)
submission.to_csv('submission_10_folds_031122.csv', index = False)

In [None]:
%%time
submission

---

# 13. Optuna Hypeparam Optimization

In [None]:
%%time
import optuna

In [None]:
%%time
def objective(trial):
    n_estimators = trial.suggest_int("n_estimators", 8, 4096)
    max_depth = trial.suggest_int("max_depth", 2, 8)
    learning_rate = trial.suggest_float("learning_rate", 0.01, 0.2)
    subsample = trial.suggest_float("subsample", 0.5, 1)
    colsample_bytree = trial.suggest_float("colsample_bytree", 0.5, 1)
    reg_lambda = trial.suggest_float("reg_lambda", 1, 10)
    reg_alpha = trial.suggest_float("reg_alpha", 0, 10)
    gamma = trial.suggest_float("gamma", 0, 10)
    min_child_weight  = trial.suggest_int("min_child_weight", 0, 128)
    
    xgr = XGBRegressor(n_estimators  = n_estimators,
                       learning_rate = learning_rate,
                       max_depth = max_depth,
                       subsample = subsample,
                       colsample_bytree = colsample_bytree,
                       reg_lambda = reg_lambda,
                       reg_alpha = reg_alpha,
                       gamma = gamma,
                       min_child_weight = min_child_weight,
                       random_state  = 69,
                       objective = 'reg:squarederror',
                       tree_method = 'gpu_hist',
                      )
    
    xgr.fit(X_train, y_train)
    
    valid_pred = xgr.predict(X_valid)
    score = mean_absolute_error(y_valid, valid_pred)
    
    return score

In [None]:
%%time
study = optuna.create_study(direction = "minimize")
study.optimize(objective, n_trials = 30)

In [None]:
%%time
parameters = study.best_params
parameters

In [None]:
%%time
optuna_param = {'n_estimators': 3190,
                 'max_depth': 8,
                 'learning_rate': 0.12874747409892442,
                 'subsample': 0.9199579343366618,
                 'colsample_bytree': 0.9211378127079498,
                 'reg_lambda': 3.005844076201087,
                 'reg_alpha': 6.019297057378581,
                 'gamma': 5.190287854922152,
                 'min_child_weight': 81,
                 'random_state': 69,
                 'objective': 'reg:squarederror',
                 'tree_method': 'gpu_hist',
               }

In [None]:
%%time
xgb = XGBRegressor(**param)
xgb.fit(X_train, y_train, eval_set = [(X_valid, y_valid)], eval_metric = ['mae'], early_stopping_rounds = 128, verbose = 500)

In [None]:
%%time
from sklearn.metrics import mean_absolute_error
preds = xgb.predict(X_valid)
print("Mean accuracy score:", mean_absolute_error(y_valid, preds))

In [None]:
%%time
preds = xgb.predict(tst_data[features])

In [None]:
%%time
submission['congestion'] = preds
submission.to_csv('submission_optuna_hyperparam_simple_split_8020_031122.csv', index = False)

---

# 14. Optuna Hypeparam + XGBoost CV Loop

In [None]:
%%time
N_SPLITS = 10
folds = StratifiedKFold(n_splits = N_SPLITS, shuffle = True)

In [None]:
%%time
optuna_param = {'n_estimators': 3190,
                 'max_depth': 8,
                 'learning_rate': 0.12874747409892442,
                 'subsample': 0.9199579343366618,
                 'colsample_bytree': 0.9211378127079498,
                 'reg_lambda': 3.005844076201087,
                 'reg_alpha': 6.019297057378581,
                 'gamma': 5.190287854922152,
                 'min_child_weight': 81,
                 'random_state': 69,
                 'objective': 'reg:squarederror',
                 'tree_method': 'gpu_hist',
               }

In [None]:
%%time

scores  = []
y_preds = []

for fold, (trn_id, val_id) in enumerate(folds.split(trn_data[features], trn_data[target_feature])):  
    X_train, y_train = trn_data[features].iloc[trn_id], trn_data[target_feature].iloc[trn_id]
    X_valid, y_valid = trn_data[features].iloc[val_id], trn_data[target_feature].iloc[val_id]

    model = XGBRegressor(**param)
    model.fit(X_train, y_train, eval_set = [(X_valid, y_valid)], eval_metric = ['mae'], early_stopping_rounds = 128, verbose = False)
    
    valid_pred = model.predict(X_valid)
    score = mean_absolute_error(y_valid, valid_pred)
    
    print("Fold:", fold, "MAE:", score)
    scores.append(score)
    y_preds.append(model.predict(tst_data[features]))

In [None]:
%%time
print("Mean accuracy score:", np.array(scores).mean())

In [None]:
%%time
submission['congestion'] = np.mean(y_preds, axis = 0)
submission.to_csv('submission_optuna_hyperparam_10_folds_031122.csv', index = False)