# <b>1 <span style='color:lightseagreen'>|</span> Introduction</b>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>1.1 | Goal</b></p>
</div>

Forecast twelve-hours of **<span style='color:lightseagreen'>traffic flow</span>** in a U.S. metropolis. The time series in this dataset are labelled with both **<span style='color:lightseagreen'>location coordinates</span>** and a **<span style='color:lightseagreen'>direction of travel</span>** -- a combination of features that will test your skill at **<span style='color:lightseagreen'>spatio-temporal</span>** forecasting within a highly dynamic traffic network.

![](https://www.wsp.com/-/media/Hubs/Global/Congestion-Management/bnr-congestion.jpg)

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>1.2 | Metric</b></p>
</div>

Submissions are evaluated on the mean absolute error between predicted and actual congestion values for each time period in the test set. The congestion target has integer values from 0 to 100.

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>1.3 | How to import data ?</b></p>
</div>

First, we import all the datasets needed for this kernel. The required time series column is imported as a datetime column using **<span style='color:lightseagreen'>parse_dates</span>** parameter and is also selected as index of the dataframe using **<span style='color:lightseagreen'>index_col</span>** parameter.

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>1.4 | Timestamps and Periods</b></p>
</div>

Timestamps are used to represent a point in time. Periods represent an interval in time. Periods can used to check if a specific event in the given period. They can also be converted to each other's form.

📌 Video: [How to use dates and times with pandas](https://campus.datacamp.com/courses/manipulating-time-series-data-in-python/working-with-time-series-in-pandas?ex=1): explain **<span style='color:lightseagreen'>TimeStamp and Period</span>** data. 

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>1.5 | Using date_range</b></p>
</div>

date_range is a method that returns a fixed **<span style='color:lightseagreen'>frequency datetimeindex</span>**. It is quite useful when creating your own time series attribute for pre-existing data or arranging the whole data around the time series attribute created by you.

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>1.6 | Acknowledgements</b></p>
</div>

1. [Geunho Yu Kernel](https://www.kaggle.com/zhangcheche/xgboost-tps-2022-03)
2. [Sy-Tuan Nguyen Kernel](https://www.kaggle.com/sytuannguyen/tps-mar-2022-eda-model)

In [None]:
#%%capture
#!pip install -U lightautoml
import os
import warnings
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas_profiling as pp
import seaborn as sns
from IPython.display import display
from pandas.api.types import CategoricalDtype

from category_encoders import MEstimateEncoder
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.feature_selection import mutual_info_regression
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from category_encoders import MEstimateEncoder

# Algorithms
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

# Optuna - Bayesian Optimization 
import optuna
from optuna.samplers import TPESampler

# Plotly
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.offline as offline
import plotly.graph_objs as go

warnings.filterwarnings('ignore')

def load_data():
    data_dir = Path("../input/tabular-playground-series-mar-2022")
    df_train = pd.read_csv(data_dir / "train.csv")
    df_test = pd.read_csv(data_dir / "test.csv")
    # Merge the splits so we can process them together
    df = pd.concat([df_train, df_test])
    return df

def plot_feature_importance(importance,names,model_type):
    
    #Create arrays from feature importance and feature names
    feature_importance = np.array(importance)
    feature_names = np.array(names)
    
    #Create a DataFrame using a Dictionary
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)
    
    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    
    #Define size of bar plot
    plt.figure(figsize=(20,10))
    #Plot Searborn bar chart
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
    #Add chart labels
    plt.title(model_type + ' FEATURE IMPORTANCE')
    plt.xlabel('FEATURE IMPORTANCE')
    plt.ylabel('FEATURE NAMES')

df_data = load_data()
pp.ProfileReport(df_data)

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>1.7 | Reducing Memory Usage</b></p>
</div>

As we can observe in the previous report from the dataset, we have the amount of more than **<span style='color:lightseagreen'>800k rows</span>**. Due to that, in order to not having **<span style='color:lightseagreen'>issues with memory</span>** in the kernel, we are going to reduce its memory usage with the following function. Below, we can appreciate that reduction was successful as we manage to make a **<span style='color:lightseagreen'>reduction of 39.7%</span>**. 

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int8','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2

    for col in df.columns:
        col_type = df[col].dtypes

        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()

            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2

    if verbose:
        print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
 
    return df

df_data = reduce_mem_usage(df_data)

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>1.8 | Datetime Features</b></p>
</div>

As we can observe in the previous report we have a dataset with no missing values. Thus, we are going to proceed directly to feature engineering section. We'll start by **<span style='color:lightseagreen'>breaking down the date</span>** into different columns:

* One for the **<span style='color:lightseagreen'>year</span>**
* One for the **<span style='color:lightseagreen'>month</span>**
* One for the **<span style='color:lightseagreen'>week</span>**
* One for the **<span style='color:lightseagreen'>quarter</span>** of the year
* One for the **<span style='color:lightseagreen'>day of the week</span>**
* One for **<span style='color:lightseagreen'>weekend</span>**

In [None]:
df_data.time = pd.to_datetime(df_data.time)
df_data['year'] = df_data.time.dt.year
df_data['month'] = df_data.time.dt.month
df_data['week'] = df_data.time.dt.isocalendar().week
df_data['hour'] = df_data.time.dt.hour
df_data['minute'] = df_data.time.dt.minute
df_data['day_of_week'] = df_data.time.dt.day_name()
df_data['day_of_year'] = df_data.time.dt.dayofyear
df_data['is_weekend'] = (df_data.time.dt.dayofweek >= 5).astype("int")
df_data = df_data.set_index('time')

# <b>2 <span style='color:lightseagreen'>|</span> Exploratory Data Analysis</b>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.1 | General Congestion Analysis</b></p>
</div>

In [None]:
from itertools import cycle
palette = cycle(px.colors.sequential.Viridis)
df_graph = df_data[df_data['congestion'].isnull() == False]

# Defining all our palette colours.
primary_blue = "#496595"
primary_blue2 = "#85a1c1"
primary_blue3 = "#3f4d63"
primary_grey = "#c6ccd8"
primary_black = "#202022"
primary_bgcolor = "#f4f0ea"

# "coffee" pallette turqoise-gold.
f1 = "#a2885e"
f2 = "#e9cf87"
f3 = "#f1efd9"
f4 = "#8eb3aa"
f5 = "#235f83"
f6 = "#b4cde3"

# chart
fig = make_subplots(rows=3, cols=2, 
                    specs=[[{"type": "bar"}, {"type": "scatter"}], [{"colspan": 2}, None], [{'type':'histogram'}, {'type':'bar'}]],
                    column_widths=[0.4, 0.6], vertical_spacing=0.1, horizontal_spacing=0.1,
                    subplot_titles=("Mean Congestion per Day of Week", "Hourly Congestion Trend", "Daily Congestion Trend", "Congestion Distribution","Congestion Value Counts"))

# Upper Left chart
df_day = df_graph.groupby(['day_of_week']).agg({"congestion" : "mean"}).reset_index().sort_values(by='congestion', ascending = False)
values = list(range(7))
fig.add_trace(go.Bar(x=df_day['day_of_week'], y=df_day['congestion'], marker = dict(color=values, colorscale="Viridis"), 
                     name = 'Day of Week'),
                      row=1, col=1)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=1)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=1)

# Upper Right chart
df_hour = df_graph.groupby(['hour']).agg({"congestion" : "mean"}).reset_index('hour')
fig.add_trace(go.Scatter(x=df_hour['hour'], y=df_hour['congestion'], mode='lines+markers',
                 marker = dict(color = primary_blue3), name='Hourly Congestion'), row = 1, col = 2)

# Rectangle to highlight range
fig.add_vrect(x0=12.5, x1=18.5,
              fillcolor=px.colors.sequential.Viridis[4],
              layer="below", 
              opacity=0.25, 
              line_width=0, 
              row = 1, col = 2
)

fig.add_annotation(dict(
        x=7.9,
        y=df_hour.loc[8,'congestion']+0.45,
        text="There is a <b>peak at <br>8am</b> coinciding with<br>going to work.",
        ax="-20",
        ay="-60",
        showarrow = True,
        arrowhead = 7,
        arrowwidth = 0.7
), row=1, col=2)

fig.add_annotation(dict(
        x=15.50,
        y=49,
        text="Midday hours are <br><b>the rush hours</b>.",
        showarrow = False
), row=1, col=2)

fig.add_annotation(dict(
        x=18.5,
        y=df_hour.loc[18,'congestion'],
        text="From 6pm <br>on <b>congestion<br> ratio falls</b>.",
        ax="50",
        ay="-40",
        showarrow = True,
        arrowhead = 7,
        arrowwidth = 0.7
), row=1, col=2)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=2)
fig.update_yaxes(showgrid = False, linecolor='gray', linewidth=2, row=1, col=2)

# Medium Chart
df_week = df_graph.groupby(['day_of_year']).agg({"congestion" : "mean"}).reset_index()
fig.add_trace(go.Scatter(x = df_week['day_of_year'], y = df_week['congestion'], mode='lines',
                        marker = dict(color = px.colors.sequential.Viridis[5]),
                        name='Daily Congestion'), row = 2, col = 1)

from statsmodels.tsa.seasonal import seasonal_decompose
decomp = seasonal_decompose(df_week['congestion'], period=61, model='additive', extrapolate_trend='freq')

fig.add_trace(go.Scatter(x = df_week['day_of_year'], y = decomp.trend, mode='lines',
                        marker = dict(color = primary_blue3),
                        name='Congestion Trend'), row = 2, col = 1)
    
fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, row=2, col=1)
fig.update_yaxes(gridcolor = 'gray', gridwidth = 0.15, linecolor='gray',linewidth=2, row=2, col=1)

# Left Bottom Chart
fig.add_trace(go.Histogram(x=df_graph.congestion, name='Congestion Distribution', marker = dict(color = px.colors.sequential.Viridis[3])), row = 3, col = 1)

fig.update_xaxes(showgrid = False, showline = True, linecolor = 'gray', linewidth = 2, row = 3, col = 1)
fig.update_yaxes(showgrid = False, gridcolor = 'gray', gridwidth = 0.5, showline = True, linecolor = 'gray', linewidth = 2, row = 3, col = 1)

# Right Bottom Chart
con_bar = df_graph.copy()
con_bar['congestion_group'] = pd.cut(con_bar.congestion, bins=[0,20,40,60,80,100], labels=['0-20', '20-40', '40-60', '60-80', '80-100'])
con_bar = con_bar.groupby('congestion_group').agg({'congestion':'count'}).reset_index().sort_values(by='congestion')

values = list(range(5))
fig.add_trace(go.Bar(x=con_bar['congestion'], y=con_bar['congestion_group'], marker = dict(color=values, colorscale="Viridis_r"), name = 'Congestion Values', orientation = 'h'),
                      row=3, col=2)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=3, col=2)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=3, col=2)

fig.add_annotation(dict(
        x=con_bar.loc[4,'congestion']+0.15,
        y=0,
        text="Highest congestion <br>ratios are <b> more unusual</b>.",
        ax="110",
        ay="-20",
        showarrow = True,
        arrowhead = 7,
        arrowwidth = 0.7
), row=3, col=2)

# General Styling
fig.update_layout(height=1100, bargap=0.2,
                  margin=dict(b=50,r=30,l=100),
                  title = "<span style='font-size:36px; font-family:Times New Roman'>Congestion Analysis</span>",                  
                  plot_bgcolor='rgb(242,242,242)',
                  paper_bgcolor = 'rgb(242,242,242)',
                  font=dict(family="Times New Roman", size= 14),
                  hoverlabel=dict(font_color="floralwhite"),
                  showlegend=False)

📌 **Interpret:** As we can appreeciate, **<span style='color:lightseagreen'>working days</span>** of the week have a similar congestion rate. Likewise, we can see that the weekend days are the ones with the least traffic, with Sunday being the quietest day. Moving to the upper right chart, we observe that there is an **<span style='color:lightseagreen'>increase</span>** in traffic at the **<span style='color:lightseagreen'>beginning of the day</span>**. Busiest hours are between **<span style='color:lightseagreen'>13h - 17h</span>**, and after congestion rate decrease as the night falls. At the middle graph is easily to see a **<span style='color:lightseagreen'>strong seasonality with respect to congestion rate per week</span>**. Moreover, trend remains almost constant, increasing insignificantly over time.. Finally at the bottom, we can see that congestion feature is **<span style='color:lightseagreen'>normally distributed</span>** (as the initial report showed). Likewise, most repeated congestion value rate belongs to **<span style='color:lightseagreen'>40-60 category</span>**. 

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.2 | Daily Congestion Analysis</b></p>
</div>

In [None]:
df_graph = df_data[df_data['congestion'].isnull() == False]

# chart
fig = make_subplots(rows=2, cols=3, 
                    specs=[[{"type": "scatter"}, {"type": "scatter"}, {'type':'scatter'}], [{"type": "scatter"}, {"type": "scatter"}, {'type':'scatter'}]],
                    column_widths=[0.33, 0.33, 0.34], vertical_spacing=0.125, horizontal_spacing=0.1,
                    subplot_titles=("Monday Congestion", "Tuesday Congestion", "Wednesday Congestion",'Thursday Congestion','Friday Congestion','Weekend Congestion'))

# Upper Left chart
df_monday = df_graph[df_graph.day_of_week == 'Monday'].groupby(['hour']).agg({"congestion" : "mean"}).reset_index()
fig.add_trace(go.Scatter(x=df_monday['hour'], y=df_monday['congestion'], mode='lines+markers',
                 marker = dict(color = px.colors.sequential.Viridis[0]), name='Monday Congestion'), row = 1, col = 1)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=1)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, row=1, col=1)

# Upper Medium chart
df_tuesday = df_graph[df_graph.day_of_week == 'Tuesday'].groupby(['hour']).agg({"congestion" : "mean"}).reset_index()
fig.add_trace(go.Scatter(x=df_tuesday['hour'], y=df_tuesday['congestion'], mode='lines+markers',
                 marker = dict(color = px.colors.sequential.Viridis[2]), name='Monday Congestion'), row = 1, col = 2)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=2)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, row=1, col=2)

# Upper Right chart
df_wednesday = df_graph[df_graph.day_of_week == 'Wednesday'].groupby(['hour']).agg({"congestion" : "mean"}).reset_index()
fig.add_trace(go.Scatter(x=df_wednesday['hour'], y=df_wednesday['congestion'], mode='lines+markers',
                 marker = dict(color = px.colors.sequential.Viridis[4]), name='Monday Congestion'), row = 1, col = 3)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=3)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, row=1, col=3)

# Bottom Left chart
df_thursday = df_graph[df_graph.day_of_week == 'Thursday'].groupby(['hour']).agg({"congestion" : "mean"}).reset_index()
fig.add_trace(go.Scatter(x=df_thursday['hour'], y=df_thursday['congestion'], mode='lines+markers',
                 marker = dict(color = px.colors.sequential.Viridis[6]), name='Monday Congestion'), row = 2, col = 1)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=2, col=1)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, row=2, col=1)

# Bottom Medium chart
df_friday = df_graph[df_graph.day_of_week == 'Friday'].groupby(['hour']).agg({"congestion" : "mean"}).reset_index()
fig.add_trace(go.Scatter(x=df_friday['hour'], y=df_friday['congestion'], mode='lines+markers',
                 marker = dict(color = px.colors.sequential.Viridis[9]), name='Monday Congestion'), row = 2, col = 2)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=2, col=2)
fig.update_yaxes(showgrid = False,linecolor='gray', linewidth=2, row=2, col=2)

# Bottom Right chart
df_weekend = df_graph[df_graph.is_weekend == True].groupby(['hour']).agg({"congestion" : "mean"}).reset_index()
df_saturday = df_graph[df_graph.day_of_week == 'Saturday'].groupby(['hour']).agg({"congestion" : "mean"}).reset_index()
df_sunday = df_graph[df_graph.day_of_week == 'Sunday'].groupby(['hour']).agg({"congestion" : "mean"}).reset_index()

fig.add_trace(go.Scatter(x=df_weekend['hour'], y=df_weekend['congestion'], mode='lines+markers',
                 marker = dict(color = px.colors.sequential.Viridis[6]), name='Weekend Average Congestion'), row = 2, col = 3)
fig.add_trace(go.Scatter(x=df_saturday['hour'], y=df_saturday['congestion'], mode='lines+markers',
                 marker = dict(color = px.colors.sequential.Viridis[3]), name='Saturday Congestion'), row = 2, col = 3)
fig.add_trace(go.Scatter(x=df_sunday['hour'], y=df_sunday['congestion'], mode='lines+markers',
                 marker = dict(color = px.colors.sequential.Viridis[9]), name='Sunday Congestion'), row = 2, col = 3)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=2, col=3)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, row=2, col=3)

# General Styling
fig.update_layout(height=750, bargap=0.2,
                  margin=dict(b=50,r=30,l=100),
                  title = "<span style='font-size:36px; font-family:Times New Roman'>Daily Congestion Analysis</span>",                  
                  plot_bgcolor='rgb(242,242,242)',
                  paper_bgcolor = 'rgb(242,242,242)',
                  font=dict(family="Times New Roman", size= 14),
                  hoverlabel=dict(font_color="floralwhite"),
                  showlegend=False)

📌 **Interpret:** As we can appreeciate, in **<span style='color:lightseagreen'>working days</span>** congestion rate is quite similar at each hour. However, this change when we get into the weekend. We can appreciate that, due to the fact that people do not have to work congestion rates are much smaller. Moreover, weekend congestion trend does not have as ups and downs as working days have. 

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.3 | Direction and Location Analysis</b></p>
</div>

In [None]:
palette = cycle(px.colors.sequential.Sunsetdark)
df_graph = df_data[df_data['congestion'].isnull() == False]

# chart
fig = make_subplots(rows=1, cols=3, 
                    specs=[[{"type": "bar"}, {"type": "bar"}, {'type':'bar'}]],
                    column_widths=[0.33, 0.34, 0.33], vertical_spacing=0.1, horizontal_spacing=0.1,
                    subplot_titles=("Mean Congestion per Direction", "Mean Congestion Per X Location", "Mean Congestion per Y Location"))

# Left chart
df_direction = df_graph.groupby(['direction']).agg({"congestion" : "mean"})
values = list(range(8))
fig.add_trace(go.Bar(x=df_direction.index, y=df_direction['congestion'], marker = dict(color=values, colorscale="Viridis"), name = 'Day of Week'),
                      row=1, col=1)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=1)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=1)

# Middle chart
df_x = df_graph.groupby(['x']).agg({"congestion" : "mean"})
values = list(range(3))
fig.add_trace(go.Bar(x=df_x.index, y=df_x['congestion'], marker = dict(color=values, colorscale="Viridis"), name = 'Day of Week'),
                      row=1, col=2)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=2)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=2)

# Right chart
df_y = df_graph.groupby(['y']).agg({"congestion" : "mean"})
values = list(range(4))
fig.add_trace(go.Bar(x=df_y.index, y=df_y['congestion'], marker = dict(color=values, colorscale="Viridis"), name = 'Day of Week'),
                      row=1, col=3)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=3)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=3)

# General Styling
fig.update_layout(height=400, bargap=0.2,
                  margin=dict(b=50,r=30,l=100),
                  title = "<span style='font-size:36px; font-family:Times New Roman'>Direction and Location Analysis</span>",                  
                  plot_bgcolor='rgb(242,242,242)',
                  paper_bgcolor = 'rgb(242,242,242)',
                  font=dict(family="Times New Roman", size= 14),
                  hoverlabel=dict(font_color="floralwhite"),
                  showlegend=False)

📌 **Interpret:** As we can appreeciate, direction with most large congestion rate is **<span style='color:lightseagreen'>south bound</span>**. Looking at the middle chart we observe that **<span style='color:lightseagreen'>X = 1 location</span>** is the most busiest, while X = 0 is the one with least traffic. Finally, at the right we see that both **<span style='color:lightseagreen'>Y = 0 and Y = 2 location</span>** are the busiest. Difference between these two and the other is a bit significant. 

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.4 | Outliers</b></p>
</div>

### 2.4.1 | Outliers Definition
Outlier is an observation that is numerically distant from the rest of the data or in a simple word it is the value which is out of the range.let’s take an example to check what happens to a data set with and data set without outliers.

### 2.4.2 | Outliers Detection

Outlier can be of two types: Univariate and Multivariate. These outliers can be found when we look at distribution of a single variable. Multi-variate outliers are outliers in an n-dimensional space. We'll start by detecting whether there are outliers in our dataset or not. 

#### 2.4.2.1 | Grubbs Test

$$
\begin{array}{l}{\text { Grubbs' test is defined for the hypothesis: }} \\ {\begin{array}{ll}{\text { Ho: }}  {\text { There are no outliers in the data set }} \\ {\mathrm{H}_{\mathrm{1}} :}  {\text { There is exactly one outlier in the data set }}\end{array}}\end{array}
$$
$$
\begin{array}{l}{\text {The Grubbs' test statistic is defined as: }} \\ {\qquad G_{calculated}=\frac{\max \left|X_{i}-\overline{X}\right|}{SD}} \\ {\text { with } \overline{X} \text { and } SD \text { denoting the sample mean and standard deviation, respectively. }} \end{array}
$$
$$
G_{critical}=\frac{(N-1)}{\sqrt{N}} \sqrt{\frac{\left(t_{\alpha /(2 N), N-2}\right)^{2}}{N-2+\left(t_{\alpha /(2 N), N-2}\right)^{2}}}
$$

\begin{array}{l}{\text { If the calculated value is greater than critical, you can reject the null hypothesis and conclude that one of the values is an outlier }}\end{array}

In [None]:
import scipy.stats as stats
def grubbs_test(x):
    n = len(x)
    mean_x = np.mean(x)
    sd_x = np.std(x)
    numerator = max(abs(x-mean_x))
    g_calculated = numerator/sd_x
    print("Grubbs Calculated Value:",g_calculated)
    t_value = stats.t.ppf(1 - 0.05 / (2 * n), n - 2)
    g_critical = ((n - 1) * np.sqrt(np.square(t_value))) / (np.sqrt(n) * np.sqrt(n - 2 + np.square(t_value)))
    print("Grubbs Critical Value:",g_critical)
    if g_critical > g_calculated:
        print("From grubbs_test we observe that calculated value is lesser than critical value, Accept null hypothesis and conclude that there is no outliers\n")
    else:
        print("From grubbs_test we observe that calculated value is greater than critical value, Reject null hypothesis and conclude that there is an outliers\n")

grubbs_test(df_data[df_data.congestion.isnull() == False]['congestion'])

#### 2.4.2.2 | Z-score method

Using Z score method,we can find out how many standard deviations value away from the mean. 

![minipic](https://i.pinimg.com/originals/cd/14/73/cd1473c4c82980c6596ea9f535a7f41c.jpg)

 Figure in the left shows area under normal curve and how much area that standard deviation covers.
* 68% of the data points lie between + or - 1 standard deviation.
* 95% of the data points lie between + or - 2 standard deviation
* 99.7% of the data points lie between + or - 3 standard deviation

$\begin{array}{l} {R.Z.score=\frac{0.6745*( X_{i} - Median)}{MAD}}  \end{array}$

If the z score of a data point is more than 3 (because it cover 99.7% of area), it indicates that the data value is quite different from the other values. It is taken as outliers.

In [None]:
df_outlier = df_data.reset_index().set_index('row_id').copy()
out=[]
def Zscore_outlier(df):
    m = np.mean(df)
    sd = np.std(df)
    row = 0
    for i in df: 
        z = (i-m)/sd
        if np.abs(z) > 3: 
            out.append(row)
        row += 1
    return out

outliers_index = Zscore_outlier(df_outlier[df_outlier.congestion.isnull() == False]['congestion'])

In [None]:
df_outlier['outlier'] = 0
df_outlier.loc[outliers_index,'outlier'] = 1

fig = px.scatter(df_outlier, x=df_outlier.index, y='congestion', color = 'outlier')
fig.update_xaxes(visible = False, zeroline = False)
fig.update_yaxes(showgrid = False, gridcolor = 'gray', gridwidth = .5, zeroline = False)

#General Styling
fig.update_layout(height=400, bargap=0.2,
                  margin=dict(b=50,r=30,l=100),
                  title = "<span style='font-size:36px; font-family:Times New Roman'>Congestion Outliers Analysis</span>",                  
                  plot_bgcolor='rgb(242,242,242)',
                  paper_bgcolor = 'rgb(242,242,242)',
                  font=dict(family="Times New Roman", size= 14),
                  hoverlabel=dict(font_color="floralwhite"),
                  showlegend=False)

# <b>3 <span style='color:lightseagreen'>|</span> Feature Engineering</b>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.1 | Dataset Score</b></p>
</div>

Hereafter, we are going to define a baseline score which is going to help us to know whether some set of features we've assembled has actually led to any **<span style='color:lightseagreen'>improvement</span>** or not. In this first step, we are going to drop rows containing outliers (after comparison, dataset score is better without outliers).

In [None]:
def score_dataset(X, y, model=XGBRegressor(tree_method='gpu_hist', predictor='gpu_predictor'), model_2 = CatBoostRegressor(task_type = 'GPU', silent=True)):
#def score_dataset(X, y, model=XGBRegressor(), model_2 = CatBoostRegressor(silent=True)):
    # Label encoding is good for XGBoost and RandomForest, but one-hot
    # would be better for models like Lasso or Ridge. The `cat.codes`
    # attribute holds the category levels.
    for colname in X.select_dtypes(["object"]).columns:
        X[colname] = LabelEncoder().fit_transform(X[colname])
    X['week'] = X['week'].astype(int)
    X = X.drop('row_id',axis=1)
    # Metric for TPS Mar22 competition is MAE (Mean Absolute Error)
    score_xgb = cross_val_score(
        model, X, y, cv=5, scoring="neg_mean_absolute_error", n_jobs=-1
    )
    
    score_cat = cross_val_score(
        model_2, X, y, cv=5, scoring="neg_mean_absolute_error", n_jobs=-1
    )
    
    score = -0.5 * (score_xgb.mean() + score_cat.mean())
    return score

#df_data = df_data.reset_index().set_index('row_id')
#df_data = df_data.drop(outliers_index,axis=0)
#df_data = df_data.reset_index().set_index('time')

x = df_data[df_data['congestion'].isnull() == False].copy()
y = pd.DataFrame(x.pop('congestion'))

baseline_score = score_dataset(x, y)
print(f"Baseline score: {baseline_score:.5f} MAE")

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.2 | Cyclical Features</b></p>
</div>

Since map directions are **<span style='color:lightseagreen'>cyclical</span>**, we can try to capture this by breaking the direction into **<span style='color:lightseagreen'>sin and cos components</span>**. Hereafter, I hand code some value to avoid floating point noise.

In [None]:
from math import sin, cos, pi, exp
sin_vals = {
    'NB': 0.0,
    'NE': sin(1 * pi/4),
    'EB': 1.0,
    'SE': sin(3 * pi/4),
    'SB': 0.0,
    'SW': sin(5 * pi/4),    
    'WB': -1.0,    
    'NW': sin(7 * pi/4),  
}

cos_vals = {
    'NB': 1.0,
    'NE': cos(1 * pi/4),
    'EB': 0.0,
    'SE': cos(3 * pi/4),
    'SB': -1.0,
    'SW': cos(5 * pi/4),    
    'WB': 0.0,    
    'NW': cos(7 * pi/4),  
}

df_data['sin'] = df_data['direction'].map(sin_vals)
df_data['cos'] = df_data['direction'].map(cos_vals)

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.2 | Geography and Direction</b></p>
</div>

As congestion is measured for **<span style='color:lightseagreen'>certain points and directions in space</span>**, we can expect that these congestions predict future congestion at the **<span style='color:lightseagreen'>nearest point</span>** in the given direction. For instance, congestion at (0, 1, eastbound) should be correlated with future congestion at (1, 1, ...). The correlation goes both ways so that we can try to predict **<span style='color:lightseagreen'>backwards</span>**: congestion at (1, 1, eastbound) should be correlated with past congestion at (0, 1, ...). At first sight, it is unclear whether we need the geography at all: A simple approach for the competition would be ignoring the geography and creating 65 independent time series. Another simple approach is one-hot encoding the 65 position/direction combinations and using them as features. Although the y coordinate in the diagram grows from bottom to top (south to north), we can't take this for granted. Maybe it should grow from top to bottom.

### 3.2.1 | Roadway
Let's start by creating a feature that allows us to **<span style='color:lightseagreen'>differentiate each of the possible roadways</span>**, taking into account both the x and y coordinates and direction. Then, we will show a graph that will show us if this variable is useful or not. 

In [None]:
x = df_data[df_data['congestion'].isnull() == False].copy()
y = pd.DataFrame(x.pop('congestion'))
baseline_score = score_dataset(x, y)
print(f"Baseline score: {baseline_score:.5f} MAE")

In [None]:
df_data['roadway'] = df_data.x.astype(str) + df_data.y.astype(str) + df_data.direction.astype(str)
px.box(df_data[df_data.congestion.isnull() == False], x="roadway", y="congestion", color = 'roadway')

📌 **Interpret:** As we can appreeciate, the feature we have just created allows us to make a good **<span style='color:lightseagreen'>distinction</span>** between the congestion in each of the locations. In other words, this variable will be useful for us. 
### 3.2.2 | Mathematical Transformations
Let's keep going with our task of feature engineering. Now, we'll add features related with mathematical transformations. We'll **<span style='color:lightseagreen'>multiply</span>** locations coordinates with respective sin/cos function. Finally, we'll multiply it with the hour. 

In [None]:
df_data['x_cos_hour'] = df_data.x * df_data.cos * df_data.hour
df_data['y_sen_hour'] = df_data.y * df_data.sin * df_data.hour

df_data = df_data.drop(['year','x','y','direction'], axis=1)

x = df_data[df_data['congestion'].isnull() == False].copy()
y = pd.DataFrame(x.pop('congestion'))
baseline_score = score_dataset(x, y)
print(f"Baseline score: {baseline_score:.5f} MAE")

### 3.2.4 | Mean, median, maximum, minimum congestion per roadway / time
Hereafter, we are going to create **<span style='color:lightseagreen'>statistical</span>** features related to congestion target value. Those features are going to be **<span style='color:lightseagreen'>grouped by</span>** each of the roadways, and by time also. We are going to create the following features: 
* Mean congestion
* Median congestion
* Minimum congestion
* Maximum congestion

In [None]:
df_data = df_data.reset_index()
keys = ['roadway', 'day_of_week','hour', 'minute']

df = df_data.groupby(by=keys).mean().reset_index().set_index(keys)
df['mean congestion'] = df['congestion']
df_data = df_data.merge(df['mean congestion'], how='left', left_on=keys, right_on=keys)

df = df_data.groupby(by=keys).median().reset_index().set_index(keys)
df['median congestion'] = df['congestion']
df_data = df_data.merge(df['median congestion'], how='left', left_on=keys, right_on=keys)

df = df_data.groupby(by=keys).min().reset_index().set_index(keys)
df['min congestion'] = df['congestion']
df_data = df_data.merge(df['min congestion'], how='left', left_on=keys, right_on=keys)

df = df_data.groupby(by=keys).max().reset_index().set_index(keys)
df['max congestion'] = df['congestion']
df_data = df_data.merge(df['max congestion'], how='left', left_on=keys, right_on=keys)

df_data = df_data.set_index('time')

x = df_data[df_data['congestion'].isnull() == False].copy()
y = pd.DataFrame(x.pop('congestion'))
baseline_score = score_dataset(x, y)
print(f"Baseline score: {baseline_score:.5f} MAE")

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.3 | Mutual Information</b></p>
</div>

Mutual information describes **<span style='color:lightseagreen'>relationships</span>** in terms of **<span style='color:lightseagreen'>uncertainty</span>**. The mutual information (MI) between two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other. If you knew the value of a feature, how much more confident would you be about the target? Scikit-learn has two mutual information **<span style='color:lightseagreen'>metrics</span>** in its feature_selection module: one for **<span style='color:lightseagreen'>real-valued targets</span>** (mutual_info_regression) and one for **<span style='color:lightseagreen'>categorical targets</span>** (mutual_info_classif). Our target, price, is real-valued. The next cell computes the MI scores for our features and wraps them up in a nice dataframe. Hereafter, we are going to define a baseline score which is going to help us to know whether some set of features we've assembled has actually led to any **<span style='color:lightseagreen'>improvement</span>** or not.

In [None]:
from sklearn.feature_selection import mutual_info_regression
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor

def make_mi_scores(X, y):
    X = X.copy()
    for colname in X.select_dtypes(["object"]):
        X[colname], _ = X[colname].factorize()
    # All discrete features should now have integer dtypes
    #discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]
    mi_scores = mutual_info_regression(X, y, random_state=0)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

y = df_data[df_data['congestion'].isnull() == False]['congestion']
x = df_data[df_data['congestion'].isnull() == False].drop('congestion', axis=1)
mi_scores = make_mi_scores(x, y)
mi_scores = pd.DataFrame(mi_scores).reset_index().rename(columns={'index':'Feature'})

In [None]:
fig = px.bar(mi_scores, x='MI Scores', y='Feature', color="MI Scores",
             color_continuous_scale='darkmint')
fig.update_layout(height = 750, title_text="Mutual Information Scores",
                  title_font=dict(size=29, family="Lato, sans-serif"), xaxis={'categoryorder':'category ascending'}, margin=dict(t=80))

In [None]:
qualitative = [col for col in df_data if df_data[col].dtype == 'object']
for feature in qualitative:
    df_data[feature] = LabelEncoder().fit_transform(df_data[feature])
df_data = reduce_mem_usage(df_data)

df_data = df_data.drop(['month','minute','week','day_of_week','is_weekend','day_of_year','cos','sin'],axis=1)

df_train = df_data[df_data.congestion.isnull() == False]
df_test = df_data[df_data.congestion.isnull() == True]

from sklearn.model_selection import train_test_split
X = df_train.drop(['congestion','row_id'],axis = 1)
y = df_train['congestion']

# <b>4 <span style='color:lightseagreen'>|</span> Modeling</b>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>4.1 | Catboost</b></p>
</div>

### 4.1.1 | Hyperparameter Tuning - Optuna

In this case, only for Catboost, we are going to make the **<span style='color:lightseagreen'>tuning with Optuna</span>**. I will add the code for hyperparameter tuning below. However, for not **<span style='color:lightseagreen'>wasting CPU time</span>**, since I have run it once, I will simply create the model with the specific features values. I will control whether making hyperparameter tuning or not with **<span style='color:lightseagreen'>allow_optimize</span>** Finally, just say that code for tuning takes plenty of time. Due to that I enabled GPU technology. 

In [None]:
from sklearn.metrics import mean_absolute_error as mae
def objective(trial):
    params = {
        "random_state":trial.suggest_categorical("random_state", [2022]),
        'learning_rate' : trial.suggest_loguniform('learning_rate', 0.0001, 0.3),
        'bagging_temperature' :trial.suggest_loguniform('bagging_temperature', 0.01, 100.00),
        "n_estimators": 1000,
        "max_depth":trial.suggest_int("max_depth", 4, 16),
        'random_strength' :trial.suggest_int('random_strength', 0, 100),
        "l2_leaf_reg":trial.suggest_float("l2_leaf_reg",1e-8,3e-5),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
        "max_bin": trial.suggest_int("max_bin", 200, 500),
        'od_type': trial.suggest_categorical('od_type', ['IncToDec', 'Iter']),
        'task_type': trial.suggest_categorical('task_type', ['GPU']),
        'loss_function': trial.suggest_categorical('loss_function', ['MAE']),
        'eval_metric': trial.suggest_categorical('eval_metric', ['MAE'])
    }

    model = CatBoostRegressor(**params)
    X_train_tmp, X_valid_tmp, y_train_tmp, y_valid_tmp = train_test_split(X, y, test_size=0.3, random_state=42)
    model.fit(
        X_train_tmp, y_train_tmp,
        eval_set=[(X_valid_tmp, y_valid_tmp)],
        early_stopping_rounds=35, verbose=0
    )
        
    y_train_pred = model.predict(X_train_tmp)
    y_valid_pred = model.predict(X_valid_tmp)
    train_mae = mae(y_train_tmp, y_train_pred)
    valid_mae = mae(y_valid_tmp, y_valid_pred)
    
    print(f'MAE of Train: {train_mae}')
    print(f'MAE of Validation: {valid_mae}')
    
    return valid_mae

allow_optimize = 0

In [None]:
TRIALS = 100
TIMEOUT = 3600

if allow_optimize:
    sampler = TPESampler(seed=42)

    study = optuna.create_study(
        study_name = 'cat_parameter_opt',
        direction = 'minimize',
        sampler = sampler,
    )
    study.optimize(objective, n_trials=TRIALS)
    print("Best Score:",study.best_value)
    print("Best trial",study.best_trial.params)
    
    best_params = study.best_params
    
    X_train_tmp, X_valid_tmp, y_train_tmp, y_valid_tmp = train_test_split(X, y, test_size=0.3, random_state=42)
    model_tmp = CatBoostRegressor(**best_params, n_estimators=30000, verbose=1000).fit(X_train_tmp, y_train_tmp, eval_set=[(X_valid_tmp, y_valid_tmp)], early_stopping_rounds=35)

### 4.1.2 | Fitting - Feature Importances

In [None]:
if allow_optimize:
    model = CatBoostRegressor(**best_params, n_estimators=model_tmp.get_best_iteration(), verbose=1000).fit(X, y)
else:
    model = CatBoostRegressor(
        verbose=1000,
        early_stopping_rounds=10,
        #iterations=5000,
        random_state = 2022, learning_rate = 0.0824038781081412, bagging_temperature = 0.03568558360430449, max_depth = 16, 
        random_strength = 47, l2_leaf_reg = 7.459775961819184e-06, min_child_samples = 49, max_bin = 320, od_type = 'Iter', 
        task_type = 'GPU', loss_function = 'MAE', eval_metric = 'MAE'
    ).fit(X, y)    

In [None]:
plot_feature_importance(model.get_feature_importance(),X.columns,'CatBoost')

### 4.1.3 | Making Predictions

In [None]:
x_test = df_test.drop(['congestion','row_id'],axis = 1).copy()
predictions = model.predict(x_test)
submit_cat = pd.DataFrame({'row_id':df_test.row_id, 'congestion':predictions})
submit_cat = submit_cat.reset_index().drop('time',axis=1).set_index('row_id')

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>4.2 | XGBoost</b></p>
</div>

### 4.2.1 | Hyperparameter Tuning - GridSearch / RandomizedSearch

The **<span style='color:lightseagreen'>grid search</span>** approach is fine when you are exploring relatively **<span style='color:lightseagreen'>few combinations</span>**. However, when the hyperparameter **<span style='color:lightseagreen'>search space is large</span>**, it is often preferable to use **<span style='color:lightseagreen'>RandomizedSearchCV</span>** instead. This class can be used in much the same way as the GridSearchCV class, but instead of trying out all possible combinations, it evaluates a given number of random combinations by selecting a **<span style='color:lightseagreen'>random</span>** value for each hyperparameter at every iteration. This approach has two main benefits:

1. If you let the randomized search run for, say, 1,000 iterations, this approach will explore 1,000 different values for each hyperparameter (instead of just a few values per hyperparameter with the grid search approach).
2. You have more control over the computing budget you want to allocate to hyperparameter search, simply by setting the number of iterations.

In [None]:
from sklearn.model_selection import GridSearchCV
allow_optimize = 0
if allow_optimize:
    param_grid={'max_depth': [4,5,6,7,8,9],
            #'n_estimators': [100,200,300,400,500,600,700,800,900,1000],
            'min_child_weight' : [1,2,3,4,5,6],
            'gpu_id' : [0]
        }

    regressor = XGBRegressor(tree_method = 'gpu_hist', predictor = 'gpu_predictor')
    CV_regressor = GridSearchCV(regressor, param_grid, cv=3, scoring="neg_mean_absolute_error", n_jobs= -1, return_train_score = True, verbose = 0)
    CV_regressor.fit(X, y)
    
    print("The best hyperparameters are : ","\n")
    print(CV_regressor.best_params_)

### 4.2.2 | Fitting - Feature Importance

In [None]:
if allow_optimize: 
    CV_regressor = CV_regressor.best_estimator_
else:
    CV_regressor = XGBRegressor(tree_method = 'gpu_hist', predictor = 'gpu_predictor', gpu_id = 0, max_depth = 4, n_estimators = 100)
CV_regressor.fit(X, y)

In [None]:
plot_feature_importance(CV_regressor.feature_importances_,X.columns,'XGBOOST')

### 4.2.3 | Making Predictions

In [None]:
predictions = CV_regressor.predict(x_test)
submit_xgb = pd.DataFrame({'row_id':df_test.row_id, 'congestion':predictions})
submit_xgb = submit_xgb.reset_index().drop('time',axis=1).set_index('row_id')

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>4.3 | Blending and Submitting</b></p>
</div>

Finally for this kernel, we will the next blending and submit our predictions. We'll be using [Martynov Andrey special value dataset](https://www.kaggle.com/martynovandrey/tps-mar-22-don-t-forget-special-values/data) in order to make our score even better. If you found this kernel interesting or if you think you learned something new, feel free to give it an **<span style='color:lightseagreen'>upvote</span>**. 

In [None]:
submit = pd.DataFrame({'congestion': submit_cat['congestion']+0*submit_xgb['congestion']})
special = pd.read_csv('../input/tps-mar-22-special-values/special v2.csv', index_col="row_id")
special = special[['congestion']].rename(columns={'congestion':'special'})
submit = submit.merge(special, left_index=True, right_index=True, how='left')
submit['special'] = submit['special'].fillna(submit['congestion']).round().astype(int)
submit = submit.drop(['congestion'], axis=1).rename(columns={'special':'congestion'})
submit['congestion'] = round(submit['congestion'])
submit.to_csv('./submission.csv')