# Part 6: Adapting Our Strategy After First Submission

Our first Kaggle submission was a reality check. We landed at the bottom of the leaderboard, signaling that our approach needed a serious revision. This prompted us to shift gears from building a complex model to a more iterative, cautious approach.

We realized the necessity of starting with a simple model, one that was robust against overfitting, and then incrementally introducing complexity. This method allowed us to evaluate each feature's impact thoroughly, ensuring we were capturing valuable insights rather than just data noise.

In this file, we document how we deconstructed our initial complex model and rebuilt it, step by step. We focused on discerning which features genuinely mattered, refining our feature engineering process to enhance model performance.

This journey from the last rank to a top-tier position underscores our learning curve and adaptability. It highlights how strategic changes, grounded in thoughtful analysis and careful experimentation, can lead to significant improvements in a competitive environment like Kaggle.

## First Kaggle Submission

$$ Score : 2.1887 $$

In [None]:
import sys
import subprocess

def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

install('pyarrow')
install('vacances-scolaires-france')
install('meteostat')
install('lockdowndates')

#imports
from pathlib import Path
import numpy as np
import pandas as pd
import pyarrow as pa
from sklearn.impute import SimpleImputer
from IPython.display import display
from IPython import get_ipython
import os
import holidays
from vacances_scolaires_france import SchoolHolidayDates
from lockdowndates.core import LockdownDates
import catboost as cb
from meteostat import Point, Hourly

#Load the data
bike_df_train = pd.read_parquet("/kaggle/input/mdsb-2023/train.parquet")
bike_df_test = pd.read_parquet("/kaggle/input/mdsb-2023/final_test.parquet")


# Feature engineering part

# Define a function to map months to seasons
def get_season(month):
    if 3 <= month <= 5:
        return 'Spring'
    elif 6 <= month <= 8:
        return 'Summer'
    elif 9 <= month <= 11:
        return 'Fall'
    else:
        return 'Winter' # December, January, February

from vacances_scolaires_france import SchoolHolidayDates
# Initialize the SchoolHolidayDates object
school_holidays = SchoolHolidayDates()

#Check if a given date is a school holiday
def is_school_holiday(datetime_obj):
    # Extracting just the date part from the datetime object
    date_obj = datetime_obj.date()
    return school_holidays.is_holiday_for_zone(date_obj, 'C')  # Paris is Zone C

def encode_dates(df):
    if 'date' not in df.columns or not pd.api.types.is_datetime64_any_dtype(df['date']):
        raise ValueError("DataFrame must have a 'date' column of datetime type.")

    X = df.copy()

    # Extracting date components
    X['year'] = X['date'].dt.year
    X['month'] = X['date'].dt.month
    X['day'] = X['date'].dt.day
    X['hour'] = X['date'].dt.hour
    X['weekday'] = X['date'].dt.weekday

    # Adding season based on month
    X['season'] = X['month'].apply(get_season)

    # Identifying working days (weekdays not in French holidays)
    fr_holidays = holidays.France()
    X['working_day'] = ((X['date'].dt.weekday < 5) & ~X['date'].dt.date.isin(fr_holidays)).astype(int)

    # Identifying school holidays
    X['school_holiday'] = X['date'].apply(is_school_holiday).astype(int)

    return X

def get_weather_data(X):
    if not all(col in X.columns for col in ['date', 'latitude', 'longitude']):
        raise ValueError("DataFrame must have 'date', 'latitude', and 'longitude' columns.")

    unique_locations = X[['latitude', 'longitude']].drop_duplicates()
    weather_data_list = []

    for lat, lon in unique_locations.itertuples(index=False):
        point = Point(lat, lon)
        location_data = Hourly(point, X['date'].min(), X['date'].max()).fetch()
        location_data.reset_index(inplace=True)  # Reset index to make 'date' a column
        location_data.rename(columns={'time': 'date'}, inplace=True)  # Rename 'time' column to 'date'
        location_data['latitude'], location_data['longitude'] = lat, lon
        weather_data_list.append(location_data)

    weather_data = pd.concat(weather_data_list, ignore_index=True)

    # Ensure 'date' column is in the right format if it's not already
    if weather_data['date'].dtype != 'datetime64[ns]':
        weather_data['date'] = pd.to_datetime(weather_data['date'])

    # Merge data
    merged_data = pd.merge(X, weather_data, on=['date', 'latitude', 'longitude'], how='left')
    
    return merged_data


def handle_missing_values(df, knn_flag=False, save_path=None, load_path=None):
    """
    Handles missing values in the DataFrame with specific strategies for each column.

    Args:
    df (DataFrame): The input DataFrame.
    knn_flag (bool): Flag to determine whether to perform KNN imputation.
    save_path (str): Path to save the DataFrame after KNN imputation.
    load_path (str): Path to load the DataFrame if KNN imputation is not performed.

    Returns:
    DataFrame: The DataFrame after handling missing values.
    """
    # Define columns that require specific imputation strategies
    zero_fill_cols = ['prcp', 'snow']  # Assuming no precipitation for missing values
    mean_fill_cols = ['temp', 'rhum', 'wspd']  # Using mean for these columns
    #knn_fill_cols = ['snow']  # KNN imputation for snow

    # Fill with zeros
    for col in zero_fill_cols:
        if col in df.columns:
            df[col].fillna(0, inplace=True)

    # Mean imputation
    mean_imputer = SimpleImputer(strategy="mean")
    for col in mean_fill_cols:
        if col in df.columns:
            df[col] = mean_imputer.fit_transform(df[[col]])


    # Check and fill any remaining missing values for other columns
    for col in df.columns:
        if df[col].isna().any():
            # Choose a default imputation strategy for other columns (e.g., median)
            df[col].fillna(df[col].median(), inplace=True)

    return df

def remove_duplicates(df):
    
    df = df.drop_duplicates()

    return df

def clean_data(df):
    """
    Cleans the DataFrame by handling missing values, removing duplicates, setting the time index, 
    and dropping redundant or low correlation columns.

    Args:
    df (DataFrame): The input DataFrame.

    Returns:
    DataFrame: Cleaned DataFrame.
    """
    # Handling missing values
    # If performing KNN imputation and saving the result
    df = handle_missing_values(df)

    # List of columns to drop
    drop_columns = ['site_name', 'site_id', 'coordinates', 'bike_count', 'latitude', 'longitude', 'year', 'day', 'tsun', 'weekday', 
                    'counter_id', 'counter_installation_date', 'counter_technical_id', 'date'
                     #'season'
                     ]
    
    # Only drop columns that are present in the DataFrame
    columns_to_drop = [col for col in drop_columns if col in df.columns]
    df = df.drop(columns_to_drop, axis=1)
    
    return df

# Create a new column 'weather' based on conditions
def categorize_weather(row, mean_rain, mean_snow, mean_windspeed):

    #Goal: Create a column weather with the categories: Clear (could also be cloudy), Windy, Rain, Snow
    #Consider the above if the value is greater than the mean value.
    if row['prcp'] > mean_rain:
        return 'Rainy'
    elif row['snow'] > mean_snow:
        return 'Snowy'
    elif row['wspd'] > mean_windspeed:
        return 'Windy'
    else:
        return 'Clear' #Note, we cannot differentiate whether it is sunny or cloudy. Just that the conditions above are not met

# Time of Day Category
def categorize_time_of_day(hour):
        if 6 <= hour < 12:
            return 'Morning'
        elif 12 <= hour < 18:
            return 'Afternoon'
        elif 18 <= hour < 24:
            return 'Evening'
        else:
            return 'Night'

def flag_rush_hour(hour):
    """
    Flags rush hour periods based on the hour of the day.

    Args:
    hour (int): Hour of the day (0-23).

    Returns:
    int: 1 if it's rush hour, otherwise 0.
    """
    # Define morning and evening rush hours (you can adjust these based on local patterns)
    morning_rush = (7, 8, 9)
    evening_rush = (16, 17, 18)

    if hour in morning_rush or hour in evening_rush:
        return 1
    else:
        return 0

def add_lockdown_curfew_features(df):
    """
    Add lockdown and curfew features to the DataFrame.
    
    Args:
    df (DataFrame): The input DataFrame.
    
    Returns:
    DataFrame: The DataFrame with lockdown and curfew features added.
    """
    # Reset the index to make 'date' a column if it's not already a column
    if isinstance(df.index, pd.DatetimeIndex):
        df.reset_index(inplace=True)

    # Get the start and end dates in string format
    start_date_str = df['date'].min().strftime('%Y-%m-%d')
    end_date_str = df['date'].max().strftime('%Y-%m-%d')

    # Initialize LockdownDates for France with the specified dates and restrictions
    ld = LockdownDates("France", start_date_str, end_date_str, ("stay_at_home", "masks"))
    lockdown_dates = ld.dates()

    # Check if the returned DataFrame from LockdownDates is empty
    if not lockdown_dates.empty:
        # Merge the lockdown information based on the date
        df = df.merge(lockdown_dates, left_on='date', right_index=True, how='left')

        df.drop(columns='france_country_code', inplace=True)

        # Fill NaN values that resulted from merge operation
        df['france_masks'].fillna(0, inplace=True)
        df['france_stay_at_home'].fillna(0, inplace=True)

        df['france_masks'] = df['france_masks'].astype(int)
        df['france_stay_at_home'] = df['france_stay_at_home'].astype(int)

        # Rename columns
        df.rename(columns={'france_masks': 'masks_code', 'france_stay_at_home': 'stay_at_home_code'}, inplace=True)

    else:
        # If no lockdown data is available, add default columns with 0
        df['france_masks'] = 0
        df['france_stay_at_home'] = 0

    return df

def create_new_features(df,mean_rain, mean_snow, mean_windspeed):
    """
    Create new features in the DataFrame.

    Args:
    df (DataFrame): The input DataFrame.

    Returns:
    DataFrame: The DataFrame with new features.
    """
    # Categorizing weather
    df['weather'] = df.apply(lambda row: categorize_weather(row, mean_rain, mean_snow, mean_windspeed), axis=1)

    # Categorizing time of the day
    df['time_of_day'] = df['hour'].apply(categorize_time_of_day)

    # Flagging rush hour
    df['rush_hour'] = df['hour'].apply(flag_rush_hour)
    
    df = add_lockdown_curfew_features(df)

    return df

def feature_transformation(df):
    """
    Transforms features in the given DataFrame.

    Args:
    df (DataFrame): The input DataFrame.

    Returns:
    DataFrame: Transformed DataFrame.
    """
    # Dropping redundant or low correlation columns
    #drop_columns = ['counter_name', 'bike_count', 'latitude', 'longitude', 'year', 'day', 'weekday', 'season']
    #df = df.drop(drop_columns, axis=1)

    # One-hot encoding for categorical variables
    df = pd.get_dummies(df, columns=['counter_name', 'weather', 'month', 'hour', 'season', 'time_of_day'], drop_first=True,  dtype=int)

    return df

def split_dataset_by_working_day(df):
    """
    Splits the dataset into two based on working day.

    Args:
    df (DataFrame): The input DataFrame.

    Returns:
    tuple: DataFrames split by working day.
    """
    # Splitting dataset
    df_working_day = df[df['working_day'] == 1]
    df_non_working_day = df[df['working_day'] == 0]

    return df_working_day, df_non_working_day

def align_datasets(train_df, test_df):
    # Combine columns from both datasets
    all_columns = set(train_df.columns).union(set(test_df.columns))

    # Reindex both datasets to have the same columns, fill missing with 0
    train_df_aligned = train_df.reindex(columns=all_columns, fill_value=0)
    test_df_aligned = test_df.reindex(columns=all_columns, fill_value=0)

    return train_df_aligned, test_df_aligned

def feature_engineering(df):
    """
    Apply all feature engineering steps to the given DataFrame.

    Args:
    df (DataFrame): The input DataFrame.

    Returns:
    DataFrame: The DataFrame after feature engineering.
    """

    # Apply date encoding
    df = encode_dates(df)
    
    # Incorporate weather data
    df = get_weather_data(df)

    # Calculate mean values for each weather column before cleaning
    mean_rain = df['prcp'].mean()
    mean_snow = df['snow'].mean()
    mean_windspeed = df['wspd'].mean()

    # Create new features (including weather category using the calculated means)
    df = create_new_features(df, mean_rain, mean_snow, mean_windspeed)

    # Transform features (one-hot encoding, etc.)
    df = feature_transformation(df)
    
    # Clean the data (handle missing values, etc.)
    df = clean_data(df)

    return df


# Preprocess the data
train_processed = feature_engineering(bike_df_train)
test_processed = feature_engineering(bike_df_test)

# Split between working day and non working day
train_processed_w, train_processed_nw = split_dataset_by_working_day(train_processed)
test_processed_w, test_processed_nw = split_dataset_by_working_day(test_processed)


# modeling part

# Separate features and target variables for training
# For working days
X_train_w = train_processed_w.drop(columns=["log_bike_count"])
y_train_w = train_processed_w["log_bike_count"]
# For non-working days
X_train_nw = train_processed_nw.drop(columns=["log_bike_count"])
y_train_nw = train_processed_nw["log_bike_count"]

# Reindex both datasets to have the same columns, fill missing with 0
X_train_w, X_test_w = align_datasets(X_train_w, test_processed_w)
X_train_nw, X_test_nw = align_datasets(X_train_nw, test_processed_nw)

# catboost optuna tuned study results
params_w = {
    'iterations': 1096, 
    'depth': 9, 
    'learning_rate': 0.22603548878280666, 
    'random_strength': 2, 
    'bagging_temperature': 0.1867337573932248, 
    'l2_leaf_reg': 2.304593084966779e-05, 
    'border_count': 66, 
    'grow_policy': 'Lossguide',
    'loss_function': 'RMSE',
    'verbose': False
}

params_nw = {
    'iterations': 366,
    'depth': 9,
    'learning_rate': 0.13516379949083754,
    'random_strength': 10,
    'bagging_temperature': 0.27795134630855506,
    'l2_leaf_reg': 0.10661335848192686,
    'border_count': 1,
    'loss_function': 'RMSE',
    'verbose': False
}

# Create and train the CatBoost model for working days
model_w = cb.CatBoostRegressor(**params_w)
model_w.fit(X_train_w, y_train_w, verbose=False)

# Predict for working days
y_pred_w = model_w.predict(X_test_w)

# Create and train the CatBoost model for non-working days
model_nw = cb.CatBoostRegressor(**params_nw)
model_nw.fit(X_train_nw, y_train_nw, verbose=False)

# Predict for non-working days
y_pred_nw = model_nw.predict(X_test_nw)

# Create dataframes with predictions and test data
df_pred_w = pd.DataFrame({'y_pred_w': y_pred_w}, index=X_test_w.index)
df_pred_nw = pd.DataFrame({'y_pred_nw': y_pred_nw}, index=X_test_nw.index)

#merge based on date index
# Concatenate the dataframes vertically
y_pred = pd.concat([df_pred_w, df_pred_nw])

# Rename the index to make it consistent (optional)
y_pred.index.name = 'index'

# Add a common column name 'predictions'
y_pred['predictions'] = y_pred['y_pred_w'].combine_first(y_pred['y_pred_nw'])

# Drop the individual prediction columns if needed
y_pred = y_pred.drop(['y_pred_w', 'y_pred_nw'], axis=1)

# Create dataframe in the right format for Kaggle submission
results = pd.DataFrame(
    dict(
        Id=np.arange(y_pred.shape[0]),
        log_bike_count=y_pred['predictions'].tolist(),
    )
)

# Save to CSV for submission
results.to_csv("submission.csv", index=False)

This first Kaggle attempt, scoring 2.1887, revealed severe overfitting due to an overly complex model. So we pivoted to a simpler, iterative approach, carefully adding and evaluating features. This strategic shift led to a significant improvement in our leaderboard position, demonstrating the effectiveness of a balanced and adaptable model-building strategy:

## Simplifying the Model: A Key Breakthrough

$$ Previous Best Score : 2.1887 $$
$$ New Best Score : 0.6722 $$

In [None]:
import sys
import subprocess

def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

#install('vacances-scolaires-france')
#install('meteostat')

import pandas as pd
import numpy as np
from xgboost import XGBRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Load data
bike_df_train = pd.read_parquet("/kaggle/input/mdsb-2023/train.parquet")
bike_df_test = pd.read_parquet("/kaggle/input/mdsb-2023/final_test.parquet")

# Encode dates function (simplified to essential components)
def encode_dates(df):
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['day'] = df['date'].dt.day
    df['hour'] = df['date'].dt.hour
    df['weekday'] = df['date'].dt.weekday
    return df.drop(columns=['date'])

# Apply date encoding to the datasets
bike_df_train = encode_dates(bike_df_train)
bike_df_test = encode_dates(bike_df_test)

# Column selection (simplified)
columns_to_use = ['year', 'month', 'day', 'hour', 'weekday', 'counter_name']

# Preprocessing pipeline (simplified)
preprocessor = ColumnTransformer(
    [
        ("std_scaler", StandardScaler(), ['year', 'month', 'day', 'hour', 'weekday']),
        ("cat", OneHotEncoder(handle_unknown="ignore"), ["counter_name"]),
    ]
)

# XGBoost regressor (simplified)
model = XGBRegressor()

# Full pipeline
pipe = make_pipeline(preprocessor, model)

# Separate features and target
X_train = bike_df_train[columns_to_use]
y_train = bike_df_train["log_bike_count"]
X_test = bike_df_test[columns_to_use]

# Train the model
pipe.fit(X_train, y_train)

# Make predictions
y_pred = pipe.predict(X_test)

# Prepare submission
results = pd.DataFrame({'Id': np.arange(len(y_pred)), 'log_bike_count': y_pred})
results.to_csv("submission.csv", index=False)

## Exploring Model Complexity: CatBoost Implementation
shift to a different model (CatBoost) while still focusing on simplicity

$$ Previous Best Score : 0.6722 $$
$$ New Best Score : 0.6518$$

In [None]:
# Simpler version using catboost
import sys
import subprocess

def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

#install('vacances-scolaires-france')
#install('meteostat')

import pandas as pd
import numpy as np
from catboost import CatBoostRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Load data
bike_df_train = pd.read_parquet("/kaggle/input/mdsb-2023/train.parquet")
bike_df_test = pd.read_parquet("/kaggle/input/mdsb-2023/final_test.parquet")

# Encode dates function (simplified to essential components)
def encode_dates(df):
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['day'] = df['date'].dt.day
    df['hour'] = df['date'].dt.hour
    df['weekday'] = df['date'].dt.weekday
    return df.drop(columns=['date'])

# Apply date encoding to the datasets
bike_df_train = encode_dates(bike_df_train)
bike_df_test = encode_dates(bike_df_test)

# Column selection (simplified)
columns_to_use = ['year', 'month', 'day', 'hour', 'weekday', 'counter_name']

# Preprocessing pipeline (simplified)
preprocessor = ColumnTransformer(
    [
        ("std_scaler", StandardScaler(), ['year', 'month', 'day', 'hour', 'weekday']),
        ("cat", OneHotEncoder(handle_unknown="ignore"), ["counter_name"]),
    ]
)

# CatBoost regressor (simplified)
model = CatBoostRegressor(verbose=0)  # Set verbose to 0 to reduce log output

# Full pipeline
pipe = make_pipeline(preprocessor, model)

# Separate features and target
X_train = bike_df_train[columns_to_use]
y_train = bike_df_train["log_bike_count"]
X_test = bike_df_test[columns_to_use]

# Train the model
pipe.fit(X_train, y_train)

# Make predictions
y_pred = pipe.predict(X_test)

# Prepare submission
results = pd.DataFrame({'Id': np.arange(len(y_pred)), 'log_bike_count': y_pred})
results.to_csv("submission.csv", index=False)

## Refining the Model: Simple GridSearch Tuning
At this stage, we decided to check how a simple tuning strategy would modify our score before adding more complex features and risking to overfit

$$ Previous Best Score : 0.6518 $$
$$ New Best Score : 0.6358 $$

In [None]:
# Simpler version using catboost
import sys
import subprocess

def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

#install('vacances-scolaires-france')
#install('meteostat')

import pandas as pd
import numpy as np
from catboost import CatBoostRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Load data
bike_df_train = pd.read_parquet("/kaggle/input/mdsb-2023/train.parquet")
bike_df_test = pd.read_parquet("/kaggle/input/mdsb-2023/final_test.parquet")

# Encode dates function (simplified to essential components)
def encode_dates(df):
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['day'] = df['date'].dt.day
    df['hour'] = df['date'].dt.hour
    df['weekday'] = df['date'].dt.weekday
    return df.drop(columns=['date'])

# Apply date encoding to the datasets
bike_df_train = encode_dates(bike_df_train)
bike_df_test = encode_dates(bike_df_test)

# Column selection (simplified)
columns_to_use = ['year', 'month', 'day', 'hour', 'weekday', 'counter_name']

# Preprocessing pipeline (simplified)
preprocessor = ColumnTransformer(
    [
        ("std_scaler", StandardScaler(), ['year', 'month', 'day', 'hour', 'weekday']),
        ("cat", OneHotEncoder(handle_unknown="ignore"), ["counter_name"]),
    ]
)

best_parms = {'iterations': 448, 'depth': 8, 'learning_rate': 0.19833451396224458, 'random_strength': 19, 'bagging_temperature': 0.7521783474435499, 'od_type': 'IncToDec', 'l2_leaf_reg': 0.2758522910440182}


# CatBoost regressor (simplified)
model = CatBoostRegressor(**best_parms, random_seed=42, verbose=0)  # Set verbose to 0 to reduce log output

# Full pipeline
pipe = make_pipeline(preprocessor, model)

# Separate features and target
X_train = bike_df_train[columns_to_use]
y_train = bike_df_train["log_bike_count"]
X_test = bike_df_test[columns_to_use]

# Train the model
pipe.fit(X_train, y_train)

# Make predictions
y_pred = pipe.predict(X_test)

# Prepare submission
results = pd.DataFrame({'Id': np.arange(len(y_pred)), 'log_bike_count': y_pred})
results.to_csv("submission.csv", index=False)

## Incorporating New Features: Balancing Simplicity and Complexity
Careful addition of new features, balancing complexity with the risk of overfitting

$$ Previous Best Score : 0.6358 $$
$$ New Best Score : 0.6291 $$

In [None]:
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Load data
bike_df_train = pd.read_parquet("/kaggle/input/mdsb-2023/train.parquet")
bike_df_test = pd.read_parquet("/kaggle/input/mdsb-2023/final_test.parquet")

# Convert date to datetime and extract components
bike_df_train['date'] = pd.to_datetime(bike_df_train['date'])
bike_df_train['hour'] = bike_df_train['date'].dt.hour
bike_df_train['day'] = bike_df_train['date'].dt.day
bike_df_train['weekday'] = bike_df_train['date'].dt.weekday
bike_df_train['month'] = bike_df_train['date'].dt.month
bike_df_train['year'] = bike_df_train['date'].dt.year

bike_df_test['date'] = pd.to_datetime(bike_df_test['date'])
bike_df_test['hour'] = bike_df_test['date'].dt.hour
bike_df_test['day'] = bike_df_test['date'].dt.day
bike_df_test['weekday'] = bike_df_test['date'].dt.weekday
bike_df_test['month'] = bike_df_test['date'].dt.month
bike_df_test['year'] = bike_df_test['date'].dt.year

# Feature Engineering
# Time of day (morning, midday, afternoon, night)
def categorize_hour(hour):
    if 5 <= hour < 10:
        return 'morning'
    elif 10 <= hour < 15:
        return 'midday'
    elif 15 <= hour < 20:
        return 'afternoon'
    else:
        return 'night'

bike_df_train['time_of_day'] = bike_df_train['hour'].apply(categorize_hour)
bike_df_test['time_of_day'] = bike_df_test['hour'].apply(categorize_hour)

# Weekday/Weekend
bike_df_train['is_weekend'] = bike_df_train['weekday'].apply(lambda x: 1 if x >= 5 else 0)
bike_df_test['is_weekend'] = bike_df_test['weekday'].apply(lambda x: 1 if x >= 5 else 0)

# Define columns to use
columns_to_use = ['year', 'month', 'day', 'hour', 'weekday', 'latitude', 'longitude', 
                  'time_of_day', 'is_weekend', 'counter_name']

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    [
        ("num", StandardScaler(), ['year', 'month', 'day', 'hour', 'weekday', 'latitude', 'longitude']),
        ("cat", OneHotEncoder(handle_unknown="ignore"), ['time_of_day', 'counter_name']),
    ],
    remainder='passthrough'
)

# Separate features and target
X_train = bike_df_train[columns_to_use]
y_train = bike_df_train["log_bike_count"]
X_test = bike_df_test[columns_to_use]

# Splitting the data for validation
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Preprocess the data
X_train = preprocessor.fit_transform(X_train)
X_valid = preprocessor.transform(X_valid)
X_test = preprocessor.transform(X_test)

# Create LGBM dataset
train_data = lgb.Dataset(X_train, label=y_train)
valid_data = lgb.Dataset(X_valid, label=y_valid)

# Define parameters
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'learning_rate': 0.1,
    'num_leaves': 31,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5
}

# Train the model
model = lgb.train(params,
                  train_data,
                  valid_sets=[train_data, valid_data],
                  num_boost_round=2000,
                  early_stopping_rounds=50,
                  verbose_eval=50)

# Make predictions
y_pred = model.predict(X_test, num_iteration=model.best_iteration)

# Prepare submission
results = pd.DataFrame({'Id': np.arange(len(y_pred)), 'log_bike_count': y_pred})
results.to_csv("submission.csv", index=False)

## Evolving the Model: Incorporating Weather and School Holidays data
After getting a good score with simple features, and since tuning wasn't really improving our performance again, we decided it was time to add more complex features to better predict the bike traffic in paris. In particular we added the school holidays as well as the weather dimension.

Since the data got bigger by adding external data and additional packages we decided to use the more efficient lightgbm which seemed the new best balance for our model.

$$ Previous Best Score : 0.6291 $$
$$ New Best Score : 0.6145 $$

In [None]:
import sys
import subprocess

def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

install('vacances-scolaires-france')


import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from vacances_scolaires_france import SchoolHolidayDates

# Load data
bike_df_train = pd.read_parquet("/kaggle/input/mdsb-2023/train.parquet")
bike_df_test = pd.read_parquet("/kaggle/input/mdsb-2023/final_test.parquet")

# Load the weather data
weather_data = pd.read_csv("/kaggle/input/mdsb-2023/external_data.csv")

# Initialize the SchoolHolidayDates object
school_holidays = SchoolHolidayDates()

# Check if a given date is a school holiday
def is_school_holiday(datetime_obj):
    date_obj = datetime_obj.date()
    return school_holidays.is_holiday_for_zone(date_obj, 'C')  # Paris is Zone C

# Function to categorize weather
def simplified_weather_categorization(row):
    # Define your thresholds
    temp_cold_threshold = 278.15
    temp_warm_threshold = 298.15
    rain_threshold = 1.0

    temp = row['t']
    rain = row.get('rr1', 0)

    if rain >= rain_threshold:
        return "Rainy"
    elif temp <= temp_cold_threshold:
        return "Cold"
    elif temp >= temp_warm_threshold:
        return "Warm"
    else:
        return "Moderate"

# Apply the function to create the 'simplified_weather_category' column
weather_data['date'] = pd.to_datetime(weather_data['date'])
weather_data['simplified_weather_category'] = weather_data.apply(simplified_weather_categorization, axis=1)

# Merge with training and testing data
bike_df_train = bike_df_train.merge(weather_data[['date', 'simplified_weather_category']], on='date', how='left')
bike_df_test = bike_df_test.merge(weather_data[['date', 'simplified_weather_category']], on='date', how='left')

# Convert date to datetime and extract components
bike_df_train['date'] = pd.to_datetime(bike_df_train['date'])
bike_df_test['date'] = pd.to_datetime(bike_df_test['date'])

# Extracting date components and school holiday feature
for df in [bike_df_train, bike_df_test]:
    df['hour'] = df['date'].dt.hour
    df['day'] = df['date'].dt.day
    df['weekday'] = df['date'].dt.weekday
    df['month'] = df['date'].dt.month
    df['year'] = df['date'].dt.year
    df['is_school_holiday'] = df['date'].apply(is_school_holiday)

# Feature Engineering
def categorize_hour(hour):
    if 5 <= hour < 10:
        return 'morning'
    elif 10 <= hour < 15:
        return 'midday'
    elif 15 <= hour < 20:
        return 'afternoon'
    else:
        return 'night'

bike_df_train['time_of_day'] = bike_df_train['hour'].apply(categorize_hour)
bike_df_test['time_of_day'] = bike_df_test['hour'].apply(categorize_hour)

bike_df_train['is_weekend'] = bike_df_train['weekday'].apply(lambda x: 1 if x >= 5 else 0)
bike_df_test['is_weekend'] = bike_df_test['weekday'].apply(lambda x: 1 if x >= 5 else 0)

# Define columns to use
columns_to_use = ['year', 'month', 'day', 'hour', 'weekday', 'is_school_holiday', 'latitude', 'longitude', 
                  'time_of_day', 'is_weekend', 'counter_name', 'simplified_weather_category']

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    [
        ("num", StandardScaler(), ['year', 'month', 'day', 'hour', 'weekday', 'latitude', 'longitude']),
        ("cat", OneHotEncoder(handle_unknown="ignore"), ['time_of_day', 'is_school_holiday', 'counter_name', 'simplified_weather_category']),
    ],
    remainder='passthrough'
)

# Separate features and target
X_train = bike_df_train[columns_to_use]
y_train = bike_df_train["log_bike_count"]
X_test = bike_df_test[columns_to_use]

# Splitting the data for validation
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Preprocess the data
X_train = preprocessor.fit_transform(X_train)
X_valid = preprocessor.transform(X_valid)
X_test = preprocessor.transform(X_test)

# Create LGBM dataset
train_data = lgb.Dataset(X_train, label=y_train)
valid_data = lgb.Dataset(X_valid, label=y_valid)

# Define parameters
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'learning_rate': 0.1,
    'num_leaves': 31,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5
}

# Train the model
model = lgb.train(params,
                  train_data,
                  valid_sets=[train_data, valid_data],
                  num_boost_round=2000,
                  early_stopping_rounds=50,
                  verbose_eval=50)

# Make predictions
y_pred = model.predict(X_test, num_iteration=model.best_iteration)

# Prepare submission
results = pd.DataFrame({'Id': np.arange(len(y_pred)), 'log_bike_count': y_pred})
results.to_csv("submission.csv", index=False)

## Final Enhancements: Optimizing Weather-Time Interactions
Final tweaks to the model, focusing on specific interactions for improved accuracy

$$ Previous Best Score : 0.6145 $$
$$ New Best Score : 0.6044 $$

In [None]:
import sys
import subprocess
def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

install('vacances-scolaires-france')

import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from vacances_scolaires_france import SchoolHolidayDates

# Load data
bike_df_train = pd.read_parquet("/kaggle/input/mdsb-2023/train.parquet")
bike_df_test = pd.read_parquet("/kaggle/input/mdsb-2023/final_test.parquet")

# Load the weather data
weather_data = pd.read_csv("/kaggle/input/mdsb-2023/external_data.csv")

# Initialize the SchoolHolidayDates object
school_holidays = SchoolHolidayDates()

def is_school_holiday(datetime_obj):
    date_obj = datetime_obj.date()
    return school_holidays.is_holiday_for_zone(date_obj, 'C')

def simplified_weather_categorization(row):
    temp_cold_threshold = 278.15
    temp_warm_threshold = 298.15
    rain_threshold = 1.0
    temp = row['t']
    rain = row.get('rr1', 0)
    if rain >= rain_threshold:
        return "Rainy"
    elif temp <= temp_cold_threshold:
        return "Cold"
    elif temp >= temp_warm_threshold:
        return "Warm"
    else:
        return "Moderate"

weather_data['date'] = pd.to_datetime(weather_data['date'])
weather_data['simplified_weather_category'] = weather_data.apply(simplified_weather_categorization, axis=1)

bike_df_train = bike_df_train.merge(weather_data[['date', 'simplified_weather_category']], on='date', how='left')
bike_df_test = bike_df_test.merge(weather_data[['date', 'simplified_weather_category']], on='date', how='left')

for df in [bike_df_train, bike_df_test]:
    df['date'] = pd.to_datetime(df['date'])
    df['hour'] = df['date'].dt.hour
    df['day'] = df['date'].dt.day
    df['weekday'] = df['date'].dt.weekday
    df['month'] = df['date'].dt.month
    df['year'] = df['date'].dt.year
    df['is_school_holiday'] = df['date'].apply(is_school_holiday)

def categorize_hour(hour):
    if 5 <= hour < 10:
        return 'morning'
    elif 10 <= hour < 15:
        return 'midday'
    elif 15 <= hour < 20:
        return 'afternoon'
    else:
        return 'night'

for df in [bike_df_train, bike_df_test]:
    df['time_of_day'] = df['hour'].apply(categorize_hour)

def create_weather_time_interaction(row):
    return f"{row['simplified_weather_category']}_{row['time_of_day']}"

for df in [bike_df_train, bike_df_test]:
    df['weather_time_interaction'] = df.apply(create_weather_time_interaction, axis=1)

for df in [bike_df_train, bike_df_test]:
    df['is_weekend'] = df['weekday'].apply(lambda x: 1 if x >= 5 else 0)

columns_to_use = ['year', 'month', 'hour', 'weekday', 'is_school_holiday', 'latitude', 'longitude', 
                  'time_of_day', 'is_weekend', 'counter_name', 'simplified_weather_category', 'weather_time_interaction']

preprocessor = ColumnTransformer(
    [
        ("num", StandardScaler(), ['year', 'month', 'hour', 'weekday', 'latitude', 'longitude']),
        ("cat", OneHotEncoder(handle_unknown="ignore"), ['time_of_day', 'is_school_holiday', 'counter_name', 'simplified_weather_category', 'weather_time_interaction']),
    ],
    remainder='passthrough'
)

X_train = bike_df_train[columns_to_use]
y_train = bike_df_train["log_bike_count"]
X_test = bike_df_test[columns_to_use]

X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

X_train = preprocessor.fit_transform(X_train)
X_valid = preprocessor.transform(X_valid)
X_test = preprocessor.transform(X_test)

train_data = lgb.Dataset(X_train, label=y_train)
valid_data = lgb.Dataset(X_valid, label=y_valid)

params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'learning_rate': 0.1,
    'num_leaves': 31,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5
}

model = lgb.train(params,
                  train_data,
                  valid_sets=[train_data, valid_data],
                  num_boost_round=2000,
                  early_stopping_rounds=50,
                  verbose_eval=50)

y_pred = model.predict(X_test, num_iteration=model.best_iteration)

results = pd.DataFrame({'Id': np.arange(len(y_pred)), 'log_bike_count': y_pred})
results.to_csv("submission.csv", index=False)

## Final Best Submission

<p align="center" style="color:#32CD32">
  <strong>Best Final Public Score : 0.6048</strong><br><br>
  <strong>Best Final Private Score : 0.5900</strong>
</p>

In [None]:
import sys
import subprocess
def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

install('vacances-scolaires-france')

import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from vacances_scolaires_france import SchoolHolidayDates

# Load data
bike_df_train = pd.read_parquet("/kaggle/input/mdsb-2023/train.parquet")
bike_df_test = pd.read_parquet("/kaggle/input/mdsb-2023/final_test.parquet")

# Load the weather data
weather_data = pd.read_csv("/kaggle/input/mdsb-2023/external_data.csv")

# Initialize the SchoolHolidayDates object
school_holidays = SchoolHolidayDates()

def is_school_holiday(datetime_obj):
    date_obj = datetime_obj.date()
    return school_holidays.is_holiday_for_zone(date_obj, 'C')

def simplified_weather_categorization(row):
    temp_cold_threshold = 278.15
    temp_warm_threshold = 298.15
    rain_threshold = 1.0
    temp = row['t']
    rain = row.get('rr1', 0)
    if rain >= rain_threshold:
        return "Rainy"
    elif temp <= temp_cold_threshold:
        return "Cold"
    elif temp >= temp_warm_threshold:
        return "Warm"
    else:
        return "Moderate"

weather_data['date'] = pd.to_datetime(weather_data['date'])
weather_data['simplified_weather_category'] = weather_data.apply(simplified_weather_categorization, axis=1)

bike_df_train = bike_df_train.merge(weather_data[['date', 'simplified_weather_category']], on='date', how='left')
bike_df_test = bike_df_test.merge(weather_data[['date', 'simplified_weather_category']], on='date', how='left')

for df in [bike_df_train, bike_df_test]:
    df['date'] = pd.to_datetime(df['date'])
    df['hour'] = df['date'].dt.hour
    df['day'] = df['date'].dt.day
    df['weekday'] = df['date'].dt.weekday
    df['month'] = df['date'].dt.month
    df['year'] = df['date'].dt.year
    df['is_school_holiday'] = df['date'].apply(is_school_holiday)

def categorize_hour(hour):
    if 5 <= hour < 10:
        return 'morning'
    elif 10 <= hour < 15:
        return 'midday'
    elif 15 <= hour < 20:
        return 'afternoon'
    else:
        return 'night'

for df in [bike_df_train, bike_df_test]:
    df['time_of_day'] = df['hour'].apply(categorize_hour)

def create_weather_time_interaction(row):
    return f"{row['simplified_weather_category']}_{row['time_of_day']}"

for df in [bike_df_train, bike_df_test]:
    df['weather_time_interaction'] = df.apply(create_weather_time_interaction, axis=1)

for df in [bike_df_train, bike_df_test]:
    df['is_weekend'] = df['weekday'].apply(lambda x: 1 if x >= 5 else 0)

columns_to_use = ['year', 'month', 'hour', 'weekday', 'is_school_holiday', 'latitude', 
                  'time_of_day', 'is_weekend', 'counter_name', 'simplified_weather_category', 'weather_time_interaction']

preprocessor = ColumnTransformer(
    [
        ("num", StandardScaler(), ['year', 'month', 'hour', 'weekday', 'latitude']),
        ("cat", OneHotEncoder(handle_unknown="ignore"), ['time_of_day', 'is_school_holiday', 'counter_name', 'simplified_weather_category', 'weather_time_interaction']),
    ],
    remainder='passthrough'
)

X_train = bike_df_train[columns_to_use]
y_train = bike_df_train["log_bike_count"]
X_test = bike_df_test[columns_to_use]

X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

X_train = preprocessor.fit_transform(X_train)
X_valid = preprocessor.transform(X_valid)
X_test = preprocessor.transform(X_test)

train_data = lgb.Dataset(X_train, label=y_train)
valid_data = lgb.Dataset(X_valid, label=y_valid)

params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'learning_rate': 0.1,
    'num_leaves': 31,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5
}

model = lgb.train(params,
                  train_data,
                  valid_sets=[train_data, valid_data],
                  num_boost_round=2000,
                  early_stopping_rounds=50,
                  verbose_eval=50)

y_pred = model.predict(X_test, num_iteration=model.best_iteration)

results = pd.DataFrame({'Id': np.arange(len(y_pred)), 'log_bike_count': y_pred})
results.to_csv("submission.csv", index=False)

# Conclusion : Reflections and Lessons Learned

Our journey in this Kaggle challenge highlights the importance of adaptability and strategic thinking in data science. Initially hindered by overfitting due to a complex model, we shifted to a simpler, iterative approach, adding complexity only when beneficial. This change in strategy led to a steady climb up the leaderboard.

Incorporating nuanced features like weather patterns and school holidays, we found a balance between feature richness and model simplicity, ultimately choosing LightGBM for its efficiency. Our final model, refined with careful feature selection, showcased our ability to blend data insights with technical proficiency.

This experience reinforces key data science principles: start simple, evaluate rigorously, and adapt continuously. Our significant improvement in both public and private scores is a testament to these learnings. As we move forward, the lessons from this challenge will guide our future endeavors in data science, armed with a deeper understanding of model building and feature selection.