# Date Features from the datetime variable

In this notebook, we will see how we can easily derive many date-related features using the `dt` module from pandas.


## Features from the date part:

Below are some of the features that we can extract from the date part of the datetime variable off-the-shelf using [pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components):


- pandas.Series.dt.year
- pandas.Series.dt.quarter
- pandas.Series.dt.month
- pandas.Series.dt.isocalendar().week
- pandas.Series.dt.day
- pandas.Series.dt.day_of_week
- pandas.Series.dt.weekday
- pandas.Series.dt.dayofyear
- pandas.Series.dt.day_of_year

- pandas.Series.dt.is_month_start
- pandas.Series.dt.is_month_end
- pandas.Series.dt.is_quarter_start
- pandas.Series.dt.is_quarter_end
- pandas.Series.dt.is_year_start
- pandas.Series.dt.is_year_end
- pandas.Series.dt.is_leap_year
- pandas.Series.dt.days_in_month

We can use the features obtained with pandas to create even more features, such as:

- Semester
- Is Weekend?


## The dataset

We will use the Online Retail II Data Set available in the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-dfbases/00502/).

Download the xlsx file from the link above and save it in the **root** folder of this repo.

**Citation**:

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.


## In this demo

We will extract different features from the datetime variable: **InvoiceDate**

In [39]:
import numpy as np
import pandas as pd

In [40]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.rcParams['figure.figsize'] = (10, 8)
sns.set_palette('flare')
#sns.set_palette('Set2')
#sns.set_theme('notebook')

## Load the df

In [41]:
data = pd.read_csv('../../data/online_retail_II.csv')

In this dfset, we have the datetime variable in a column called **InvoiceDate**.

The dfset contains sales information for different customers in different countries. Customers may have made one or multiple purchases from the business that provided the df.

In [42]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1067371 entries, 0 to 1067370
Data columns (total 8 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   Invoice      1067371 non-null  object 
 1   StockCode    1067371 non-null  object 
 2   Description  1062989 non-null  object 
 3   Quantity     1067371 non-null  int64  
 4   InvoiceDate  1067371 non-null  object 
 5   Price        1067371 non-null  float64
 6   Customer ID  824364 non-null   float64
 7   Country      1067371 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 65.1+ MB


## Variable format

In [43]:
# Let's determine the type of data in the datetime variable.
data['InvoiceDate'].dtypes

dtype('O')

In this dataset, the variable is already parsed as datetime data.

In some datasets, the datetime variable may be cast as an object, i.e., strings. In these cases, before carrying on with the rest of the notebook, we should re-cast it from object into datetime, as we do in the following cell: 

In [44]:
# This is how we parse date strings into datetime format:
data['date'] = pd.to_datetime(data['InvoiceDate'])
data.drop(columns='InvoiceDate')

Unnamed: 0,Invoice,StockCode,Description,Quantity,Price,Customer ID,Country,date
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,6.95,13085.0,United Kingdom,2009-12-01 07:45:00
1,489434,79323P,PINK CHERRY LIGHTS,12,6.75,13085.0,United Kingdom,2009-12-01 07:45:00
2,489434,79323W,WHITE CHERRY LIGHTS,12,6.75,13085.0,United Kingdom,2009-12-01 07:45:00
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2.10,13085.0,United Kingdom,2009-12-01 07:45:00
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,1.25,13085.0,United Kingdom,2009-12-01 07:45:00
...,...,...,...,...,...,...,...,...
1067366,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2.10,12680.0,France,2011-12-09 12:50:00
1067367,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,4.15,12680.0,France,2011-12-09 12:50:00
1067368,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,4.15,12680.0,France,2011-12-09 12:50:00
1067369,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,4.95,12680.0,France,2011-12-09 12:50:00


## Date and time

In [45]:
def extract_datetime_features(df):
    # define column name for transformation
    feature = 'date'
    # Create a copy of the DataFrame to avoid modifying the original
    df = df.copy()
    
    # Ensure the feature is of datetime type
    if df[feature].dtype != 'datetime64[ns]':
        df[feature] = pd.to_datetime(df[feature])
    
    # Assign to a variable to avoid repetitive access
    dt_ = df[feature].dt
    
    df['date_part'] = dt_.date
    df['time_part'] = dt_.time
    df['year'] = dt_.year
    df['year_start'] = dt_.is_year_start
    df['year_end'] = dt_.is_year_end
    df['year_leap'] = dt_.is_leap_year
    df['quarter'] = dt_.quarter
    df['quarter_start'] = dt_.is_quarter_start
    df['quarter_end'] = dt_.is_quarter_end
    df['semester'] = np.where(dt_.quarter.isin([1, 2]), 1, 2)
    df['month'] = dt_.month
    df['days_in_month'] = dt_.days_in_month
    df['month_start'] = dt_.is_month_start
    df['month_end'] = dt_.is_month_end
    df['week'] = dt_.isocalendar().week
    df['day'] = dt_.day
    df['dayofweek'] = dt_.dayofweek
    df['day_name'] = dt_.day_name()
    df['is_weekend'] = dt_.weekday.isin([5, 6])
    df['day_year'] = dt_.dayofyear
    
    df.drop(columns=feature, inplace=True)
    
    return df

In [52]:
def transform_types(df):
    df = df.copy()
    df['Customer ID'].fillna(0, inplace=True)
    df['Customer ID'] = df['Customer ID'].astype(int)
    return df

In [49]:
from sklearn.base import BaseEstimator, TransformerMixin

class StateLessTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, transform_funcs):
        # Ensure transform_funcs is a list, even if a single function is passed
        if not isinstance(transform_funcs, list):
            transform_funcs = [transform_funcs]
            
        self.transform_funcs = transform_funcs

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X_transformed = X
        for func in self.transform_funcs:
            X_transformed = func(X_transformed)
        return X_transformed

In [50]:
slt = StateLessTransformer([extract_datetime_features, transform_types])

In [51]:
df = slt.fit_transform(data)
df

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country,date_part,time_part,...,month,days_in_month,month_start,month_end,week,day,dayofweek,day_name,is_weekend,day_year
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085,United Kingdom,2009-12-01,07:45:00,...,12,31,True,False,49,1,1,Tuesday,False,335
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085,United Kingdom,2009-12-01,07:45:00,...,12,31,True,False,49,1,1,Tuesday,False,335
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085,United Kingdom,2009-12-01,07:45:00,...,12,31,True,False,49,1,1,Tuesday,False,335
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.10,13085,United Kingdom,2009-12-01,07:45:00,...,12,31,True,False,49,1,1,Tuesday,False,335
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085,United Kingdom,2009-12-01,07:45:00,...,12,31,True,False,49,1,1,Tuesday,False,335
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1067366,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680,France,2011-12-09,12:50:00,...,12,31,False,False,49,9,4,Friday,False,343
1067367,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680,France,2011-12-09,12:50:00,...,12,31,False,False,49,9,4,Friday,False,343
1067368,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680,France,2011-12-09,12:50:00,...,12,31,False,False,49,9,4,Friday,False,343
1067369,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,2011-12-09 12:50:00,4.95,12680,France,2011-12-09,12:50:00,...,12,31,False,False,49,9,4,Friday,False,343
