# CHAPTER - 7  Handling Dates and Time

## 7.1 Converting Strings to Dates

Converting vector of strings representing Dates and times into time series data.

the format of the dates might vary significantly depending on data sources, like 24-11-2023 or Nov, 24, 2023 or 24/11/2023

In [1]:
import numpy as np
import pandas as pd

In [2]:
# creating strings

date_strings = np.array(['24-11-2023 05:33 AM',
                        '18-11-2023 11:54 PM',
                        '18-05-2023 09:09 AM'])

In [3]:
# converting to datetimes using pandas to_datetime

[pd.to_datetime(date, format='%d-%m-%Y %I:%M %p') for date in date_strings]

[Timestamp('2023-11-24 05:33:00'),
 Timestamp('2023-11-18 23:54:00'),
 Timestamp('2023-05-18 09:09:00')]

In [4]:
# we can add errors parameter so that any problem will not raise an error,
# instead it will set the value causing the error to missing value NaT

[pd.to_datetime(date, format = '%d-%m-%Y %I:%M %p') for date in date_strings]

[Timestamp('2023-11-24 05:33:00'),
 Timestamp('2023-11-18 23:54:00'),
 Timestamp('2023-05-18 09:09:00')]

## 7.2 Handling Time zones

Adding or changing time zone information to time series data.

In [5]:
# creating timestamp with timezone

pd.Timestamp('2023-11-24 05:48:50', tz='US/Pacific')

Timestamp('2023-11-24 05:48:50-0800', tz='US/Pacific')

In [8]:
# adding timezone to previously created datetime using tz_localize

date = pd.Timestamp('2023-11-24 05:48:50')

In [10]:
londonTime = date.tz_localize('Europe/London')

londonTime

Timestamp('2023-11-24 05:48:50+0000', tz='Europe/London')

In [12]:
# we can convert this to other time zone

londonTime.tz_convert('Africa/Abidjan')

Timestamp('2023-11-24 05:48:50+0000', tz='Africa/Abidjan')

In [20]:
# we can apply tz_localize and tz_convert to each element in pandas series

# creating 3 dates

dates = pd.Series(pd.date_range('11/24/2023', periods = 3, freq = 'M'))

# periods - no of periods to generate

In [15]:
# seetting time zone
dates.dt.tz_localize('US/Pacific')

0   2023-11-30 00:00:00-08:00
1   2023-12-31 00:00:00-08:00
2   2024-01-31 00:00:00-08:00
dtype: datetime64[ns, US/Pacific]

Pandas supports 2 sets of strings in representing time zones; by using 'pytz' library we can see all the timezones

In [18]:
# importing library

from pytz import all_timezones

In [19]:
# show 5 time zones

all_timezones[:5]

['Africa/Abidjan',
 'Africa/Accra',
 'Africa/Addis_Ababa',
 'Africa/Algiers',
 'Africa/Asmara']

## 7.3 Selecting Dates and Times

Selecting one or more dates from vector of dates

In [21]:
# creating a dataframe

dataframe = pd.DataFrame()

# creating datetimes

dataframe['date'] = pd.date_range('1/1/2023', periods = 100000, freq = 'H')

In [24]:
# selecting observations between 2 dates

dataframe[(dataframe['date'] > '2023-1-1 00:00:00') &
         (dataframe['date'] <= '2023-1-1 04:00:00')]

Unnamed: 0,date
1,2023-01-01 01:00:00
2,2023-01-01 02:00:00
3,2023-01-01 03:00:00
4,2023-01-01 04:00:00


In [25]:
# we can also use loc using date column as index

# setting the index

dataframe = dataframe.set_index(dataframe['date'])

In [29]:
# selecting observations between 2 dates

dataframe.loc['2023-1-1 00:00:00' : '2023-1-1 04:00:00']

Unnamed: 0_level_0,date
date,Unnamed: 1_level_1
2023-01-01 00:00:00,2023-01-01 00:00:00
2023-01-01 01:00:00,2023-01-01 01:00:00
2023-01-01 02:00:00,2023-01-01 02:00:00
2023-01-01 03:00:00,2023-01-01 03:00:00
2023-01-01 04:00:00,2023-01-01 04:00:00


## 7.4 Breaking up date data into multiple features

Using a column of date and times, creating features for year, month, day, hour, minute 

In [55]:
# create a dataframe

dataframe = pd.DataFrame()

In [56]:
# creating five dates

dataframe['date'] = pd.date_range('1/1/2023', periods = 150, freq = 'W')

In [57]:
# creating features for year, month, day, hour and minute

dataframe['year'] = dataframe['date'].dt.year
dataframe['month'] = dataframe['date'].dt.month
dataframe['day'] = dataframe['date'].dt.day
dataframe['hour'] = dataframe['date'].dt.hour
dataframe['minute'] = dataframe['date'].dt.minute

In [59]:
# Show head 

dataframe.head()

Unnamed: 0,date,year,month,day,hour,minute
0,2023-01-01,2023,1,1,0,0
1,2023-01-08,2023,1,8,0,0
2,2023-01-15,2023,1,15,0,0
3,2023-01-22,2023,1,22,0,0
4,2023-01-29,2023,1,29,0,0


## 7.5 Caluclating the difference between the dates

caluclating the time between datetime feature

In [60]:
dataframe = pd.DataFrame()

In [65]:
# Creating 2 datetime features

dataframe['Arrived'] = [pd.Timestamp('01-01-2023'), pd.Timestamp('01-04-2023')]
dataframe['Left'] = [pd.Timestamp('01-01-2023'), pd.Timestamp('01-10-2023')]

In [66]:
# duration between the features


dataframe['Left'] - dataframe['Arrived']

0   0 days
1   6 days
dtype: timedelta64[ns]

In [68]:
# caluclating duration between features(removing days in output)

pd.Series(delta.days for delta in (dataframe['Left'] - dataframe['Arrived']))

0    0
1    6
dtype: int64

In [69]:
# Time Deltas

## 7.6 Encoing Days of the week

Finding the day of the week in a vector of dates

In [70]:
# Creating dates

dates = pd.Series(pd.date_range('2/2/2023', periods = 3, freq = 'M'))

In [80]:
# showing days of the week

dates.dt.day_name()

0    Tuesday
1     Friday
2     Sunday
dtype: object

In [82]:
# if we want the output to be a numeric value

dates.dt.weekday

0    1
1    4
2    6
dtype: int64

## 7.7 Creating a Lagged Feature

Creating a feature that is lagged n time periods

In [84]:
dataframe = pd.DataFrame()

In [86]:
# creating data

dataframe['date'] = pd.date_range('1/1/2023', periods = 5, freq = 'D')
dataframe['stock_price'] = [1.1,2.2,3.3,4.4,5.5]

In [91]:
# dataframe of perivous days sock price

dataframe['previous_days_stock_price'] = dataframe['stock_price'].shift(1)

dataframe

Unnamed: 0,date,stock_price,previous_days_stock_price
0,2023-01-01,1.1,
1,2023-01-02,2.2,1.1
2,2023-01-03,3.3,2.2
3,2023-01-04,4.4,3.3
4,2023-01-05,5.5,4.4


We can use shift in predicting next days stock price by looking at yesterdays result, in the above result first value is null because there is no previous value to that

## 7.8 Using Rolling Time Windows

Caluclating statistic for a rolling time

In [92]:
# creating datetimes

time_index = pd.date_range('01/01/2023', periods = 5, freq = 'M')

In [94]:
# creating a dataframe and setting index

dataframe = pd.DataFrame(index = time_index)

# creating the feature

dataframe['stock_price'] = [1,2,3,4,5]

In [95]:
# calculating  rolling mean

dataframe.rolling(window = 2).mean()

Unnamed: 0,stock_price
2023-01-31,
2023-02-28,1.5
2023-03-31,2.5
2023-04-30,3.5
2023-05-31,4.5


## 7.9 Handling Missing Data in Time Series

Handling missing values in a time series date

In [96]:
# creating dates

time_index = pd.date_range('01/01/2023', periods = 5, freq ='M')

In [97]:
# creating dataframe and setting index

dataframe = pd.DataFrame(index = time_index)

In [98]:
# feature with gap of missing values

dataframe['sales'] = [1.0, 2.0, np.nan, np.nan, 5.0]

In [99]:
# Interpolating missing values
# drawing a curve or line between known values to find missing values

dataframe.interpolate()

Unnamed: 0,sales
2023-01-31,1.0
2023-02-28,2.0
2023-03-31,3.0
2023-04-30,4.0
2023-05-31,5.0


In [100]:
# replacing missing values with last known value
# Forward-fill

dataframe.ffill()

Unnamed: 0,sales
2023-01-31,1.0
2023-02-28,2.0
2023-03-31,2.0
2023-04-30,2.0
2023-05-31,5.0


In [101]:
# we can also fill with latest known value
# Back-fill

dataframe.bfill()

Unnamed: 0,sales
2023-01-31,1.0
2023-02-28,2.0
2023-03-31,5.0
2023-04-30,5.0
2023-05-31,5.0


In [102]:
# if we think the line between known values is non-linear we can use different interpolate method
# interpolating missing values

dataframe.interpolate(method = 'quadratic')

Unnamed: 0,sales
2023-01-31,1.0
2023-02-28,2.0
2023-03-31,3.059808
2023-04-30,4.038069
2023-05-31,5.0


In [103]:
# If there are large gaps we can limit the number of interpolated values and limit_direction

dataframe.interpolate(limit = 1, limit_direction = 'forward')

Unnamed: 0,sales
2023-01-31,1.0
2023-02-28,2.0
2023-03-31,3.0
2023-04-30,
2023-05-31,5.0


    Code Description Example
    %Y - Full year                                   - 2001
    %m - Month w/ zero padding                       - 04
    %d - Day of the month w/ zero padding            - 09
    %I - Hour (12hr clock) w/ zero padding           - 02
    %p - AM or PM                                    - AM
    %M - Minute w/ zero padding                      - 05
    %S - Second w/ zero padding                      - 09