# INFO 212: Data Science Programming

## Week 9: Lecture 1: Time Series Data Analysis

---

**Agenda:**
- Apply techiques to time series data

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Time Series
Time series data is an important form of structured data in many different fields, such
as finance, economics, ecology, neuroscience, and physics. Anything that is observed
or measured at many points in time forms a time series. Many time series are fixed
frequency, which is to say that data points occur at regular intervals according to some
rule, such as every 15 seconds, every 5 minutes, or once per month. Time series can
also be irregular without a fixed unit of time or offset between units. How you mark
and refer to time series data depends on the application, and you may have one of the
following:

- Timestamps, specific instants in time Fixed periods, such as the month January 2007 or the full year 2010
- Intervals of time, indicated by a start and end timestamp. Periods can be thought
of as special cases of intervals
- Experiment or elapsed time; each timestamp is a measure of time relative to a
particular start time (e.g., the diameter of a cookie baking each second since
being placed in the oven)

In [2]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Date and Time Data Types and Tools
The Python standard library includes data types for date and time data, as well as
calendar-related functionality. The datetime, time, and calendar modules are the
main places to start. The datetime.datetime type, or simply datetime, is widely
used.
```
from datetime import datetime
now = datetime.now()

now.year, now.month, now.day
```

In [6]:
from datetime import datetime
now = datetime.now()

now.year, now.month, now.day

(2024, 11, 18)

In [3]:
import pytz

In [4]:
local_tz = pytz.timezone('America/New_York')

In [5]:
now = astimezone(local_tz)

NameError: name 'now' is not defined

We can apply arithmatic operations on datetime objects:
```
datetime.now() - datetime(2024, 4, 20)
```

The result have time related properties:
```
delta = datetime(2021, 1, 7) - datetime(2008, 6, 24, 8, 15)

delta.days
delta.seconds
```

## Time Zone Handling

Working with time zones is generally considered one of the most unpleasant parts of
time series manipulation. As a result, many time series users choose to work with
time series in coordinated universal time or UTC, which is the successor to Greenwich
Mean Time and is the current international standard. Time zones are expressed as
offsets from UTC; for example, New York is four hours behind UTC during daylight
saving time and five hours behind the rest of the year.

In Python, time zone information comes from the third-party pytz library (installable
with pip or conda), which exposes the Olson database, a compilation of world
time zone information. This is especially important for historical data because the
daylight saving time (DST) transition dates (and even UTC offsets) have been
changed numerous times depending on the whims of local governments. In the United
States, the DST transition times have been changed many times since 1900!

```
import pytz
local_tz = pytz.timezone('America/New_York')

now = datetime.now()

now = now.astimezone(local_tz)

now.year

now.hour

now.tzinfo
```

In [8]:
import pytz
local_tz = pytz.timezone('America/New_York')

now = datetime.now()

now = now.astimezone(local_tz)

now.year

now.hour

now.tzinfo

<DstTzInfo 'America/New_York' EST-1 day, 19:00:00 STD>

```
for tz in pytz.common_timezones:
    if 'America' in tz:
        print(tz)
```

In [9]:
for tz in pytz.common_timezones:
    if 'America' in tz:
        print(tz)

America/Adak
America/Anchorage
America/Anguilla
America/Antigua
America/Araguaina
America/Argentina/Buenos_Aires
America/Argentina/Catamarca
America/Argentina/Cordoba
America/Argentina/Jujuy
America/Argentina/La_Rioja
America/Argentina/Mendoza
America/Argentina/Rio_Gallegos
America/Argentina/Salta
America/Argentina/San_Juan
America/Argentina/San_Luis
America/Argentina/Tucuman
America/Argentina/Ushuaia
America/Aruba
America/Asuncion
America/Atikokan
America/Bahia
America/Bahia_Banderas
America/Barbados
America/Belem
America/Belize
America/Blanc-Sablon
America/Boa_Vista
America/Bogota
America/Boise
America/Cambridge_Bay
America/Campo_Grande
America/Cancun
America/Caracas
America/Cayenne
America/Cayman
America/Chicago
America/Chihuahua
America/Ciudad_Juarez
America/Costa_Rica
America/Creston
America/Cuiaba
America/Curacao
America/Danmarkshavn
America/Dawson
America/Dawson_Creek
America/Denver
America/Detroit
America/Dominica
America/Edmonton
America/Eirunepe
America/El_Salvador
America/Fo

### Converting Between String and Datetime
Format datetime objects and pandas Timestamp objects as strings using str or the strftime method, passing a format specification.
```
stamp = datetime(2021, 1, 3)
str(stamp)
stamp.strftime('%Y-%m-%d')

s = stamp.strftime('%m/%d/%Y')
```

In [12]:
stamp = datetime(2021, 1, 3)
str(stamp)
stamp.strftime('%Y-%m-%d')

s = stamp.strftime('%m/%d/%Y')
s

'01/03/2021'

Convert from string to datetime:
```
d = datetime.strptime(s, '%m/%d/%Y')
```

In [14]:
d = datetime.strptime(s, '%m/%d/%Y')
d

datetime.datetime(2021, 1, 3, 0, 0)

## Exercise:
```
value = '2021-01-03'
datetime.strptime(value, '%Y-%m-%d')
datestrs = ['7/6/2021', '8/6/2021']
[datetime.strptime(x, '%m/%d/%Y') for x in datestrs]
```

In [15]:
value = '2021-01-03'
datetime.strptime(value, '%Y-%m-%d')
datestrs = ['7/6/2021', '8/6/2021']
[datetime.strptime(x, '%m/%d/%Y') for x in datestrs]

[datetime.datetime(2021, 7, 6, 0, 0), datetime.datetime(2021, 8, 6, 0, 0)]

pandas is generally oriented toward working with arrays of dates, whether used as an
axis index or a column in a DataFrame. The to_datetime method parses many different
kinds of date representations. Standard date formats like ISO 8601 can be
parsed very quickly:
```
datestrs = ['2021-07-06 12:00:00', '2021-08-06 00:00:00']
pd.to_datetime(datestrs)
```

In [16]:
datestrs = ['2021-07-06 12:00:00', '2021-08-06 00:00:00']
pd.to_datetime(datestrs)

DatetimeIndex(['2021-07-06 12:00:00', '2021-08-06 00:00:00'], dtype='datetime64[ns]', freq=None)

## Time Series Basics
A basic kind of time series object in pandas is a Series indexed by timestamps, which
is often represented external to pandas as Python strings or datetime objects.

```
from datetime import datetime
dates = [datetime(2021, 1, 2), datetime(2021, 1, 5),
         datetime(2021, 1, 7), datetime(2021, 1, 8),
         datetime(2021, 1, 10), datetime(2021, 1, 12)]
ts = pd.Series(np.random.randn(6), index=dates)
ts
```

In [18]:
from datetime import datetime
dates = [datetime(2021, 1, 2), datetime(2021, 1, 5),
         datetime(2021, 1, 7), datetime(2021, 1, 8),
         datetime(2021, 1, 10), datetime(2021, 1, 12)]
ts = pd.Series(np.random.randn(6), index=dates)
ts

Unnamed: 0,0
2021-01-02,-0.609492
2021-01-05,-0.092874
2021-01-07,0.793695
2021-01-08,-0.405201
2021-01-10,-0.316311
2021-01-12,0.332994


```
ts.index
```

In [19]:
ts.index


DatetimeIndex(['2021-01-02', '2021-01-05', '2021-01-07', '2021-01-08',
               '2021-01-10', '2021-01-12'],
              dtype='datetime64[ns]', freq=None)

Like other Series, arithmetic operations between differently indexed time series automatically
align on the dates:
```
ts + ts[::2]
```

In [20]:
ts + ts[::2]

Unnamed: 0,0
2021-01-02,-1.218983
2021-01-05,
2021-01-07,1.587391
2021-01-08,
2021-01-10,-0.632623
2021-01-12,


### Indexing, Selection, Subsetting
Time series behaves like any other pandas.Series when you are indexing and selecting
data based on label:
```
stamp = ts.index[2]
ts[stamp]
```

In [21]:
stamp = ts.index[2]
ts[stamp]

0.7936953204360424

For longer time series, a year or only a year and month can be passed to easily select
slices of data:
```
longer_ts = pd.Series(np.random.randn(1000),
                      index=pd.date_range('1/1/2023', periods=1000))
longer_ts
```

In [22]:
longer_ts = pd.Series(np.random.randn(1000),
                      index=pd.date_range('1/1/2023', periods=1000))
longer_ts

Unnamed: 0,0
2023-01-01,0.779283
2023-01-02,0.889889
2023-01-03,-1.439462
2023-01-04,-1.824704
2023-01-05,0.462699
...,...
2025-09-22,-0.369942
2025-09-23,-0.532757
2025-09-24,-0.263389
2025-09-25,0.485013


```
longer_ts['2023-5']
```

In [23]:
longer_ts['2023-5']

Unnamed: 0,0
2023-05-01,-0.104483
2023-05-02,-1.549521
2023-05-03,-0.441235
2023-05-04,-1.016231
2023-05-05,0.169679
2023-05-06,-1.766143
2023-05-07,0.026971
2023-05-08,-0.717987
2023-05-09,-0.98128
2023-05-10,-1.117939


### Time Series with Duplicate Indices
In some applications, there may be multiple data observations falling on a particular
timestamp. Here is an example:
```
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000',
                          '1/2/2000', '1/3/2000'])
dup_ts = pd.Series(np.arange(5), index=dates)
dup_ts
```

In [24]:
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000',
                          '1/2/2000', '1/3/2000'])
dup_ts = pd.Series(np.arange(5), index=dates)
dup_ts

Unnamed: 0,0
2000-01-01,0
2000-01-02,1
2000-01-02,2
2000-01-02,3
2000-01-03,4


Suppose you wanted to aggregate the data having non-unique timestamps. One way
to do this is to use groupby and pass level=0:
```
grouped = dup_ts.groupby(level=0)
grouped.mean()
grouped.count()
```

In [25]:
grouped = dup_ts.groupby(level=0)
grouped.mean()
grouped.count()


Unnamed: 0,0
2000-01-01,1
2000-01-02,3
2000-01-03,1


## Date Ranges, Frequencies, and Shifting
Generic time series in pandas are assumed to be irregular; that is, they have no fixed
frequency. For many applications this is sufficient. However, it’s often desirable to
work relative to a fixed frequency, such as daily, monthly, or every 15 minutes, even if
that means introducing missing values into a time series. Fortunately pandas has a
full suite of standard time series frequencies and tools for resampling, inferring frequencies,
and generating fixed-frequency date ranges. For example, you can convert
the sample time series to be fixed daily frequency by calling resample:

```
from datetime import datetime
dates = [datetime(2021, 1, 2), datetime(2021, 1, 5),
         datetime(2021, 1, 7), datetime(2021, 1, 8),
         datetime(2021, 1, 10), datetime(2021, 1, 12)]
ts = pd.Series(np.random.randn(6), index=dates)
ts
```

In [26]:
from datetime import datetime
dates = [datetime(2021, 1, 2), datetime(2021, 1, 5),
         datetime(2021, 1, 7), datetime(2021, 1, 8),
         datetime(2021, 1, 10), datetime(2021, 1, 12)]
ts = pd.Series(np.random.randn(6), index=dates)
ts

Unnamed: 0,0
2021-01-02,-0.217425
2021-01-05,0.061264
2021-01-07,0.30008
2021-01-08,0.867471
2021-01-10,-0.378101
2021-01-12,0.544666


```
ts.resample('D')
```

In [29]:
ts.resample('D')
#D is day

<pandas.core.resample.DatetimeIndexResampler object at 0x7dc25e4a52a0>

```
resampler = ts.resample('D')
resampler
```

In [30]:
resampler = ts.resample('D')
resampler

<pandas.core.resample.DatetimeIndexResampler object at 0x7dc25e4a5480>

### Generating Date Ranges
pandas.date_range is responsible for
generating a DatetimeIndex with an indicated length according to a particular
frequency:
```
index = pd.date_range('2012-04-01', '2012-06-01')
index
```

In [31]:
index = pd.date_range('2012-04-01', '2012-06-01')
index

DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
               '2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
               '2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
               '2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
               '2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
               '2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
               '2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
               '2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
               '2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
               '2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
      

```
pd.date_range(start='2012-04-01', periods=20)
pd.date_range(end='2012-06-01', periods=20)
```

In [32]:
pd.date_range(start='2012-04-01', periods=20)
pd.date_range(end='2012-06-01', periods=20)


DatetimeIndex(['2012-05-13', '2012-05-14', '2012-05-15', '2012-05-16',
               '2012-05-17', '2012-05-18', '2012-05-19', '2012-05-20',
               '2012-05-21', '2012-05-22', '2012-05-23', '2012-05-24',
               '2012-05-25', '2012-05-26', '2012-05-27', '2012-05-28',
               '2012-05-29', '2012-05-30', '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')

```
pd.date_range('2000-01-01', '2000-12-01', freq='BM')
```

In [33]:
pd.date_range('2000-01-01', '2000-12-01', freq='BM')


  pd.date_range('2000-01-01', '2000-12-01', freq='BM')


DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
               '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
               '2000-09-29', '2000-10-31', '2000-11-30'],
              dtype='datetime64[ns]', freq='BME')

```
pd.date_range('2012-05-02 12:56:31', periods=5)
```

In [34]:
pd.date_range('2012-05-02 12:56:31', periods=5)


DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
               '2012-05-04 12:56:31', '2012-05-05 12:56:31',
               '2012-05-06 12:56:31'],
              dtype='datetime64[ns]', freq='D')

```
pd.date_range('2012-05-02 12:56:31', periods=5, normalize=True)
```

In [35]:
pd.date_range('2012-05-02 12:56:31', periods=5, normalize=True)


DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
               '2012-05-06'],
              dtype='datetime64[ns]', freq='D')

### Frequencies and Date Offsets

```
from pandas.tseries.offsets import Hour, Minute
hour = Hour()
hour
```

In [36]:
from pandas.tseries.offsets import Hour, Minute
hour = Hour()
hour

<Hour>

```
four_hours = Hour(4)
four_hours
```

In [37]:
four_hours = Hour(4)
four_hours

<4 * Hours>

#### Week of month dates
One useful frequency class is “week of month,” starting with WOM. This enables you to
get dates like the third Friday of each month:
```
rng = pd.date_range('2012-01-01', '2012-09-01', freq='WOM-3FRI')
list(rng)
```

In [38]:
rng = pd.date_range('2012-01-01', '2012-09-01', freq='WOM-3FRI')
list(rng)

[Timestamp('2012-01-20 00:00:00'),
 Timestamp('2012-02-17 00:00:00'),
 Timestamp('2012-03-16 00:00:00'),
 Timestamp('2012-04-20 00:00:00'),
 Timestamp('2012-05-18 00:00:00'),
 Timestamp('2012-06-15 00:00:00'),
 Timestamp('2012-07-20 00:00:00'),
 Timestamp('2012-08-17 00:00:00')]

### Shifting (Leading and Lagging) Data
“Shifting” refers to moving data backward and forward through time. Both Series and
DataFrame have a shift method for doing naive shifts forward or backward, leaving
the index unmodified:

```
ts = pd.Series(np.random.randn(4),
               index=pd.date_range('1/1/2000', periods=4, freq='M'))
ts
ts.shift(2)
ts.shift(-2)
```

In [39]:
ts = pd.Series(np.random.randn(4),
               index=pd.date_range('1/1/2000', periods=4, freq='M'))
ts
ts.shift(2)
ts.shift(-2)

  index=pd.date_range('1/1/2000', periods=4, freq='M'))


Unnamed: 0,0
2000-01-31,-0.409682
2000-02-29,-0.927374
2000-03-31,
2000-04-30,


## Time Zone Handling in Pandas


### Time Zone Localization and Conversion
By default, time series in pandas are time zone naive. For example, consider the following
time series:
```
rng = pd.date_range('3/9/2012 9:30', periods=6, freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
```

In [40]:
rng = pd.date_range('3/9/2012 9:30', periods=6, freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

Unnamed: 0,0
2012-03-09 09:30:00,0.293595
2012-03-10 09:30:00,-1.859131
2012-03-11 09:30:00,-0.234802
2012-03-12 09:30:00,0.368968
2012-03-13 09:30:00,-1.872337
2012-03-14 09:30:00,0.311492


### Operations with Time Zone−Aware Timestamp Objects
Similar to time series and date ranges, individual Timestamp objects similarly can be
localized from naive to time zone–aware and converted from one time zone to
another:
```
stamp = pd.Timestamp('2011-03-12 04:00')
stamp_utc = stamp.tz_localize('utc')
stamp_utc.tz_convert('America/New_York')
```

In [41]:
stamp = pd.Timestamp('2011-03-12 04:00')
stamp_utc = stamp.tz_localize('utc')
stamp_utc.tz_convert('America/New_York')

Timestamp('2011-03-11 23:00:00-0500', tz='America/New_York')

### Operations Between Different Time Zones
If two time series with different time zones are combined, the result will be UTC.
Since the timestamps are stored under the hood in UTC, this is a straightforward
operation and requires no conversion to happen:
```
rng = pd.date_range('3/7/2012 9:30', periods=10, freq='B')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
ts1 = ts[:7].tz_localize('Europe/London')
ts2 = ts1[2:].tz_convert('Europe/Moscow')
result = ts1 + ts2
result.index
```

## Periods and Period Arithmetic
Periods represent timespans, like days, months, quarters, or years. The Period class
represents this data type, requiring a string or integer and a frequency.
```
p = pd.Period(2007, freq='A-DEC')
p
```

### Period Frequency Conversion
```
p = pd.Period('2007', freq='A-DEC')
p
p.asfreq('M', how='start')
p.asfreq('M', how='end')
```

### Quarterly Period Frequencies
```
p = pd.Period('2012Q4', freq='Q-JAN')
p
```

### Converting Timestamps to Periods (and Back)
```
rng = pd.date_range('2000-01-01', periods=3, freq='M')
ts = pd.Series(np.random.randn(3), index=rng)
ts
pts = ts.to_period()
pts
```

## Resampling and Frequency Conversion
```
rng = pd.date_range('2000-01-01', periods=100, freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
ts.resample('M').mean()
ts.resample('M', kind='period').mean()
```

### Downsampling
```
rng = pd.date_range('2000-01-01', periods=12, freq='T')
ts = pd.Series(np.arange(12), index=rng)
ts
```

```
ts.resample('5min', closed='right').sum()
```

```
ts.resample('5min', closed='right',
            label='right', loffset='-1s').sum()
```

#### Open-High-Low-Close (OHLC) resampling
```
ts.resample('5min').ohlc()
```

### Upsampling and Interpolation
```
frame = pd.DataFrame(np.random.randn(2, 4),
                     index=pd.date_range('1/1/2000', periods=2,
                                         freq='W-WED'),
                     columns=['Colorado', 'Texas', 'New York', 'Ohio'])
frame
```

```
frame.resample('W-THU').ffill()
```

### Resampling with Periods
```
frame = pd.DataFrame(np.random.randn(24, 4),
                     index=pd.period_range('1-2000', '12-2001',
                                           freq='M'),
                     columns=['Colorado', 'Texas', 'New York', 'Ohio'])
frame[:5]
annual_frame = frame.resample('A-DEC').mean()
annual_frame
```

```
# Q-DEC: Quarterly, year ending in December
annual_frame.resample('Q-DEC').ffill()
annual_frame.resample('Q-DEC', convention='end').ffill()
```

```
annual_frame.resample('Q-MAR').ffill()
```