# Every Function You Can (Should) Use in Pandas to Manipulate Time Series
## From basic time series metrics to window functions
![](https://cdn-images-1.medium.com/max/1200/1*goDYbZULUkLiheRkrJpmJQ.jpeg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@bentonphotocinema?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Jordan Benton</a>
        on 
        <a href='https://www.pexels.com/photo/shallow-focus-of-clear-hourglass-1095601/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pexels</a>
    </strong>
</figcaption>

## Setup

In [None]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import rcParams

rcParams["xtick.labelsize"] = 15
rcParams["ytick.labelsize"] = 15
rcParams["legend.fontsize"] = "small"

pd.set_option("precision", 2)
warnings.filterwarnings("ignore")

## Introduction to this project on Time Series Forecasting

Recently, the Optiver Realized Volatility Prediction competition has been launched on Kaggle. As the name suggests, it is a time series forecasting challenge.

I wanted to participate, but it turns out my knowledge in time series couldn't even begin to suffice to participate in a competition of such a magnitude. So, I accepted this as the 'kick in the pants' I needed to start paying serious attention to this large sphere of ML.

As the first step, I wanted to learn and teach every single Pandas function you can use to manipulate time-series data. These functions are the basic requirements for dealing with any time series data you encounter in the wild.

I have got rather cool and interesting articles/notebooks planned on this topic, and today, you will be reading the first taste of what is to come. Enjoy!

## Table of Contents <small id='toc'></small>

#### [**1. Basic date and time functions**](#1)
  * [1.1 Importing time series data](#1.1)
  * [1.2 Pandas TimeStamp](#1.2)
  * [1.3 Sequence of dates (timestamps)](#1.3)
  * [1.4 Slicing](#1.4)

#### [**2. Missing data imputation/interpolation in time series**](#2)
  * [2.1 Mean, median and mode imputation](#2.1)
  * [2.2 Forward and backward filling](#2.2)
  * [2.3 Using pd.interpolate](#2.3)
  * [2.4 Model based imputation with KNN](#2.4)

#### [**3. Basic time series calculations and metrics**](#3)
  * [3.1 Shifts and lags](#3.1)
  * [3.2 Percentage changes](#3.2)

#### [**4. Resampling - upsample and downsample**](#4)
  * [4.1 Changing the frequency with `asfreq`](#4.1)
  * [4.2 Downsampling with resample and aggregating](#4.2)
  * [4.3 Upsampling with resample and interpolating](#4.3)
  * [4.4 Plotting the resampled data](#4.4)

#### [**5. Comparing the growth of multiple time series**](#5)

#### [**6. Window functions**](#6)
  * [6.1 Rolling window functions](#6.1)
  * [6.2 Expanding window functions](#6.2)

#### [**7. Summary**](#7)

## 1. Basic date and time functions in Pandas

### 1.1 Importing time series data <small id='1.1'></small>

[Back to top🔝](#toc)

When using the `pd.read_csv` function to import time series, there are 2 arguments you should always use - `parse_dates` and `index_col`:

In [None]:
# Import Apple/Google stock prices
aapl_googl = pd.read_csv(
    "https://raw.githubusercontent.com/BexTuychiev/medium_stories/master/2021/july/3_time_series_manipulation/data/apple_google.csv",
    parse_dates=["Date"],
    index_col="Date",
).dropna()

In [None]:
aapl_googl.head()

In [None]:
# Import S&P500 stock prices
sp500 = pd.read_csv(
    "https://raw.githubusercontent.com/BexTuychiev/medium_stories/master/2021/july/3_time_series_manipulation/data/sp500.csv",
    parse_dates=["date"],
    index_col="date",
)

In [None]:
sp500.head()

`parse_dates` converts date-like strings to DateTime objects and `index_col` sets the passed column as the index. This operation is the basis for all time-series manipulation you will do with Pandas.

When you don't know which column contains dates upon importing, you can perform the date conversion using `pd.to_datetime` function afterward:

In [None]:
# Import the data with unknown date column
sp500 = pd.read_csv("https://raw.githubusercontent.com/BexTuychiev/medium_stories/master/2021/july/3_time_series_manipulation/data/sp500.csv")

# Inspect the dtypes
sp500.dtypes

Now, inspect the datetime format string:

In [None]:
sp500.head()

It is in the format "%Y-%m-%d" (full list of datetime format strings can be found [here](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)). Pass this to `pd.to_datetime`:

In [None]:
sp500["date"] = pd.to_datetime(sp500["date"], format="%Y-%m-%d", errors="coerce")

# Check if the conversion is successful
assert sp500["date"].dtype == "datetime64[ns]"

Passing a format string to `pd.to_datetime` significantly speeds up the conversion for large datasets. Set `errors` to "coerce" to mark invalid dates as `NaT` (not a date, i.e. - missing).

After conversion, set the DateTime column as index (a strict requirement for best time series analysis):

In [None]:
sp500.set_index("date", inplace=True)

### 1.2 Pandas TimeStamp <small id='1.2'></small>

[Back to top🔝](#toc)

The basic date data structure in Pandas is a timestamp:

In [None]:
stamp = pd.Timestamp("2020/12/26")  # You can pass any date-like string
stamp

You can make even more granular timestamps using the right format or, better yet, using the `datetime` module:

In [None]:
from datetime import datetime

stamp = pd.Timestamp(
    datetime(year=2021, month=10, day=5, hour=13, minute=59, second=59)
)
stamp

A full timestamp has useful attributes such as these:

In [None]:
attributes = [
    ".year",
    ".month",
    ".quarter",
    ".day",
    ".hour",
    ".minute",
    ".second",
    ".weekday()",
    ".dayofweek",
    ".weekofyear",
    ".dayofyear",
]

pd.DataFrame(
    {
        "Attribute": attributes,
        "'2021-10-05 13:59:59'": [
            eval(f"stamp{attribute}") for attribute in attributes
        ],
    }
)

### 1.3 Sequence of dates (timestamps) <small id='1.3'></small>

[Back to top🔝](#toc)

A `DateTime` column/index in pandas is represented as a series of `TimeStamp` objects.

`pd.date_range` returns a special `DateTimeIndex` object that is a collection of `TimeStamps` with a custom frequency over a given range:

In [None]:
index = pd.date_range(start="2010-10-10", end="2020-10-10", freq="M")
index

After specifying the date range (from October 10, 2010, to the same date in 2020), we are telling pandas to generate `TimeStamps` on a monthly-basis with `freq='M'`:

In [None]:
index[0]

Another way to create date ranges is passing the start date and telling how many periods you want, and specifying the frequency:

In [None]:
pd.date_range(start="2020-01-01", periods=5, freq="Y")

Since we set the frequency to years, `date_range` with 5 periods returns 5 years/timestamp objects. The [list of frequency aliases](https://medium.com/r/?url=https%3A%2F%2Fpandas.pydata.org%2Fpandas-docs%2Fstable%2Fuser_guide%2Ftimeseries.html%23timeseries-offset-aliases) that can be passed to `freq` is large, so I will only mention the most important ones here:

In [None]:
aliases = ["B", "D", "W", "M", "BM", "MS", "Q", "H", "A, Y"]
values = [
    "Business days",
    "Calendar days",
    "Weekly",
    "Month end frequency",
    "Business month end frequency",
    "Month start frequency",
    "Quarterly",
    "Hourly",
    "Year end",
]

pd.DataFrame({"Frequency Alias": aliases, "Definition": values})

It is also possible to pass custom frequencies such as "1h30min", "5D", "2W", etc. Again, check out [this link](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases) for the full info.

### 1.4 Slicing <small id='1.4'></small>

[Back to top🔝](#toc)

Slicing time series data can be very intuitive if the index is a `DateTimeIndex`. You can use something called partial slicing:

In [None]:
aapl_googl["2010":"2015"].sample(5)  # All rows within 2010 and 2015

In [None]:
aapl_googl["2012-4":"2012-12"].sample(5)  # rows within April and December of 2012

You can even go down to hours, minutes, or seconds levels if the DateTime is granular enough.

Note that pandas slices dates in closed intervals. For example, using "2010": "2013" returns rows for all 4 years - it does not exclude the end of the period like integer slicing.

This date slicing logic applies to other operations like choosing a specific column after the slice:

In [None]:
aapl_googl.loc["2012-10-10":"2012-12-10", "GOOG"].head()

## 2. Missing data imputation or interpolation <small id='2'></small>

[Back to top🔝](#toc)

Missing data is ubiquitous no matter the type of the dataset. This section is all about imputing it in the context of time series. 

> You may also hear it called **interpolation** of missing data in time series lingo.

Besides the basic mean, median and mode imputation, some of the most common techniques include:

1. Forward filling
2. Backward filling
3. Intermediate imputations with `pd.interpolate`

We will also discuss model-based imputation such as KNN imputing. Moreover, we will explore visual methods of comparing the efficiency of the techniques and choose the one that best fits the underlying distribution.

### 2.1 Mean, median and mode imputation <small id='2.1'></small>

[Back to top🔝](#toc)

Let's start with the basics. We will randomly select data points in Apple/Google stock dataset and convert them to NaN:

In [None]:
# Choose 200 random
random_indices = np.random.choice([_ for _ in range(len(aapl_googl))], size=200)

# Mark the indices as missing
clone = aapl_googl.copy(deep=True).drop("AAPL", axis=1)
clone.iloc[random_indices, 0] = np.nan

We will also create a function that plots the original distribution before and after an imputation(s) is performed:

In [None]:
def compare_dists(original_dist, imputed_dists: dict):
    """
    Plot original_dist and imputed_dists on top of each other
    to see the difference in distributions.
    """
    fig, ax = plt.subplots(figsize=(12, 7), dpi=140)
    # Plot the original
    sns.kdeplot(
        original_dist, linewidth=5, ax=ax, color="black", label="Original dist."
    )
    for key, value in imputed_dists.items():
        sns.kdeplot(value, linewidth=3, label=key, ax=ax)

    plt.legend()
    plt.show();

We will start trying out techniques with `SimpleImputer` from Sklearn:

In [None]:
from sklearn.impute import SimpleImputer

for method in ["mean", "median", "most_frequent"]:
    clone[method] = SimpleImputer(strategy=method).fit_transform(
        clone["GOOG"].values.reshape(-1, 1)
    )

Let's plot the original GOOG distribution against the 3 imputed features we just created:

In [None]:
compare_dists(
    clone["GOOG"],
    {"mean": clone["mean"], "median": clone["median"], "mode": clone["most_frequent"]},
)

It is hard to say which lines most closely resembles the black line, but I will go with the blue.

In [None]:
clone.drop(["mean", "median", "most_frequent"], axis=1, inplace=True)

### 2.2 Forward and backward filling <small id='2.2'></small>

[Back to top🔝](#toc)

Consider this small distribution:

In [None]:
sample = pd.Series([np.nan, 2, 3, np.nan, 4, np.nan, np.nan, 5, 12, np.nan]).to_frame(
    name="original"
)
sample

We will use both forward and backward filling and assign them back to the DataFrame as separate columns:

In [None]:
sample["ffill"] = sample["original"].ffill()
sample["bfill"] = sample["original"].bfill()

sample

It should be fairly obvious how these methods work once you examine the above output. 

Now, let's perform these methods on the Airquality in India dataset:

In [None]:
air_q = pd.read_csv(
    "https://raw.githubusercontent.com/BexTuychiev/medium_stories/master/2021/july/3_time_series_manipulation/data/station_day.csv",
    usecols=["Date", "NO2"],
    parse_dates=["Date"],
    index_col="Date",
)

for method in ["ffill", "bfill"]:
    air_q[method] = eval(f"air_q['NO2'].{method}()")

compare_dists(air_q["NO2"], {"ffill": air_q["ffill"], "bfill": air_q["bfill"]})

Even though very basic, forward and backward filling actually works pretty well on climate and stocks data since the differences between nearby data points are small.

### 2.3 Using `pd.interpolate` <small id='2.3'></small>

[Back to top🔝](#toc)

Pandas provides a whole suite of other statistical imputation techniques in `pd.interpolate` function. Its `method` parameter accepts the name of the technique as a string.

The most popular ones are 'linear' and 'nearest,' but you can see the full list from the function's documentation. Here, we will only discuss those two.

Consider this small distribution:

In [None]:
sample = pd.Series([1] + [np.nan] * 6 + [10]).to_frame(name="original")
sample

Once again, we apply the methods and assign their results back:

In [None]:
sample["linear"] = sample.original.interpolate(method="linear")
sample["nearest"] = sample.original.interpolate(method="nearest")

sample

Neat, huh? The linear method considers the distance between any two non-missing points as linearly spaced and finds a linear line that connects them (like `np.linspace`). 'Nearest' method should be understandable from its name and the above output.

### 2.4 Model based imputation with KNN <small id='2.4'></small>

[Back to top🔝](#toc)

The last method we will see is the K-Nearest-Neighbors algorithm. I won't detail how the algorithm works but only show how you can use it with Sklearn. If you want the details, I have a separate article [here](https://medium.com/r/?url=https%3A%2F%2Ftowardsdatascience.com%2Fgoing-beyond-the-simpleimputer-for-missing-data-imputation-dd8ba168d505%3Fsource%3Dyour_stories_page-------------------------------------).

The most important parameter of KNN is `k` - the number of neighbors. We will apply the technique to Apple/Google data with several values of `k` and find the best one the same way as we did in the previous sections:

In [None]:
from sklearn.impute import KNNImputer

n_neighbors = [2, 3, 5, 7, 9]

for k in n_neighbors:
    imp = KNNImputer(n_neighbors=k)
    clone[f"k={k}"] = imp.fit_transform(clone["GOOG"].values.reshape(-1, 1))

compare_dists(clone["GOOG"], {f"k={k}": clone[f"k={k}"] for k in n_neighbors})

## 3. Basic time series calculations <small id='3'></small>

[Back to top🔝](#toc)

Pandas offers basic functions to calculate the most common time series calculations. These are called shifts, lags, and something called a percentage change.

### 3.1 Shifts and lags <small id='3.1'></small>

[Back to top🔝](#toc)

A common operation in time series is to move all data points one or more periods backward or forward to compare past and future values. You can do these operations using `shift` function of pandas. Let's see how to move the data points 1 and 2 periods into the future:

In [None]:
sp500 = pd.read_csv("https://raw.githubusercontent.com/BexTuychiev/medium_stories/master/2021/july/3_time_series_manipulation/data/sp500.csv", parse_dates=["date"], index_col="date")

sp500.head()

In [None]:
sp500["shifted_1"] = sp500["SP500"].shift(periods=1)  # the default
sp500["shifted_2"] = sp500["SP500"].shift(periods=2)

sp500.head(6)

Shifting forward enables you to compare the current data point to those recorded one or more periods before.

You can also shift backward. This operation is also called "lagging":

In [None]:
sp500.drop(["shifted_1", "shifted_2"], axis=1, inplace=True)

sp500["lagged_1"] = sp500["SP500"].shift(periods=-1)
sp500["lagged_2"] = sp500["SP500"].shift(periods=-2)

sp500.tail(6)

Shifting backward enables us to see the difference between the current data point and the one that comes one or more periods later.

A common operation after shifting or lagging is finding the difference and plotting it:

In [None]:
sp500.drop("lagged_2", axis=1, inplace=True)

sp500["diff_lag"] = sp500["lagged_1"] - sp500["SP500"]
sp500.head()

In [None]:
sp500["diff_lag"].plot(figsize=(16, 4));

Since this operation is so common, Pandas has the `diff` function that computes the differences based on the period:

In [None]:
sp500.drop(["lagged_1", "diff_lag"], axis=1, inplace=True)

sp500["shifted_diff_1"] = sp500["SP500"].diff(periods=1)
sp500["shifted_diff_3"] = sp500["SP500"].diff(periods=3)
sp500["shifted_lagg_1"] = sp500["SP500"].diff(periods=-1)
sp500["shifted_lagg_3"] = sp500["SP500"].diff(periods=3)

sp500.drop("SP500", axis=1).plot(figsize=(16, 8), subplots=True);

### 3.2 Percentage changes <small id='3.2'></small>

[Back to top🔝](#toc)

Another common metric that can be derived from time-series data is day-to-day percentage change:

In [None]:
sp500.drop(
    ["shifted_diff_1", "shifted_diff_3", "shifted_lagg_1", "shifted_lagg_3"],
    axis=1,
    inplace=True,
)

In [None]:
sp500["shifted"] = sp500["SP500"].shift(1)
sp500["change"] = sp500["SP500"].div(sp500["shifted"]).sub(1).mul(100)

sp500.head()

To calculate day-to-day percentage change, shift one period forward and divide the original distribution by the shifted one and subtract 1. The resulting values are given as proportions of what they were the day before.

Since it is a common operation, Pandas implements it with the `pct_change` function:

In [None]:
sp500["pct_change"] = sp500["SP500"].pct_change().mul(100)

sp500.head()

In [None]:
sp500.drop(["shifted", "change", "pct_change"], axis=1, inplace=True)

## 4. Resampling <small id='4'></small>

[Back to top🔝](#toc)

Often, you may want to increase or decrease the granularity of time series to generate new insights. These operations are called resampling or changing the frequency of time series, and we will discuss the Pandas functions related to them in this section.

### 4.1 Changing the frequency with `asfreq` <small id='4.1'></small>

The SP500 stocks data does not have a fixed date frequency, i.e., the period difference between each date is not the same:

In [None]:
sp500.head()

Let's fix this by giving it a calendar day frequency (daily):

In [None]:
sp500.asfreq("D").head(7)

We just made the frequency of the date in SP500 more granular. As a result, new dates were added, leading to more missing values. You can now interpolate them using any of the techniques we discussed earlier.

You can see the list of built-in frequency aliases from [here](https://medium.com/r/?url=https%3A%2F%2Fpandas.pydata.org%2Fpandas-docs%2Fstable%2Fuser_guide%2Ftimeseries.html%23offset-aliases). A more interesting scenario would be using custom frequencies:

In [None]:
# 5-hour frequency
sp500.asfreq("5h").head(7)  # This makes the dataset very large

In [None]:
# 10 day frequency
sp500.asfreq("10d", method="ffill").head(7)  # This makes the dataset smaller

In [None]:
# 10 month frequency
sp500.asfreq("10M", method="bfill").head(7)

There is also a `reindex` function that operates similarly and supports additional missing value filling logic. We won't discuss it here as there are better options we will consider.

### 4.2 Downsampling with `resample` and aggregating <small id='4.2'></small>

[Back to top🔝](#toc)

In time series lingo, making the frequency of a `DateTime` less granular is called downsampling. The examples are changing the frequency from hourly to daily, from daily to weekly, etc.

We saw how to downsample with `asfreq`. A more powerful alternative is `resample` which behaves like `pd.groupby`. Just like `groupby` groups the data based on categorical values, `resample` groups the data by date frequencies.

Let's downsample the Apple/Google stock prices by month-end frequency:

In [None]:
aapl_googl.resample("M")

Unlike `asfreq`, using resample only returns the data in the resampled state. To see each group, we need to use some type of function, similar to how we use `groupby`.

Since downsampling decreases the number of data points, we need an aggregation function like mean, median, or mode:

In [None]:
aapl_googl.resample("M").mean().tail()

There are also functions that return the first or last record of a group:

In [None]:
# Resample with business-month frequency
# and return the first record of each group
aapl_googl.resample("BM").first().tail()

In [None]:
# Opposite of first()
aapl_googl.resample("Y").last().tail()  # Year-end frequency

It is also possible to use multiple aggregating functions using `agg`:

In [None]:
aapl_googl.resample("Y").agg(["mean", "median", "std"]).head()

### 4.3 Upsampling with `resample` and interpolating <small id='4.3'></small>

[Back to top🔝](#toc)

The opposite of downsampling is making the `DateTime` more granular. This is called upsampling and includes operations like changing the frequency from daily to hourly, hourly to seconds, etc.

When upsampling, you introduce new dates leading to more missing values. This means you need to use some type of imputation:

In [None]:
# Resample with business day freq and forward-fill
aapl_googl.resample("B").ffill().tail()

In [None]:
# Resample with 20-hour frequency and back-fill
aapl_googl.resample("20h").bfill().sample(5)

### 4.4 Plotting the resampled data <small id='4.4'></small>

[Back to top🔝](#toc)

Resampling isn't going to give much if you don't plot its results.

In most cases, you will see new trends and patterns when you downsample. This is because downsampling reduces the granularity, thus eliminating noise:

In [None]:
quarter_google = aapl_googl.resample("Q")["GOOG"].mean()
yearly_google = aapl_googl.resample("Y")["GOOG"].mean()

quarter_apple = aapl_googl.resample("Q")["AAPL"].mean()
yearly_apple = aapl_googl.resample("Y")["AAPL"].mean()

In [None]:
# Plot Apple's downsampled stocks
aapl_googl["AAPL"].plot(figsize=(16, 5), label="Original")
quarter_apple.plot(label="Quarterly")
yearly_apple.plot(label="Yearly")
plt.legend(fontsize="x-large");

In [None]:
# Plot Google's downsampled stocks
aapl_googl["GOOG"].plot(figsize=(16, 5), label="Original")
quarter_google.plot(label="Quarterly")
yearly_google.plot(label="Yearly")
plt.legend(fontsize="x-large");

Plotting the upsampled distribution is only going to introduce more noise, so we won't do it here.

## 5. Comparing the growth of multiple time series <small id='5'></small>

[Back to top🔝](#toc)

It is common to compare two or more numeric values that change over time. For example, we might want to see the growth rate of Google and Apple's stock prices. But here is the problem:

In [None]:
aapl_googl.mean()

Google's stock prices are way higher than Apple's. Plotting the stocks together would probably squish Apple's to a flat line. In other words, the two stocks have different scales.

To fix this, statisticians use normalization. The most common variation is choosing the first recorded value and dividing the rest of the samples by that amount. This shows how each record changes compared to the first.

Here is an example:

In [None]:
aapl_googl.dropna(inplace=True)

# The first rows will contain ones because
# they are being divided by themselvs
aapl_googl.div(aapl_googl.iloc[0]).head(10)

The above output shows that for the first 3 dates, Apple stocks didn't change. Then, it increased by 1% of what it was on the first date ('2010–12–16'). Google's prices are more volatile, fluctuating between 1 and 2% increases during the first 10 dates.

Now, let's plot them to compare growth:

In [None]:
# Normalize
normalized_aapl_goog = aapl_googl.div(aapl_googl.iloc[0])

normalized_aapl_goog.plot(figsize=(16, 5))
plt.legend(fontsize="xx-large");

Both Apple's and Google's achieved over 300% growth from 2011 to 2017. This plot may be even more interesting if we compare their growth to other 500 Fortune Companies:

In [None]:
# Normalize SP500 dataset
normalized_sp500 = sp500.div(sp500.iloc[0])

# PLot
fig, ax = plt.subplots(figsize=(16, 5))

normalized_aapl_goog.plot(ax=ax)
normalized_sp500["2011":].plot(label="S&P500", ax=ax)

plt.legend(fontsize="xx-large");

As you can see, Apple and Google have much higher growth than other top 500 companies in the US.

## 6. Window functions <small id='6'></small>

[Back to top🔝](#toc)

There is another type of function that helps you analyze time-series data in novel ways. These are called window functions, and they help you aggregate over a custom number of rows called 'windows.'

For example, I can create a 30-day window over my [Medium subscribers](https://medium.com/@ibexorigin) data to see the total number of subscribers for the past 30 days on any given day. Or a restaurant owner might create a weekly window to see average sales of the past week. Examples are endless as you can create a window of any size over your data.

Let's explore these in more detail.

### 6.1 Rolling window functions <small id='6.1'></small>

[Back to top🔝](#toc)

Rolling window functions will have the same length. As they slide through the data, their coverage (number of rows don't change). Here is an example window of 5 periods sliding through the data:

![](https://cdn-images-1.medium.com/max/800/1*AsTSxTsolMRce59M3dw-KA.png)

Here is how we create rolling windows in pandas:

In [None]:
aapl_googl.rolling(window=5)

Just like `resample`, it is in a read-only state - to use each window, we should chain some type of function. For example, let's create a cumulative sum for every past 5 periods:

In [None]:
aapl_googl["GOOG_5d_roll"] = aapl_googl["GOOG"].rolling(window=5).sum()

aapl_googl.head(10)

Obviously, the first 4 rows will be NaNs. Any other row will contain the sum of the previous 4 rows and the current one.

Pay attention to the window argument. If you pass an integer, the window size will be determined by that number of rows. If you pass a frequency alias such as months, years, 5 hours, or 7 weeks, the window size will be whatever number of rows that includes the single unit of the passed frequency. In other words, a 5-period window might have a different size than a 5-day frequency window.

As an example, let's plot 90 and 360-day moving averages for Google stock prices and plot them:

In [None]:
aapl_googl["90D_roll_mean"] = aapl_googl["GOOG"].rolling(window="90D").mean()
aapl_googl["360D_roll_mean"] = aapl_googl["GOOG"].rolling(window="360D").mean()

In [None]:
fig, ax = plt.subplots(figsize=(16, 5))

aapl_googl[["90D_roll_mean", "360D_roll_mean", "GOOG"]].plot(ax=ax)

plt.legend(fontsize="xx-large");

Just like `groupby` and `resample`, you can calculate multiple metrics with the `agg` function for each window.

### 6.2 Expanding window functions <small id='6.2'></small>

[Back to top🔝](#toc)

Another type of window function deals with expanding windows. Each new window will contain all the records up to the current date:

![](https://cdn-images-1.medium.com/max/800/1*lqNZULHaEUHJDcaevMz1cA.png)

Expanding windows are useful for calculating 'running' metrics-for example, running sum, mean, min and max, running rate of return, etc.

Below, you will see how to calculate the cumulative sum. The cumulative sum is actually an expanding window function with a window size of 1:

In [None]:
aapl_googl.drop(
    ["GOOG_5d_roll", "90D_roll_mean", "360D_roll_mean"], axis=1, inplace=True
)

In [None]:
aapl_googl["expanding_cumsum"] = aapl_googl["GOOG"].expanding(min_periods=1).sum()
# The same operation with cumsum() func
aapl_googl["cumsum_function"] = aapl_googl["GOOG"].cumsum()

aapl_googl.head()

`expanding` function has a `min_periods` parameter that determines the initial window size.

Now, let's plot the running min and max of S&P500 stocks:

In [None]:
sp500["running_min"] = sp500["SP500"].expanding().min()  # same as cummin()
sp500["running_max"] = sp500["SP500"].expanding().max()

fig, ax = plt.subplots(figsize=(10, 8))

sp500.plot(ax=ax)
plt.legend(fontsize="xx-large");

## Summary <small id='7'></small>

[Back to top🔝](#toc)

I think congratulations are in order!

Now, you know every single Pandas function you can use to manipulate time-series data. It has been an excruciatingly long post, but it was definitely worth it since now, you can tackle any time series data thrown at you.

This post was mainly focused on data manipulation. The next posts in the series will be about more in-depth time series analyses, similar posts on every single plot you can create on time series, and dedicated articles on forecasting. Stay tuned!

### You might also be interested...

- [Matplotlib vs. Plotly: Let’s Decide Once and for All](https://towardsdatascience.com/matplotlib-vs-plotly-lets-decide-once-and-for-all-ad25a5e43322?source=your_stories_page-------------------------------------)
- [6 Things I Do to Consistently Improve My Machine Learning Models](https://medium.com/me/stories/public#:~:text=6%20Things%20I%20Do%20to%20Consistently%20Improve%20My%20Machine%20Learning%20Models)
- [5 Super Productive Things To Do While Training Machine Learning Models](https://towardsdatascience.com/5-short-but-super-productive-things-to-do-during-model-training-b02e2d7f0d06?source=your_stories_page-------------------------------------)