# Introduction to Time Series Data

Data can come in many different formats, and many differentshapes and sizes. You've maybe heard of tabular data, a format you may be familiar with from working in something like Excel. 

We will explore two main kinds of tabular data in this module. The first is time series data. Time series data will be *indexed* with a date and time. We'll look a bit more closely at that soon, but for now just think of it as each row having a date or time, rather than a row number.

## Loading Data

One of the most popular packages in Python for working with tabular data is called Pandas. Today we'll get acquainted with Pandas.

The first thing we'll do is `import` the `pandas` package. Convention has us use a shortform name - `pd` - because we'll be using the package so often.

In [1]:
import pandas as pd


And below we'll use pandas' `read_csv()` to load the data into a `DataFrame`. DataFrames are the main data structure in pandas for tabular data, and lots of other programming languages use the concept of a DataFrame too! By convention, you'll often see `df` used as a variable name.

In [2]:
df = pd. read_csv("data/NVDA_2024.csv")


In [11]:
df

Unnamed: 0,Date,Close,High,Low,Open,Volume
0,2024-01-02,48.149918,49.276493,47.577135,49.225514,411254000
1,2024-01-03,47.551140,48.165907,47.302233,47.467172,320896000
2,2024-01-04,47.979984,48.481795,47.490167,47.749068,306535000
3,2024-01-05,49.078564,49.528395,48.287860,48.443804,415039000
4,2024-01-08,52.233379,52.255374,49.460423,49.493411,642510000
...,...,...,...,...,...,...
246,2024-12-23,139.657150,139.777134,135.107566,136.267463,176053500
247,2024-12-24,140.207108,141.886946,138.637245,139.987127,105157000
248,2024-12-26,139.917130,140.837058,137.717335,139.687155,116205600
249,2024-12-27,136.997391,139.007216,134.697615,138.537258,170582600


Before we do anything else, it's a good idea to take a look at the DataFrame. Some methods will let us take a closer look at parts of our data. 

In [3]:
# Print the first five rows
print(df.head())

# Print the last 15 rows
print(df.tail(15))

         Date      Close       High        Low       Open     Volume
0  2024-01-02  48.149918  49.276493  47.577135  49.225514  411254000
1  2024-01-03  47.551140  48.165907  47.302233  47.467172  320896000
2  2024-01-04  47.979984  48.481795  47.490167  47.749068  306535000
3  2024-01-05  49.078564  49.528395  48.287860  48.443804  415039000
4  2024-01-08  52.233379  52.255374  49.460423  49.493411  642510000
           Date       Close        High         Low        Open     Volume
236  2024-12-09  138.797226  139.937120  137.117388  138.957215  189308600
237  2024-12-10  135.057587  141.806966  133.777690  138.997212  210020900
238  2024-12-11  139.297180  140.157102  135.197567  137.347363  184905200
239  2024-12-12  137.327362  138.427267  135.787510  137.067391  159211400
240  2024-12-13  134.237656  139.587170  132.527806  138.927227  231514900
241  2024-12-16  131.987854  134.387627  130.407998  134.167646  237951100
242  2024-12-17  130.378006  131.577893  126.848332  129.0781

## Looking at data

Other methods and attributes can give us an overview, and give us further insights to our data in general. `shape()` will tell us the number of rows and columns in our data frame, while `info()` will give us some info on the data type (`dtype`) of each column.

You'll notice the types are slightly different from the usual Python types - this is because they belong to the `numpy` package, which sits under the hood of `pandas`. We'll look more at `numpy` tomorrow, but for now here is a word about each of the types in our data frame.

- `float64` - 64-bit floating point (number with a decimal point)
- `int64` - 64-bit integer (whole number)
- `object` - other Python data types (strings in this case)

In [4]:
# Print rows and columns
print("Rows and columns: ", df.shape)

# Print summary info
print("Info")
print(df.info())

Rows and columns:  (251, 6)
Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251 entries, 0 to 250
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    251 non-null    object 
 1   Close   251 non-null    float64
 2   High    251 non-null    float64
 3   Low     251 non-null    float64
 4   Open    251 non-null    float64
 5   Volume  251 non-null    int64  
dtypes: float64(4), int64(1), object(1)
memory usage: 11.9+ KB
None


For a look at some actual data within the data frame, we can use square bracket notation and `iloc` to access columns and rows. The `i` in `iloc` refers to **integer-based indexing**, so looking at a row or column *number*.

## Filtering Data

In Pandas, we can use a technique known as *boolean indexing* or *masking* to filter rows depending on some condition. We can express conditions using a *boolean expression* or *compound boolean expression* with either `&` (and) *or* `|` (or). These are also called *logical expressions*.

### Exercise: End of Year

Display the data for the entire month of December 2024.

In [12]:
## YOUR CODE GOES HERE
df[df[data]]

NameError: name 'data' is not defined

### Exercise: The First Fifty

Display the data for the first fifty (50) days of trading in the period.

In [None]:
## YOUR CODE GOES HERE

### Optional Exercise: Big Days

Display rows where trading volume exceeded 800 000 000.

In [None]:
## YOUR CODE GOES HERE
df[df["Volume"] > 800_000_000]

Unnamed: 0,Date,Close,High,Low,Open,Volume
35,2024-02-22,78.508522,78.545503,74.192142,74.99684,865100000
36,2024-02-23,78.787407,82.36306,77.540874,80.759666,829388000
46,2024-03-08,87.499252,97.368012,86.477585,95.106754,1142269000
75,2024-04-19,76.174973,84.296305,75.581173,83.122695,875198000
99,2024-05-23,103.764908,106.285076,101.486649,101.994486,835065000


## Setting the Index

In a DataFrame, each row is assigned a unique index value. By default, this is just a number (starting at 0). When it makes sense, we can choose one of the other columns to be an index. For time series data, where each row represents a different point in time, we'll set our `Date` column as the index. This will make it easier for us to work with the data, and can speed up other operations later on.

In [None]:
df["Date"] = pd.to_datetime(df["Date"])

df.set_index("Date", inplace=True)
df

Unnamed: 0_level_0,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2024-01-02,48.149918,49.276493,47.577135,49.225514,411254000
2024-01-03,47.551140,48.165907,47.302233,47.467172,320896000
2024-01-04,47.979984,48.481795,47.490167,47.749068,306535000
2024-01-05,49.078564,49.528395,48.287860,48.443804,415039000
2024-01-08,52.233379,52.255374,49.460423,49.493411,642510000
...,...,...,...,...,...
2024-12-23,139.657150,139.777134,135.107566,136.267463,176053500
2024-12-24,140.207108,141.886946,138.637245,139.987127,105157000
2024-12-26,139.917130,140.837058,137.717335,139.687155,116205600
2024-12-27,136.997391,139.007216,134.697615,138.537258,170582600


We convert the 'Date' column to a datetime object because pandas can recognise and efficiently work with datetime objects. We set the `Date` column as the index because in time-series data like ours, operations are time-based.

With the index set, we can now use it to access different portions of our data a little bit more easily. Because our indices are labeled, we can use `loc` for **label-based indexing**.

In [None]:
df.loc ["2024-08"]

Unnamed: 0_level_0,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2024-08-01,109.183098,120.130405,106.783687,117.501048,523462300
2024-08-02,107.243561,108.693208,101.345021,103.734431,482027500
2024-08-05,100.425247,103.384525,90.667657,92.037315,552842400
2024-08-06,104.224312,107.683458,100.525227,103.814409,409012100
2024-08-07,98.885635,108.773198,98.665688,107.783437,411440400
2024-08-08,104.944138,105.474006,97.495969,101.974868,391910000
2024-08-09,104.72419,106.573732,103.404515,105.61397,290844200
2024-08-12,108.993141,111.042639,106.233827,106.293809,325559900
2024-08-13,116.111382,116.201363,111.552508,112.412296,312646700
2024-08-14,118.050911,118.570779,114.041897,118.500797,339246400


## Getting the Index

Oftentimes it is helpful to retrieve the index of the dataframe for a given row or rows. Let's say we wanted to see the dates where Nvidia's `Volume` was less than 150 000 000. After some smart boolean indexing, we can look at the `index` attribute.

In [None]:
df[df["Volume"] < 150_000_000].index

DatetimeIndex(['2024-11-29', '2024-12-24', '2024-12-26'], dtype='datetime64[ns]', name='Date', freq=None)

Because of our conversion to datetime when we first loaded the dataframe, we can pull specific information out of the index.

## Aggregations

There are many basic operations we can do with pandas, such as calculating the mean of a column, the maximum of a column, and so on. We generally refer to these as *aggregations* since they reduce multiple values to one summary value.


In [None]:
# Calculate the mean of 'Close' prices
print("Mean close", df['Close'].mean())

# Find the maximum volume traded
print("Max volume", df['Volume'].max())

# Find the day that had the max volume traded
print("Max volume day", df['Volume'].idxmax())

# Be careful when using these operations on multiple columns
# We can calculate the mean of the high and low column like so
print("High/low COLUMN average")
print(df[["Low", "High"]].mean())

# Or we can calculate the mean high low of each row
print("High/low ROW average")
print(df[["Low", "High"]].mean(axis=1))

NameError: name 'df' is not defined

### Exercise: nVidia Quarters

Did Q3 or Q4 have more trading days where the `Close` price was above the annual average (i.e. above the mean)?

In [18]:
## YOUR CODE GOES HERE
condition = df.Close > df.Close.mean()
mask = df.index.to_period ("Q") == "2024Q3"
condition.loc[condition.index.to_period("Q")== 2024Q3""].sum()

SyntaxError: invalid decimal literal (1909115864.py, line 4)

### Exercise: March Madness

Looking only at the month of March, print the following information:

* First opening price of the period
* Last close price of the period
* Total volume traded over the period

In [None]:
## YOUR CODE GOES HERE