# Introduction to Time Series Data

Data can come in many different formats, and many differentshapes and sizes. You've maybe heard of tabular data, a format you may be familiar with from working in something like Excel. 

We will explore two main kinds of tabular data in this module. The first is time series data. Time series data will be *indexed* with a date and time. We'll look a bit more closely at that soon, but for now just think of it as each row having a date or time, rather than a row number.

## Loading Data

One of the most popular packages in Python for working with tabular data is called Pandas. Today we'll get acquainted with Pandas.

The first thing we'll do is `import` the `pandas` package. Convention has us use a shortform name - `pd` - because we'll be using the package so often.

In [3]:
import pandas as pd

And below we'll use pandas' `read_csv()` to load the data into a `DataFrame`. DataFrames are the main data structure in pandas for tabular data, and lots of other programming languages use the concept of a DataFrame too! By convention, you'll often see `df` used as a variable name.

In [4]:
df = pd.read_csv("data/NVDA_2024.csv")

Before we do anything else, it's a good idea to take a look at the DataFrame. Some methods will let us take a closer look at parts of our data. 

In [5]:
df
df.head()
print(df.head())


         Date      Close       High        Low       Open     Volume
0  2024-01-02  48.149918  49.276493  47.577135  49.225514  411254000
1  2024-01-03  47.551140  48.165907  47.302233  47.467172  320896000
2  2024-01-04  47.979984  48.481795  47.490167  47.749068  306535000
3  2024-01-05  49.078564  49.528395  48.287860  48.443804  415039000
4  2024-01-08  52.233379  52.255374  49.460423  49.493411  642510000


In [6]:
df.tail()

Unnamed: 0,Date,Close,High,Low,Open,Volume
246,2024-12-23,139.65715,139.777134,135.107566,136.267463,176053500
247,2024-12-24,140.207108,141.886946,138.637245,139.987127,105157000
248,2024-12-26,139.91713,140.837058,137.717335,139.687155,116205600
249,2024-12-27,136.997391,139.007216,134.697615,138.537258,170582600
250,2024-12-30,137.477356,140.257099,134.007674,134.817597,167734700


## Looking at data

Other methods and attributes can give us an overview, and give us further insights to our data in general. `shape()` will tell us the number of rows and columns in our data frame, while `info()` will give us some info on the data type (`dtype`) of each column.

You'll notice the types are slightly different from the usual Python types - this is because they belong to the `numpy` package, which sits under the hood of `pandas`. We'll look more at `numpy` tomorrow, but for now here is a word about each of the types in our data frame.

- `float64` - 64-bit floating point (number with a decimal point)
- `int64` - 64-bit integer (whole number)
- `object` - other Python data types (strings in this case)

In [7]:
df.shape

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251 entries, 0 to 250
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    251 non-null    object 
 1   Close   251 non-null    float64
 2   High    251 non-null    float64
 3   Low     251 non-null    float64
 4   Open    251 non-null    float64
 5   Volume  251 non-null    int64  
dtypes: float64(4), int64(1), object(1)
memory usage: 11.9+ KB


For a look at some actual data within the data frame, we can use square bracket notation and `iloc` to access columns and rows. The `i` in `iloc` refers to **integer-based indexing**, so looking at a row or column *number*.

In [8]:
df["Close"]

df[["Open","Close"]]

df.iloc[0]

df.iloc[0,2]


np.float64(49.27649344260116)

## Filtering Data

In Pandas, we can use a technique known as *boolean indexing* or *masking* to filter rows depending on some condition. We can express conditions using a *boolean expression* or *compound boolean expression* with either `&` (and) *or* `|` (or). These are also called *logical expressions*.

In [9]:
condition = df["Date"] == "2024-08-08"

df[condition]

df[(df["Date"] >= "2024-08-08") & (df["Date"] <= "2024-08-31")]

Unnamed: 0,Date,Close,High,Low,Open,Volume
151,2024-08-08,104.944138,105.474006,97.495969,101.974868,391910000
152,2024-08-09,104.72419,106.573732,103.404515,105.61397,290844200
153,2024-08-12,108.993141,111.042639,106.233827,106.293809,325559900
154,2024-08-13,116.111382,116.201363,111.552508,112.412296,312646700
155,2024-08-14,118.050911,118.570779,114.041897,118.500797,339246400
156,2024-08-15,122.829735,123.209638,117.441063,118.730746,318086700
157,2024-08-16,124.549301,124.969196,121.150137,121.909952,302589900
158,2024-08-19,129.967972,129.967972,123.389591,124.24938,318333600
159,2024-08-20,127.218643,129.848,125.858978,128.368354,300087400
160,2024-08-21,128.468353,129.31815,126.62881,127.288644,257883600


### Exercise: End of Year

Display the data for the entire month of December 2024.

In [10]:
## YOUR CODE GOES HERE

df[(df["Date"] >= "2024-11-30") & (df["Date"] <= "2024-12-31")]

df[df["Date"] >= "2024-12-01"] # relying on daye ending 2024

Unnamed: 0,Date,Close,High,Low,Open,Volume
231,2024-12-02,138.607697,140.427396,137.797829,138.807661,171682800
232,2024-12-03,140.237442,140.517396,137.927816,138.237764,164414000
233,2024-12-04,145.116653,145.766543,140.267427,141.977159,231224300
234,2024-12-05,145.046661,146.526521,143.936763,145.09666,172621200
235,2024-12-06,142.426895,145.68659,141.296994,144.5867,188505600
236,2024-12-09,138.797226,139.93712,137.117388,138.957215,189308600
237,2024-12-10,135.057587,141.806966,133.77769,138.997212,210020900
238,2024-12-11,139.29718,140.157102,135.197567,137.347363,184905200
239,2024-12-12,137.327362,138.427267,135.78751,137.067391,159211400
240,2024-12-13,134.237656,139.58717,132.527806,138.927227,231514900


### Exercise: The First Fifty

Display the data for the first fifty (50) days of trading in the period.

In [11]:
## YOUR CODE GOES HERE

df.head(50)

df.iloc[0:50] # or without the 0

Unnamed: 0,Date,Close,High,Low,Open,Volume
0,2024-01-02,48.149918,49.276493,47.577135,49.225514,411254000
1,2024-01-03,47.55114,48.165907,47.302233,47.467172,320896000
2,2024-01-04,47.979984,48.481795,47.490167,47.749068,306535000
3,2024-01-05,49.078564,49.528395,48.28786,48.443804,415039000
4,2024-01-08,52.233379,52.255374,49.460423,49.493411,642510000
5,2024-01-09,53.120052,54.304609,51.670596,52.381331,773100000
6,2024-01-10,54.329597,54.579504,53.468921,53.595876,533796000
7,2024-01-11,54.801418,55.325224,53.539895,54.978354,596759000
8,2024-01-12,54.689457,54.949361,54.309602,54.599491,352994000
9,2024-01-16,56.360836,56.813665,54.879394,54.99735,449580000


### Optional Exercise: Big Days

Display rows where trading volume exceeded 800 000 000.

In [12]:
## YOUR CODE GOES HERE

df[df["Volume"] > 800_000_000]


Unnamed: 0,Date,Close,High,Low,Open,Volume
35,2024-02-22,78.508522,78.545503,74.192142,74.99684,865100000
36,2024-02-23,78.787407,82.36306,77.540874,80.759666,829388000
46,2024-03-08,87.499252,97.368012,86.477585,95.106754,1142269000
75,2024-04-19,76.174973,84.296305,75.581173,83.122695,875198000
99,2024-05-23,103.764908,106.285076,101.486649,101.994486,835065000


## Setting the Index

In a DataFrame, each row is assigned a unique index value. By default, this is just a number (starting at 0). When it makes sense, we can choose one of the other columns to be an index. For time series data, where each row represents a different point in time, we'll set our `Date` column as the index. This will make it easier for us to work with the data, and can speed up other operations later on.


In [13]:
df["Date"] = pd.to_datetime(df["Date"])

df.set_index("Date", inplace=True) #preview of the data (only can run once, because we change the date into index and runall can solve this problem)

df # Change the whole data frame 

Unnamed: 0_level_0,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2024-01-02,48.149918,49.276493,47.577135,49.225514,411254000
2024-01-03,47.551140,48.165907,47.302233,47.467172,320896000
2024-01-04,47.979984,48.481795,47.490167,47.749068,306535000
2024-01-05,49.078564,49.528395,48.287860,48.443804,415039000
2024-01-08,52.233379,52.255374,49.460423,49.493411,642510000
...,...,...,...,...,...
2024-12-23,139.657150,139.777134,135.107566,136.267463,176053500
2024-12-24,140.207108,141.886946,138.637245,139.987127,105157000
2024-12-26,139.917130,140.837058,137.717335,139.687155,116205600
2024-12-27,136.997391,139.007216,134.697615,138.537258,170582600


We convert the 'Date' column to a datetime object because pandas can recognise and efficiently work with datetime objects. We set the `Date` column as the index because in time-series data like ours, operations are time-based.

With the index set, we can now use it to access different portions of our data a little bit more easily. Because our indices are labeled, we can use `loc` for **label-based indexing**.

In [14]:
df.loc["2024-08"]

df.loc["2024-01-01":"2024-01-07"]

df.loc["2024-08-08", "Close"]

np.float64(104.9441375732422)

## Getting the Index

Oftentimes it is helpful to retrieve the index of the dataframe for a given row or rows. Let's say we wanted to see the dates where Nvidia's `Volume` was less than 150 000 000. After some smart boolean indexing, we can look at the `index` attribute.

In [15]:
dates = df[df["Volume" ]< 150_000_000].index 
dates

DatetimeIndex(['2024-11-29', '2024-12-24', '2024-12-26'], dtype='datetime64[ns]', name='Date', freq=None)

Because of our conversion to datetime when we first loaded the dataframe, we can pull specific information out of the index.

In [16]:
dates.day_name()
dates.to_period("Q")
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 251 entries, 2024-01-02 to 2024-12-30
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Close   251 non-null    float64
 1   High    251 non-null    float64
 2   Low     251 non-null    float64
 3   Open    251 non-null    float64
 4   Volume  251 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 19.9 KB


## Aggregations

There are many basic operations we can do with pandas, such as calculating the mean of a column, the maximum of a column, and so on. We generally refer to these as *aggregations* since they reduce multiple values to one summary value.


In [17]:
df.Close
df["Close"] # This two types are same, but first one can save more time

df.Close.mean() #()refers to function
df.Volume.max()
df.Volume.idxmax() # Find the index max 

df[["Low","High"]].mean(axis=1) # It calculates the row-wise average of the Low and High columns. axis: computes the mean across columns for each row 

Date
2024-01-02     48.426814
2024-01-03     47.734070
2024-01-04     47.985981
2024-01-05     48.908128
2024-01-08     50.857898
                 ...    
2024-12-23    137.442350
2024-12-24    140.262095
2024-12-26    139.277197
2024-12-27    136.852415
2024-12-30    137.132386
Length: 251, dtype: float64

### Exercise: nVidia Quarters

Did Q3 or Q4 have more trading days where the `Close` price was above the annual average (i.e. above the mean)?

In [None]:
## YOUR CODE GOES HERE

condition = df.Close > df.Close.mean() # find the annual average 

mask = condition.index.to_period("Q") == "2024Q3" # condition whether the quater is Q3

condition.loc[condition.index.to_period("Q") == "2024Q3"].sum() 

print(condition.loc[condition.index.to_period("Q") == "2024Q3"].sum())
print(condition.loc[condition.index.to_period("Q") == "2024Q4"].sum())

53
63


### Exercise: March Madness

Looking only at the month of March, print the following information:

* First opening price of the period
* Last close price of the period
* Total volume traded over the period

In [28]:
## YOUR CODE GOES HERE

march_data = df.loc["2024-03"]
march_data

march_data.Open.iloc[0]
march_data.Close.iloc[-1]
march_data.Volume.sum()

np.int64(12149218000)