# How are trading volume and volatility related for energy stocks?

## Goals

In this lab, we will be replicating some of the analyses we conducted in the main lecture. You will gain more familiarity with the `pandas` library, which was introduced in the pre-foundational Python program, and will learn how to do basic feature engineering, how to plot distributions and how to calculate summary statistics.

## Importing the required libraries

Let's start by importing `pandas`:

In [None]:
# Import the Pandas package
import pandas as pd

We also import a second package, `matplotlib`, which is the basic Python library for plotting. We are interested only in the `pyplot` module, so we run this (here we use `plt` as the alias for the `pyplot` module that comes with the `matplotlib` package - `plt` as an alias is customary):

In [None]:
# Import the Matplotlib package (for plotting)
import matplotlib.pyplot as plt

## Basic feature engineering

As you know, we can create new columns (also known as features) in our dataset using data from other columns. This is often useful when the existing features aren't in the right format or unit, or don't provide a lot of information by themselves.

Let's load in the CSV file for the stock symbol `D`:

In [None]:
# Load a file as a DataFrame and assign to the variable name D
D = pd.read_csv("data/D.csv")

In [None]:
D.head(10)

As you see, our DataFrame has 6 columns (`Date`, `Open`, `High`, `Low`, `Close`, and `Volume`) and one default index (the vertical sequence of integers in bold to the left of the table).

It is usually a good idea to set one of the columns as the index, instead of using the default. We can do that with the **`.set_index()`** method:

In [None]:
D = D.set_index('Date') # In this context, Date is the best candidate to be the index
D.head(10)

### Exercise 1

Load the other four datasets, namely `DUK.csv`, `EXC.csv`, `NEE.csv`, and `SO.csv`, and assign their contents to variables whose name is the symbol (that is, `DUK.csv` should be assigned to the variable `DUK`, and so on). Don't forget to set `Date` as the index for each one of the DataFrames.

**Answer.** Shown below.

In [None]:
DUK = pd.read_csv("data/DUK.csv")
DUK = DUK.set_index('Date')

EXC = pd.read_csv("data/EXC.csv")
EXC = EXC.set_index('Date')

NEE = pd.read_csv("data/NEE.csv")
NEE = NEE.set_index('Date')

SO = pd.read_csv("data/SO.csv")
SO = SO.set_index('Date')

### The `symbol`  column

To create a new column in `pandas`, you use this syntax:

~~~python
my_dataframe["name_of_new_column"] = the_definition_of_the_new_column
~~~

So, to create a new column with the name of the symbol, you use:

In [None]:
D["Symbol"] = "D"
D.head(10)

Here we defined a new column called `Symbol` and asked `pandas` to assign the string value `D` to *all* of its rows. Let's create the corresponding `Symbol` column in `DUK`, `EXC`, `NEE` and `SO`.

In [None]:
DUK["Symbol"] = "DUK"
EXC["Symbol"] = "EXC"
NEE["Symbol"] = "NEE"
SO["Symbol"] = "SO"

#### Concatenating our DataFrame

Below we stitch all the dataframes together with the `pd.concat()` function (we will need this concatenated DataFrame in a moment):

In [None]:
stocks = pd.concat([D, DUK, EXC, NEE, SO])

### The `Return` column

This is the daily return formula:

$$
Return = \frac{Close_t}{Close_{t-1}} - 1
$$

Let's look at our DataFrame `D` (we only show  the first three rows):

In [None]:
D.head(3)

Let's compute $Return$ for July 3:

In [None]:
close_2_july = 61.730
close_3_july = 60.863
return_3_july = (close_3_july / close_2_july) - 1
print(return_3_july)

We could write a `for` loop to do this operation on each of the rows one by one, but thankfully `pandas` offers us an easier way - the **`.shift()`** method. The `.shift()` method takes a Series and shifts it up or down (depending on your input) in relation to the index of the Series. This is better understood with an example - here's the head of the `High` Series:

In [None]:
D["High"].head(5)

You can see that the `High` price for July 3 was 61.394.

You can shift the Series so that the value that is associated with July 3 is no longer its own, but rather the one corresponding to July 7:

In [None]:
D["High"].head(5).shift(-1)

You can, of course, reverse the direction of the shift and move the whole Series "downwards" instead:

In [None]:
D["High"].head(5).shift(1)

And you can shift it as many rows as you want:

In [None]:
D["High"].head(5).shift(3)

### Exercise 2


This is the formula we want to code:

$$
Return = \frac{Close_t}{Close_{t-1}} - 1
$$

Getting $Close_t$ is easy:

~~~python
D["Close"]
~~~

But how would you get its shifted version, $Close_{t-1}$?

**Hint:** Let's say $t=30$. You would then want to get $Close_{29}$. So, in your table, the row for $t=30$ should have a column that contains $Close_{30}$ (this one already exists, it's in `Close`) and another one (call it `Close_Previous`) that contains $Close_{29}$.

**Answer.** Shown below.

In [None]:
D["Close_Previous"] = D["Close"].shift(1)
D.head(10)

### Exercise 3

How would you code the definition of `Return`?

**Answer.**

In [None]:
D["Return"] = (D["Close"] / D["Close_Previous"]) - 1.0
D.head(10)

### Trading volume in millions

Recall the formula:

$$
Volume\_Millions = \frac{Volume}{1,000,000}
$$

You probably remember that you can divide in Python using the `/` symbol. Since `pandas` Series support Python arithmetic, you can simply do this in order to create the new column (notice that we now use the concatenated DataFrame `stocks` instead of the `D` DataFrame):

In [None]:
stocks["Volume_Millions"] = stocks["Volume"] / 1000000
stocks.head(10)

### A measure of volatility

This is another formula that we need to replicate in our concatenated DataFrame:

$$
VolStat = \frac{High_t - Low_t}{Open_t}
$$

Let's add it to our dataset:

In [None]:
stocks["VolStat"] = (stocks["High"] - stocks["Low"]) / stocks["Open"]

## Creating histograms

This is a histogram of the `Volume_Millions` column in `stocks` (therefore it includes the data for all five symbols). Here we use the `.plot.hist()` method and accept the default bins that Pandas calculates.

In [None]:
hvm = stocks['Volume_Millions'].plot.hist()

hvm.set_title("Histogram of Volume_Millions")
hvm.set_xlabel("Millions of shares")

plt.plot()

### Exercise 4

Plot the histogram of the `Open` column in `stocks`.

**Answer.** This is the histogram of the `Open` column:

In [None]:
ho = stocks['Open'].plot.hist()

ho.set_title("Histogram of Open")
ho.set_xlabel("USD")

plt.plot()

## Summary statistics

### Minimum, maximum, mean and median

To find the minimum, maximum, mean, median, and mode of a distribution, we can use these functions:

~~~python
.min()
.max()
.mean()
.median()
.mode()
~~~

For instance:

In [None]:
stocks["Close"].min()

### Percentiles

To compute the $p$-th percentile, you use

~~~python
my_series.quantile(per/100)
~~~

The resulting number will be such that $p$% of your data points are smaller than that value. So, for instance, if you want to find the $p=30$ percentile (a value such that 30% of your data points are smaller than that value), you need to pass 0.3 to the function, like this:

~~~python
my_series.quantile(0.3)
~~~

For instance:

In [None]:
per0 = stocks["Close"].quantile(0)
per25 = stocks["Close"].quantile(0.25)
per50 = stocks["Close"].quantile(0.50)
per75 = stocks["Close"].quantile(0.75)
per100 = stocks["Close"].quantile(1)

print(per0)
print(per25)
print(per50)
print(per75)
print(per100)

## Using `.groupby()`

We can easily calculate the summary statistics for any given column with the **`.describe()`** method:

In [None]:
stocks["Volume_Millions"].describe()

However, it would be more useful if we computed the summary statistics *for each symbol individually.* Let's group our `stocks` DataFrame by `Symbol`. The resulting `DataFrameGroupBy` object should have five elements, because there are five symbols:

In [None]:
g = stocks.groupby(['Symbol'])
len(g)

So, this is basically the `VolStat` column split up into five chunks, one for each stock symbol.

The real power of `.groupby()` is evident when you pair it with **aggregation functions** like `sum()`, `mean()` and others. For instance:

In [None]:
stocks.groupby(['Symbol'])["VolStat"].median()

### Exercise 5

Copy the preceding code and modify it to run the `.describe()` function on the grouped `VolStat` column.

**Answer.**

In [None]:
stocks.groupby(['Symbol'])["VolStat"].describe()

This is the output we wanted. We have the summary statistics of the `VolStat` column *for each symbol*.

## Plotting the time series

Let's now plot the `VolStat` time series. Since we have several stocks, we can use one series for each stock.

This bit that follows is important because you'll see it a lot in professional practice. Let's first find out what data type our index is:

In [None]:
stocks.index

So you can see that this index is of `dtype='object'`, which means that for `pandas` its elements are not dates, but rather simple strings. If we were to plot our variables using this index as the time axis, we could get a lot of unexpected and weird behaviors. That's why it is always advised to cast our columns into the appropriate data types before plotting or doing analyses with them.

Let's convert this index to the `datetime` data type:

In [None]:
stocks.index = pd.to_datetime(stocks.index)
stocks.index

Now you see that our new data type is `datetime64[ns]`, which is what we wanted. For more details about the `datetime` data type, check the [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html).

The actual plot is very easy to make. You just take the `SeriesGroupBy` object that corresponds to `VolStat` and call the `.plot()` method on it:

In [None]:
stocks.groupby("Symbol")["VolStat"].plot()

This plots one series for each symbol. The output is accurate, but the plot looks somewhat raw at the moment. Let's add the all-important legend and a title:

In [None]:
stocks.groupby("Symbol")["VolStat"].plot(legend=True, title="Energy Sector Trends - VolStat")

And let's make it a bit bigger with the `figsize` argument:

In [None]:
stocks.groupby("Symbol")["VolStat"].plot(legend=True,
                                         title="Energy Sector Trends - VolStat",
                                         figsize=(15,6)
                                        )

## Attribution

"Reshaping and pivot" (modified from the original), Pandas Developers, [BSD-3 license](https://github.com/pandas-dev/pandas/blob/master/LICENSE), https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html.


"Huge Stock Market Dataset", No. 10, 2017, Boris Marjanovic, Public Domain. https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs