<a href="https://colab.research.google.com/github/Segtanof/pyfin/blob/main/05_Data_Visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Visualization
Now that we learned to import and work with data(frames) we can look at some visualization tools. We will discuss pandas and seaborn plots. However, Python has a lot more libraries for data visualizations. Another famous one besides these two is [matplotlib](https://matplotlib.org/). It is often used in combination with the other two.

## Python is an object-oriented programming language

What does that mean? Every "thing" (e.g. integers, floats, lists, dicts, DataFrames) is an "object". We have seen that some objects have methods (e.g. DataFrame**.sum()**, list**.append()**, string**.lower()**), which are essentially functions associated with that specific object.

Objects can also have attributes (e.g. DataFrame**.shape**), which is essentially a property of that object.

Therefore, when we create a DataFrame via `pd.DataFrame`, we don't actually call a *function*, we instantiate an object of the type `DataFrame`. The correct term for this is creating an instance of the `DataFrame` **`class`**. How this works is however beyond the scope of this course.

The key thing to remember is that `DataFrame`s and other objects can keep state. I.e. if you add rows to a DataFrame, it will store the rows. A function on the other hand only returns its value and disappears together with all the variables in the local name space.

## Pandas Plots
Pandas allows us to make simple plots very quickly and easily.

In [None]:
import pandas as pd

Let's use some real world data. The authors of the paper Jensen, T., Kelly, B., and Pedersen, L. “Is There a Replication Crisis in Finance?” Journal of Finance (2023) provide [replication data](https://jkpfactors.com/factor-returns?country=usa&theme=all_factors&frequency=monthly&weight=vw_cap#) on [their website](https://jkpfactors.com/?country=usa&factor=all_factors).

`pandas` can directly import zipped CSV files from web links, which is very convenient. Let's try it.

In [None]:
df = pd.read_csv(
    "https://jkpfactors.s3.amazonaws.com/public/%5Busa%5D_%5Ball_factors%5D_%5Bmonthly%5D_%5Bvw_cap%5D.zip",
    parse_dates=["date"] # Note that we are attempting to parse the dates in the column named 'date'
)

In [None]:
# Let's check a random sample of rows
df.sample(5)

Let's assume we don't like the column name `name`, and we want to rename it to `factor`. We can do that using the `.rename()` method. Because we want to rename a column, we use the keyword argument `columns`:

In [None]:
df = df.rename(columns = {'name':'factor'})
df.head()

For now, we only want to work with a single factor: `cash_at`. So we use the `query` method to filter the data:

In [None]:
input_data = df.query("factor=='cash_at'")

In [None]:
input_data.head()

We can make a chart (or a "plot") by calling the `plot` method.

In [None]:
input_data.plot()

Well... it's a chart, but we can improve it by focusing on the relevant things. Let's only look at the number of stocks:

In [None]:
input_data['n_stocks'].plot()

Better! The x-axis still looks not like what we would want. Why is that? It's the index of the data. But we want to use the date as the x-axis. So let's manually fix the x-axis and plot it again. Instead of taking out a single column and plotting it, we are taking the full DataFrame and telling pandas which columns should go where. Let's plot both `n_stocks` and `n_stocks_min`.

In [None]:
input_data.plot(x='date', y=['n_stocks', 'n_stocks_min'])

Now that looks a lot better!

**Quick exercise**

Plot the `n_stocks_min` against `n_stocks` as a scatterplot. Set the title to `Scatterplot` and use log-scaling for the y-axis. Check [the documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) for hints.

## Seaborn Plots
The pandas library is very nice to use for some quick-and-dirty plots. For plots that you might want to put into your thesis, and some special types of plots, seaborn is the better choice.

To make them look nicer, we can choose styles. You can look [here](https://python-charts.com/seaborn/themes/) for some examples.

In [None]:
import seaborn as sns
sns.set_style("darkgrid") # This is important for making it look nicer!

In seaborn, we can simply use `pd.DataFrame` objects as data arguments to the plot functions. Using  [`lineplot`](https://seaborn.pydata.org/generated/seaborn.lineplot.html) we can create lineplots.

To pass data to the plot, we use the `data` argument:

In [None]:
sns.lineplot(x="date",y="n_stocks",data=input_data)

using [`histplot`](https://seaborn.pydata.org/generated/seaborn.histplot.html) we can create histograms

In [None]:
sns.histplot(
    x='n_stocks',
    data = input_data,
)

And using [`scatterplot`](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) we can create scatter plots. To color the dots based on their return (column `ret`), we can set the `hue` argument.

In [None]:
sns.scatterplot(x="n_stocks_min", y="n_stocks", hue='ret', data=input_data)

Another very useful type of plots is the [`heatmap`](https://seaborn.pydata.org/generated/seaborn.heatmap.html). We can use it for example to show a visual representatio of a correlation matrix.

In [None]:
correlation_matrix = input_data[['ret','n_stocks','n_stocks_min']].corr()
correlation_matrix

In [None]:
sns.heatmap(correlation_matrix, annot=True, vmin=-1, vmax=1, cmap="crest")

**Quick exercise**

- Check out available [color palettes](https://seaborn.pydata.org/tutorial/color_palettes.html) for seaborn.
- Plot the correlation heatmap using a diverging color palette.

## When to use matplotlib

matplotlib is the underlying package that pandas and seaborn rely on. They simply make it nicer and easier to use.

For some features, matplotlib still comes in very handy.

For example to plot a horizontal or vertical line at a specific point.

In [None]:
fig = sns.scatterplot(x="n_stocks_min", y="n_stocks", hue='ret', data=input_data) # We need to store the plot into a variable that we can then modify and add features to

fig.axhline(y=1000, color='r')

fig.axvline(x=1000, color='b')

## Exercises

a)
Filter the `df` dataframe so that it includes rows where "factor" is either "cash_at" or "debt_me".

- Create a seaborn plot of the n_stocks over time for each factor.
- Draw a seaborn histogram for the distribution of n_stocks for each of the two "factor".
- Create a seaborn scatterplot of "ret" (on x) and "n_stocks" (on y). Ensure that we have different markers and colors for the two different "factor". Use an appropriate type of color palette for this.


b)
- Take a random sample of 5 "factor"
- Filter the dataframe to rows corresponding to that "factor". Check the help for `query`. It tells you how to access variables.
- `pivot` the data, so that index is `date`, columns are `factor`, values are `ret`.
- Calculate the return correlation
- Plot it as heatmap; use an appropriate color palette

c)
- Take the simple lineplot from part a).
- Save the output to a PDF file. Check the type and then use the internet to find help.
- Download it to your computer and take a look.
- Turn your code for saving the plot into a function, which allows you to specify the filename.