# QF 627 Extras - Financial Analytics
## Lesson 3 | Exploratory Data Analysis with Grammar of Graphics | `RE`view

<table style="width: 100%;">
  <tr>
    <td style="width: 25%; text-align: center; padding-right: 20px;">
      <img src="https://images.squarespace-cdn.com/content/v1/53f3eb3ce4b077de0318f4ea/1705649199392-2MVU89VL4DTZM4HCYC2B/thankyou_prof_roh.gif" alt="Prof. Roh" width="100%" height="auto">
    </td>
    <td style="vertical-align: top;">

> Hi, Team 👋

> In our Week 3 lesson, we will build on your understanding of tabular data—a structured format where observations (rows) and variables (columns) can be easily accessed. The next step in your learning journey is data visualization.

> For those new to computational data science, data visualization is often perceived as a supplementary step that merely supports quantitative modeling with machine learning techniques such as regression, classification, clustering, and dimensionality reduction. However, in practice, visualization is a fundamental process in any data science endeavor.

> Here, I will first demonstrate how accessible it is for you to build computational data visualizations once you understand the universal grammar structure that governs data visualization 😊

> Next, I’ll show you why data visualization matters by working with a small dataset in the context of fintech products.

> In the final segment of this interactive Python lecture note, you will follow a step-by-step guide using a piece-meal approach to get your first hands-on experience with the Grammar of Graphics, using the open-source Python library, lets-plot.

> Let's begin our journey—yay! 🎉
    </td>
  </tr>
</table>

## Dependencies

In [1]:
import time # for our customized timer function

def countdown(Time):
    
    while Time:
        minutes, seconds = divmod(Time, 60)
        timer = "{:02d}:{:02d}".format(minutes, seconds)
        
        print(timer,
             end = "\r")
        
        time.sleep(1)
        Time -= 1
        
    print("Let's solve this problem together, Team :)")

In [2]:
countdown(5)

Let's solve this problem together, Team :)


## 👉 <a id = "top">Learning Pointers</a> 👈 

## [1. Learning the Grammar of Graphics Framework Through Live Coding](#p1)

> ### <font color = red> Unpacking the Principles of the Grammar of Graphics </font>

## [2. The Importance of Data Visualization in Analytics](#p2)

> ### <font color = red> Why Visual Analytics Matter and How They Drive Insights </font>

## [3. First Look at the Grammar of Graphics with the plotnine Library](#p3)

> ### <font color = red> Exploring the Layers of plotnine for Effective Data Visualization </font>

## <a id = "p1"> 1. </a> <font color = "green"> Learning the Grammar of Graphics Framework Through Live Coding </font> [back to table of contents](#top)

    Grammar of Graphics is a theoretical framework for describing and building data visualizations, which has been implemented in multiple programming languages through various packages. The Grammar of Graphics is a versatile and powerful framework that transcends individual programming languages, making it a valuable tool for data visualization across various platforms.

> The `Grammar of Graphics` is a unifying method of visualizing your data and can be implemented across various computational languages (a cross-language concept used for data visualization).

* `Python` (plotnine, `lets-plot`): In Python, there are packages like plotnine and lets-plot that bring the Grammar of Graphics to the Python ecosystem.

* `JavaScript` (`G2`, Vega, and Vega-Lite): For web-based visualizations, Vega and Vega-Lite are declarative languages for creating, sharing, and exploring visualizations. They are inspired by the Grammar of Graphics and provide a high-level grammar for specifying visualizations.

* `R` (`ggplot2`): The earliest implementation of the Grammar of Graphics is the ggplot2 package in R, created by Hadley Wickham. It provides a powerful and flexible system for creating complex and multi-layered visualizations.

* `Julia` (`Gadfly`): Julia, a high-performance programming language for technical computing, has the Gadfly package, which is based on the Grammar of Graphics principles and offers a consistent approach to creating visualizations.

> First, I will walk you through how the Grammar of Graphics operates using synthetic tabular data. A deeper understanding can be achieved when `the underlying rationale is unpacked in real-time`—so let me provide you with that learning experience right from the beginning.

In [3]:
from IPython.display import Image

Image(url="https://static1.squarespace.com/static/53f3eb3ce4b077de0318f4ea/t/66d6857e966cef1122cc340b/1725334912400/ggplot_layers.jpg", 
      width = 600)

### Import

In [4]:
!pip install numpy pandas

zsh:1: command not found: pip


In [5]:
%whos

Variable    Type        Data/Info
---------------------------------
Image       type        <class 'IPython.core.display.Image'>
countdown   function    <function countdown at 0x1047cac00>
time        module      <module 'time' (built-in)>


In [6]:
import numpy

In [7]:
%whos

Variable    Type        Data/Info
---------------------------------
Image       type        <class 'IPython.core.display.Image'>
countdown   function    <function countdown at 0x1047cac00>
numpy       module      <module 'numpy' from '/op<...>kages/numpy/__init__.py'>
time        module      <module 'time' (built-in)>


In [8]:
import numpy as np

In [9]:
dir(np)

['False_',
 'ScalarType',
 'True_',
 '_CopyMode',
 '_NoValue',
 '__NUMPY_SETUP__',
 '__all__',
 '__array_api_version__',
 '__array_namespace_info__',
 '__builtins__',
 '__cached__',
 '__config__',
 '__dir__',
 '__doc__',
 '__expired_attributes__',
 '__file__',
 '__former_attrs__',
 '__future_scalars__',
 '__getattr__',
 '__loader__',
 '__name__',
 '__numpy_submodules__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_array_api_info',
 '_core',
 '_distributor_init',
 '_expired_attrs_2_0',
 '_globals',
 '_int_extended_msg',
 '_mat',
 '_msg',
 '_pyinstaller_hooks_dir',
 '_pytesttester',
 '_specific_msg',
 '_type_info',
 '_typing',
 '_utils',
 'abs',
 'absolute',
 'acos',
 'acosh',
 'add',
 'all',
 'allclose',
 'amax',
 'amin',
 'angle',
 'any',
 'append',
 'apply_along_axis',
 'apply_over_axes',
 'arange',
 'arccos',
 'arccosh',
 'arcsin',
 'arcsinh',
 'arctan',
 'arctan2',
 'arctanh',
 'argmax',
 'argmin',
 'argpartition',
 'argsort',
 'argwhere',
 'around',
 'array',
 'arr

In [10]:
%whos

Variable    Type        Data/Info
---------------------------------
Image       type        <class 'IPython.core.display.Image'>
countdown   function    <function countdown at 0x1047cac00>
np          module      <module 'numpy' from '/op<...>kages/numpy/__init__.py'>
numpy       module      <module 'numpy' from '/op<...>kages/numpy/__init__.py'>
time        module      <module 'time' (built-in)>


In [11]:
%who

Image	 countdown	 np	 numpy	 time	 


In [12]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


### 1d-data

In [13]:
sample_list =\
    [1,
     2,
     3,
     4] # a representation of vector (1d data)

In [14]:
sample_list

[1, 2, 3, 4]

In [15]:
sample_list * 3

[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]

In [16]:
type(
    np
    .array(sample_list)
)

numpy.ndarray

In [17]:
(
    np
    .array(sample_list)
).ndim

1

In [18]:
sample_array =\
(
    np
    .array(sample_list)
)

In [19]:
sample_array * 3 # numerical computation for vectorization (element-wise operation)

array([ 3,  6,  9, 12])

# 2-d data (Matrix)

### With Built-in Python

In [20]:
nested_list =\
    [
        [1,2,3],
        [2,3,4],
        [3,4,5]
    ]

In [21]:
nested_list # lists-in-list

[[1, 2, 3], [2, 3, 4], [3, 4, 5]]

### with NumPy

In [22]:
array_2d =\
(
    np
    .array(nested_list)
)

### with pandas

In [23]:
import pandas as pd

In [24]:
%whos

Variable       Type        Data/Info
------------------------------------
Image          type        <class 'IPython.core.display.Image'>
array_2d       ndarray     3x3: 9 elems, type `int64`, 72 bytes
countdown      function    <function countdown at 0x1047cac00>
nested_list    list        n=3
np             module      <module 'numpy' from '/op<...>kages/numpy/__init__.py'>
numpy          module      <module 'numpy' from '/op<...>kages/numpy/__init__.py'>
pd             module      <module 'pandas' from '/o<...>ages/pandas/__init__.py'>
sample_array   ndarray     4: 4 elems, type `int64`, 32 bytes
sample_list    list        n=4
this           module      <module 'this' from '/opt<...>/lib/python3.13/this.py'>
time           module      <module 'time' (built-in)>


In [25]:
our_first_DF =\
(
    pd
    .DataFrame(array_2d,
               columns = ["A", "B", "C"],
               index = ["D", "E", "F"]
              )
)

In [26]:
our_first_DF

Unnamed: 0,A,B,C
D,1,2,3
E,2,3,4
F,3,4,5


### Three Elements of DataFrame
- `.values`: cell-entry values
- `.columns`: name of variables (vectors)
- `.index`: name of observations

In [27]:
our_first_DF.values

array([[1, 2, 3],
       [2, 3, 4],
       [3, 4, 5]])

In [28]:
our_first_DF.columns

Index(['A', 'B', 'C'], dtype='object')

In [29]:
our_first_DF.index

Index(['D', 'E', 'F'], dtype='object')

### <mark>The meaning of assignment in Python is `refers to`</mark>

In [30]:
our_first_DF.columns =\
    ["QF", "QF627", "QF999"]

In [31]:
our_first_DF

Unnamed: 0,QF,QF627,QF999
D,1,2,3
E,2,3,4
F,3,4,5


### To Wei Min

In [32]:
to_wei_min =\
(
    ["Wei Min", "Rocks", "Hear Hear"]
)
# python is indentation-sensitive language

## <a id = "p2"> 2. </a> <font color = "green"> The Importance of Data Visualization in Analytics </font> [back to table of contents](#top)

> Imagine you are working as a Quantitative Analyst at JP Morgan Chase in their fintech products department. In this scenario, you have received four datasets from four different fintech applications.

> Below is the context you are addressing: analyzing the relationship between page views and time spent by fintech product users (clients).

    Potential Variables in the Dataset

* `Page Views (Weekly)`: A simple usage metric: How many times were pages (or screens) within your application viewed in a given week?
This can be a total count (e.g., 5,000 page views this week) or aggregated by user segment (e.g., free vs. paid users).

* `Time Spent (Weekly in Hours)`: The total amount of time users spent on all pages in your application during that week:  For instance, “2,000 total minutes spent in-app by all users from Monday to Sunday.” You could also measure the average time spent per user, depending on your analysis goals.

> Studying the relationship between weekly page views and weekly time spent can provide several user insights:

    Deeper Engagement vs. Quick Visits

* If you see high page views and high total time spent, it might suggest deep engagement (users aren’t just clicking around; they’re genuinely spending time in the product).

* If page views are high but total time spent is relatively low, it might suggest “quick hits” of usage—lots of visits but short durations (perhaps tasks can be completed quickly, or users find what they need and leave).

    Signs of Usability Issues

* If time spent is unusually high for a small number of page views, it could indicate users are stuck, confused, or searching too long for information.

* Conversely, low time spent with low page views may indicate low adoption or a disinterest in the application’s content.
Trends Over Time

    Tracking weekly page views vs. weekly time spent can reveal trends:

* Are they moving together? (When page views go up, so does time spent.)
* Are they diverging? (Time spent stays flat while page views grow—maybe more quick bounces?)

#### IMPORT

```python
(
    pd
    .read_csv("https://talktoroh.com/s/app_usage.csv")
)
```

In [33]:
data =\
(
    pd
    .read_csv("https://talktoroh.com/s/app_usage.csv")
)

In [34]:
data

Unnamed: 0,app1_page_views,app2_page_views,app3_page_views,app4_page_views,app1_time_spent,app2_time_spent,app3_time_spent,app4_time_spent
0,10,10,10,8,8.04,9.14,7.46,6.58
1,8,8,8,8,6.95,8.14,6.77,5.76
2,13,13,13,8,7.58,8.74,12.74,7.71
3,9,9,9,8,8.81,8.77,7.11,8.84
4,11,11,11,8,8.33,9.26,7.81,8.47
5,14,14,14,8,9.96,8.1,8.84,7.04
6,6,6,6,8,7.24,6.13,6.08,5.25
7,4,4,4,19,4.26,3.1,5.39,12.5
8,12,12,12,8,10.84,9.13,8.15,5.56
9,7,7,7,8,4.82,7.26,6.42,7.91


#### WRANGLE

> The variables are all set for your analysis.

#### MODEL

> Let's say that you ran a quick regression analysis.

> Model Specification (Least Squares; Linear Regression):

In [35]:
!pip install statsmodels

zsh:1: command not found: pip


In [36]:
import statsmodels.formula.api as smf

In [37]:
data.columns

Index(['app1_page_views', 'app2_page_views', 'app3_page_views',
       'app4_page_views', 'app1_time_spent', 'app2_time_spent',
       'app3_time_spent', 'app4_time_spent'],
      dtype='object')

### Model of Interest

$$
    {TimeSpent} = B_0 + \widehat{B_1} \times {PageViews} + \epsilon
$$

In [38]:
our_first_ols_model =\
(
    # statsmodels
    # .formula
    # .api
    smf
    .ols(formula = "app1_time_spent ~ app1_page_views",
         data = data) # ordinary least squares
    .fit() # fit() returns calculation
)

In [39]:
dir(our_first_ols_model.summary()
   )

  return hypotest_fun_in(*args, **kwds)


['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__firstlineno__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__static_attributes__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_repr_html_',
 '_repr_latex_',
 'add_extra_txt',
 'add_table_2cols',
 'add_table_params',
 'as_csv',
 'as_html',
 'as_latex',
 'as_text',
 'extra_txt',
 'tables']

In [40]:
our_first_ols_model.summary()

  return hypotest_fun_in(*args, **kwds)


0,1,2,3
Dep. Variable:,app1_time_spent,R-squared:,0.667
Model:,OLS,Adj. R-squared:,0.629
Method:,Least Squares,F-statistic:,17.99
Date:,"Wed, 04 Jun 2025",Prob (F-statistic):,0.00217
Time:,19:25:19,Log-Likelihood:,-16.841
No. Observations:,11,AIC:,37.68
Df Residuals:,9,BIC:,38.48
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3.0001,1.125,2.667,0.026,0.456,5.544
app1_page_views,0.5001,0.118,4.241,0.002,0.233,0.767

0,1,2,3
Omnibus:,0.082,Durbin-Watson:,3.212
Prob(Omnibus):,0.96,Jarque-Bera (JB):,0.289
Skew:,-0.122,Prob(JB):,0.865
Kurtosis:,2.244,Cond. No.,29.1


In [41]:
our_first_ols_model.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3.0001,1.125,2.667,0.026,0.456,5.544
app1_page_views,0.5001,0.118,4.241,0.002,0.233,0.767


### APP 1

$$
    {TimeSpent} = 3.0001 + 0.5001 \times {PageViews} + \epsilon
$$

### APP2

$$
    {TimeSpent} = 3.0009 + 0.5000 \times {PageViews} + \epsilon
$$

In [42]:
data.columns

Index(['app1_page_views', 'app2_page_views', 'app3_page_views',
       'app4_page_views', 'app1_time_spent', 'app2_time_spent',
       'app3_time_spent', 'app4_time_spent'],
      dtype='object')

In [43]:
model_for_app2 =\
(
    smf
    .ols(formula = "app2_time_spent ~ app2_page_views",
         data = data)
    .fit()
)

In [44]:
(
    model_for_app2
    .summary()
    .tables[1]
)

  return hypotest_fun_in(*args, **kwds)


0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3.0009,1.125,2.667,0.026,0.455,5.547
app2_page_views,0.5000,0.118,4.239,0.002,0.233,0.767


### VISUALIZE YOUR DATA

In [45]:
!pip install lets_plot

zsh:1: command not found: pip


In [46]:
from lets_plot import *

In [47]:
LetsPlot.setup_html()

In [48]:
%whos

Variable                     Type                        Data/Info
------------------------------------------------------------------
GGBunch                      type                        <class 'lets_plot.plot.plot.GGBunch'>
Image                        type                        <class 'IPython.core.display.Image'>
LetsPlot                     type                        <class 'lets_plot.LetsPlot'>
aes                          function                    <function aes at 0x163d3b420>
array_2d                     ndarray                     3x3: 9 elems, type `int64`, 72 bytes
arrow                        function                    <function arrow at 0x163dcefc0>
as_discrete                  function                    <function as_discrete at 0x163dcc220>
coord_cartesian              function                    <function coord_cartesian at 0x163db16c0>
coord_fixed                  function                    <function coord_fixed at 0x163db1760>
coord_flip                   fun

In [49]:
from lets_plot import *

LetsPlot.setup_html() # this one?

In [50]:
data.columns

Index(['app1_page_views', 'app2_page_views', 'app3_page_views',
       'app4_page_views', 'app1_time_spent', 'app2_time_spent',
       'app3_time_spent', 'app4_time_spent'],
      dtype='object')

In [51]:
ols_model_1 =\
(
    ggplot(data,
          aes(x = "app1_page_views",
              y = "app1_time_spent")
          ) 
    + geom_smooth()
)

In [52]:
ols_model_1.show()

In [53]:
ols_model_2 =\
(
    ggplot(data,
          aes(x = "app2_page_views",
              y = "app2_time_spent")
          ) 
    + geom_smooth(color = "blue")
)

In [54]:
ols_model_2

In [55]:
gggrid([ols_model_1, ols_model_2],
       ncol = 2)

In [56]:
help(geom_smooth)

Help on function geom_smooth in module lets_plot.plot.geom:

geom_smooth(
    mapping=None,
    *,
    data=None,
    stat=None,
    position=None,
    show_legend=None,
    inherit_aes=None,
    manual_key=None,
    sampling=None,
    tooltips=None,
    orientation=None,
    method=None,
    n=None,
    se=None,
    level=None,
    span=None,
    deg=None,
    seed=None,
    max_n=None,
    color_by=None,
    fill_by=None,
    **other_args
)
    Add a smoothed conditional mean.

    Parameters
    ----------
    mapping : `FeatureSpec`
        Set of aesthetic mappings created by `aes()` function.
        Aesthetic mappings describe the way that variables in the data are
        mapped to plot "aesthetics".
    data : dict or Pandas or Polars `DataFrame`
        The data to be displayed in this layer. If None, the default, the data
        is inherited from the plot data as specified in the call to ggplot.
    stat : str, default='smooth'
        The statistical transformation to use 

### To Michelle

In [57]:
regression_with_99_CI =\
(
    ggplot(data,
           aes(x = "app2_page_views",
               y = "app2_time_spent")
          )
    + geom_smooth(formula = "y ~ x",
                  color = "red",
                  level = 0.99) # 99% confidence level
)

In [58]:
regression_with_95_CI =\
(
    ggplot(data,
           aes(x = "app2_page_views",
               y = "app2_time_spent")
          )
    + geom_smooth(formula = "y ~ x",
                  color = "red",
                  level = 0.95)
)

In [59]:
gggrid([regression_with_99_CI 
        + labs(title = "99% CIs", 
               subtitle = "99% confidence intervals are wider than\n95% confidence intervals."),
        regression_with_95_CI 
        + labs(title = "95% CIs")
       ],
        ncol = 2 # Number of columns
      )

### Learn how to learn

In [60]:
help(gggrid)

Help on function gggrid in module lets_plot.plot.gggrid_:

gggrid(
    plots: list,
    ncol: int = None,
    *,
    sharex: str = None,
    sharey: str = None,
    widths: list = None,
    heights: list = None,
    hspace: float = None,
    vspace: float = None,
    fit: bool = None,
    align: bool = None
) -> lets_plot.plot.subplots.SupPlotsSpec
    Combine several plots on one figure, organized in a regular grid.

    Parameters
    ----------
    plots : list
        A list where each element is a plot specification, a subplots specification, or None.
        Use value None to fill-in empty cells in grid.
    ncol : int
        Number of columns in grid.
        If not specified, shows plots horizontally, in one row.
    sharex, sharey : bool or str, default=False
        Controls sharing of axis limits between subplots in the grid.

        - 'all'/True - share limits between all subplots.
        - 'none'/False - do not share limits between subplots.
        - 'row' - share limi

In [61]:
gggrid([regression_with_99_CI,
        regression_with_95_CI],
        ncol = 1 # Number of columns (here you go—I told you in class I’d given you all the hints! 🙂)
      )

In [62]:
gggrid([ols_model_1 
        + geom_point(), 
        ols_model_2 
        + geom_point()
        + geom_smooth(formula = "y ~ x", # geometric smoother --> modeling fit
                      deg = 2,
                      color = "green")
       ],
       ncol = 2)

#### REPORT

> What would you report: Are user behaviors the same or different across the four applications?

### To Wei Min: geom_line() connects the dots

In [63]:
gggrid([ols_model_1 
        + geom_point()
        + geom_line(color = "red"), 
        ols_model_2 
        + geom_point()
        + geom_smooth(formula = "y ~ x", # geometric smoother --> modeling fit
                      deg = 2,
                      color = "green")
       ],
       ncol = 2)

## <a id = "p3"> 3. </a> <font color = "green"> First Look at the Grammar of Graphics with the `lets-plot` Library </font> [back to table of contents](#top)

* You will learn a 360-degree view of how the principle of the Grammar of Graphics operates. The Grammar of Graphics is a unifying method of visualizing your data and can be implemented across various computational languages (a cross-language concept used for data visualization).

> Again, Grammar of Graphics is a theoretical framework for describing and building data visualizations, which has been implemented in multiple programming languages through various packages. The Grammar of Graphics is a versatile and powerful framework that transcends individual programming languages, making it a valuable tool for data visualization across various platforms.

* Python (plotnine, lets-plot): In Python, there are packages like plotnine and lets-plot that bring the Grammar of Graphics to the Python ecosystem.

* JavaScript (G2, Vega, and Vega-Lite): For web-based visualizations, Vega and Vega-Lite are declarative languages for creating, sharing, and exploring visualizations. They are inspired by the Grammar of Graphics and provide a high-level grammar for specifying visualizations.

* R (ggplot2): The earliest implementation of the Grammar of Graphics is the ggplot2 package in R, created by Hadley Wickham. It provides a powerful and flexible system for creating complex and multi-layered visualizations.

* Julia (Gadfly): Julia, a high-performance programming language for technical computing, has the Gadfly package, which is based on the Grammar of Graphics principles and offers a consistent approach to creating visualizations.

> Among the languages mentioned above, Python is the most common and accessible for day-to-day data visualization in organizations, including both for-profit and government agencies around the world. Python is particularly useful for individuals who do not have formal training in computer science but work with data for behavioral strategies and insights. Therefore, I will demonstrate use cases for Python throughout our lesson to help you see how the principles of data visualization can be applied in the field.

### A Gentle Reminder: Grammar of Graphics is a Layer-by-Layer Visualization Approach

In [64]:
from IPython.display import Image

Image(url="https://static1.squarespace.com/static/53f3eb3ce4b077de0318f4ea/t/66d6857e966cef1122cc340b/1725334912400/ggplot_layers.jpg", 
      width = 600)

### Step-by-Step Guide on Each Layer of Grammar of Graphics in plotnine

#### Generate Synthetic DataFrame

In [65]:
data_for_ggplot =\
(
    pd
    .DataFrame(
        {"A": [2,3,4,5,6,7],
         "B": [3,4,5,6,7,8],
         "banks": ["DBS", "UOB", "DBS", "UOB", "DBS", "UOB"]
        }
    )
)

In [66]:
data_for_ggplot

Unnamed: 0,A,B,banks
0,2,3,DBS
1,3,4,UOB
2,4,5,DBS
3,5,6,UOB
4,6,7,DBS
5,7,8,UOB


In [67]:
(
    ggplot(data_for_ggplot, # would you draw grammar of graphics plot? with data named data_for_ggplot
           aes(x = "A",
               y = "B")
          )
    + geom_point(aes(color = "banks",
                     shape = "banks"),
                 size = 5
                ) # would you add data points with geometric feature of points? with blue color?
)

In [68]:
(
    ggplot(data_for_ggplot, # would you draw grammar of graphics plot? with data named data_for_ggplot
           aes(x = "A",
               y = "B")
          )
    + geom_point(aes(color = "banks",
                     shape = "banks"),
                 size = 5
                ) # would you add data points with geometric feature of points? with blue color?
    + geom_line(color = "orange")
)

In [89]:
(
    ggplot(data_for_ggplot, # would you draw grammar of graphics plot? with data named data_for_ggplot
           aes(x = "A",
               y = "B")
          )
    + geom_point(aes(color = "banks",
                     shape = "banks"),
                 size = 5
                ) # would you add data points with geometric feature of points? with blue color?
    + scale_x_continuous()
)

In [90]:
(
    ggplot(data_for_ggplot, # would you draw grammar of graphics plot? with data named data_for_ggplot
           aes(x = "A",
               y = "B")
          )
    + geom_line(color = "orange")
    + geom_point(aes(color = "banks",
                     shape = "banks"),
                 size = 5
                ) # would you add data points with geometric feature of points? with blue color?
    + scale_x_continuous(limits = [0, 10] # would you set x axis scale that runs from zero through 10
                         )
    + scale_y_continuous(limits = [0, 10] # would you set x axis scale that runs from zero through 10
                         )
   #  + geom_line(color = "orange") # The order matters here!
    + facet_grid(x = "banks") # column-wise faceting
    + theme(legend_position = "none")
    
)

In [91]:
(
    ggplot(data_for_ggplot, # would you draw grammar of graphics plot? with data named data_for_ggplot
           aes(x = "A",
               y = "B")
          )
    + geom_line(color = "orange")
    + geom_point(aes(color = "banks",
                     shape = "banks"),
                 size = 5
                ) # would you add data points with geometric feature of points? with blue color?
    + scale_x_continuous(limits = [0, 10] # would you set x axis scale that runs from zero through 10
                         )
    + scale_y_continuous(limits = [0, 10] # would you set x axis scale that runs from zero through 10
                         )
   #  + geom_line(color = "orange")
    + facet_grid(y = "banks")
    + theme(legend_position = "none")
    
)

#### Can we create interactive Grammar of Graphics visualizations? Yes, we can!

> In Python, you can use the lets-plot library to build interactive visualizations following the Grammar of Graphics framework. 🚀

In [72]:
data_abc =\
(
    pd
    .DataFrame(
        {
    "a": [1, 2, 4, 6, 3, 2],
    "b": [2, 3, 4, 7, 5, 3],
    "c": ["Male", "Male", "Male", "Female", "Female", "Female"]
        }
    )
)

In [73]:
data_abc

Unnamed: 0,a,b,c
0,1,2,Male
1,2,3,Male
2,4,4,Male
3,6,7,Female
4,3,5,Female
5,2,3,Female


### Dependencies

In [74]:
# !pip install lets_plot

In [75]:
import pandas as pd

from lets_plot import *

In [76]:
LetsPlot.setup_html()

## IMPORT

In [77]:
df =\
(
    pd
    .read_csv("https://talktoroh.com/s/netflix_show.csv")
)

## WRANGLE

In [78]:
df =\
(
    df
    [df["release_year"] >= 2000]
)

In [79]:
movie_ratins_ordered = ["NR", "UR", "G", "PG", "PG-13", "R", "NC-17"]
tv_ratins_ordered = ["NR", "UR", "TV-Y", "TV-Y7", "TV-Y7-FV", "TV-G", "TV-PG", "TV-14", "TV-MA"]

In [80]:
movies_df =\
(
    df
    [(df.type == "Movie")
     &
     (df.rating.isin(movie_ratins_ordered)
     )
    ]
)

## VISUALIZE

In [81]:
movies_plot =\
(
    ggplot(movies_df, 
           aes(x = "rating", 
               fill = "..count..")
          ) +
    geom_bar() +
    scale_x_discrete(breaks = movie_ratins_ordered) + 
    scale_y_log10(limits = [0, 1100]
                 ) +
    scale_fill_viridis(name = "Movies count", 
                       limits = [0, 1100], 
                       option = "plasma") +
    ggtitle("Movies count by rating") +
    theme(axis_text = element_text(size = 8, 
                                   angle = 0.0)
         )
)

In [82]:
tv_df =\
(
    df
    [
    (df.type == "TV Show")
    &
    (df.rating.isin(tv_ratins_ordered)
    )
    ]
)

In [83]:
tv_plot =\
(
    ggplot(tv_df, 
           aes(x = "rating", 
               fill = "..count..")
          ) +
    geom_bar() +
    scale_x_discrete(breaks=tv_ratins_ordered) + 
    scale_y_log10(limits=[0, 1100]
                 ) +
    scale_fill_viridis(name = "TV-shows count", 
                       limits = [0, 1100], 
                       option = "plasma") +
    ggtitle("TV-shows count by rating") +
    theme(axis_text = element_text(size = 8, 
                                   angle = 0.0)
         )
)

### How can we display two plots side by side?

In [84]:
gggrid([movies_plot, tv_plot],
       ncol = 2
      )

## Do you remember the sneaky average that you should keep in mind in data literacy?

In [85]:
movies_df =\
(
    df
    [(df["type"] == "Movie")
     &
     (df["genres"] != "Movies")
    ]
)

In [86]:
by_genre_df =\
(
    pd
    .melt(
          movies_df["genres"].str.split(", ", expand = True)
          .assign(duration = movies_df["duration"]
                  ),
          id_vars = ["duration"], 
          value_vars = [0, 1, 2], 
          value_name = "genre"
         )
    [["genre", "duration"]]
     .dropna(subset = ["genre"]
            )
)

In [87]:
by_genre_df =\
(
    by_genre_df
    .assign(duration_mean = by_genre_df["genre"]
                            .replace(by_genre_df
                                     .groupby("genre")["duration"].mean()
                                    )
           )
    .sort_values(by = "duration_mean", 
                 ascending = False)
)

  .replace(by_genre_df


In [88]:
(
    ggplot(by_genre_df, 
           aes(x = "duration", 
               y = "genre")
          ) 
    + geom_area_ridges(aes(group = "genre", 
                             fill = "duration_mean"),
                         scale = 4, 
                         sampling = sampling_pick(by_genre_df.shape[0]
                                                 ),
                         tooltips = layer_tooltips()
                                     .title("@genre")
                                     .line("duration|@duration")
                        ) 
    + scale_x_log10() 
    + scale_fill_viridis(name = "average duration", 
                           option = "plasma") 
    + ggsize(800, 600) 
    + ggtitle("Average Netflix Movie Duration") 
    + theme(axis_line_x = "blank")
)

> `Thank you for working with the Python script, Team 👍`