### Implementing CDFs

Copyright 2019 Allen Downey

BSD 3-clause license: https://opensource.org/licenses/BSD-3-Clause

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd

import seaborn as sns
sns.set_style('white')

import matplotlib.pyplot as plt

In [2]:
import inspect

def psource(obj):
    """Prints the source code for a given object.

    obj: function or method object
    """
    print(inspect.getsource(obj))

The `Cdf` class inherits from `pd.Series`.  The `__init__` method is essentially unchanged, but it includes a workaround for what I think is bad behavior.

In [3]:
from distribution import Cdf

psource(Cdf.__init__)

    def __init__(self, *args, **kwargs):
        """Initialize a Cdf.

        Note: this cleans up a weird Series behavior, which is
        that Series() and Series([]) yield different results.
        See: https://github.com/pandas-dev/pandas/issues/16737
        """
        if args:
            super().__init__(*args, **kwargs)
        else:
            underride(kwargs, dtype=np.float64)
            super().__init__([], **kwargs)



### Working with Cdfs

Create a Cdf object to represent a six-sided die.

In [4]:
d6 = Cdf()

A Cdf is a map from possible outcomes to their probabilities.

In [5]:
for x in [1,2,3,4,5,6]:
    d6[x] = 1

Initially the probabilities don't add up to 1.

In [6]:
d6

Unnamed: 0,probs
1,1
2,1
3,1
4,1
5,1
6,1


`normalize` adds up the probabilities and divides through.  The return value is the total probability before normalizing.

In [7]:
psource(Cdf.normalize)

    def normalize(self):
        """Make the probabilities add up to 1 (modifies self).

        returns: normalizing constant
        """
        total = self.ps[-1]
        self /= total
        return total



In [8]:
d6.normalize()

1

Now the Cdf is normalized.

In [9]:
d6

Unnamed: 0,probs
1,1.0
2,1.0
3,1.0
4,1.0
5,1.0
6,1.0


`Cdf` provides `_repr_html_`, so it looks good when displayed in a notebook.

In [10]:
psource(Cdf._repr_html_)

    def _repr_html_(self):
        """Returns an HTML representation of the series.

        Mostly used for Jupyter notebooks.
        """
        df = pd.DataFrame(dict(probs=self))
        return df._repr_html_()



And we can compute its mean (which only works if it's normalized).

In [11]:
psource(Cdf.mean)

    def mean(self):
        """Expected value.

        returns: float
        """
        return self.make_pmf().mean()



In [12]:
d6.mean()

1.0

`choice` chooses a random values from the Cdf.

In [13]:
psource(Cdf.choice)

    def choice(self, *args, **kwargs):
        """Makes a random sample.

        Uses the probabilities as weights unless `p` is provided.

        args: same as np.random.choice
        options: same as np.random.choice

        returns: NumPy array
        """
        # TODO: Make this more efficient by implementing the inverse CDF method.
        pmf = self.make_pmf()
        return pmf.choice(*args, *kwargs)



In [14]:
d6.choice(size=10)

TypeError: cannot perform reduce with flexible type

`bar` plots the Cdf as a bar chart

In [None]:
psource(Cdf.bar)

In [None]:
def decorate_dice(title):
    """Labels the axes.
    
    title: string
    """
    plt.xlabel('Outcome')
    plt.ylabel('Cdf')
    plt.title(title)

In [None]:
d6.bar()
decorate_dice('One die')

`Cdf` provides `__add__`, which computes the distribution of the sum.

In [None]:
psource(Cdf.__add__)

In [None]:
from distribution import Cdf_add

psource(Cdf_add)

Here's the distribution of the sum of two dice.

In [None]:
twice = d6 + d6
twice

In [None]:
twice = d6 + d6
twice.bar()
decorate_dice('Two dice')
twice.mean()

`Cdf` overrides `__getitem__` to return 0 for values that are not in the distribution.

In [None]:
psource(Cdf.__getitem__)

In [None]:
twice[2]

In [None]:
twice[12]

In [None]:
twice[1]

`Cdf` objects are mutable, but in general the result is not normalized.

In [None]:
twice = d6 + d6
twice[2] = 0
twice[3] = 0
twice.sum()

In [None]:
twice.normalize()
twice.sum()

### Make Cdf from sequence

The following function make a `Cdf` object from a sequence of values.

In [None]:
psource(Cdf.from_seq)

We'll use `Cdf_from_seq` to create a Cdf from a sequence of values.

In [None]:
psource(Cdf.from_seq)

In [None]:
Cdf = Cdf.from_seq([1, 2, 2, 3, 5])
Cdf

`Cdf` provides properties to access the quantities and probabilities as NumPy arrays.

In [None]:
Cdf.qs

In [None]:
Cdf.ps

Because a `Cdf` is a `Series`, you can initialize it with any type the `Series` constructor can handle.

In [None]:
Cdf = Cdf.from_seq([1, 2, 2, 3, 5])
Cdf2 = Cdf(dict(zip(Cdf.qs, Cdf.ps)))
Cdf2

In [None]:
Cdf = Cdf.from_seq([1, 2, 2, 3, 5])
Cdf3 = Cdf(pd.Series(Cdf.ps, index=Cdf.qs))
Cdf3

However, you have to be careful about sharing.  In this example, `Cdf` and `Cdf3` share the same arrays.

In [None]:
Cdf3[1] = np.pi
Cdf[1]

In [None]:
Cdf = Cdf.from_seq([1, 2, 2, 3, 5])
Cdf4 = Cdf(pd.Series(Cdf.ps, index=Cdf.qs), copy=True)
Cdf4

In [None]:
Cdf4[1] = np.pi
Cdf[1]