Here is a common network example about a sidewalk.  Maybe it will inform a package-delivery agent as to the right speed or tires to use on its delivery route!

The network tells us that the variable **Season** directly influences both **Sprinkler** and **Rain**, that they both in turn influence **Wet**, which in turn influences **Slippery**.

There is a data set with historical observations about the variables.  The data set is in the file slippery.csv.  In this file, Season is coded as 0 to 3 (Winter, Spring, Summer, Fall) and the other variables are binary (0 for false 1 for true).

First determine the parameters you need to build this network.

In [1]:
# Read the file into a data frame and look at the first few rows

from pandas import *
df = pandas.read_csv("slippery.csv")

df.head()

Unnamed: 0,Season,Sprinkler,Rain,Wet,Slippery
0,3,1,0,0,0
1,0,0,0,0,0
2,2,1,0,1,1
3,3,0,1,1,1
4,3,1,0,1,0


In [10]:
df.columns

Index(['Season', 'Sprinkler', 'Rain', 'Wet', 'Slippery'], dtype='object')

In [3]:
pandas.crosstab(df.Season, df.Wet)

Wet,0,1
Season,Unnamed: 1_level_1,Unnamed: 2_level_1
0,9183,3440
1,8810,3578
2,3236,9189
3,5714,6850


In [4]:
pandas.crosstab(df.Wet, [df.Sprinkler, df.Rain])

Sprinkler,0,0,1,1
Rain,0,1,0,1
Wet,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,18957,3003,3956,1027
1,14,9078,3892,10073


In [7]:
# Values in the Season column and count of each value
df.Season.value_counts().sort_index()

0    12623
1    12388
2    12425
3    12564
Name: Season, dtype: int64

In [8]:
#  Now it's easy to get the prior probability on Season
(df.Season.value_counts() / len(df.Season)).sort_index()

0    0.25246
1    0.24776
2    0.24850
3    0.25128
Name: Season, dtype: float64

In [9]:
##  Probabilities and Conditional Probabilities
##  Counting:
##      Example 1:  Count the number of observations that are in the summer

len(df[df.Season == 2])


12425

In [22]:
##  For variables that are binary, to count the number of observations where a 
##  variable is true, just use sum (sum the 1 values gives you the count)
#
#  Number of observations where the sidewalk is wet
print(df.Wet.sum())
# Percent of the observations where the sidewalk is wet
print(df.Wet.sum() / len(df.Wet))

23057
0.46114


In [17]:
##  To get P(Slippery | Wet = 0) and P(Slippery | Wet = 1)

## The general idea is to first restrict the dataframe to rows where the conditioning 
##  expression is true.  For example, the rows where Wet == 0.  This is a data frame with the 
##  same columns but a subset of the rows

wet0 = df[df.Wet == 0]
print(wet0.head())

wet1 = df[df.Wet == 1]
print(wet1.head())

    Season  Sprinkler  Rain  Wet  Slippery
0        3          1     0    0         0
1        0          0     0    0         0
5        2          0     0    0         0
9        1          0     0    0         0
12       0          0     1    0         0
   Season  Sprinkler  Rain  Wet  Slippery
2       2          1     0    1         1
3       3          0     1    1         1
4       3          1     0    1         0
6       2          1     1    1         0
7       0          0     1    1         1


In [20]:
##  Now to get P(Slippery | Wet = 0) we get the percentage of records with Slippery ==1 in the restricted dataframe
## This is what % slippery when the pavement is not wet, which we expect to be low
print(wet0.Slippery.sum()/len(wet0.Slippery))


0.00883346323720447


In [21]:
###  P(Slippery | Wet = 1) should be higher
print(wet1.Slippery.sum()/len(wet1.Slippery))

0.7474953376414971


#### Gather the Model Parameters

The network structure tells you the probabilities you need:  a distribution over values for Season, a conditional probability table for Rain that depends on the value of Season, and so on.   Collect these values either just printed or into variables.

#### Build the Network

Using the example networks as a guide, build the distributions, the nodes, then the model.
You will have success when you can say **model.bake()**
**Hint:**  It is very easy to make small errors as you go, and if you build the whole model before testing, you will likely get an obscure error message and not know where to look.

Start small and build incrementally, each time building the model and looking at its **proba** distribution to verify your inputs.

* Start just with Season and its unconditional distribution.   Build a model just with that node and no arcs
* Once that works, add Sprinkler with its conditional probability table depending on Season
* Then you can add Rain, which should look exactly the same except for the values in the probability table
* Then add Wet, which depends on both Rain and Sprinkler.  Think first about what its probability table should look like.  Draw out the template of the probability table arrays before putting in actual values.  Be careful about 0 and 1.   You will tend to write your 0 entry before your 1 entry, but you also tend to think of True before False
* After Wet works, Slippery is easy, and you're done!

In [None]:
#  Network code here, ending in model.bake()

### Answer some questions

* Run the predict_proba method on the network.  What is it telling you?   Are those numbers plausible?  Useful?
* Compare the difference in the probability of **Slippery** based on the season being Summer rather than Winter.  Is it what you were expecting?  Why or why not?
* For the fixed **Season** value Spring, suppose you know the sprinklers are running but it's not raining.  What is the joint probability distribution over the values of **Wet** and **Slippery**
* Suppose you know for sure that **Wet** is true.  What is the value of **Slippery**.  Now fix the value of **Rain** to true.  Does that change the probability of **Wet**?  Why or why not?


## Delete Below Here

In [37]:
## Build the network
from pomegranate import *

#  Season, Sprinkler, Rain, Wet, Slippery

seasondist = DiscreteDistribution({0: .25, 1: .25, 2: .25, 3: .25})

sprinklerdist = ConditionalProbabilityTable(
    [[0, 1, .01], [0, 0, .99],
     [1, 1, .25], [1, 0, .75],
     [2, 1, .75], [2, 0, .25],
     [3, 1, .50], [3, 0, .50]], [seasondist])

raindist = ConditionalProbabilityTable(
    [[0, 1, .35], [0, 0, .65],
     [1, 1, .25], [1, 0, .75],
     [2, 1, .05], [2, 0, .95],
     [3, 1, .50], [3, 0, .50]], [seasondist])

wetdist = ConditionalProbabilityTable(
    [[0, 0, 1, .001], [0, 0, 0, .999],
     [0, 1, 1, .75],  [0, 1, 0, .25],
     [1, 0, 1, .50],  [1, 0, 0, .50],
     [1, 1, 1, .90],  [1, 1, 0, .10]], [sprinklerdist, raindist])

slipperydist = ConditionalProbabilityTable(
    [[0, 1, .01], [0, 0, .99],
     [1, 0, .75], [1, 1, .25]], [wetdist])


season = Node(seasondist, name="Season")
sprinkler = Node(sprinklerdist, name="Sprinkler")
rain = Node(raindist, name="Rain")
wet = Node(wetdist, name="Wet")
slippery = Node(slipperydist, name="Slippery")


model = BayesianNetwork("Slippery Sidewalk")
model.add_states(season, sprinkler, rain, wet, slippery)
model.add_edge(season, sprinkler)
model.add_edge(season, rain)
model.add_edge(sprinkler, wet)
model.add_edge(rain, wet)
model.add_edge(wet, slippery)

model.bake()
     


In [46]:
model.predict_proba({"Wet":1, "Rain":0})

array([{
    "class" :"Distribution",
    "dtype" :"int",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "0" :0.006713585030266369,
            "1" :0.16262360040245324,
            "2" :0.6146935003151162,
            "3" :0.2159693142521643
        }
    ],
    "frozen" :false
},
       {
    "class" :"Distribution",
    "dtype" :"int",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "0" :0.0029201120453978477,
            "1" :0.9970798879546022
        }
    ],
    "frozen" :false
},
       0, 1,
       {
    "class" :"Distribution",
    "dtype" :"int",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "0" :0.7499999999999999,
            "1" :0.25000000000000017
        }
    ],
    "frozen" :false
}], dtype=object)

In [28]:
'''
Generate the data file
Season
Sprinkler
Rain

Season -- [.25, .25, .25, .25]
Sprinkler -- given season [.01, .25, .75, .50]
Rain -- given season [.35, .25, .05, .15]
Wet -- given sprinkler and rain [0.001, .75, .50, .90]
Slippery -- given Wet [.01, .75]
'''

import random
def generateFile():
    f = open("slippery.csv", "w")
    f.write("Season,Sprinkler,Rain,Wet,Slippery\n")
    for i in range(50000):
        f.write(generateRecord() + "\n")
    f.close()

def generateRecord():
    season = random.randint(0, 3)
    sprinkler = randBool([.01, .25, .75, .50][season])
    rain = randBool([.35, .25, .75, .50][season])
    wet = randBool([[0.001, 0.75], [0.50, 0.90]][sprinkler][rain])
    slippery = randBool([.01, .75][wet])
    return ",".join(str(x) for x in [season, sprinkler, rain, wet, slippery])

def randBool(prob):
    return 1 if random.random() > (1 - prob) else 0

In [29]:
generateFile()