# Resampling

Big Idea: Statistics modelled in a program are easier to get right and understand than a formulaic approach. It also extends to more complicated situations.

Topics to prepare for resampling:
* f-strings
* ```Counter()```, ```most_common```, ```elements```
* Statistics
* Random: ```seed```, ```gauss```, ```triangular```, ```expovariate```, ```choice```, ```choices```, ```sample```, ```shuffle```
* Review: list concatenation, slicing, index/count, ```sorted()```
* Review: lambda expressions and chained comparisons

# F-Strings

## %-formating

The classic style of formatting uses percents ```%``` for placeholders in the templates.

In [1]:
x = 10

In [2]:
"The answer is %d today" % x

'The answer is 10 today'

After the string, the ```%``` operator is used followed by the variable to be inserted.

In [3]:
x = 10
y = 10
z = 20

In [4]:
"%d plus %d is %d" % (x, y, z)

'10 plus 10 is 20'

## .format()

The newer style of formatting changes quite a bit and we place an integer placeholder within the string alongside a positional input argument to the string method ```format```:

In [5]:
x = 10

In [6]:
"The answer is {0} today".format(x)

'The answer is 10 today'

One thing that we don't like about this approach is that a numeric positional argument ```0``` is used inside the template, instead of the name of the variable ```x```, so we can improve this just a little bit by using keyword arguments.

In [7]:
"The answer is {x} today".format(x=x)

'The answer is 10 today'

In [8]:
"The answer is {x} today".format(x=0)

'The answer is 0 today'

And so that's the new string style formatting. It's really nice, once you get used to it, it'll become your preferred style of formatting.

That said, one thing we don't like is typing the ```.format``` method and the ```x=x``` and we can get rid of that using a ```f-string```.

## F-string

```f```, for formatted string, knows how to reach into the environment and find the ```x```. As you can see it is alot more succinct, than the older style and it evaluates the expression.

In [9]:
x = 10

In [10]:
print(f"The answer is {x} today")

The answer is 10 today


One of the nice things is that it's designed in a way that compiles as we build it; so if you include this in your actual scripts, there's no eval run at run time. So there's no potential for a security flaw inside the f-strings, which is fantastic.

One of the interesting things about f-strings is that you can use all of your classic formatting operators. For instance I would like to make this 8 characters wide with leading zeros and put in a decimal.

In [11]:
print(f"The answer is {x :08d} today")

The answer is 00000010 today


Other interesting things we can do is run Python expressions inside. There's some question of how much of this you want to do, whether it's a good practice to put this inside a string however for small expressions, it's actually kind of elegant. I want to show the square of ```x```, what the square of ```x``` was today.

In [12]:
print(f"The answer is {x ** 2 :08d} today")

The answer is 00000100 today


f-strings are useful when you are going to write messages for your exceptions. For example raising a ValueError when a float is expected. ```x```, with bang r ```!r``` gives the representation of the variable ```x``` and ```type(x).__name___``` gives the class type of ```x```. This is something that wouldn't look particularly good with the .formatting style; and look here, we've got the 10 dropped in and we also got the data type dropped in. I think this reads very nicely,, show the representation of ```x``` and show the name of the datatype along the way.

In [13]:
#raise ValueError(f"Expected {x!r} to a float not a {type(x).__name__}")

So particularly for exception messages, I like to use f-strings throughout. I've been using these for several months now, and they grow on you very quickly and it'll quickly become one of your favourite features of Python.

# Counter object

The next idea that we'll want to cover is how to use the ```Counter``` object which has been present in Python for quite some time; and so for alot of you this will be a quick review but when you're doing data analysis, ```Counter``` is one of your best friends.

In [14]:
from collections import Counter

A traditional dictionary, if it has a missing key, will raise a key error. How many dragonsdo you have?

In [15]:
d = {}

In [16]:
#d["dragons"]

So traditionally when somebody asks you how many dragons you have, the answer is 0. However, Python's dictionaries would answer with a key error. How can we improve on this situation?  We can use ```Counter```, ```Counter``` is a sub-class of dictionary and has very dictionary-like behaviour; however if I go to lookup the number of dragons, it answers quite sensibly and says I have no dragons. 

In [17]:
d = Counter()

In [18]:
d["dragons"]

0

That side it is pretty easy to increment your number of dragons. 

In [19]:
d["dragons"] += 1

In [20]:
d

Counter({'dragons': 1})

It works just like a regular dictionary; however if you look up a missing key, rather than raising a key error, it just returns 0. This makes it suitble for counting, which is why, of course, it's called a ```Counter```

One of the other nice things about a counter is that you can pass into it a list of objects and it will go ahead and count them for you. it saves you from building a loop in order to construct your counter. For example, I have a counter of the colors red, green, red, blue, red, blue green. What the ```split``` will do is split the string of colors, it splits on whitespace, making a list. What the counter will do is count over the list and count all the colors, telling us that we have three reds, two greens, and two blues.

In [21]:
Counter("red green red blue red blue green".split())

Counter({'red': 3, 'green': 2, 'blue': 2})

In [22]:
c = Counter("red green red blue red blue green".split())

One of the nice things that can be done with a Counter is to ask it what are the most common colors? What is the one msot common? Notice that it returned a list of tuples; the first part of the tuple is the color, and the second part is its count.

In [23]:
c.most_common(n=1)

[('red', 3)]

I can ask for the two most common.

In [24]:
c.most_common(n=2)

[('red', 3), ('green', 2)]

The choice between these two is arbitrary; and in Python, it'll actually give you the first one of the ones encountered if there are two equal values; and this is really handy. Part of the design concept for our counter was it was modelled on smalltalk bags; and the notion of a bag was you can put 50 marblesinto a bag, red marbles and then put another 50 green marbles and another 10 yellow marbles and then the bag would let you take these out one at a time. So it worked like any other container except that it was quite efficient in the way it stored. Instead of putting all of the marbles in seperately, it listed them in as just one time along with their count. In C++ this is called a multiset. So one of the notions of the bag is since there were a lot of marbles in there, you can pull them all out one at a time. The tool for doing that for us is called elements.

So if I were to list all of the elements of the counter:

In [25]:
list(c.elements())

['red', 'red', 'red', 'green', 'green', 'blue', 'blue']

we get back all of our colors. Notice the reds have been grouped together, the greens and the blues have been put together. So it doesn't remember order; all it does remember is the multiplicity, how many times we saw each color.

Now, what else would list do? Without elements, remember, this is just a dictionary; and by default, when you iterate over a dictionary, you get all of the keys of the dictionary. 

In [26]:
list(c)

['red', 'green', 'blue']

And since it's just a dictionary, we can also ask for the values

In [27]:
list(c.values())

[3, 2, 2]

or the key value pairs, which are the items:

In [28]:
list(c.items())

[('red', 3), ('green', 2), ('blue', 2)]

None of these, though, gave us all of the individual elements, and we'll be needing that shortly for doing data analytics.

# The Statistics Module

The next thing I'd like to show you is the statistics module. In the statistics module are all the classic descriptive statistics: mean, median, mode, standard deviation, and population standard deviation.

In [29]:
from statistics import mean, median, mode, stdev, pstdev

As people start to do more and more data analytics, as they get interested in data science, as they get interested in machine learning, or as they get interested in data in general, descriptive statistics are really handy; and it's nice to have those built in to Python.

One of the interesting design aspects of the statistics module was it was designed for accuracy. There's some question whether that was a really good design goal. After all, many times when you want to know the average of a series of numbers, you only want to know out to one or two decimal places rather than having it perfect out to 17 decimal places. That said, a number of statistical formulas are very easily gotten wrong and can amplify very small errors, so a great deal of care was taken in this module to make sure that it gives you as accurate a possible as result.

In other words, even though you could implement most of these functions yourself quite easily, the statistics module will tend to give you better results.

So mean does the obvious thing. I've got a mean of several measurements of 50, 52 and 53; the mean of those is 51 and two-thirds. 

In [30]:
mean([50, 52, 53])

51.666666666666664

Now the median of the two will be the center-most value. 

In [31]:
median([50, 52, 53])

52

Where median becomes more interesting is if there are two center values, it will average those two, of course, as they taught you in school.

In [32]:
median([50, 51, 52, 53])

51.5

What is the mode all about? If one value occurs more than the others, it becomes the mode. 

In [33]:
mode([51, 50, 52, 53, 51, 51])

51

If you had to pick exactly one representative element and say this one occurs more than all of the other times, mode is a reasonably good choice.

Median is mainly used for unbalanced distributions

In [34]:
median([51, 50, 52, 53, 51, 51])

51.0

And then there's the classic standard deviation and population standard deviation. Some of you have probably forgotten the difference between those two. The population standard deviation is the one that most people remember where you divide by the number N when you're doing the computation. The sample standard deviation is interesting. It's offset a little bit. We divide by N minus one.

In [35]:
stdev([51, 50, 52, 53, 51, 51])

1.0327955589886444

In [36]:
pstdev([51, 50, 52, 53, 51, 51])

0.9428090415820634

The idea is if you have a sample of size one, it gives you some estimate of the mean of the population. Your best guess is the one that you chose in your sample. On the other hand, it gives you no idea of the variability. If you reach into a bag and pull out a marble, it's red. Do you have any idea were there many different marbles, only a few different marbles, or whether they're all red? And so the standard deviation, a sample standard deviation, of a sample of size one is infinite. 

In [37]:
stdev([1, 0.9])

0.07071067811865474

In [38]:
pstdev([1, 0.9])

0.04999999999999999

On the other hand, if you knew that there was exactly only one marble and you chose it, you would know everything about the population; so we would divide by the population size of one, and the population standard deviation of that one marble would be zero. So are these two very close to each other? Sometimes they give answers very close to each other when the N is fairly large. But, remember, when N is one; one of them will give you infinity, and the other one will give you zero, two numbers that are quite far apart.

# list Concatenation

The next one up is list concatenation. This is just a little bit of review from intro Python. I have two lists, and then two concatenate end to end. This works quite different than numpy. numpy while adding these would add them element-wise, the 10 to the 40, the 20 to the 50, and the 30 to the 60.

In [39]:
s = [10, 20, 30]

In [40]:
t = [40, 50, 60]

In [41]:
u = s + t
u

[10, 20, 30, 40, 50, 60]

So Python list concatenation works differently. 

List slicing says that we can take the first few elements. This says give me the first two elements of list ```u```, which is the 10 and 20.

In [42]:
u[:2]

[10, 20]

And then if I want the last two, we can use a negative index. This says count two from the right, so this is one back and two back. Minus two is the 50, and then the colon says go all the way to the end

In [43]:
u[-2:]

[50, 60]

If I wanted to make a brand new list, I could concatenate these together and say I want the first two plus the last two.

In [44]:
u[:2] + u[-2:]

[10, 20, 50, 60]

# sequence methods

Other interesting characteristics of list is if you do a dir of list, you'll see a couple of methods that people tend to forget about. All sequences have count, and all sequences have an index

In [45]:
dir(list)

['__add__',
 '__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

All sequences have count, and all sequences have an index. So let's go ahead and try this with a string sequence, ```"abracadabra"```; and I can ask the question: At what location is the letter ```"c"```? And ```"c"``` is up here at position 4. 

In [46]:
s = "abracadabra"

In [47]:
i = s.index("c")
i

4

So asset i is four, but I could also ask the question: How many times does the letter ```c``` occur? It's in there only once, as opposed to the five ```"A"```s.

In [48]:
s.count("c")

1

In [49]:
s.count("a")

5

A lot of people forget that these methods exist. However, when we're doing data analytics; it's going to become really common, very handy to determine how many times an element occurs. 

# sorting

I'd like to also remind you of sorted. There's two different ways to sort. I could take this list and sort it in place using the method ```sort```.

In [50]:
s = [10, 5, 70, 2]

In [51]:
s.sort()

In [52]:
s

[2, 5, 10, 70]

And if you already have a list, and if you want that list sorted, this is a really good way to do it.

However, if you'd like to leave that list alone and not sort it in place; you can use the inbuilt function ```sorted```. Sorted constructs a new list so the old list is the same as before, but the new list is in sorted order. 

In [53]:
s = [10, 5, 70, 2]

In [54]:
t = sorted(s)

In [55]:
t

[2, 5, 10, 70]

One of the nice things about sorted is that the signature is it runs with any iterable, which means I can now use it on immutable objects, such as sorting a string ```"cat"```, sort the letters in alphabetical order. 

In [56]:
sorted("cat")

['a', 'c', 't']

So sorted is often much more convenient to use than sort itself, in part, because you don't have to convert to a list first; it does that for you.

# lambda

Another one that people often forget is ```lambda```. Lambda used to be very popular in the Python world. People used to use ```map``` and ```lambda``` quite a bit, but then list comprehensions came along; and list comprehensions are so much more beautiful, most of the time, that people started to forget about lambda or they decided that lambda was unattractive. Python has grown a lot of tools to try to make up for people trying to not use lambda. So we used to use lambda for everything; but now we have partial objects, we have item getters, attribute getters, and a whole zoo of other objects whose sole purpose is to make sure that you don't have to use a lambda. I say just get over it and learn to use lambda. Why is it so many people react badly to lambda? I think because of the name. What should it have been called, **make function**, no one would have any questions about what it does. 

```Lambda x``` makes a function.

In [57]:
lambda x: x ** 2

<function __main__.<lambda>(x)>

What can you do with functions? You can call them. A common use of lambdas is to make anonymous or throwaway functions; so this is a silly use. Make an expression that has a lambda in it and then consume it right away.

In [58]:
(lambda x: x ** 2)(5)

25

This says take 100, put it on the stack, build a function, put it on the stack, call that function with five, substituting the five for the ```x``` and computing the 25, adding it to the 100, getting 125, and then adding 50 to get to 175. lambda is actually really straightforward if you remember that it means **make function**.

In [59]:
100 + (lambda x: x ** 2)(5) + 50

175

That said, some people remember those; but they tend to forget that it can also be used with functions of multiple arguments. So this lets me break a brand new function where I can call where the three gets substituted for the ```x```; eight gets substituted for the ```y```.

In [60]:
f = lambda x, y: 3 * x + y

In [61]:
f(x=3, y=8)

17

In [62]:
f(3, 8)

17

Even for people who remember those uses of lambda, there's another case that you don't see that often, which is to defer a computation or to make a promise; and the idea is you run a computation at some point in the future. For instance, I know an ```x``` is 10 and a ```y``` is 20; and I want to square them, but I don't want to square them right now. So it's reasonably common to make a lambda that has no arguments at all that then runs the computation that you're interested in.

In [63]:
x = 10
y = 20

In [64]:
f = lambda : x ** y

The idea is we're making a promise that in the future while we've frozen this computation, we can thaw it out. Only when we call it do we run it. 

In [65]:
f()

100000000000000000000

And this is a little bit of a silly example for it; however, it's very common in callback style programming where we know we'd only want to run the feature when somebody triggers an event in an async I/O type program or when they trigger an event, you pressing a button in GUI widget. So you'll see these all the time in GUI. Some people call it freeze and thaw, some people call it promises, some people call them thunks; but the idea is we're deferring a computation to the future by making a function of no arguments, and once we call it, that's when the function runs.

# Chained Comparisons

Now here's one that everyone knows, which is that we can do comparisons like this: Is ```x``` greater than 6; is ```x``` less than 10? 

In [66]:
x = 15

In [67]:
x > 6

True

In [68]:
x < 10

False

We can also chain those together by saying is ```x``` greater than 6 and is ```x``` less than 20? 

In [69]:
6 < x < 20

True

So just using binary operations, and that's very common in a lot of languages; but Python offers something new and special, which is chained comparisons. And it lets you write this the same way that you would've learned to write it in mathematics or back in school, and so these last two expressions are equivalent to each other except that the second one is a little bit more efficient because it doesn't have to load X onto the stack two times. These are called chained comparisons.

# The random module

Next up is a little tour of the random module. When we do resampling statistics, we make heavy use of the random module. So I'll give you a quick tour. A lot of people have seen one or two pieces, but they haven't seen how they all fit together. 

In [70]:
from random import random, seed

In order to get the same random numbers every time, we would want to see the random number generator. In other words, if I were to restart my shell right now and type random, I would get a different random number than before.

In [71]:
random()

0.03931061598070873

In [72]:
random()

0.7475454215857801

However, if I were to see the random number generator, I could reliably produce the exact same sequence over and over again.

In [73]:
seed(8675309)

In [74]:
random()

0.40224696110279223

In [75]:
random()

0.5102471779215914

In [76]:
random()

0.6637431122665531

This, of course, would be terrible for gambling applications but is fantastic for simulations so that later you can reproduce the results of the simulation. It's also commonly used in fields where you are picking samples and confirming that you've checked some of these samples that you published your seed so that people know that you didn't cherry pick your results. So this is a step toward reproducible research always using a random seed so that we can reproduce your sequence and know that you didn't cherry pick the results. So if I use the same seed again, I should be able to produce exactly the same set of random numbers once again; and you can see that 0.4022 got chosen twice and that the 0.5102 got chosen twice, and the 0.6637 got chosen twice as well.

In [77]:
seed(8675309)

In [78]:
random()

0.40224696110279223

In [79]:
random()

0.5102471779215914

In [80]:
random()

0.6637431122665531

Next up is a set of continuous distributions. So one of them is uniform. Uniform does exactly what it says. I want a number chosen uniformly in the range from 1000 to 1100; and it gives me a floating point number, as is befitting of a continuous distribution. 


In [81]:
from random import uniform

In [82]:
uniform(1000, 1100)

1086.0716692339552

However, we can also get distributions that are shaped not in a flat line. One of them, that I believe is well named, is a triangular distribution. Say I want a number between 1000 and 1100; however, I want the halfway point to be chosen much more frequently than the edges. And so these will all be centered about the middle, and there will be very few entries toward the 1000 or 1100.

In [83]:
from random import triangular

In [84]:
triangular(low=1000, high=1100)

1037.4795077189701

Now if you would like the tails to split off and for it to be a little bit less angular in its distribution, a gaussian or a normal distribution is a really good choice.

So let's pick our random IQ. The average IQ according to its original definition is 100, and the standard deviation was 15. So this generates random IQs for people. So just a simple normal distribution.

In [85]:
from random import gauss

In [86]:
gauss(mu=100, sigma=15)

86.34454152701

In [87]:
gauss(mu=100, sigma=15)

83.1958332105035

Expovariate is used to simulate arrival times and interestingly, its argument is called lambda. Python uses the syntax lambd, to prevent confusion with a lambda expression. The output is reciporical from lambd. If lambd is 20, the mean of all the results will be centred around 1/20 = 0.05.

In [88]:
from random import expovariate

In [89]:
expovariate(lambd=20)

0.012883883534313367

In [90]:
expovariate(lambd=20)

0.062187163410823934

In [91]:
expovariate(lambd=20)

0.0047789956500410785

It is easy to see what the averages of these are using a list comprehension, so let me go ahead and use our statistics from statistics. Import, mean, and standard deviation, and now demonstrate our distributions. 

In [92]:
from statistics import mean, stdev
seed(8675309)

So let's get to some data. I would like a triangular number in the range ```triangular(low=1000, high=1100)```, and I want a thousand of them.

In [93]:
data = [triangular(low=1000, high=1100) for i in range(1000)]

We can take the mean, and we would expect that to be centered right in the middle of the triangular distribution.

In [94]:
mean(data)

1048.7066714907694

1048 is just about in the middle, and we can check out its population standard deviation, which is like we would expect.

In [95]:
stdev(data)

20.122912471146602

Now, let's try the same thing for a uniform distribution. Our expectation is that the mean will also be in the center; however, the standard deviation will be much wider because it's been flattened out.

In [96]:
data = [uniform(a=1000, b=1100) for i in range(1000)]

In [97]:
mean(data)

1051.1985410381749

In [98]:
stdev(data)

28.468072698653245

Let's look at the gaussian distribution. This one should have very predictable results. We've told it that we want the mean to be 100 and that we want the standard deviation to be 15.

In [99]:
data = [gauss(mu=100, sigma=15) for i in range(1000)]

So the mean of data and standard deviation of data; I'm hoping that that'll come out to be around 100 and 15 respectively.

In [100]:
mean(data)

100.25337488818245

In [101]:
stdev(data)

15.304329132339166

And my hopes were matched.

Expovariate was the one that I found to be surprising when I first started to use it. Expovariate of 20 is going to give us an average very close to 1/20, not 20 itself; and these are used to simulate arrival times in a queuing system. And so if you're simulating the arrival of packets in a network or arrival rate of request from users, expovariate is a really good way to model this because every now and then, users aren't going to request anything at all and then a whole bunch of packets or requests will show up all at the same time.

In [102]:
data = [expovariate(lambd=20) for i in range(1000)]

In [103]:
mean(data)

0.04672929739028364

In [104]:
stdev(data)

0.04694950835620672

So the mean of this is, in fact, about one-twentieth. The standard deviation is broader than most people expect because this is an exponential distribution so occasionally you get really large outliers. And so those are the continuous distributions.

The next thing I'd like to show is some of the discrete distributions. We've got choice, something new, choices; and something else, a sample, and shuffle.

In [105]:
from random import choice, choices, sample, shuffle

So choice is fantastic. It's typically used to pick a single choice out of a list. So I could have some outcomes which are win, lose, draw, play again, and a double win. 

In [106]:
outcomes = ["win", "lose", "draw", "play again", "double win"]

Several possible outcomes, and if I were to pick a choice of those outcomes, it just gives me a single one; and it's equidistributed.

In [107]:
choice(outcomes)

'double win'

In [108]:
choice(outcomes)

'lose'

In [109]:
choice(outcomes)

'lose'

In [110]:
choice(outcomes)

'play again'

Sometimes we want to make multiple choices and so that we just turn this into a plural, and we tell it how many we want. I would like to choose 10 times, straightforward enough. 

In [111]:
choices(outcomes, k=10)

['double win',
 'double win',
 'draw',
 'lose',
 'double win',
 'double win',
 'win',
 'double win',
 'win',
 'double win']

Something that's kind of cool about this is we can now use it with collections counter, and these tools fit together really well; and we can see how often we got a particular outcome. The display of the counter shows the most frequent to the least frequent along the way. 

In [112]:
seed(8675309)

In [113]:
from collections import Counter

In [114]:
Counter(choices(outcomes, k=10))

Counter({'draw': 2, 'play again': 4, 'double win': 1, 'lose': 2, 'win': 1})

That said, these are somewhat evenly distributed. For instance, if I were to go out to run this 10000 times, you'll see that each of these occurred with about the same frequency. Double win happened slightly less and lose slightly more, but they're all roughly centered around 2000. 

In [115]:
Counter(choices(outcomes, k=10000))

Counter({'win': 1970,
         'double win': 1967,
         'play again': 2023,
         'lose': 2067,
         'draw': 1973})

What if I'd like to change that? One thing that choices can do that choice can't is that we can supply some weights to the outcome, and I'd say I would like to have the first outcome of win occur five times to four losses compared to three draws, two play again, and only one double win; and so I'm expecting that with this change we'll get the outcomes in a different ratio: a ratio of five to four to three to two to one. 

In [116]:
Counter(choices(outcomes, [5, 4, 3, 2, 1], k=10000))

Counter({'draw': 2001,
         'win': 3322,
         'lose': 2646,
         'play again': 1355,
         'double win': 676})

And you could see roughly that distribution here. So we have five times as many wins as double wins. 

Another thing we can do with the outcomes is shuffle them. So let's look at them unshuffled and then shuffled. Shuffle is a little bit like sort in that it works in place. So it actually mutates the input, whereas nothing else mutated the input.

In [117]:
outcomes

['win', 'lose', 'draw', 'play again', 'double win']

In [118]:
shuffle(outcomes)

In [119]:
outcomes

['draw', 'lose', 'double win', 'win', 'play again']

 Now, earlier, we saw that when we made multiple choices, if I chose five times, that there might be some duplications in there. If I'd like to choose without duplication and do sampling without replacement, there is sample, which has a similar format; and so sample of outcomes with key before says choose four of these outcomes, but with no duplicates at all. 

In [120]:
sample(outcomes, k=4)

['lose', 'play again', 'double win', 'win']

So that's sampling without replacement.

So if I would like to choose numbers for a California lottery, these tools combine very nicely. We can take a sample, and the lottery numbers are in the range of 1 to 56 inclusive; and you have to get six numbers right. We chose sample because the lottery doesn't choose any two numbers twice, so it samples without replacement. And when the lottery-winning numbers are published, they publish them in sorted order to make it easier to figure out whether you've won or whether you're going to be teaching Python for the rest of your life. Let's see how I did. Oh, it looks like I'll be teaching Python for the rest of my life.

In [121]:
sorted(sample(range(1, 57), k=6))

[3, 12, 27, 48, 50, 55]

All right. So that's sampling without replacement. Sampling with replacement, weighted average sampling, one of the things that I can do for you, besides giving you a little tour of these features, is to show you some of the relationships between all of them. What if you took a sample of the outcomes and only wanted one of them? The data type it would give you is a list, a list of length one; and the first element of that list is zero.

In [122]:
sample(outcomes, k=1)[0]

'draw'

This code is exactly the equivalent of choice.

In [123]:
choice(outcomes)

'lose'

What we've discovered here is that choice is really a special case of a sample.

Now what we could do is shuffle all of the outcomes and then take a look at what the outcomes are, but there's another way to do that. I could've sampled the outcomes where k is the length of the outcomes.

In [124]:
sample(outcomes, k=len(outcomes))

['double win', 'draw', 'play again', 'win', 'lose']

What this shows us is that really shuffle is just a special case of sample where you've sampled them all, and choice is a special case of sample where you've sampled just one. 

In fact, all of these functions are quite related to each other. That said, you rarely want to use this style and would instead prefer saying shuffle to shuffle and choice for a single choice; and then the interesting one is choices itself, which allows sampling with replacement. So I hope you enjoyed that little tour of Python. Some of this might've seemed basic to some of you; some of you might've thought that some parts were new. I think your favorite part is going to be f-strings. That said, these represent the core tools that we use in data analytics. So very soon, in our next lesson, we're about to go put all of these things together and see how they can be used for data analytics. I think you're going to see that even though the tools are simple, they can be combined together to express big ideas. Remember, the theme of our overall presentation here is a lot of people write way too much code. They take way too many lines to express their ideas; and in doing so, the ideas become muddy, the code becomes hard to change, and it doesn't become very expressive. What I'd like to teach you is to write Python and enjoy it the way I do, to express big ideas with only a little amount of code. And so what's coming up next is resampling statistics. Don't let the statistics part scare you off. It's going to be very approachable, and we're going to be able to do things that will be easy for us that are hard for other people, in part, because they're using statistics the way they were taught in school. Resampling statistics is fun, it's easy, it takes very little code, and we've now laid all of the groundwork for it.