In [1]:
from datascience import *
#path_data = '../../../assets/data/'
path_data = '/content/gdrive/MyDrive/DataScience/data/'
import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import numpy as np

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


# Random Sampling in Python

This section summarizes the ways you have learned to sample at random using Python, and introduces a new way.

## Review: Sampling from a Population in a Table
If you are sampling from a population of individuals whose data are represented in the rows of a table, then you can use the Table method `sample` to [randomly select rows](https://inferentialthinking.com/chapters/10/1/Empirical_Distributions.html#id1) of the table. That is, you can use `sample` to select a random sample of individuals.

By default, `sample` draws uniformly at random with replacement. This is a natural model for chance experiments such as rolling a die.

In [3]:
faces = np.arange(1, 7)
die = Table().with_columns('Face', faces)
die

Face
1
2
3
4
5
6


Run the cell below to simulate 7 rolls of a die.

In [7]:
die.sample(6)

Face
1
3
4
3
6
2


Sometimes it is more natural to sample individuals at random without replacement. This is called a simple random sample. The argument `with_replacement=False` allows you to do this.

In [8]:
actors = Table.read_table(path_data + 'actors.csv')
actors

Actor,Total Gross,Number of Movies,Average per Movie,#1 Movie,Gross
Harrison Ford,4871.7,41,118.8,Star Wars: The Force Awakens,936.7
Samuel L. Jackson,4772.8,69,69.2,The Avengers,623.4
Morgan Freeman,4468.3,61,73.3,The Dark Knight,534.9
Tom Hanks,4340.8,44,98.7,Toy Story 3,415.0
"Robert Downey, Jr.",3947.3,53,74.5,The Avengers,623.4
Eddie Murphy,3810.4,38,100.3,Shrek 2,441.2
Tom Cruise,3587.2,36,99.6,War of the Worlds,234.3
Johnny Depp,3368.6,45,74.9,Dead Man's Chest,423.3
Michael Caine,3351.5,58,57.8,The Dark Knight,534.9
Scarlett Johansson,3341.2,37,90.3,The Avengers,623.4


In [9]:
# Simple random sample of 5 rows
actors.sample(5, with_replacement=False)

Actor,Total Gross,Number of Movies,Average per Movie,#1 Movie,Gross
Daniel Radcliffe,2634.4,17,155.0,Harry Potter / Deathly Hallows (P2),381.0
Robert DeNiro,3081.3,79,39.0,Meet the Fockers,279.3
Michael Caine,3351.5,58,57.8,The Dark Knight,534.9
Jeremy Renner,2500.3,21,119.1,The Avengers,623.4
Philip Seymour Hoffman,2463.7,40,61.6,Catching Fire,424.7


Since `sample` gives you the entire sample in the order in which the rows were selected, you can use Table methods on the sampled table to answer many questions about the sample. For example, you can find the number of times the die showed six spots, or the average number of movies in which the sampled actors appeared, or whether one two specified actors appeared in the sample. You might need multiple lines of code to get some of this information.

## Review: Sampling from a Population in an Array

If you are sampling from a population of individuals whose data are represented as an array, you can use the NumPy function `np.random.choice` to [randomly select elements of the array](https://inferentialthinking.com/chapters/09/3/Simulation.html#example-number-of-heads-in-100-tosses).

By default, `np.random.choice` samples at random with replacement.

In [10]:
# The faces of a die, as an array
faces

array([1, 2, 3, 4, 5, 6])

In [11]:
# 7 rolls of the die
np.random.choice(faces, 7)

array([3, 5, 2, 6, 1, 6, 3])

The argument `replace=False` allows you to get a simple random sample, that is, a sample drawn at random without replacement.

In [12]:
# Array of actor names
actor_names = actors.column('Actor')

In [13]:
# Simple random sample of 5 actor names
np.random.choice(actor_names, 5, replace=False)

array(['Cameron Diaz', 'Ben Stiller', 'Tom Hanks', 'Cate Blanchett',
       'Eddie Murphy'],
      dtype='<U22')

Just as `sample` did, so also `np.random.choice` gives you the entire sequence of sampled elements. You can use array operations to answer many questions about the sample. For example, you can find which actor was the second one to be drawn, or the number of faces of the die that appeared more than once. Some answers might need multiple lines of code.

## Sampling from a Categorical Distribution
Sometimes we are interested in a categorical attribute of our sampled individuals. For example, we might be looking at whether a coin lands Heads or Tails. Or we might be interested in the political parties of randomly selected voters.

In such cases, we frequently need the proportions of sampled voters in the different categories. If we have the entire sample, we can calculate these proportions. The function `sample_proportions` in the `datascience` library does that work for us. It is tailored for sampling at random with replacement from a categorical distribution and returns the proportions of sampled elements in each category.

The `sample_proportions` function takes two arguments:
- the sample size
- the distribution of the categories in the population, as a list or array of proportions that add up to 1

It returns an array containing the distribution of the categories in a random sample of the given size taken from the population. That's an array consisting of the sample proportions in all the different categories, in the same order in which they appeared in the population distribution.

For example, suppose each plant of a species is red-flowering with chance 25%, pink-flowering with chance 50%, and white-flowering with chance 25%, regardless of the flower colors of all other plants. You can use `sample_proportions` to see the proportions of the different colors among 300 plants of the species.

In [87]:
# Species distribution of flower colors:
# Proportions are in the order Red, Pink, White
species_proportions = [0.25, 0.5, .25]
#species_proportions = [0.2, 0.8]

sample_size = 100

# Distribution of sample
sample_distribution = sample_proportions(sample_size, species_proportions)
sample_distribution

array([ 0.29,  0.55,  0.16])

As you expect, the proportions in the sample sum to 1.

In [88]:
sum(sample_distribution)

1.0

The categories in `species_proportions` are in the order Red, Pink, White. That order is preserved by `sample_proportions`. If you just want the proportion of pink-flowering plants in the sample, you can use `item`:

In [89]:
# Sample proportion of Heads
sample_distribution.item(1)

0.55

You can use `sample_proportions` and array operations to answer questions based only on the proportions of sampled individuals in the different categories. You will not be able to answer questions that require more detailed information about the sample, such as which of the sampled plants had each of the different colors.