## How do I select survey participants from a pool of potential customers at my soon-to-be-opened sandwich shop in order to plan the menu?

## Introduction

**Business Context.** Your friends want to open a sandwich shop in an area surrounded by office buildings. This sounds like a good business opportunity! However, new ventures, especially in the restaurant business, require careful planning to increase the likelihood of having a successful opening and subsequent business. A key aspect of this planning is to get a good grip on what your potential customers would like to order and what the popular items on the menu will be. Your friends will conduct a survey in order to obtain better knowledge of the preferences of our potential costumers. However, they are unsure of who to send out the survey to, so they have hired you as a consultant to help them figure this out.

**Business Problem.** Your task is to **devise a plan for how to select people from the potential customer base to send a survey to, while still obtaining answers helpful for planning the menu.**

**Analytical Context.** The case will proceed as follows: you will (1) understand some more background on the problem and pull in the data; (2) learn about and use some simple methods of sampling our population; (3) look at shortcomings of those simple methods and turn to more sophisticated methods instead; and finally (4) implement those complex methods and analyze the results.

## Background on the problem

Your new sandwich shop will be located in a busy neighborhood in Bogotá, and you have estimated that there are about 100,000 potential customers. There is a mix of residential and commercial buildings around your shop. In such a scenario, it is infeasible, and at best impractical, to ask every potential customer about their preferences. We need to find a way to select a sample set of individuals whose answers, when put together, will provide useful information for creating a menu for the entire potential customer base.

### Designing survey questions

Designing survey questions requires careful thought and some domain expertise. After consulting with experts in the restaurant planning business, we have planned to ask the the following questions: 

1. What kind of meat/veggie do you prefer? (a) steak (b) roasted veggies (c) chicken (d) ham
2. What is your preference for cheese? (a) provolone (b) Swiss (c) cheddar (d) doble crema
3. What is your preference for bread? (a) white (b) wheat (c) rye (d) artisinal
4. Do you prefer to eat out for lunch or dinner? (a) lunch (b) dinner
5. What is your preference for dessert? (a) oblea (b) tres leches (c) arequipe (d) ice cream
6. What days do you prefer to eat out? (a) weekdays (b) weekends

These questions will help us put together a nice menu. This is a great start; however, putting together a survey that proves useful to our enterprise requires more than just asking a few good questions.

Recall the task is to *devise a plan for how to select people from the potential customer base to send a survey to while still obtaining answers helpful for planning the menu*. This involves more than just knowing what questions to ask; it involves knowing the profiles of people to ask and how to source those people!

### Exercise 1:

Talk to your group and list some ways to approach sampling in this dataset.

**Answer**:

---------

These four approaches are called convenience sampling, simple random sampling, stratified sampling, and cluster sampling. We'll now look at sampling from an invented target population using these four methods.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from numpy import random as r

In [None]:
## Creating a dataset.
def get_income_quartile(i):
    rand = r.random()
    if df.loc[i,'building'] == 1:
        if rand < 0.25:
            return 1
        elif rand < 0.5:
            return 2
        elif rand < 0.75:
            return 3
        else:
            return 4
    elif df.loc[i,'building'] == 2:
        if rand < 0.15:
            return 1
        elif rand < 0.35:
            return 2
        elif rand < 0.7:
            return 3
        else:
            return 4
    elif df.loc[i,'building'] == 3:
        if rand < 0.4:
            return 1
        elif rand < 0.8:
            return 2
        elif rand < 0.9:
            return 3
        else:
            return 4
    elif df.loc[i,'building'] == 4:
        if rand < 0.1:
            return 1
        elif rand < 0.5:
            return 2
        elif rand < 0.9:
            return 3
        else:
            return 4
        

columns = ['building','kids', 'crosses_corner', 'income_quartile', 'desserts', 'weekday/weekend', 'lunch/dinner', 'meat', 'cheese', 'bread']
index = np.arange(100000)
df = pd.DataFrame(columns=columns, index=index)
df['building'] = [1] * 25000 + [2] * 25000 + [3] * 25000 + [4] * 25000
df['crosses_corner'] = [r.random() > 0.75 for _ in range(len(index))]
df['income_quartile'] = [get_income_quartile(i) for i in range(len(index))]
df['kids'] = [r.choice([0, 1, 2, 3, 4], p=[0.15, 0.25, 0.45, 0.1, 0.05]) for _ in range(len(index))]

In [None]:
def get_answers(i):
    building, kids, corner, income = df.loc[i,['building', 'kids', 'crosses_corner', 'income_quartile']]
    seed_dessert = r.random() # Desserts
    seed_weekday = r.random() # Weekday/weekend
    seed_lunch = r.random() # Lunch/dinner
    seed_meat = r.random() # Meat
    seed_cheese = r.random() # Cheese
    seed_bread = r.random() # Bread
    
    desserts = ''
    weekday = ''
    lunch = ''
    meat = ''
    cheese = ''
    bread = ''
    
    dessert_options = ['oblea', 'tres leches', 'arequipe', 'ice cream']
    weekday_options = ['weekday', 'weekend']
    lunch_options = ['lunch', 'dinner']
    meat_options = ['roasted veggies', 'chicken', 'ham', 'steak']
    cheese_options = ['provolone', 'Swiss', 'cheddar', 'doble crema']
    bread_options = ['white', 'wheat', 'rye', 'artisinal']
    
    if seed_weekday > 0.5:
        weekday = 'weekday'
    else:
        weekday = 'weekend'
    
    def responses(dessert_probs, weekday_probs, lunch_probs, meat_probs, cheese_probs, bread_probs):
        responses = []
        for i, prob in enumerate(dessert_probs):
            if seed_dessert < prob:
                responses.append(dessert_options[i])
                break
        for i, prob in enumerate(weekday_probs):
            if seed_weekday < prob:
                responses.append(weekday_options[i])
                break
        for i, prob in enumerate(lunch_probs):
            if seed_lunch < prob:
                responses.append(lunch_options[i])
                break
        for i, prob in enumerate(meat_probs):
            if seed_meat < prob:
                responses.append(meat_options[i])
                break
        for i, prob in enumerate(cheese_probs):
            if seed_cheese < prob:
                responses.append(cheese_options[i])
                break
        for i, prob in enumerate(bread_probs):
            if seed_bread < prob:
                responses.append(bread_options[i])
                break
        return responses
    
    dessert_probs = [1] * len(dessert_options)
    weekday_probs = [1] * len(weekday_options)
    lunch_probs = [1] * len(lunch_options)
    meat_probs = [1] * len(meat_options)
    cheese_probs = [1] * len(cheese_options)
    bread_probs = [1] * len(bread_options)
    
    
    
    # The part that actually requires judgment
    
    if kids == 0:
        seed_lunch = -np.exp(-3 * seed_lunch) + 1
        if income == 1:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.3, 0.6, 0.8, 1]
                cheese_probs = [0.25, 0.5, 0.75, 1]
                bread_probs = [0.3, 0.6, 0.9, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.2, 0.4, 0.7, 1]
                cheese_probs = [0.23, 0.46, 0.69, 1]
                bread_probs = [0.3, 0.6, 0.9, 1]
        elif income == 2:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.3, 0.6, 0.8, 1]
                cheese_probs = [0.26, 0.52, 0.78, 1]
                bread_probs = [0.25, 0.5, 0.75, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.25, 0.5, 0.75, 1]
                cheese_probs = [0.25, 0.5, 0.75, 1]
                bread_probs = [0.25, 0.5, 0.75, 1]
        elif income == 3:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.4, 0.7, 0.85, 1]
                cheese_probs = [0.3, 0.6, 0.9, 1]
                bread_probs = [0.2, 0.4, 0.6, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.3, 0.6, 0.8, 1]
                cheese_probs = [0.26, 0.52, 0.78, 1]
                bread_probs = [0.2, 0.4, 0.6, 1]
        elif income == 4:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.45, 0.8, 0.9, 1]
                cheese_probs = [0.33, 0.66, 0.99, 1]
                bread_probs = [0.15, 0.3, 0.45, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.35, 0.7, 0.9, 1]
                cheese_probs = [0.3, 0.6, 0.9, 1]
                bread_probs = [0.15, 0.3, 0.45, 1]
    elif kids == 1:
        seed_lunch = -np.exp(-2 * seed_lunch) + 1
        if income == 1:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.3, 0.6, 0.8, 1]
                cheese_probs = [0.25, 0.5, 0.75, 1]
                bread_probs = [0.3, 0.6, 0.9, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.2, 0.4, 0.7, 1]
                cheese_probs = [0.23, 0.46, 0.69, 1]
                bread_probs = [0.3, 0.6, 0.9, 1]
        elif income == 2:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.3, 0.6, 0.8, 1]
                cheese_probs = [0.26, 0.52, 0.78, 1]
                bread_probs = [0.25, 0.5, 0.75, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.25, 0.5, 0.75, 1]
                cheese_probs = [0.25, 0.5, 0.75, 1]
                bread_probs = [0.25, 0.5, 0.75, 1]
        elif income == 3:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.4, 0.7, 0.85, 1]
                cheese_probs = [0.3, 0.6, 0.9, 1]
                bread_probs = [0.2, 0.4, 0.6, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.3, 0.6, 0.8, 1]
                cheese_probs = [0.26, 0.52, 0.78, 1]
                bread_probs = [0.2, 0.4, 0.6, 1]
        elif income == 4:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.45, 0.8, 0.9, 1]
                cheese_probs = [0.33, 0.66, 0.99, 1]
                bread_probs = [0.15, 0.3, 0.45, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.35, 0.7, 0.9, 1]
                cheese_probs = [0.3, 0.6, 0.9, 1]
                bread_probs = [0.15, 0.3, 0.45, 1]
    elif kids == 2:
        seed_lunch = -np.exp(-seed_lunch) + 1
        if income == 1:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.3, 0.6, 0.8, 1]
                cheese_probs = [0.25, 0.5, 0.75, 1]
                bread_probs = [0.3, 0.6, 0.9, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.2, 0.4, 0.7, 1]
                cheese_probs = [0.23, 0.46, 0.69, 1]
                bread_probs = [0.3, 0.6, 0.9, 1]
        elif income == 2:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.3, 0.6, 0.8, 1]
                cheese_probs = [0.26, 0.52, 0.78, 1]
                bread_probs = [0.25, 0.5, 0.75, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.25, 0.5, 0.75, 1]
                cheese_probs = [0.25, 0.5, 0.75, 1]
                bread_probs = [0.25, 0.5, 0.75, 1]
        elif income == 3:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.4, 0.7, 0.85, 1]
                cheese_probs = [0.3, 0.6, 0.9, 1]
                bread_probs = [0.2, 0.4, 0.6, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.3, 0.6, 0.8, 1]
                cheese_probs = [0.26, 0.52, 0.78, 1]
                bread_probs = [0.2, 0.4, 0.6, 1]
        elif income == 4:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.45, 0.8, 0.9, 1]
                cheese_probs = [0.33, 0.66, 0.99, 1]
                bread_probs = [0.15, 0.3, 0.45, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.35, 0.7, 0.9, 1]
                cheese_probs = [0.3, 0.6, 0.9, 1]
                bread_probs = [0.15, 0.3, 0.45, 1]
    elif kids == 3:
        seed_lunch = -np.exp(-0.5 * seed_lunch) + 1
        if income == 1:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.3, 0.6, 0.8, 1]
                cheese_probs = [0.25, 0.5, 0.75, 1]
                bread_probs = [0.3, 0.6, 0.9, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.2, 0.4, 0.7, 1]
                cheese_probs = [0.23, 0.46, 0.69, 1]
                bread_probs = [0.3, 0.6, 0.9, 1]
        elif income == 2:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.3, 0.6, 0.8, 1]
                cheese_probs = [0.26, 0.52, 0.78, 1]
                bread_probs = [0.25, 0.5, 0.75, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.25, 0.5, 0.75, 1]
                cheese_probs = [0.25, 0.5, 0.75, 1]
                bread_probs = [0.25, 0.5, 0.75, 1]
        elif income == 3:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.4, 0.7, 0.85, 1]
                cheese_probs = [0.3, 0.6, 0.9, 1]
                bread_probs = [0.2, 0.4, 0.6, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.3, 0.6, 0.8, 1]
                cheese_probs = [0.26, 0.52, 0.78, 1]
                bread_probs = [0.2, 0.4, 0.6, 1]
        elif income == 4:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.45, 0.8, 0.9, 1]
                cheese_probs = [0.33, 0.66, 0.99, 1]
                bread_probs = [0.15, 0.3, 0.45, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.35, 0.7, 0.9, 1]
                cheese_probs = [0.3, 0.6, 0.9, 1]
                bread_probs = [0.15, 0.3, 0.45, 1]
    elif kids == 4:
        seed_lunch = -np.exp(-0.25 * seed_lunch) + 1
        if income == 1:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.3, 0.6, 0.8, 1]
                cheese_probs = [0.25, 0.5, 0.75, 1]
                bread_probs = [0.3, 0.6, 0.9, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.2, 0.4, 0.7, 1]
                cheese_probs = [0.23, 0.46, 0.69, 1]
                bread_probs = [0.3, 0.6, 0.9, 1]
        elif income == 2:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.3, 0.6, 0.8, 1]
                cheese_probs = [0.26, 0.52, 0.78, 1]
                bread_probs = [0.25, 0.5, 0.75, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.25, 0.5, 0.75, 1]
                cheese_probs = [0.25, 0.5, 0.75, 1]
                bread_probs = [0.25, 0.5, 0.75, 1]
        elif income == 3:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.4, 0.7, 0.85, 1]
                cheese_probs = [0.3, 0.6, 0.9, 1]
                bread_probs = [0.2, 0.4, 0.6, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.3, 0.6, 0.8, 1]
                cheese_probs = [0.26, 0.52, 0.78, 1]
                bread_probs = [0.2, 0.4, 0.6, 1]
        elif income == 4:
            if corner:
                dessert_probs = [0.4, 0.6, 0.8,1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.45, 0.8, 0.9, 1]
                cheese_probs = [0.33, 0.66, 0.99, 1]
                bread_probs = [0.15, 0.3, 0.45, 1]
            else:
                dessert_probs = [0.1, 0.5, 0.75, 1]
                weekday_probs = [0.5, 1]
                lunch_probs = [0.5, 1]
                meat_probs = [0.35, 0.7, 0.9, 1]
                cheese_probs = [0.3, 0.6, 0.9, 1]
                bread_probs = [0.15, 0.3, 0.45, 1]
    
    return responses(dessert_probs, weekday_probs, lunch_probs, meat_probs, cheese_probs, bread_probs)

In [None]:
get_answers(0)

In [None]:
df.head()

This code creates data.csv which is used for a dataframe later. This takes a long time to run so this code is provided for your reference but you should use the data.csv file already provided in this case.

``` py
for i in df.index:
    responses = get_answers(i)
    for j, col in enumerate(['desserts', 'weekday/weekend', 'lunch/dinner', 'meat', 'cheese', 'bread']):
        df.loc[i, col] = responses[j]
        
df.to_csv('data.csv')
```

## Target populations

We are interested in learning about attributes from individuals of a population. In our case, the attributes we are interested in are the preferences that will help us to design a good menu for our sandwich shop.

The population we intend to study is called the **target population**. In our case the population we would like to study is the set of potential customers, which is the set of all (~100,000) people working in the buildings surrounding our new shop:

In [None]:
df = pd.read_csv('data.csv', index_col=0)
df.head()

The above dataset contains the true preferences of all the 100,000 customers in our target population. For the sake of this problem, we'll sample from this dataset. However, in a real-life sampling context, we can't view this entire dataset, and we must attempt to draw conclusions based on imperfect and/or incomplete samples. Keep this in mind as we go through this case – we will make note of when this restriction is likely to affect real-life workflows vs. our workflow here, as well as how to generally deal with it.

### Convenience sampling

One naive way to collect a sample would be to pick a fixed location on the corner of our shop and give the questionnaire to the first 100 people willing to take it. This is called a **convenience sample**.

### Exercise 2:

What are some problems with a convenience sample?

**Answer.**

---------

Let's sample from the population based on whether or not a potential customer crosses the corner at which we are sampling from:

In [None]:
survey_answers = ['desserts', 'weekday/weekend', 'lunch/dinner', 'meat', 'cheese', 'bread']

In [None]:
mask = r.choice(df.loc[df['crosses_corner'] == True].index, 100)
conv_sample = df.loc[mask]

for col in survey_answers:
    fig, ax = plt.subplots()
    ax.set_title(col + ' Value Counts')
    conv_sample[col].value_counts().plot(kind='bar')
    plt.show()

Of course, the results above exhibit randomness, and will not stay the same each time we run the same block of code. This is because we will be sampling a different 100 people every time.

As statisticians, this should be deeply disturbing – our first sample indicated that roasted veggies are significantly more popular than chicken, but a more careful look shows that about 30% of the time, chicken actually wins! The moral of the story is that sampling small amounts from a large population has inherent variability and can lead us astray if we aren't careful with how we conduct our sample.

Let's see how well the histograms above match the true values in our full population dataset. Of course, note that in a real-life setting, we would not have the luxury of being able to compare our samples to the full population – this is just for instructional purposes:

In [None]:
mask = r.choice(df.loc[df['crosses_corner'] == True].index, 100)
conv_sample = df.loc[mask]

for col in survey_answers:
    fig, ax = plt.subplots()
    ax.set_title(col + ' Value Counts')
    conv_sample[col].value_counts().plot(kind='bar')
    plt.show()
    fig, ax = plt.subplots()
    ax.set_title(col + ' Value Counts in True Population')
    df[col].value_counts().plot(kind='bar')
    plt.show()

Once again, we should be very disheartened to see these results. It seems like our convenience sampling isn't representative of our population as a whole along most metrics. This motivates us to look at other sampling methods.

## Simple random sampling

Rather than sampling based on our convenience, we can be a little more scientific by randomly selecting potential customers from the entirety of our target population. The approach here is as follows:

1. Obtain 100 elements from {1, 2, ..., 100000} using a random number generator
2. Apply the questionnaire to the members of the population associated with the numbers we obtained in step 1

The procedure just described can be applied for any population with $N$ elements and a desired sample size $n$, with $n < N$. This is known as **simple random sampling**.

### Exercise 3:

Create bar charts to view the survey results under a simple random sample of the dataset. Also, plot a bar chart that shows the variability of the top answer for `cheese` over 200 simple random samples.

**Answer.**

---------

This illustrates a prime problem with simple random sampling. In theory, our sample is supposed to be representative of the population. However, because our samples are so small, the variance between samples is significant.

### Exercise 4:

Create a bar chart that shows the variability in the top `cheese` choice in 200 simple random samples when each sample is of size 50000 rather than 100.

**Answer.**

---------

This shows how we can reduce variability by increasing the sample size. But in the real world, we'll run against cost and time constraints (in the extreme, this approach reduces to sampling every single person in the target population). We will now consider methods that guarantee representativeness of the population while being far less prohibitive than simple random sampling in terms of sample size.

### Heterogeneity

Summaries like the estimate of the proportion of people that often go for the vegetarian option tend to assume that the data is generated by a simple model. This implies that breaking down the population into categories, like income bracket, should create sub-populations with the same behavior as the entire population.

Of course, this is often not the case: it may well be the case that project managers, because of their higher income, regularly go for healthier options with pricier ingredients when compared to workers with noticeably lower income. If splitting the population along a certain feature produces sub-populations with notably different behaviors, then we say that the population is **heterogeneous**.

Heterogeneity is detrimental for the performance of the statistics obtained under simple random sampling. But we can make heterogeneity our friend with a divide-and-conquer strategy: we can group the population by income bracket, and perform simple random sampling on each group!

## Stratified sampling

**Stratified sampling** is a strategy designed to take into account the heterogeneity of the population. By doing this, we obtain procedures with smaller standard error. There are three steps to forming a stratified sample:

1. Choose which feature we want to use to form subgroups in and sample from.
2. Random sample a proportional amount from each subgroup. For example, if 20% of our population has 2 children, we should choose 20% of our sample number from this subgroup.
3. Put together these samples to form a stratified sample of the entire dataset.

### Exercise 5:

For the columns `building`, `kids`, and `income_quartile` in `df`, find the proportion of the population in each category within these features. (Note that in practice, this may not be easy to compute, but for our sake we can calculate the proportion precisely.)

**Answer.**

---------

### Exercise 6:

For each of the above features, take a stratified sample of `df` along the categories of said feature. (Please stratify along each feature one at a time, rather than all three at once; this means that you will be taking three separate samples). Plot the bar chart of dessert preferences under each stratification, as well as the variability of most favored dessert over 200 iterations under each stratification.

**Answer.**

---------

## Cluster Sampling

The mechanics of **cluster sampling** are very similar to stratified sampling. However both the motivation and technique are different. For stratified sampling, we choose a categorical breakdown of our dataset, and randomly sample within each of our breakdowns. For cluster sampling, we split up our dataset into representative clusters, and then *treat each of those clusters as an entire population*. We might choose our clusters randomly, or have other domain-specific reasons to believe that they are representative of the population.

As mentioned above, the main reason we would want to do this is due to knowledge constraints that keep us from effectively stratifying our population. For example, we may not have access to the proportions of people that fall under each category for the population as a whole, but we can figure this out for Building 1. Then, we treat Building 1 as our entire population, and can perform stratified sampling for categories within this population.

Again, this *crucially* relies on having some extrinsic (and many times qualitative) reason for believing that Building 1 is representative of the entire target population. If, for example, Building 1 is filled with residents with the highest-paying jobs, then this can be a very bad way to sample.

### Exercise 7:

Find the proportions of individuals falling into each category in Building 1, then perform a stratified sample of dessert preferences based on `kids`. Plot the bar chart of most favored dessert over 200 iterations under this stratification.

**Answer.**

---------

## Conclusions

In this case, we looked at various methods of sampling our sandwich shop's potential customer base. We saw how purely random stratified samples are the gold standard, with high precision and often high representativeness of the population. However, it can be difficult to identify the proportions of categories in our population, so in such cases we need to resort to clustering sampling in order to get measurable results.

A pure convenience sample is generally a recipe for disaster. But sampling is the part of statistics where data scientists have to think of ways to balance cost and time with good results. Without good data, everything in this case falls apart. But without cost-consciousness, even the strongest theory can't be put into practice.

## Takeaways

In this case, you developed your skills in identifying what sampling technique is most appropriate for the problem at hand. You now know that:

1. The behaviour of statistics is driven by the sampling design we use, and will affect the quality of our answers
2. Different sampling designs have different capabilities and limitations
3. Selecting an appropriate sampling design for the problem at hand depends much on the problem context