Generating random numbers
You've used .sample() to generate pseudo-random numbers from a set of values in a DataFrame. A related task is to generate random numbers that follow a statistical distribution, like the uniform distribution or the normal distribution.

Each random number generation function has distribution-specific arguments and an argument for specifying the number of random numbers to generate.

matplotlib.pyplot is loaded as plt, and numpy is loaded as np.

Generate 5000 numbers from a uniform distribution, setting the parameters low to -3 and high to 3
Generate 5000 numbers from a normal distribution, setting the parameters loc to 5 and scale to 2


In [None]:
# Generate random numbers from a Uniform(-3, 3)
uniforms = np.random.uniform(low=-3, high=3, size=5000)
normals = np.random.normal(loc=5,scale=2,size=5000)


# Print normals
print(normals)

Plot a histogram of uniforms with bins of width of 0.25 from -3 to 3 using plt.hist()

In [None]:
# Generate random numbers from a Uniform(-3, 3)
uniforms = np.random.uniform(low=-3, high=3, size=5000)

# Plot a histogram of uniform values, binwidth 0.25
plt.hist(uniforms, bins=np.arange(-3, 3.25, 0.25))
plt.show()

normals = np.random.normal(loc=5,scale=2,size=2)

# Print normals
print(normals)
[5.565957   1.98741057]

<script.py> output:
    [1.96858828 5.45561711]

#Plot a histogram of uniforms with bins of width of 0.25 from -3 to 3 using plt.hist().# Generate random numbers from a Uniform(-3, 3)
uniforms = np.random.uniform(low=-3, high=3, size=5000)

# Plot a histogram of uniform values, binwidth 0.25
plt.hist(uniforms, bins=np.arange(-3, 3.25, 0.25))
plt.show()

Plot a histogram of normals with bins of width of 0.5 from -2 to 13 using plt.hist().

In [None]:
# Generate random numbers from a Normal(5, 2)
normals = np.random.normal(loc=5, scale=2, size=5000)

# Plot a histogram of normal values, binwidth 0.5
plt.hist(normals, bins=np.arange(-2, 13.5, 0.5))
plt.show()

Understanding random seeds
While random numbers are important for many analyses, they create a problem: the results you get can vary slightly. This can cause awkward conversations with your boss when your script for calculating the sales forecast gives different answers each time.

Setting the seed for numpy's random number generator helps avoid such problems by making the random number generation reproducible.

Instructions 1/3
35 XP
1
2
3
Question
Which statement about x and y is true?

import numpy as np
np.random.seed(123)
x = np.random.normal(size=5)
y = np.random.normal(size=5)
Possible Answers

x and y have identical values.

The first value of x is identical to the first value of y, but other values are different.

The values of x are different from those of y.

Ans: x and y have identical values.

Simple random sampling
The simplest method of sampling a population is the one you've seen already. It is known as simple random sampling (sometimes abbreviated to "SRS"), and involves picking rows at random, one at a time, where each row has the same chance of being picked as any other.

In this chapter, you'll apply sampling methods to a synthetic (fictional) employee attrition dataset from IBM, where "attrition" in this context means leaving the company.

attrition_pop is available; pandas as pd is loaded.

Sample 70 rows from attrition_pop using simple random sampling, setting the random seed to 18900217.
Print the sample dataset, attrition_samp. What do you notice about the indices?

In [None]:
# Sample 70 rows using simple random sampling and set the seed
attrition_samp = attrition_pop.sample(n=70,random_state=18900217)

# Print the sample
print(attrition_samp)

Systematic sampling
One sampling method that avoids randomness is called systematic sampling. Here, you pick rows from the population at regular intervals.

For example, if the population dataset had one thousand rows, and you wanted a sample size of five, you could pick rows 0, 200, 400, 600, and 800.

attrition_pop is available; pandas has been pre-loaded as pd.

Instructions 1/2
50 XP
1
2
Set the sample size to 70.
Calculate the population size from attrition_pop.
Calculate the interval between the rows to be sampled.

In [None]:
# Set the sample size to 70
sample_size = 70

# Calculate the population size from attrition_pop
pop_size = len(attrition_pop)

# Calculate the interval
interval = pop_size // sample_size

Systematically sample attrition_pop to get the rows of the population at each interval, starting at 0; assign the rows to attrition_sys_samp.

In [None]:
attrition_sys_samp = attrition_pop.iloc[::interval]

# Print the sample
print(attrition_sys_samp)

Is systematic sampling OK?
Systematic sampling has a problem: if the data has been sorted, or there is some sort of pattern or meaning behind the row order, then the resulting sample may not be representative of the whole population. The problem can be solved by shuffling the rows, but then systematic sampling is equivalent to simple random sampling.

Here you'll look at how to determine whether or not there is a problem.

attrition_pop is available; pandas is loaded as pd, and matplotlib.pyplot as plt.

Instructions 1/3
35 XP
1
2
3
Add an index column to attrition_pop, assigning the result to attrition_pop_id.
Create a scatter plot of YearsAtCompany versus index for attrition_pop_id using pandas .plot().

In [None]:
# Add an index column to attrition_pop
attrition_pop_id = attrition_pop.reset_index()

# Plot YearsAtCompany vs. index for attrition_pop_id
attrition_pop_id.plot(x="index", y="YearsAtCompany", kind="scatter")
plt.show()

Randomly shuffle the rows of attrition_pop.
Reset the row indexes, and add an index column to attrition_pop.
Repeat the scatter plot of YearsAtCompany versus index, this time using attrition_shuffled.

In [None]:
# Shuffle the rows of attrition_pop
attrition_shuffled = attrition_pop.sample(frac=1)

# Reset the row indexes and create an index column
attrition_shuffled = attrition_shuffled.reset_index(drop=True).reset_index()

# Plot YearsAtCompany vs. index for attrition_shuffled
attrition_shuffled.plot(x="index", y="YearsAtCompany", kind="scatter")
plt.show()

Does a systematic sample always produce a sample similar to a simple random sample?

Possible Answers

Yes. All sampling (random or non-random) methods will lead us to similar results.

Yes. We should always expect a representative sample for both systematic and simple random sampling.

No. This only holds if a seed has been set for both processes.

No. This is not true if the data is sorted in some way.
Ans: This is not true if the data is sorted in some way.