Learn what sampling is and why it is so powerful. You’ll also learn about the problems caused by convenience sampling and the differences between true randomness and pseudo-randomness.

## Simple sampling with pandas
We'll be exploring song data from Spotify. Each row of this population dataset represents a song, and there are over 40,000 rows. Columns include the song name, the artists who performed it, the release year, and attributes of the song like its duration, tempo, and danceability. We'll start by looking at the durations.

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

In [6]:
pip install pyarrow

Collecting pyarrow
  Downloading pyarrow-9.0.0-cp39-cp39-win_amd64.whl (19.6 MB)
Installing collected packages: pyarrow
Successfully installed pyarrow-9.0.0
Note: you may need to restart the kernel to use updated packages.


In [7]:
spotify_population = pd.read_feather('spotify_2000_2020.feather')
# Sample 1000 rows from spotify_population
spotify_sample = spotify_population.sample(1000)

# Print the sample
print(spotify_sample)

       acousticness                                 artists  danceability  \
23699      0.995000  ['Ludwig van Beethoven', 'Paul Lewis']         0.184   
23608      0.189000                             ['Omarion']         0.775   
23242      0.000272                          ['Audioslave']         0.379   
12297      0.284000                       ['PARTYNEXTDOOR']         0.782   
26735      0.102000        ['Jon Bellion', 'Travis Mendes']         0.574   
...             ...                                     ...           ...   
15198      0.251000              ['Nujabes', 'Substantial']         0.596   
11354      0.017400                                 ['ABN']         0.662   
32953      0.052500                               ['TWICE']         0.708   
4605       0.006810                            ['Kasabian']         0.613   
9504       0.426000   ['El Coyote Y Su Banda Tierra Santa']         0.769   

       duration_ms  duration_minutes   energy  explicit  \
23699     315427

In [8]:
# Calculate the mean duration in mins from spotify_population
mean_dur_pop = spotify_population["duration_minutes"].mean()

# Calculate the mean duration in mins from spotify_sample
mean_dur_samp = spotify_sample["duration_minutes"].mean()

# Print the means
print(mean_dur_pop)
print(mean_dur_samp)

3.8521519140899896
3.8994895500000024


## Simple sampling and calculating with NumPy
We can also use numpy to calculate parameters or statistics from a list or pandas Series.

We'll be turning it up to eleven and looking at the loudness property of each song.

In [9]:
# Create a pandas Series from the loudness column of spotify_population
loudness_pop = spotify_population['loudness']

# Sample 100 values of loudness_pop
loudness_samp = loudness_pop.sample(n=100)

# Print the sample
print(loudness_samp)

40608    -7.873
18698   -35.348
20095   -10.304
1563    -10.310
11650    -6.533
          ...  
39791    -3.231
37305    -4.649
2064     -3.680
8412     -8.694
9489     -5.844
Name: loudness, Length: 100, dtype: float64


In [12]:
# Calculate the mean of loudness_pop
mean_loudness_pop = np.mean(loudness_pop)

# Calculate the mean of loudness_samp
mean_loudness_samp = np.mean(loudness_samp)

# Print the means
print(mean_loudness_pop)
print(mean_loudness_samp)

-7.366856851353918
-7.313239999999997
