# Understanding Randomness in Machine Learning

Randomness plays a significant role in machine learning, serving as a tool and a feature in data preparation and learning algorithms.
- It helps these algorithms make predictions.
- To grasp the importance of statistical methods in machine learning, it's essential to understand where randomness comes from in this field.
- In machine learning, randomness is generated through a mathematical technique known as a pseudorandom number generator.

In this tutorial, we will explore pseudorandom number generators and learn when to manage and control randomness in machine learning. By the end of this tutorial, we will gain insights into:

- The origins of randomness in applied machine learning, with a focus on algorithms.
- What pseudorandom number generators are and how to use them in Python.
- When to control the sequence of random numbers and when to control for randomness.

---


# 1. Randomness in Machine Learning

Randomness is a fundamental element in applied machine learning. It's a valuable tool that helps learning algorithms become more robust and generate accurate predictions and models. Let's explore a few aspects of randomness in this context.

## 1.1. Randomness in Data

- When we collect data for training and evaluating machine learning models, there's often a random element involved.
  - This data might contain errors and uncertainties, making it challenging to establish a clear relationship between inputs and outputs.

  ---

## 1.2. Randomness in Evaluation

- In machine learning, we typically work with a small sample of data, not the entire dataset. To account for this, we use randomness in our evaluation processes.
  - For example, techniques like k-fold cross-validation allow us to assess the model's performance on different subsets of the available data, providing a more comprehensive understanding of its capabilities.

  ---

## 1.3. Randomness in Algorithms

- Machine learning algorithms leverage randomness during the learning process.
  - This randomness is a feature that helps algorithms create more effective mappings of data.
  - It helps prevent overfitting to the training set and promotes generalization to a broader problem.

- Algorithms using randomness are often referred to as stochastic algorithms because they introduce controlled randomness into the modeling process. Some common examples of randomness in machine learning algorithms include:

  - Shuffling training data before each training epoch in stochastic gradient descent.
  - Selecting random subsets of input features for split points in random forest algorithms.
  - Using random initial weights in artificial neural networks.

In machine learning, there are sources of randomness that we need to control for, such as noise in the data, and sources of randomness that we can harness for our advantage, such as algorithm evaluation and the algorithms themselves.

---


# Pseudorandom Number Generators

- In computer programs and algorithms, we often need a source of randomness, and this is where pseudorandom number generators come into play.
  - Unlike true randomness, which might involve physical sources like Geiger counters, we use pseudorandomness in machine learning.
  - Pseudorandom numbers appear close to random but are generated through a deterministic process.

- Some common uses of pseudorandom number generators in machine learning include shuffling data and initializing coefficients with random values.

- These generators are usually functions that you can call to obtain a random number. When called repeatedly, they provide a new random number each time.
  - Additionally, wrapper functions allow you to customize your randomness, choosing between integers, floating-point numbers, specific distributions, and ranges, among other options.

- Pseudorandom numbers are generated in a sequence, and this sequence is entirely determined by an initial number called the seed.
  - If you don't specify a seed explicitly, the generator may use the current system time in seconds or milliseconds as the seed.
  - The specific value of the seed doesn't matter; you can choose any value you like. What's essential is that using the same seed will result in the same sequence of random numbers.

Let's solidify this concept with some practical examples.

---


# Generating Random Numbers with Python

Python's standard library includes a module named "random" that provides a set of functions for generating random numbers.
- Python employs a well-established pseudorandom number generator known as the Mersenne Twister.

In this section, we'll explore various scenarios where you can generate and utilize random numbers and randomness using Python's standard API.


# 1 - Seeding the Random Number Generator

In the case of pseudorandom number generation, a mathematical function creates a sequence of numbers that appear nearly random.

- To start this sequence, we use a parameter known as the "seed."
  - Importantly, this function is deterministic, which means that given the same seed, it will produce the same sequence of numbers consistently.

- The actual value of the seed is not critical. To seed the pseudorandom number generator in Python, you can use the `seed()` function, supplying it with an integer value like 1 or 7.
  - If you don't explicitly seed the generator, Python defaults to using the current system time in milliseconds since the epoch (1970).

  ---

- Here's an example of how seeding works, generating some random numbers, and showcasing how reseeding the generator results in the same sequence of numbers:
  - This example seeds the pseudorandom number generator with the value 1, generates three random numbers, reseeds the generator, and demonstrates that the same three random numbers are produced.

  - Seeding can be valuable when you want to control the randomness, ensuring that your code consistently produces the same result, such as in a production model.

  - In experimental scenarios where randomization controls for confounding variables, you may choose different seeds for each run of the experiment.


In [None]:
# Seed the pseudorandom number generator
from random import seed, random

# Seed the random number generator
seed(1)

# Generate some random numbers
print(random(), random(), random())

# Reset the seed
seed(1)

# Generate some random numbers
print(random(), random(), random())

0.13436424411240122 0.8474337369372327 0.763774618976614
0.13436424411240122 0.8474337369372327 0.763774618976614


---

# 2 - Random Floating Point Values

To generate random floating-point values, you can use Python's `random()` function.
- These values are generated within the range of 0 to 1, specifically within the interval [0, 1).

- Each value is drawn from a uniform distribution, which means that every possible value has an equal chance of being selected.

- Here's an example that generates 10 random floating-point values:

  - Running the example will generate and print each random floating-point value.

- If you need to rescale these floating-point values to a desired range, you can use the following formula:

  ```
  scaled_value = min + (value × (max - min))
  ```

  - Where `min` and `max` represent the minimum and maximum values of the desired range, and `value` is the randomly generated floating-point value in the range between 0 and 1.


In [None]:
# Generate random floating-point values
from random import seed, random

# Seed the random number generator
seed(1)

# Generate random numbers between 0 and 1
for _ in range(10):
    value = random()
    print(value)

0.13436424411240122
0.8474337369372327
0.763774618976614
0.2550690257394217
0.49543508709194095
0.4494910647887381
0.651592972722763
0.7887233511355132
0.0938595867742349
0.02834747652200631


# Random Integer Values

To generate random integer values, you can use the `randint()` function in Python.
- This function requires two arguments: the starting and ending values for the range within which the integer values will be generated.
  - The generated integers include both the start and end values, forming an interval [start, end].
  - Random values are drawn from a uniform distribution.

- Here's an example that generates 10 random integer values between 0 and 10:



In [None]:
# Generate random integer values
from random import seed, randint

# Seed the random number generator
seed(1)

# Generate some integers
for _ in range(10):
    value = randint(0, 10)
    print(value)

2
9
1
4
1
7
7
7
10
6


# Random Gaussian Values

To generate random floating-point values drawn from a Gaussian distribution, you can use the `gauss()` function in Python.

- This function requires two parameters that determine the characteristics of the distribution: the mean and the standard deviation.

- The example below generates 10 random values drawn from a Gaussian distribution with a mean of 0.0 and a standard deviation of 1.0. It's important to note that these parameters do not represent bounds on the values but rather control the shape and spread of the distribution, creating a bell-shaped curve with values distributed around the mean.


- Running the example will generate and print 10 random values following a Gaussian distribution.


In [None]:
# Generate random Gaussian values
from random import seed, gauss

# Seed the random number generator
seed(1)

# Generate some Gaussian values
for _ in range(10):
    value = gauss(0, 1)
    print(value)

1.2881847531554629
1.449445608699771
0.06633580893826191
-0.7645436509716318
-1.0921732151041414
0.03133451683171687
-1.022103170010873
-1.4368294451025299
0.19931197648375384
0.13337460465860485


# Randomly Choosing From a List

You can use random numbers to randomly select an item from a list.
- For instance, if you have a list with 10 items indexed from 0 to 9, you can generate a random integer between 0 and 9 to make a random selection from the list.
- Python's `choice()` function simplifies this process for you, making selections with a uniform likelihood.

- The example below generates a list of 20 integers and demonstrates five random selections from that list:



In [None]:
# Choose a random element from a list
from random import seed, choice

# Seed the random number generator
seed(1)

# Prepare a sequence
sequence = [i for i in range(20)]
print(sequence)

# Make choices from the sequence
for _ in range(5):
    selection = choice(sequence)
    print(selection)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
4
18
2
8
3


# Random Subsample From a List

Sometimes, you might want to repeatedly select items from a list to create a randomly chosen subset.
- Importantly, you don't want to select the same item more than once; this is known as "selection without replacement."
- Once an item is chosen for the subset, it should not be available for re-selection.
- The `sample()` function in Python provides this behavior, allowing you to select a random sample from a list without replacement.

  - The function takes both the list and the desired size of the subset as arguments.
  - It's important to note that items are not removed from the original list; instead, they are selected into a copy of the list.
  
- The example below demonstrates selecting a subset of five items from a list of 20 integers:
  - Running the example will first display the list of integer values, followed by the selection of a random sample for comparison.



In [None]:
# Select a random sample without replacement
from random import seed, sample

# Seed the random number generator
seed(1)

# Prepare a sequence
sequence = [i for i in range(20)]
print(sequence)

# Select a subset without replacement
subset = sample(sequence, 5)
print(subset)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
[4, 18, 2, 8, 3]


# Randomly Shuffle a List

Randomness can be used to shuffle a list of items, just like shuffling a deck of cards.
- Python provides the `shuffle()` function for this purpose.
- It shuffles the list in place, meaning that the list you provide as an argument to the `shuffle()` function is shuffled directly, without creating a separate shuffled copy.

- The example below demonstrates how to randomly shuffle a list of integer values:
  - Running the example first prints the list of integers in its original order and then displays the same list after it has been randomly shuffled.


In [None]:
# Randomly shuffle a sequence
from random import seed, shuffle

# Seed the random number generator
seed(1)

# Prepare a sequence
sequence = [i for i in range(20)]
print(sequence)

# Randomly shuffle the sequence
shuffle(sequence)
print(sequence)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
[11, 5, 17, 19, 9, 0, 16, 1, 15, 6, 10, 13, 14, 12, 7, 3, 8, 2, 18, 4]


---

# Random Numbers with NumPy

In machine learning, you often work with libraries like scikit-learn and Keras, which, in turn, rely on NumPy.
- NumPy is a powerful library for efficient manipulation of numerical arrays and matrices. It also provides its own implementation of a pseudorandom number generator and convenient wrapper functions.

  - NumPy employs the Mersenne Twister pseudorandom number generator.
  
Below, we'll explore a few examples of generating random numbers and utilizing randomness with NumPy arrays.



# Seed The Random Number Generator in NumPy

- It's important to note that the NumPy pseudorandom number generator operates independently from the Python standard library's pseudorandom number generator.

- Seeding the Python generator does not affect the NumPy generator, and both must be seeded separately.

- To seed the NumPy pseudorandom number generator, you can use the `seed()` function, providing it with an integer value as the seed.

- The example below demonstrates how to seed the generator and how reseeding it results in the same sequence of random numbers:
  - Running the example seeds the NumPy pseudorandom number generator, generates a sequence of random numbers, and then reseeds it, showing that the exact same sequence of random numbers is produced.


In [None]:
# Seed the pseudorandom number generator in NumPy
from numpy.random import seed, rand

# Seed the random number generator
seed(1)

# Generate some random numbers
print(rand(3))

# Reset the seed
seed(1)

# Generate some random numbers again
print(rand(3))

[4.17022005e-01 7.20324493e-01 1.14374817e-04]
[4.17022005e-01 7.20324493e-01 1.14374817e-04]


# Array of Random Floating Point Values with NumPy

You can generate an array of random floating point values using the `rand()` function from NumPy.
- If no argument is provided, it creates a single random value. However, you can specify the size of the array if needed.

- The example below demonstrates how to create an array of 10 random floating point values drawn from a uniform distribution:

  - Running the example generates and prints a NumPy array containing random floating point values.


In [None]:
# Generate random floating point values with NumPy
from numpy.random import seed, rand

# Seed the random number generator
seed(1)

# Generate random numbers between 0 and 1
values = rand(10)
print(values)

[4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01
 1.46755891e-01 9.23385948e-02 1.86260211e-01 3.45560727e-01
 3.96767474e-01 5.38816734e-01]


# Array of Random Integer Values with NumPy

You can create an array of random integers using the `randint()` function from NumPy.
- This function takes three arguments: the lower bound of the range, the upper bound of the range, and the number of integer values to generate or the size of the array.
- Random integers are drawn from a uniform distribution, including the lower bound but excluding the upper bound, meaning they fall within the interval [lower, upper).

- The example below demonstrates how to generate an array of 20 random integers between 0 and 10:
  - Running the example generates and prints an array containing 20 random integer values falling within the range of 0 to 10.


In [None]:
# Generate random integer values with NumPy
from numpy.random import seed, randint

# Seed the random number generator
seed(1)

# Generate random integers
values = randint(0, 10, 20)
print(values)

[5 8 9 5 0 0 1 7 6 9 2 4 5 2 4 2 4 7 7 9]


# Array of Random Gaussian Values with NumPy

You can generate an array of random Gaussian values using the `randn()` function from NumPy.
- This function takes a single argument to specify the size of the resulting array.
  - The generated Gaussian values are drawn from a standard Gaussian distribution with a mean of 0.0 and a standard deviation of 1.0.

- The example below demonstrates how to create an array of 10 random Gaussian values:
  - Running the example generates and prints an array of 10 random values following a standard Gaussian distribution.


- If you wish to scale these values to a different Gaussian distribution with a specific mean and standard deviation, you can use the formula:

  ```
  scaled_value = mean + value * stdev
  ```

  - Where `mean` and `stdev` are the desired mean and standard deviation for the scaled Gaussian distribution, and `value` represents the randomly generated value from a standard Gaussian distribution.




In [None]:
# Generate random Gaussian values with NumPy
from numpy.random import seed, randn

# Seed the random number generator
seed(1)

# Generate random Gaussian values
values = randn(10)
print(values)

[ 1.62434536 -0.61175641 -0.52817175 -1.07296862  0.86540763 -2.3015387
  1.74481176 -0.7612069   0.3190391  -0.24937038]


# Shuffle a NumPy Array

You can shuffle a NumPy array in-place using the `shuffle()` function from NumPy.

- The example below demonstrates how to shuffle a NumPy array:
  - Running the example first generates a list of 20 integer values, then shuffles and prints the shuffled array.



In [None]:
# Randomly shuffle a sequence with NumPy
from numpy.random import seed, shuffle

# Seed the random number generator
seed(1)

# Prepare a sequence
sequence = [i for i in range(20)]
print(sequence)

# Randomly shuffle the sequence
shuffle(sequence)
print(sequence)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
[3, 16, 6, 10, 2, 14, 4, 17, 7, 1, 13, 0, 19, 18, 9, 15, 8, 12, 11, 5]


# When to Seed the Random Number Generator

There are specific scenarios during a predictive modeling project when it's essential to consider seeding the random number generator. Let's explore two key cases:

- **Data Preparation**: In data preparation, randomness is often used, for tasks like shuffling the data or selecting values.
  - It's crucial that data preparation remains consistent throughout the project.
  - This ensures that the data is consistently prepared during model training, evaluation, and when making predictions with the final model.

- **Data Splits**: When dividing data into subsets, such as for a train/test split or k-fold cross-validation, it's vital to maintain consistency.
  - This ensures that each algorithm is trained and evaluated on the same subsamples of data.

- You may choose to seed the pseudorandom number generator either once before each individual task or once before performing a batch of tasks. The choice between the two generally doesn't matter.
  - However, there are situations where you may want an algorithm to behave consistently, such as when it's used in a production environment or during algorithm demonstrations in tutorials. In such cases, initializing the seed before fitting the algorithm makes sense.


---


# Controlling for Randomness in Machine Learning

- In stochastic machine learning algorithms, the model may learn slightly differently each time it's run on the same data.
  - This leads to variations in model performance during training.
  - While using the same sequence of random numbers for model fitting can be done, it's not advisable when evaluating the model as it conceals the inherent uncertainty in the model's performance.

- A better approach is to evaluate the algorithm in a way that captures the measured uncertainty in its performance.
  - This can be achieved by repeatedly evaluating the algorithm with different sequences of random numbers.
  - The pseudorandom number generator can be seeded once at the start of the evaluation, or it can be seeded with a different value at the beginning of each evaluation.
  
- There are two key aspects of uncertainty to consider:

  1. **Data Uncertainty**: Evaluating an algorithm on multiple data splits reveals how the algorithm's performance varies with changes in the train and test data.

  2. **Algorithm Uncertainty**: Evaluating an algorithm multiple times on the same data splits provides insights into how the algorithm's performance varies independently.

- In general, it's recommended to report on both sources of uncertainty combined. This involves fitting the algorithm on different data splits during each evaluation run, each with a new sequence of randomness.
  - The evaluation procedure can seed the random number generator once at the beginning, and this process can be repeated, possibly 30 or more times, to generate a population of performance scores. Summarizing this population provides a fair representation of model performance, taking into account variance in both the training data and the learning algorithm itself.


---


# Common Questions

**1. Can I predict random numbers?**
   - You cannot predict the sequence of random numbers, even with a deep neural network.

**2. Will real random numbers lead to better results?**
   - As far as I have read, using real randomness does not help in general, unless you are working with simulations of physical processes.

**3. What about the final model?**
   - The final model is the chosen algorithm and configuration trained on all available training data that you can use to make predictions. The performance of this model will fall within the variance of the evaluated model.

---


# Further Reading

### API Documentation
- [random Python API](https://docs.python.org/3/library/random.html)
- [Random Sampling in NumPy](https://docs.scipy.org/doc/numpy/reference/routines.random.html)

### Articles
- [Random number generation on Wikipedia](https://en.wikipedia.org/wiki/Random_number_generation)
- [Pseudorandom number generator on Wikipedia](https://en.wikipedia.org/wiki/Pseudorandom_number_generator)
- [Mersenne Twister on Wikipedia](https://en.wikipedia.org/wiki/Mersenne_Twister)

---
