***

*Course:* [Math 535](https://people.math.wisc.edu/~roch/mmids/) - Mathematical Methods in Data Science (MMiDS)  
*Chapter:* 1-Introduction   
*Author:* [Sebastien Roch](https://people.math.wisc.edu/~roch/), Department of Mathematics, University of Wisconsin-Madison  
*Updated:* Jan 4, 2024   
*Copyright:* &copy; 2024 Sebastien Roch

***

In [None]:
# IF RUNNING ON GOOGLE COLAB, UNCOMMENT THE FOLLOWING CODE CELL
# When prompted, upload: 
#     * mmids.py
#     * penguins-measurements.csv
#     * penguins-species.csv
# from your local file system
# Files at: https://github.com/MMiDS-textbook/MMiDS-textbook.github.io/tree/main/utils
# Alternative instructions: https://colab.research.google.com/notebooks/io.ipynb

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [None]:
# PYTHON 3
import numpy as np
from numpy import linalg as LA
from numpy.random import default_rng
rng = default_rng(535)
import matplotlib.pyplot as plt
import pandas as pd
import networkx as nx
import mmids

## Motivating example: species identification

Here is a [penguin dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pallter.marine.rutgers.edu/). We will upload the data in the form of a data table (similar to a spreadsheet) called [`DataFrame`](https://pandas.pydata.org/docs/reference/frame.html) in [`pandas`](https://pandas.pydata.org/docs/), where the columns are different measurements (or features) and the rows are different samples. Below, we load the data using [`pandas.read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html?highlight=read_csv#) and show the first $5$ lines of the dataset (see [`DataFrame.head`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)). This dataset is a simplified version (i.e., with some columns removed) of the full dataset, maintained by [Allison Horst](https://allisonhorst.com/) at this [GitHub page](https://github.com/allisonhorst/palmerpenguins/blob/main/README.md). 

In [None]:
df = pd.read_csv('penguins-measurements.csv')
df.head()

Observe that this dataset has missing values (i.e., the entries `NaN` above). A common way to deal with this issue is to remove all rows with missing values. This can be done using [`pandas.DataFrame.dropna`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html).

In [None]:
df = df.dropna()
df.head()

There are $342$ samples, as can be seen by using [`pandas.DataFrame.shape`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) which gives the dimensions of the DataFrame as a tuple.

In [None]:
df.shape[0]

Here is a summary of the data (see [`pandas.DataFrame.describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)).

In [None]:
df.describe()

Let's first extract the columns into a Numpy array using [`pandas.DataFrame.to_numpy()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html).

In [None]:
X = df[['bill_length_mm', 'bill_depth_mm', 
        'flipper_length_mm', 'body_mass_g']].to_numpy()
print(X)

We visualize two measurements in the data, the bill depth and flipper length. (The original dataset used the more precise term [culmen](https://en.wikipedia.org/wiki/Beak#Culmen) depth.) Below, each point is a sample. This is called a [scatter plot](https://en.wikipedia.org/wiki/Scatter_plot). 

In [None]:
plt.scatter(X[:,1], X[:,2], s=10)
plt.xlabel('bill_depth_mm')
plt.ylabel('flipper_length_mm')
plt.show()

Now let's look at the full dataset. Visualizing the full $4$-dimensional data is not straightforward. One way to do this is to consider all pairwise scatter plots. We use the function [`seaborn.pairplot`](https://seaborn.pydata.org/generated/seaborn.pairplot.html) from the library [Seaborn](https://seaborn.pydata.org/index.html). 

In [None]:
import seaborn as sns
sns.pairplot(df, vars=['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'], height=2)
plt.show()

## Background: review of basic linear algebra, calculus, and probability

**NUMERICAL CORNER:** In Numpy, a vector is defined as a 1d array. We first must import the [Numpy](https://numpy.org) package, which is often abbreviated by `np`.

In [None]:
import numpy as np
u = np.array([1., 3., 5. ,7.])
print(u)

To obtain the norm of a vector, we can use the function [`linalg.norm`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html) (which requires the `numpy.linalg` package):

In [None]:
from numpy import linalg as LA
LA.norm(u)

which we check next "by hand"

In [None]:
np.sqrt(np.sum(u ** 2))

In Numpy, [`**`](https://numpy.org/doc/stable/reference/generated/numpy.power.html) indicates element-wise exponentiation.

$\unlhd$

**NUMERICAL CORNER:** We will often work with collections of $n$ vectors $\mathbf{x}_1, \ldots, \mathbf{x}_n$ in $\mathbb{R}^d$ and it will be convenient to stack them up into a matrix

$$
X =
\begin{bmatrix}
\mathbf{x}_1^T \\
\mathbf{x}_2^T \\
\vdots \\
\mathbf{x}_n^T \\
\end{bmatrix}
=
\begin{bmatrix}
x_{11} & x_{12} & \cdots & x_{1d} \\
x_{21} & x_{22} & \cdots & x_{2d} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n1} & x_{n2} & \cdots & x_{nd} \\
\end{bmatrix}.
$$

To create a matrix out of two vectors, we use the function [`numpy.stack`](https://numpy.org/doc/stable/reference/generated/numpy.stack.html).

In [None]:
u = np.array([1., 3., 5., 7.])
v = np.array([2., 4., 6., 8.])
X = np.stack((u,v),axis=0)
print(X)

Quoting the documentation:

> The axis parameter specifies the index of the new axis in the dimensions of the result. For example, if axis=0 it will be the first dimension and if axis=-1 it will be the last dimension.

The same scheme still works with more than two vectors.

In [None]:
u = np.array([1., 3., 5., 7.])
v = np.array([2., 4., 6., 8.])
w = np.array([9., 8., 7., 6.])
X = np.stack((u,v,w))
print(X)

$\unlhd$

**NUMERICAL CORNER:** In Numpy, the Frobenius norm of a matrix can be computed using the function [`numpy.linalg.norm`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html).

In [None]:
A = np.array([[1., 0.],[0., 1.],[0., 0.]])
print(A)

In [None]:
LA.norm(A)

$\unlhd$

**NUMERICAL CORNER:** The function $f(x) = x^2$ over $\mathbb{R}$ has a global minimizer at $x^* = 0$. Indeed, we clearly have $f(x) \geq 0$ for all $x$ while $f(0) = 0$. To plot the function, we use the [matplotlib](https://matplotlib.org) package, and specifically its function [`matplotlib.pyplot.plot`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html). We also use the function [`numpy.linspace`](https://numpy.org/doc/stable/reference/generated/numpy.linspace.html) to create an array of evenly spaced numbers where we evaluate $f$.

In [None]:
import matplotlib.pyplot as plt
x = np.linspace(-2,2,100)
y = x ** 2
plt.plot(x,y)
plt.show()

The function $f(x) = e^x$ over $\mathbb{R}$ does not have a global minimizer. Indeed, $f(x) > 0$ but no $x$ achieves $0$. And, for any $m > 0$, there is $x$ small enough such that $f(x) < m$. Note that $\mathbb{R}$ is *not* bounded, therefore the *Extreme Value Theorem* does not apply here.

In [None]:
x = np.linspace(-2,2,100)
y = np.exp(x)
plt.plot(x,y)
plt.ylim(0,5)
plt.show()

The function $f(x) = (x+1)^2 (x-1)^2$ over $\mathbb{R}$ has two global minimizers at $x^* = -1$ and $x^{**} = 1$. Indeed, $f(x) \geq 0$ and $f(x) = 0$ if and only $x = x^*$ or $x = x^{**}$.

In [None]:
x = np.linspace(-2,2,100)
y = ((x+1)**2) * ((x-1)**2)
plt.plot(x,y)
plt.ylim(0,5)
plt.show()

$\unlhd$

**NUMERICAL CORNER:** We can use simulations to confirm the *Weak Law of Large Numbers*. Recall that a uniform random variable over the interval $[a,b]$ has density

$$
f_{X}(x)
= \begin{cases}
\frac{1}{b-a} & x \in [a,b] \\
0 & \text{o.w.}
\end{cases}
$$

We write $X \sim \mathrm{U}[a,b]$. We can obtain a sample from $\mathrm{U}[0,1]$ by using the function [`numpy.random`](https://numpy.org/doc/stable/reference/random/generator.html) in Numpy.  

In [None]:
from numpy.random import default_rng
rng = default_rng(535)
rng.random()

Now we take $n$ samples from $\mathrm{U}[0,1]$ and compute their sample mean. We repeat $k$ times and display the empirical distribution of the sample means using an [histogram](https://en.wikipedia.org/wiki/Histogram).

In [None]:
def lln_unif(n, k):
    sample_mean = [np.mean(rng.random(n)) for i in range(k)]
    plt.hist(sample_mean,bins=15)
    plt.xlim(0,1)
    plt.title(f'n={n}')
    plt.show()

We start with $n=10$.

In [None]:
lln_unif(10, 1000)

Taking $n$ much larger leads to more concentration around the mean.

In [None]:
lln_unif(100, 1000)

$\unlhd$

**NUMERICAL CORNER:** We plot the PDF of a standard normal distribution. We use the function [scipy.stats.norm](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html) from the [SciPy library](https://scipy.org), which outputs the PDF. The following code was adapted from [here](https://commons.wikimedia.org/wiki/File:Standard_Normal_Distribution.svg) with the help of ChatGPT.

In [None]:
from scipy.stats import norm

# Plot the normal distribution curve
x = np.linspace(-4, 4, 100)
y = norm.pdf(x)
plt.plot(x, y, color='black')

# Fill areas under the curve for different standard deviations
plt.fill_between(x, y, where=(x > -1) & (x < 1), color='red', alpha=0.25)
plt.fill_between(x, y, where=(x > -2) & (x < 2), color='red', alpha=0.25)
plt.hlines(norm.pdf(1), -1, 1, color='black', linestyle='dashed')
plt.hlines(norm.pdf(2), -2, 2, color='black', linestyle='dashed')
plt.text(0, norm.pdf(1) + 0.01, "68.3%", ha='center')
plt.text(0, norm.pdf(2) + 0.01, "95.4%", ha='center')

# Set labels, title, and xticks
plt.xlabel("Standard Deviations from Mean")
plt.ylabel("PDF")
plt.xticks(range(-4, 5, 2), [f'{i}' for i in range(-4, 5, 2)])
plt.show()

$\unlhd$

**NUMERICAL CORNER:** The following function generates $n$ data points from a spherical $d$-dimensional Gaussians with variance $1$ and mean $w \mathbf{e}_1$. We will use it later in the chapter to simulate interesting datasets. 

Below, `rng.normal(0,1,d)` generates a `d`-dimensional spherical Gaussian with mean $\mathbf{0}$. Below we use the function [`numpy.concatenate`](https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html) to create a vector by concatenating two given vectors. We use `[w]` to create a vector with a single entry `w`. We also use the function [`numpy.zeros`](https://numpy.org/doc/stable/reference/generated/numpy.zeros.html) to create an all-zero vector. 

In [None]:
def one_cluster(d, n, w):
    X = np.stack(
        [np.concatenate(([w], np.zeros(d-1))) + rng.normal(0,1,d) for _ in range(n)]
    )
    return X

We generate $100$ data points in dimension $d=2$.

In [None]:
d, n, w = 2, 100, 3.
X = one_cluster(d, n, w)
plt.scatter(X[:,0], X[:,1])
plt.show()

$\unlhd$

## Clustering: an objective, an algorithm and a guarantee

**NUMERICAL CORNER:** Here's a numerical example. We first define a quadratic function.

In [None]:
def q(a,b,c,x):
    return a * (x**2) + b * x + c

We plot it for different values of the coefficients. 

In [None]:
x = np.linspace(-2, 2, 100)

plt.plot(x, q(2,4,-1,x))
plt.plot(x, q(2,-4,4,x))
plt.plot(x, q(-2,0,4,x))

plt.legend(['y1', 'y2', 'y3'])

plt.show()

$\unlhd$

We are now ready to describe the <a href="https://en.wikipedia.org/wiki/K-means_clustering">$k$-means algorithm</a>, also known as Lloyd's algorithm. We start from a random assignment of clusters. (An alternative [initialization strategy](https://en.wikipedia.org/wiki/K-means_clustering#Initialization_methods) is to choose $k$ representatives at random among the data points.) We then alternate between the optimal choices in the lemmas. In lieu of pseudo-code, we write out the algorithm in Python. 

The input `X` is assumed to be a collection of $n$ vectors $\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d$ stacked into a matrix, with one row for each data point. The other input, `k`, is the desired number of clusters. There is an optional input `maxiter` for the maximum number of iterations, which is set to $10$ by default.

We first define separate functions for the two main steps. To find the minimum of an array, we use the function [`numpy.argmin`](https://numpy.org/doc/stable/reference/generated/numpy.argmin.html). We also use [`numpy.linalg.norm`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html) to compute the Euclidean distance.

In [None]:
def opt_reps(X, k, assign):
    (n, d) = X.shape
    reps = np.zeros((k, d))
    for i in range(k):
        in_i = [j for j in range(n) if assign[j] == i]             
        reps[i,:] = np.sum(X[in_i,:],axis=0) / len(in_i)
    return reps

def opt_clust(X, k, reps):
    (n, d) = X.shape
    dist = np.zeros(n)
    assign = np.zeros(n, dtype=int)
    for j in range(n):
        dist_to_i = np.array([LA.norm(X[j,:] - reps[i,:]) for i in range(k)])
        assign[j] = np.argmin(dist_to_i)
        dist[j] = dist_to_i[assign[j]]
    G = np.sum(dist ** 2)
    print(G)
    return assign

The main function follows. Below, `rng.integers(0,k,n)` is an array of `n` uniformly chosen integers between `0` and `k-1` (inclusive). (See [random.Generator.integers](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.integers.html) for details.)

In [None]:
def kmeans(X, k, maxiter=10):
    (n, d) = X.shape
    assign = rng.integers(0,k,n)
    reps = np.zeros((k, d), dtype=int)
    for iter in range(maxiter):
        # Step 1: Optimal representatives for fixed clusters
        reps = opt_reps(X, k, assign) 
        # Step 2: Optimal clusters for fixed representatives
        assign = opt_clust(X, k, reps) 
    return assign

**NUMERICAL CORNER:** We apply our implementation of $k$-means to the example above. We fix `k` to $3$. Here the data matrix `X` is the following:

In [None]:
X = np.array([[1., 0.],[-2., 0.],[-2.,1.],[1.,-3.],[-10.,10.],[2.,-2.],[-3.,1.],[3.,-1.]])
assign = kmeans(X, 3)

We vizualize the output by coloring the points according to their cluster assignment.

In [None]:
plt.scatter(X[:,0], X[:,1], c=assign, s=10)
plt.show()

We can compute the final representatives (optimal for the final assignment) by using the subroutine `opt_reps`.

In [None]:
print(opt_reps(X, 3, assign))

Each row is the center of the corresponding cluster. Note these match with the ones we previously computed. Indeed, the clustering is the same (although not necessarily in the same order).

$\unlhd$

We will test our implementation of $k$-means on the penguins dataset introduced earlier in the chapter.

We first extract the columns and combine them into a data matrix `X`. As we did previously, we also remove the rows with missing values.

In [None]:
df = pd.read_csv('penguins-measurements.csv')
df = df.dropna()
X = df[['bill_length_mm', 'bill_depth_mm', 
        'flipper_length_mm', 'body_mass_g']].to_numpy()

We  visualize a two-dimensional slice of the data. 

In [None]:
plt.scatter(X[:,1], X[:,3], s=10)
plt.xlabel('bill_depth_mm')
plt.ylabel('body_mass_g')
plt.show()

Observe that the features have quite different scales (tens versus thousands in the plot above). In such a case, it is common to standardize the data so that each feature has roughly the same scale. That is accomplished by, for each column of `X`, subtracting its empirical mean and dividing by its empirical standard deviation. We do this next. 

In [None]:
mean = np.mean(X, axis=0)  # Compute mean for each column
std = np.std(X, axis=0)  # Compute standard deviation for each column
X = (X - mean) / std # Standardize each column

Now we run the $k$-means algorithm with $k=2$ clusters. 

In [None]:
assign = kmeans(X, 2)

We vizualize the output as we did before.

In [None]:
plt.scatter(X[:,1], X[:,3], c=assign, s=10)
plt.xlabel('bill_depth_mm (standardized)')
plt.ylabel('body_mass_g (standardized)')
plt.show()

This clustering looks quite good. Nevertheless recall that:

1. in this plot we are looking at only two of the four variables while $k$-means uses all of them, 

2. we are not guaranteed to find the best solution, 

3. our objective function is somewhat arbitrary, and 

4. it is not clear what the right choice of $k$ is. 

In fact, the original dataset provided the correct answer. Despite what the figure above may lead us to believe, there are in reality three separate species. So let's try with $k=3$ clusters.

In [None]:
assign = kmeans(X, 3)

The output does not seem quite right.

In [None]:
plt.scatter(X[:,1], X[:,3], c=assign, s=10)
plt.xlabel('bill_depth_mm (standardized)')
plt.ylabel('body_mass_g (standardized)')
plt.show()

But, remembering the warnings mentioned previously, let's look at a different two-dimensional slice.

In [None]:
plt.scatter(X[:,0], X[:,3], c=assign, s=10)
plt.xlabel('bill_length_mm (standardized)')
plt.ylabel('body_mass_g (standardized)')
plt.show()

Let's load up the truth and compare. We only keep those samples that were not removed because of missing values (see [`pandas.DataFrame.iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html)).

In [None]:
df_truth = pd.read_csv('penguins-species.csv') 
df_truth = df_truth.iloc[df.index]
df_truth.head()

The species are:

In [None]:
species = df_truth['species']
print(species.unique())

To plot the outcome, we color the species blue-green-red using a [dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries).

In [None]:
species2color = {'Adelie': 'b', 'Chinstrap': 'g', 'Gentoo': 'r'}
truth = species.replace(species2color)

Finally, we can compare the output to the truth. The match is quite excellent -- but not perfect.

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, sharex=True, sharey=True)
ax1.scatter(X[:,0], X[:,3], c=truth, s=3)
ax1.set_title('truth')
ax2.scatter(X[:,0], X[:,3], c=assign, s=3)
ax2.set_title('kmeans')
plt.show()

**TRY IT!** Run the analysis again, but this time *without the standardization step*. What do you observe? Is one feature more influential than the others on the final output? Why do you think that is?

$\unlhd$

## Some observations about high-dimensional data

The following function generates $n$ data points from two spherical $d$-dimensional Gaussians with variance $1$, one with mean $-w\mathbf{e}_1$ and one with mean $w \mathbf{e}_1$. 

In [None]:
def one_cluster(d, n, w):
    X = np.stack(
        [np.concatenate(([w], np.zeros(d-1))) + rng.normal(0,1,d) for _ in range(n)]
    )
    return X

def two_clusters(d, n, w):
    X1 = one_cluster(d, n, -w)
    X2 = one_cluster(d, n, w)
    return X1, X2

We will mix these two datasets to form an interesting case for clustering.

**Two dimensions** We start with $d=2$.

In [None]:
d, n, w = 2, 100, 3.
X1, X2 = two_clusters(d, n, w)
X = np.concatenate((X1, X2), axis=0)

We use a scatterplot to vizualize the data. Each dot corresponds to one data point. Observe the two clearly delineated clusters.

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], s=10)
plt.show()

Let's run $k$-means on this dataset using $k=2$. We use `kmeans()` from the `mmids.py` file.

In [None]:
assign = mmids.kmeans(X, 2)

Our default of $10$ iterations seem to have been enough for the algorithm to converge. We can visualize the result by [coloring](https://matplotlib.org/stable/api/_as_gen/matplotlib.lines.Line2D.html) the points according to the assignment.  

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], c=assign, s=10)
plt.show()

**General dimension** Let's see what happens in higher dimension. We repeat our experiment with $d=1000$.

In [None]:
d, n, w = 1000, 100, 3.
X1, X2 = two_clusters(d, n, w)
X = np.concatenate((X1, X2), axis=0)

Again, we observe two clearly delineated clusters.

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], s=10)
plt.show()

This dataset is in $1000$ dimensions, but we've plotted the data in only the first two dimensions. If we plot in any two dimensions not including the first one instead, we see only one cluster. 

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,1], X[:,2], s=10)
plt.show()

Let's see how $k$-means fares on this dataset.

In [None]:
assign = mmids.kmeans(X, 2)

Our attempt at clustering does not appear to have been successful.

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], c=assign, s=10)
plt.show()

Our attempt at clustering does not appear to have been successful. What happened? While these clusters are easy to tease apart *if we know to look at the first coordinate only*, in the full space the within-cluster and between-cluster distances become harder to distinguish: the noise overwhelms the signal. 

The function below plots the histograms of within-cluster and between-cluster distances for a sample of size $n$ in $d$ dimensions with a given offset. As $d$ increases, the two distributions become increasingly indistinguishable. Later in the course, we will develop dimension-reduction techniques that help deal with this issue.

In [None]:
def highdim_2clusters(d, n, w):
    # generate datasets
    X1, X2 = two_clusters(d, n, w)
    
    # within-cluster distances for X1
    intra = np.stack(
        [LA.norm(X1[i,:] - X1[j,:]) for i in range(n) for j in range(n) if j>i]
    )
    plt.hist(intra, density=True, label='within-cluster')
    plt.title(f'dim={d}')
 
    # between-cluster distances
    inter = np.stack(
        [LA.norm(X1[i,:] - X2[j,:]) for i in range(n) for j in range(n)]
    )
    plt.hist(inter, density=True, alpha=0.75, label='between-cluster')

    plt.legend(loc='upper right')
    plt.title(f'dim={d}')

Next we plot the results for dimensions $d=2, 100, 1000$. What do you observe?

In [None]:
highdim_2clusters(2, 100, 3)

In [None]:
highdim_2clusters(100, 100, 3)

In [None]:
highdim_2clusters(1000, 100, 3)

As the dimension increases, the distributions of intra-cluster and inter-cluster distances overlap significantly and become more or less indistinguishable. That provides some insights into why clustering may fail here. Note that we used the same offset for all simulations. On the other hand, if the separation between the clusters is sufficiently large, one would expect clustering to work even in high dimension. 

**TRY IT!** What precedes (and what follows in the next subsection) is not a formal proof that $k$-means clustering will be unsuccessful here. The behavior of the algorithm is quite complex and depends, in particular, on the initialization and the density of points. Here, increasing the number of data points eventually leads to a much better performance. Explore this behavior on your own by modifying the code. (For some theoretical justifications (beyond this course), see [here](https://arxiv.org/pdf/0912.0086.pdf) and [here](http://www.stat.yale.edu/~pollard/Papers/Pollard81AS.pdf).)

**NUMERICAL CORNER:** We can check the theorem in a simulation. Here we pick $n$ points uniformly at random in the $d$-cube $\mathcal{C}$, for a range of dimensions up to `dmax`. We then plot the frequency of landing in the inscribed $d$-ball $\mathcal{B}$ and see that it rapidly converges to $0$. Alternatively, we could just plot the formula for the volume of $\mathcal{B}$. But knowing how to do simulations is useful in situations where explicit formulas are unavailable or intractable.

In [None]:
def highdim_cube(dmax, n):
    
    in_ball = np.zeros(dmax)
    for d in range(dmax):
        # recall that d starts at 0 so we add 1 below
        in_ball[d] = np.mean(
            [(LA.norm(rng.random(d+1) - 1/2) < 1/2) for _ in range(n)]
        )
    
    plt.plot(np.arange(1,dmax+1), in_ball) 
    plt.xlabel('dim')
    plt.ylabel('in-ball freq')
    plt.show()

We plot the result up to dimension $10$.

In [None]:
highdim_cube(10, 1000)

$\unlhd$

**NUMERICAL CORNER:** We check our claim in a simulation. We generate standard Normal $d$-vectors using the `rng.normal(0,1,d)` function and plot the histogram of their $2$-norm.

In [None]:
def normal_shell(d, n):
    one_sample_norm = [LA.norm(rng.normal(0,1,d)) for _ in range(n)]
    plt.hist(one_sample_norm, bins=20)
    plt.xlim=(0,np.stack(one_sample_norm).max())
    plt.show()

We first plot it in one dimensions.

In [None]:
normal_shell(1, 10000)

In higher dimension:

In [None]:
normal_shell(100, 10000)

$\unlhd$