In [2]:
# Initialize Otter
import otter
grader = otter.Notebook("ps6.ipynb")

In [3]:
import numpy as np

## Question 1: Down with `for` loops
Each of the problems below contains a function that uses `for` loops to perform a certain operation on arrays. Your job is to rewrite these function using only Numpy array manipulations and library functions (e.g. `np.xxx()`). Do not use any `for` or `while` loops, iterators, generators, or list comprehensions in your solutions.

**1(a)** (3 pts) Return all of the rows of the integer matrix `A` where where each entry of the row is distinct:

In [4]:
def distinct_rows_py(A):
    'Return all rows of A that have completely distinct entries.'
    return np.array([a for a in A if len(set(a)) == len(a)])

In [5]:
A = np.eye(5)
distinct_rows_py(A)

array([], dtype=float64)

In [6]:
A = np.arange(9).reshape(3, 3)
distinct_rows_py(A)

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [7]:
A = np.array([
    [1, 2, 3],
    [4, 4, 4],
    [5, 6, 6]])
distinct_rows_py(A)

array([[1, 2, 3]])

In [8]:
def distinct_rows_np(A):
    distinct = np.apply_along_axis(lambda x: len(set(x)) == len(x), axis=1, arr=A)
    return A[distinct]
    ...

In [9]:
grader.check("1a")

**1(b)** (3 pts) Given a vector $v$ of length $n$, and an integer $0<k<n$, return the matrix
```
[[v[0], ..., v[k-1]],
 [v[1], ..., v[k]  ],
 [v[2], ..., v[k+1]],
 [ .     .       . ],
 [ .     .       . ],
 [ .     .       . ],
 [v[n-k+1, ..., v[n]]
 ```

In [10]:
def sliding_stack_py(v, k):
    "Stack sliding windows of v of length k."
    rows = []
    for i in range(len(v) - k + 1):
        rows.append(v[i : (i + k)])
    return np.array(rows)

In [11]:
sliding_stack_py(np.array([1, 2, 3, 4, 5]), 3)

array([[1, 2, 3],
       [2, 3, 4],
       [3, 4, 5]])

In [12]:
def sliding_stack_np(v, k):
    A = np.zeros((len(v)-k+1,k),dtype=int)
    
    def count_n_slide(x,i=[-1]):
        i[0]+=1                                              # count how many times a function was run, start at -1 so first row gets 0
        return v[i[0]:(i[0]+k)]                              # return slicing from position according the times a function was run
 
    return np.apply_along_axis(count_n_slide, axis=1, arr=A) # apply the function on each row using an np function
    ...

In [13]:
grader.check("1b")

**1(c)** (3 pts) Given a vector of non-negative integers `v`, with `max(v) = m`, return a vector `c` of length `m + 1` such that `c[i]` is the number of times that the integer `i` appears in `v`.

In [14]:
def digit_count_py(v):
    m = max(v)
    ret = np.zeros(m + 1, int)
    for vv in v:
        ret[vv] += 1
    return ret

In [15]:
v = np.array([0, 0, 1, 1, 2, 2, 2, 3, 4])
digit_count_py(v)

array([2, 2, 3, 1, 1])

In [16]:
def digit_count_np(v):
    return np.bincount(v).astype(int)
    ...

In [17]:
grader.check("1c")

**1(d)** (3 pts) Call a square $n\times n$ matrix $A$ *countersymmetric* if $A_{ij} = A_{n-j,n-i}$ for all $i$ and $j$. An example of such a matrix is:

$$
\begin{pmatrix}
4 & 3 & 2 & 1 & 0\\
8 & 7 & 6 & 5 & 1\\
11 & 10 & 9 & 6 & 2\\
13 & 12 & 10 & 7 & 3\\
14 & 13 & 11 & 8 & 4
\end{pmatrix}
$$

Write a function `is_countersym` that checks this property:

In [18]:
def is_countersym_py(A):
    "Returns True if A is countersymmetric"
    n = A.shape[0]
    for i in range(n):
        for j in range(n):
            if A[i, j] != A[n - j - 1, n - i - 1]:
                return False
    return True

In [19]:
cs_matrix = np.array([[ 4,  3,  2,  1,  0], [ 8,  7,  6,  5,  1], [11, 10,  9,  6,  2], [13, 12, 10,  7,  3], [14, 13, 11,  8,  4]])
is_countersym_py(cs_matrix)

True

In [20]:
def is_countersym_np(A):
    counter_sym = A.T                            # Transform the matrix
    counter_sym = np.flip(counter_sym, axis=1)   # Flip over columns
    counter_sym = np.flip(counter_sym, axis=0)   # Flip over rows
    return np.all(counter_sym == A)
    ...

In [21]:
grader.check("1d")

**1(e)** (5 pts extra credit) [Sudoku](https://en.wikipedia.org/wiki/Sudoku) is a number game played on a $9\times 9$ grid, arranged into nine $3\times 3$ subgrids. In order to win Sudoku, you must fill in the numbers 1-9 exactly once in every row, column, and $3\times 3$ subgrid. Here is an example of a winning solution:

![sudoku](sudoku_solved.png)

*Generalized Sudoku* is played on an $n^2 \times n^2$ grid, arranged into $n^2$ subgrids of size $n\times n$. A winning solution contains the numbers 1-$n$ exactly once in every row, column, and $n\times n$ subgrid.

Write a function that takes a $n^2\times n^2$ integer array and returns `True` if it is a winning (generalized) Sudoku solution.

In [34]:
def sudoku_win_py(A):
    "Returns True if A is a winning Sudoku board"
    import math
    import itertools

    n = math.isqrt(A.shape[0])
    s = set(range(1, n ** 2 + 1))
    horiz, vert = [all(set(row) == s for row in M) for M in (A, A.T)]
    if not (horiz and vert):
        return False
    for h, v in itertools.product(range(0, n * n, n), repeat=2):
        block = A[h : h + n, v : v + n]
        if set(block.flat) != s:
            return False
    return True

In [35]:
good_board = [[3, 1, 6, 2, 4, 9, 7, 8, 5], 
              [2, 7, 8, 3, 6, 5, 9, 1, 4], 
              [4, 9, 5, 7, 8, 1, 6, 2, 3], 
              [8, 5, 1, 9, 7, 4, 3, 6, 2], 
              [9, 2, 3, 8, 1, 6, 4, 5, 7], 
              [6, 4, 7, 5, 3, 2, 1, 9, 8], 
              [7, 6, 2, 1, 5, 3, 8, 4, 9], 
              [1, 3, 9, 4, 2, 8, 5, 7, 6], 
              [5, 8, 4, 6, 9, 7, 2, 3, 1]]
sudoku_win_py(np.array(good_board, dtype=int))

True

In [36]:
bad_board = np.arange(81).reshape(9, 9)
sudoku_win_py(bad_board)

False

In [66]:
def sudoku_win_np(A):
    n = np.sqrt(A.shape[0]).astype(int)
    s = set(range(1, n ** 2 + 1))
    
    def uni(axis):
        return set(axis) == s
    
    # check rows
    row_check = np.all(np.apply_along_axis(uni, axis=1,arr=A))
    if row_check == False:
        return False
    
    # check columns
    col_check = np.all(np.apply_along_axis(uni, axis=0,arr=A))
    if col_check == False:
        return False
    
    # check subgrids
    rows = A.strides[0]
    cols = A.strides[1]

    subgrids = np.lib.stride_tricks.as_strided(A, shape=(n, n, n, n), strides=(rows*n, cols*n, rows, cols))
    
    flat_grids = subgrids.flatten()
    row_grids = flat_grids.reshape(9,9)
    
    grid_check = np.all(np.apply_along_axis(uni, axis=1,arr=row_grids))
    
    if grid_check == False:
        return False
    
    return True

    ...

In [67]:
grader.check("1e")

## Question 2: $k$-means clustering

$k$-means is a fundamental algorithm for clustering multivariate data. The inputs to the algorithm are:
- An $n\times p$ data matrix $X$ consisting of $n$ observations of a $p$-dimensional feature vector, and
- A $k\times p$ matrix $C$ containing initial guesses for each $k$ cluster centers.

The algorithm proceeds by iteratively a) assigning each point to the nearest cluster center, and b) recomputing the cluster centers as the mean of all of the currently assigned points. Here is a partial implementation:

In [39]:
def kmeans(X, C):
    """
    K-means algorithm.

    Args:
        - X: ndarray, shape (n, p), n observations of p-dimensional feature vector
        - C: ndarray, shape (k, p), k initial cluster centers

    Returns:
        Tuple of length two:
        The first entry is integer ndarray, shape (n), cluster assignments for each data point
        The second entry is ndarray, shape (k, p), centers of each cluster
    """
    assert X.shape[1] == C.shape[1]  # p should match
    while True:
        new_assignments = nearest_cluster(X, C)
        try:
            if np.all(new_assignments == assignments):
                # converged
                return assignments, C
        except NameError:  # first iteration, no assignments
            pass 
        assignments = new_assignments
        C = compute_centroids(X, assignments)

You will finish implementing this algorithm by completing the missing functions `nearest_cluster()` and `compute_centroids()` below. Note: as in the Question 1, do not use any loops, iterators, or comprehensions.

**2(a)** (3 pts) Implement the function `nearest_cluster`. It should take two array arguments, the data points `X` and the cluster centers `C`, and return an integer array giving the index in `C` which is nearest to each point in `X`.

In [40]:
def nearest_cluster(X, C):
    """
    For each point in X, find the nearest point in C.

    Args:
        X: ndarray, shape (n, p), n points of dimension p.
        C: ndarray, shape (k, p), k points of dimension p.

    Returns:
        Integer array of length n, [j[1], j[2], ..., j[n]], such that |X[i] - C[j[i]]| <= |X[i] - C[ell]| for 1 <= ell <= k.
    """
    diffs = (X - C[:, None])                        # compute the differences for each element in X for each element in C, we have to add a dimension

    sq_diffs = diffs**2                             # square the differences as distance can't be negative
    euc_dist = np.sqrt(sq_diffs)                    # get Euclidian distance for each point
    
    distances = euc_dist.sum(axis=2)                # remove extra dimension, wonder why squezze doesn't work for the seoncd example
    return np.argmin(distances, axis=0).astype(int) # return index position for each column where the value is minimum
    ...

In [41]:
X = np.array([[-101, 1, 3, 4], [1,1,2,3], [1, 1.01,2,1], [1.5, 1.6,3,4], [2.5, -3,3,4], [2.5, 2.7,4,4], [3, 33,4,4]])
C = np.array([[1, 1,1,1], [2,2,2,2], [3,3,3,3]])

nearest_cluster(X,C)

array([0, 0, 0, 1, 2, 2, 2])

In [42]:
grader.check("2a")

**2(b)** (3 pts) Implement the function `compute_centroids`. It should take two array arguments, the data points `X` and the assignment array `a`, and return an $k \times p$ array containing the cluster centroids (averages) for each point assigned to cluster $0, \dots, k-1$. (You may assume that every entry of $a$ is between $0$ and $k-1$, inclusive.)

In [43]:
def compute_centroids(X, a):
    A = np.zeros((len(np.unique(a)),X.shape[1]),dtype=int)
    
    def count_n_mean(x,i=[-1]):
        i[0]+=1                   # count how many times a function was run, start at -1 so first row gets 0
        x = X[a==i[0]]            # change the row to array of numbers according to a
        mean = np.mean(x,axis=0)  # calculate mean of the numbers
        return mean
    
    return np.apply_along_axis(count_n_mean, axis=1, arr=A) # use numpy function to go over rows
    ...

In [44]:
grader.check("2b")

**2(c)** (5 pts.) The performance of the $k$-means algorithm is known to depend heavily on the starting point (the initial clusters `C` passed in as the second argument.) In some cases, using a "good" starting point can dramatically improve the performance of the algorithm.

The $k$-means++ algorithm is designed to find such a good starting point. [According to Wikipedia](https://en.wikipedia.org/wiki/K-means%2B%2B), the steps of $k$-means++ are:

1. Choose one center uniformly at random among the data points.
2. For each data point $x$ not chosen yet, compute $D(x)$, the distance between $x$ and the nearest center that has already been chosen.
3. Choose one new data point at random as a new center, using a weighted probability distribution where a point $x$ is chosen with probability proportional to $D(x)^2$.
4. Repeat Steps 2 and 3 until $k$ centers have been chosen.

Implement this algorithm using the skeleton provided below. As before, your implementation should only use Numpy functions--no additional loops or comprehensions. 

**Note**: To ensure reproducibility, the parts of the algorithm that rely on ranndomness are provided for you. Your job is to fill in the missing lines necessary to complete the algoritm.

In [45]:
def kmeanspp(X, k, rng):
    """
    k-means++ algorithm.

    Args:
        - X: ndarray, shape (n, p), as above.
        - k, the number of clusters.
        - rng: instance of np.random.Generator().

    Returns:
        ndarray, shape (k, p), cluster centers.
    """
    n, p = X.shape
    C = np.zeros((k, p))
    # step 1
    j = rng.choice(n)
    C[0] = X[j]
    step = 0
    for i in range(1, k):
        
        # calculate the distances
        diffs = (X - C[:, None])
        euc_dist = np.sqrt(diffs**2).sum(axis=2)
        
        # vector of probabilities
        w = euc_dist[step]/np.sum(euc_dist[step])
        
        # step 3
        j = rng.choice(n, p=w)
        C[i] = X[j]
        step += 1
        
    return C

In [46]:
grader.check("2c")

**2(d)** (2 pts) In order to measure how good a clustering is, we can define the *within-class variance* 

$$ V(\mathbf{X}, \mathbf{a}, \mathbf{C}) = \sum_{i=1}^n \| \mathbf{x}_i - \mathbf{c}_{a_i} \|^2,$$

where the $i$-th element of $\mathbf{a}=\{a_1,\dots,a_n\}$ is the cluster assignment of observation $i$, and $\mathbf{C}=(\mathbf{c}_1,\dots,\mathbf{c}_k)$ are the centers of each cluster. Thus, $V(\mathbf{X}, \mathbf{a}, \mathbf{C})$ is the sum of the squared distance from each data point to the center of its assigned cluster.

Implement this function. (Again, no loops, just use Numpy functions.)

In [47]:
def V(X, a, C):
    
    def count_n_deduct(x,i=[-1]):
        i[0]+=1                   # count how many times a function was run, start at -1 so first row gets 0
        diff = (x-C[a[i[0]]])**2   # we iterate over every row and deduct it by the index in a
        return diff
    
    return np.sum(np.apply_along_axis(count_n_deduct, axis=1, arr=X)) # use numpy function to go over rows
    ...

In [48]:
grader.check("2d")

<!-- BEGIN QUESTION -->

**2(e)** (5 pts) Recall from lecture the file `mnist.npz`, which contains the labeled image data for handwritten digits.

In [49]:
mnist = np.load("mnist.npz")
mnist["images"].shape

(60000, 784)

We will experiment with clustering these data. For memory and performance reasons, we will only look at the first 1000 images:

In [50]:
X = mnist["images"][:1000]

Which performs better, $k$-means++ or random initialization? Do the clusters make sense to you? How do the clusters relate to the true labels given in `mnist['labels']`? What are some examples of images where the clustering is nearly ambiguous (meaning they were almost part of another cluster?)

Variance-wise random initialization and kpp did very similar job, the variance of kpp depended a lot on chosen seed. When we compare clustering to true labels, clustering got around 60 % correct. In the example of 1,000 observations, the algorithm had a problem with differentiating between 4 and 9, 0 and 6, 8 and 2. The exact picture clustering is shown below in the dictionary, based on the data provided, we can check which images were ambiguous.

In [51]:
n, p = X.shape
labels = mnist['labels'][:1000]
cnt_labels = len(np.unique(labels))

# Random initialization
random_indices = np.random.choice(n, size=cnt_labels, replace=False)
random_points = X[random_indices, :]

a_rand, C_rand = kmeans(X,random_points)

variance_random = V(X, a_rand, C_rand)
print('Variance with random initialization:', variance_random)

rng = np.random.default_rng(1234567) # heavily dependant on the seed
kpp_points = kmeanspp(X,cnt_labels , rng)

a_kpp, C_kpp = kmeans(X,kpp_points)

variance_kpp = V(X, a_kpp, C_kpp)
print('Variance K++:',variance_kpp)


rand_check = {}
for i in range(0,10):
    if i in rand_check:
        rand_check[i].append(labels[np.where(a_rand == i)])
    else:
        rand_check[i] = []
        rand_check[i].append(labels[np.where(a_rand == i)])

rand_real = [np.bincount(x[0]).argmax() for x in rand_check.values()]
print('Guesses with random initialization, based on the highest count:\n',sorted(rand_real))

kpp_check = {}
for i in range(0,10):
    if i in kpp_check:
        kpp_check[i].append(labels[np.where(a_kpp == i)])
    else:
        kpp_check[i] = []
        kpp_check[i].append(labels[np.where(a_kpp == i)])

kpp_real = [np.bincount(x[0]).argmax() for x in kpp_check.values()]

print('Guesses with k++, based on the highest count:\n',sorted(kpp_real))

# replace values in a by values here and see the %
a_rand_real = []
for a in a_rand:
    a_rand_real.append(rand_real[a])
    
a_kpp_real = []
for a in a_kpp:
    a_kpp_real.append(kpp_real[a])
 
print('% correct using random initialization:',sum(a_rand_real == labels)/len(labels))
print('% correct using kpp:',sum(a_kpp_real == labels)/len(labels))

Variance with random initialization: 2468322256.2957196
Variance K++: 2455865952.223058
Guesses with random initialization, based on the highest count:
 [0, 0, 1, 1, 2, 3, 3, 4, 6, 7]
Guesses with k++, based on the highest count:
 [0, 1, 2, 3, 4, 4, 5, 6, 7, 8]
% correct using random initialization: 0.458
% correct using kpp: 0.575


In [52]:
for k, v in kpp_check.items():
    print('%s:' % k, v )

0: [array([4, 4, 4, 4, 9, 7, 4, 4, 7, 4, 6, 9, 9, 4, 4, 9, 9, 4, 9, 9, 7, 6,
       5, 8, 4, 7, 5, 4, 6, 4, 4, 9, 4, 4, 9, 4, 7, 9, 4, 4, 4, 4, 6, 4,
       4, 9, 9, 6, 6, 4, 4], dtype=uint8)]
1: [array([3, 0, 9, 3, 3, 3, 5, 3, 5, 0, 0, 0, 5, 3, 5, 5, 3, 5, 0, 5, 8, 8,
       3, 8, 3, 5, 0, 8, 5, 5, 0, 0, 5, 5, 5, 5, 3, 3, 5, 6, 3, 5, 3, 3,
       5, 5, 0, 5, 5, 3, 3], dtype=uint8)]
2: [array([6, 0, 6, 0, 6, 6, 6, 6, 0, 6, 6, 6, 0, 6, 6, 6, 6, 6, 0, 2, 2, 6,
       6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 4, 6, 6, 6, 0, 6, 0, 0, 6, 6, 6,
       6, 0, 6, 6, 0, 6, 0, 6, 0, 0, 6, 6, 6, 0, 6, 6, 6, 6, 0, 6, 0, 6,
       6, 6, 0, 6, 6, 0, 6, 0, 0, 6, 0, 6, 0, 0, 0, 0, 6, 0, 0, 6, 6, 0,
       0, 6], dtype=uint8)]
3: [array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2], dtype=uint8)]
4: [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

<!-- END QUESTION -->

## Question 3: Working with pandas DataFrames

In this problem, you'll get practice working with pandas `DataFrames`, reading
them into and out of memory, changing their contents and performing
aggregation operations. We'll use the file `iris.csv` included with this problem set to practice.
**Note:** for the sake of consistency, please the CSV included with the problem set, and not one from elsewhere.

In [53]:
import pandas as pd

<!-- BEGIN QUESTION -->

**3(a)** (2 pts) Read the data into a variable called `iris`. How many data points are
    there in this data set? What are the data types of the columns? What
    are the column names? The column names correspond to flower species
    names, as well as four basic measurements one can make of a flower:
    the width and length of its petals and the width and length of its
    sepal (the part of the pant that supports and protects the flower
    itself). How many species of flower are included in the data? Show your work by including the
    pandas commands you used to figure out the answers.

a. 150 data points; b. dtype=float64 , c. column names = Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species, d. There are 3 different species in the data

In [54]:
iris = pd.read_csv('iris.csv')
print('DF info:')
print(iris.info(),'\n')
print('No. of species:')
print(iris.Species.nunique())
...

DF info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Sepal.Length  150 non-null    float64
 1   Sepal.Width   150 non-null    float64
 2   Petal.Length  150 non-null    float64
 3   Petal.Width   150 non-null    float64
 4   Species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None 

No. of species:
3


Ellipsis

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**3(b)** It is now known that this dataset contains errors
    in two of its rows (see the documentation at
    <https://archive.ics.uci.edu/ml/datasets/Iris>). Using 1-indexing,
    these errors are in the 35th and 38th rows. The 35th row should read
    `4.9,3.1,1.5,0.2,"setosa"`, 
    where the fourth feature is incorrect as it appears in the file,
    and the 38th row should read `4.9,3.6,1.4,0.1,"setosa"`, where the second and third features
    are incorrect as they appear in the file. Correct these entries of
    your DataFrame.


In [55]:
print(iris.iloc[34],'\n')
print(iris.iloc[37],'\n')
print('Entries are already correct.')
...

Sepal.Length       4.9
Sepal.Width        3.1
Petal.Length       1.5
Petal.Width        0.2
Species         setosa
Name: 34, dtype: object 

Sepal.Length       4.9
Sepal.Width        3.6
Petal.Length       1.4
Petal.Width        0.1
Species         setosa
Name: 37, dtype: object 

Entries are already correct.


Ellipsis

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**3(c)** The iris dataset is commonly used in machine learning as a
        proving ground for clustering and classification algorithms.
        Some researchers have found it useful to use two additional features,
        called *Petal ratio* and *Sepal ratio*,
        defined as the ratio of the petal length to petal width
        and the ratio of the sepal length to sepal width, respectively.
        Add two columns to your DataFrame corresponding to these two
        new features.
        Name these columns
        `Petal.Ratio` and `Sepal.Ratio`, respectively.

In [56]:
iris['Petal.Ratio'] = iris['Petal.Length'] / iris['Petal.Width']
iris['Sepal.Ratio'] = iris['Sepal.Length'] / iris['Sepal.Width']
iris.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,Petal.Ratio,Sepal.Ratio
0,5.1,3.5,1.4,0.2,setosa,7.0,1.457143
1,4.9,3.0,1.4,0.2,setosa,7.0,1.633333
2,4.7,3.2,1.3,0.2,setosa,6.5,1.46875
3,4.6,3.1,1.5,0.2,setosa,7.5,1.483871
4,5.0,3.6,1.4,0.2,setosa,7.0,1.388889


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**3(d)** (2 pts)
Use a pandas aggregate operation to determine the
        mean, median, minimum, maximum and standard deviation of the
        petal and sepal ratio for each of the three species in the data set.
        **Note**: you should be able to get all five numbers in a single
        table (indeed, in a single line of code)
        using a well-chosen group-by or aggregate operation.

In [57]:
iris.groupby('Species').agg({'Petal.Ratio': ['mean', 'median', 'min', 'max', 'std'], 'Sepal.Ratio': ['mean', 'median', 'min', 'max', 'std']})

Unnamed: 0_level_0,Petal.Ratio,Petal.Ratio,Petal.Ratio,Petal.Ratio,Petal.Ratio,Sepal.Ratio,Sepal.Ratio,Sepal.Ratio,Sepal.Ratio,Sepal.Ratio
Unnamed: 0_level_1,mean,median,min,max,std,mean,median,min,max,std
Species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
setosa,6.908,7.0,2.666667,15.0,2.854545,1.470188,1.463063,1.268293,1.956522,0.11875
versicolor,3.242837,3.240385,2.666667,4.1,0.312456,2.160402,2.16129,1.764706,2.818182,0.228658
virginica,2.780662,2.666667,2.125,4.0,0.407367,2.230453,2.16954,1.823529,2.961538,0.246992


<!-- END QUESTION -->



---

To double-check your work, the cell below will rerun all of the autograder tests.

In [58]:
grader.check_all()

1a results: All test cases passed!

1b results: All test cases passed!

1c results: All test cases passed!

1d results: All test cases passed!

1e results: All test cases passed!

2a results: All test cases passed!

2b results: All test cases passed!

2c results: All test cases passed!

2d results: All test cases passed!

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Upload this .zip file to Gradescope for grading.

In [59]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)