# Dataset Statistics

The MNIST dataset is a collection of handwritten digit images that has been modified from two datasets originally collected by the National Institute of Standards and Technology in the United States.

It is a popular dataset and can be easily loaded in a Google Colab notebook using the provided code snippet:
```python
from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
```
The dataset contains every digit from 0 to 9.

Could you help us determine what of the following statements are correct about the dataset?
+ The digit 1 has the most images in the dataset, totaling 7,877 images.

+ The digit 0 has the least images in the test set, totaling 600 images.

+ The train and test sets contain the same number of images.

+ The digit 3 has 6,131 images in the train set.

## Official Solution
We need to write some code to answer this question. Let's start by concatenating the train and test sets:
```python
import numpy as np
labels = np.concatenate((y_train, y_test))
```
Notice that we don't need to worry about `X_train` and `X_test` because those arrays contain the images. We can answer the question by looking at the labels only.

Numpy's unique() function will allow us to group and count the labels:
```python
digit, count = np.unique(labels, return_counts=1)
```
If we print the count corresponding to digit 1, we will find out that there are 7,877 images:
```python
print(count[1])
```
To determine how popular each digit is, we can print the entire count array and get the following result: [6903, 7877, 6990, 7141, 6824, 6313, 6876, 7293, 6825, 6958]. Notice that digit 1 is indeed the most popular in the dataset.

We can check how many instances of digit 0 in the test set with the following code that will print 980:
```python
print(np.where(y_test == 0)[0].shape[0])
```
We can find the total number of images on each set by printing `y_train.shape[0]` and `y_test.shape[0]`. The result will be 60,000 and 10,000, respectively.

Finally, we can check how many instances of digit 3 in the train set with the following code that will print 6,131:
```python
print(np.where(y_train == 3)[0].shape[0])
```

## My Solution

I simply used the `Counter` class from **Python**'s *collections* module:

In [None]:
# get necessary libs
from collections import Counter
from keras.datasets import mnist

# get train/test split
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# get count for train/test labels
y_train_count = Counter(y_train)
y_test_count  = Counter(y_test)

### Answer 1
*The digit 1 has the most images in the dataset, totaling 7,877 images:* `True`

In [None]:
# add both counter objects to see total dataset counts
y_train_count + y_test_count

### Answer 2
*The digit 0 has the least images in the test set, totaling 600 images:* `False`

In [None]:
# check test set counts
y_test_count

### Answer 3
*The train and test sets contain the same number of images:* `False`

In [None]:
# look at the shape or lengths
len(y_train), len(y_test)

### Answer 4
*The digit 3 has 6,131 images in the train set:* `True`

In [None]:
# look at train set counts
y_train_count