In [35]:
import pandas as pd
import numpy as np

### Exploring ways to return the most frequent number in a sequence

It's a fairly common task - when you have some set of objects and you need to calculate, which object appears more than the others.<br>
This can be done in many ways <u>using existing functions</u> (we won't be doing algorithms here). So let's find out, which way is the fastest!

N. B. Let's note that there will be situations when we'll be having several most frequent objects, so we just pick the first, that's not our topic here.

In [38]:
SET_SIZE = 1000
MAX_NUM = 200

# first we create a set of SET_SIZE numbers, consisting of random int-s 
set = pd.DataFrame(data=np.array(np.random.randint(MAX_NUM, size=SET_SIZE)))
display(set[:7])

Unnamed: 0,0
0,76
1,170
2,140
3,199
4,153
5,145
6,99


#### Using pandas 'mode' method

First obvious idea - to use special function exactly for this task by definition - **mode**. Mode is the most frequent number(s) in a distribution. Luckily, pandas lib has it.

In [41]:
%%timeit -n10000
set.mode()[0][0]

202 μs ± 55.4 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


> Results: <br> 215 μs ± 58 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>
> ⚠ Concrete numbers of microseconds will differ from run to run

#### Using pandas 'value_counts' method

We could use **value_counts()** and grab frequent value from there

In [45]:
%%timeit -n10000
set.value_counts().idxmax()[0]

410 μs ± 105 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


> Results: <br> 486 μs ± 33.5 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) <br>
Almost twice as slow as mode method!

#### Using numpy's 'unique' method

Turns out numpy has methods that could help here too. We could use 'np.unique', which returns unique value - and their counts, if asked :)

In [49]:
%%timeit -n10000
a, b = np.unique(set, return_counts=True)
a[b.argmax()]

21.6 μs ± 215 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


> Results: <br> 37.5 μs ± 14.5 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) <br>
Wow, that's impressive.<br>Works <b>6 times faster</b> than 'pd.mode' and <b>12 times faster</b> than 'pd.value_counts()'!

### Are those codes even working?

We surely checked it before calculating time of execution, but let's double-check.

In [53]:
# let's see first 5 top-counts of values

print(set.value_counts()[:5])

0  
185    12
182    11
66     11
47     11
35     10
Name: count, dtype: int64


First column is a number, second column is how many times this number appears in dataset. So our wanted outcome is in the left upper corner (nevermind 0).

In [55]:
# what if our objects are words, not numbers? spoiler: numpy will deliver
# set = ['Am', 'Dm', 'G', 'G', 'G', 'C', 'Em', 'Em']

In [56]:
print('Most frequent number by "mode" method =', set.mode()[0][0])
print('Most frequent number by "value_counts" method =', set.value_counts().idxmax()[0])
a, b = np.unique(set, return_counts=True)
print('Most frequent number by "np.unique" method =', a[b.argmax()])

Most frequent number by "mode" method = 185
Most frequent number by "value_counts" method = 185
Most frequent number by "np.unique" method = 185


### Conclusion

**Numpy** is just the champion, what a beast!<br>
Note that this lib outperforms greatly even when we use pd.DataFrame as a type of set of numbers.<br>
There's also a big '+': if we change type of the set to, say, list, containing words, numpy-approach will still work, whereas two other methods will drop an error
