# Reduction Operations in Numpy Arrays

Numpy arrays have become inexpensible in the world of data science. They offer a wide range of functions which are encountered most often when dealing with data. One such family of functions is the family of reduction operations. Eponymously, reduction operations are the ones which reduce the number of elements in an array. 

Just by reading through the definition, we can't gauge the significance of these operations. Let me give a list of the functions we're going to look at, then the usability of these functions will become very clear. 

- descriptive stats - sum, mean, std, min, max, median
- argmax
- argmin
- reduce

So, let's get started.

Just like it says, it sums up the elements of an array. When we're dealing with a one dimensional array, it's not anything great, it's just a simple addition of a few numbers thrown at a computer.

But when we go to higher dimensions, we can see the real power of numpy. We can sum along any given axes of an object. Consider a 2-Dimensional array/matrix for example. 

<img src="https://media.geeksforgeeks.org/wp-content/uploads/sum-of-each-row-and-column.jpg" width="40%">

Many real world structures/tabular data problems are processed in pandas dataframe which is built on top of numpy arrays. Typically, the rows are records of an individual datapoint and the columns are the features of a record. When we want to get an idea of the distribution of features we can use reduction operations like sum over individual axes (particularly the columns). Here's an example of the same below.

In [2]:
import numpy as np
np.random.seed(10)
scores = np.random.randint(30, 50, (5, 4))
print(scores)

[[39 34 45 30]
 [47 46 47 38]
 [39 30 40 38]
 [34 49 46 34]
 [45 41 41 31]]


Let's say the values above are the scores of 5 students in 4 subjects. Here, rows represent students(records) and columns represent marks in a subject(features). We can see the sum of scores of each student by summing over the rows. We can see the sum of scores per subject by summing over the columns. Here's how we can do the same.

In [3]:
# Summing over the columns/features
scores.sum(axis = 0)

array([204, 200, 219, 171])

In [4]:
# Summing over the rows/records
scores.sum(axis = 1)

array([148, 178, 147, 163, 158])

Now let's say we wanted to focus on the average marks scored by a student, we can reduce the scores array using the mean function in numpy as follows.

In [5]:
# Average score per student
scores.mean(axis = 1)

array([37.  , 44.5 , 36.75, 40.75, 39.5 ])

In [6]:
# Average score per subject
scores.mean(axis = 0)

array([40.8, 40. , 43.8, 34.2])

We can also focus on the standard deviation of marks scored by students. Standard Deviation is intuitively speaking how bound or spread the distribution of the marks is from the mean score of the students. It measures how dispersed the data is from the mean.

In [8]:
# See the spread of average marks in the 4 subjects
np.std(scores.mean(axis = 0))

3.47706773014274

In [9]:
# See the spread of average marks of students in the class
np.std(scores.mean(axis = 1))

2.8346075566116733

We can see the highest and lowest scores which a student has secured using the min and max functions respectively. They too are as simple to use as the above ones. 

In [10]:
# See the lowest score secured in every subject
scores.min(axis = 0)

array([34, 30, 40, 30])

In [11]:
# See the lowest score secured by each student
scores.min(axis = 1)

array([30, 38, 30, 34, 31])

In [12]:
# See the highest score secured in every subject
scores.max(axis = 0)

array([47, 49, 47, 38])

In [13]:
# See the highest score secured by each student
scores.max(axis = 1)

array([45, 47, 40, 49, 45])

Notice the we have always specified the axis argument until now, telling numpy explicitly that we want to reduce along that dimension. Sometimes you're interested in global reduction, i.e. reduce the entire matrix to a single value. Let's say you wanted to find the sum of scores of all students in all subjects or, you wanted to find the mean score across all students in all subjects. This can be done by ommitting the axis argument altogether. This will then perform reduction across the entire array or matrix.

In [14]:
# Reducing globally
print(f"The global sum of scores is:     {scores.sum()}")
print(f"The global mean of scores is:    {scores.mean()}")
print(f"The global minimum of scores is: {scores.min()}")
print(f"The global maximum of scores is: {scores.max()}")

The global sum of scores is:     794
The global mean of scores is:    39.7
The global minimum of scores is: 30
The global maximum of scores is: 49


There are two more reduction operations which in addition to the above ones which are of great significance especially in Machine Learning and Deep Learning.

## argmax & argmin 

Given an array it finds out the index of the maximum or minimum element along a given dimension. Let us consider using the above example itself. Suppose we wanted to find out which student out of the five students scored highest in subject one. We can get the scores for subject one and do an argmax on the same to find this out. 

In [15]:
subject1scores = scores[:, 1]
print(f"Scores: {subject1scores}")
subject1scores.argmax()

Scores: [34 46 30 49 41]


3

Since indexing starts at 0, we can see from the scores above that 4th student i.e. the 3rd index is where the scores are highest. 

Consider that we wanted to find out which student got the lowest marks in subject one. We can do an argmin on the same array to figure that out.

In [16]:
subject1scores = scores[:, 1]
print(f"Scores: {subject1scores}")
subject1scores.argmin()

Scores: [34 46 30 49 41]


2

We can find out the same for all the subjects using the entire array and specifying the reduction dimension along the axis. Since we want to reduce across the scores, we'll specify axis = 0

In [17]:
# Student who scored max in each of the subjects
scores.argmax(axis = 0)

array([1, 3, 1, 1])

In [18]:
# Student who scored the least in each of the subjects
scores.argmin(axis = 0)

array([3, 2, 2, 0])

We can conversely find out which subject did every student performed the best or the worst in by specifying the axis accordingly, we'll specify axis = 1 for this case and this is how it appears.

In [19]:
# The subject which each of the five students topped in
scores.argmax(axis = 1)

array([2, 0, 2, 1, 0])

In [20]:
# The subject which each of the five students struggled the most with
scores.argmin(axis = 1)

array([3, 3, 1, 0, 3])

> In classification tasks in Machine Learning and Deep Learning applications, we always get a probability distribution or likelihood scores of every class for a given input. argmax() function helps to find out the most likely class for that input.


> In nearest neighbour search application for eg. picking items similar to a given input item, a distance matrix is calculated and argmin() function helps to find out the nearest neighbours because we're looking to minimize the distances of the neighbours from the given input item.

## reduce & accumulate operations

Numpy provides this function in order to reduce an array with a particular operation. The way these functions work are they repeatedly apply the operation over all the elements of an array until only a single element remains. 

The key difference between the two is that reduce function stores only the final result whereas the accumulate function stores all the middle stages of computation as well. 

In [21]:
x = np.array([1,2,3,4,5,6])
print(np.add.reduce(x))
print(np.add.accumulate(x))

21
[ 1  3  6 10 15 21]


These functions help us in case where we have to perform cumulative operations for eg. 
- Finding out the cumulative distribution of a feature.
- Creating elbow charts to determine how many significant components to keep in Principal Component Analysis (a method used for dimensionality reduction in ML)

In this post, we studied about reduction operations in numpy which are very handy in many ML/DL/Data Science operations in general. In our next post we shall talk about some advanced operations in numpy which are commonplace and inevitable to Data Scientists.

## References
1. [Array Image](https://www.geeksforgeeks.org/)
2. [Array Computations](https://jakevdp.github.io/PythonDataScienceHandbook/02.03-computation-on-arrays-ufuncs.html)
3. [Matrix Rain Image](http://www.teachmeidea.com/2018/09/how-to-build-matrix-rain-in-java.html)