# Statistical Methods

NumPy provides several statistical functions. We use basic statistics very often in daily life. Computing average, sumation and several others. 

In Machine Learning, we often have to perform several statistical operations. Even when plotting charts, requires use of several statistical functions. Let's take a look at some of the statistical functions provided by NumPy.

In [1]:
import numpy as np

In [2]:
data_list = np.random.rand(4,4) # products a random (4,4) sized 2D matrix

In [3]:
data_list

array([[0.72686874, 0.19474382, 0.05061276, 0.35455887],
       [0.61115868, 0.55553073, 0.44741632, 0.92858649],
       [0.02789257, 0.69665108, 0.31917997, 0.48274819],
       [0.81314283, 0.11493034, 0.7677079 , 0.53254535]])

# Sum

Used to fetch the summation of all values within the matrix. 

In [4]:
data_list.sum()

7.624274629565662

# Mean

The Mean value is nothing but the Average value. We simply produce a sum() of all numbers and then divide by the number of numbers we did a summation on. 

![image.png](attachment:image.png)

`np.mean(array)` - returns mean of array-list.

In [5]:
np.mean(data_list)

0.47651716434785385

# Median

Median is that data point of given data, which have half of values lower than it and other half higher than it.

![image.png](attachment:image.png)

`np.median(array)` - returns median/middle value of array-list.

In [6]:
np.median(data_list)

0.5076467722525979

# Standard Deviation

Standard deviation provides an indication of the spread of the data

![image.png](attachment:image.png)

If Standard deviation is <b><i>Low</i></b> means that most of the data points are close to mean of the data. If it is <b><i>High</i></b> means that the data points are spread out away from the mean, or in simple words are more deviated/distributed in a high range. <br>

`np.std(array)` - returns the standard deviation.

In [7]:
np.std(data_list)

0.2709284253978271

# Variance

Variance is another number which represents spreadness of data points. It's formula is square of standard deviation.

![image.png](attachment:image.png)

`np.var(array)` - returns the variance of array-list.

In [8]:
np.var(data_list)

0.07340221168854594

# Min/Max

Returns the minimum/maximum value of array. Let's look at an example. 

In [9]:
np.min(data_list)

0.02789257109099408

In [10]:
np.max(data_list)

0.9285864855860084

# Argmin/Argmax

Let's say we want to find the index of the element with the min or max value; we can do that using the `argmin` or `argmax` command.

In [11]:
np.argmax(data_list)

7

In [12]:
np.argmin(data_list)

8

# Cumsum

Returns the cumulative sum of the array, taking 1st addition with 0.

In [13]:
np.cumsum(data_list)

array([0.72686874, 0.92161255, 0.97222531, 1.32678418, 1.93794286,
       2.49347359, 2.94088991, 3.8694764 , 3.89736897, 4.59402004,
       4.91320002, 5.39594821, 6.20909104, 6.32402138, 7.09172928,
       7.62427463])

# Cumprod

Returns the cumulative product of the array, taking 1st product with 1.

In [14]:
np.cumprod(data_list)

array([7.26868735e-01, 1.41553191e-01, 7.16439728e-03, 2.54020062e-03,
       1.55246566e-03, 8.62442384e-04, 3.85870794e-04, 3.58314404e-04,
       9.99431000e-06, 6.96254683e-06, 2.22230550e-06, 1.07281396e-06,
       8.72350983e-07, 1.00259595e-07, 7.69700829e-08, 4.09900599e-08])

# Some more common methods

We encourage you to try out these functions. They will come to good use in several real-world data science problems you encounter. 

![image-2.png](attachment:image-2.png)

# Array and Set operations

## Unique

`np.unique(array)` - returns the unique values of array in sorted order

In [15]:
fruits = np.array(['apple','banana','mango','orange','kiwi','mango','lemon'])
fruits

array(['apple', 'banana', 'mango', 'orange', 'kiwi', 'mango', 'lemon'],
      dtype='<U6')

In [16]:
np.unique(fruits)

array(['apple', 'banana', 'kiwi', 'lemon', 'mango', 'orange'], dtype='<U6')

## Intersection

Let us take 2 sets and perform some operations on the sets. Here a set is nothing but a regular ndarray. 

In [17]:
fruits = np.array(['apple','banana','mango','orange','kiwi','mango','lemon'])
veges = np.array(['tomato','spinach','lemon','capsicum'])

![image-3.png](attachment:image-3.png)

Intersection means the element which are common in both sets.

In [18]:
np.intersect1d(fruits, veges)

array(['lemon'], dtype='<U8')

## Union

Fetch elements present in both sets. A standard set union operation.

![image.png](attachment:image.png)
<br>


In [19]:
np.union1d(fruits, veges)

array(['apple', 'banana', 'capsicum', 'kiwi', 'lemon', 'mango', 'orange',
       'spinach', 'tomato'], dtype='<U8')

## Difference

Performs a difference operations between the two sets. Set difference returns the elements which are present in the first set, which are not present in the second set. 

![image-2.png](attachment:image-2.png)

As per the above exampl, we get everything from Set A, which is everything of fruits, but is not in Veges. A Lemon here is classified as both Fruit and Vege, so it is not present in the response of the difference operation. 

In [20]:
np.setdiff1d(fruits, veges)

array(['apple', 'banana', 'kiwi', 'mango', 'orange'], dtype='<U6')

## Symmetric Difference

Gets all the elements in both the sets, except for the ones that are common between both the sets.

![image-2.png](attachment:image-2.png)



In [21]:
np.setxor1d(fruits, veges)

array(['apple', 'banana', 'capsicum', 'kiwi', 'mango', 'orange',
       'spinach', 'tomato'], dtype='<U8')

Here is a quick reference of several other set operations you can perform. We encourage you to try these on your own by trying out different sets.

![image-3.png](attachment:image-3.png)