### 03 - Statistics in NumPy

In this notebook, we will explore common statistical operations using NumPy, which will help you summarize and understand data distributions.

### 1\. **Mean (Average)**

The **mean** is just the average of all the numbers in a list. It tells you the "central" value of your data.

In [None]:
import numpy as np

data = np.array([1, 2, 3, 4, 5])

# Mean
mean_value = np.mean(data)
print(f"Mean: {mean_value}")

### 2\. **Median**

The **median** is the middle value of a sorted list of numbers. If there’s an even number of values, the median is the average of the two middle numbers.

In [None]:
# Median
median_value = np.median(data)
print(f"Median: {median_value}")

### 3\. **Mode (Most Frequent Value)**

The **mode** is the number that appears most often in a dataset. If no number repeats, the dataset has **no mode**.

In [None]:
%pip install scipy

In [None]:
from scipy import stats

# Mode
mode_value = stats.mode(data)
print(f"Mode: {mode_value.mode[0]}")

### 4\. **Variance**

**Variance** tells you how spread out your data is. If the values in your data are very spread out from the mean, the variance will be high. If they are close to the mean, the variance will be low.

In [None]:
# Variance
variance_value = np.var(data)
print(f"Variance: {variance_value}")

### 5\. **Standard Deviation**

The **standard deviation** is just the square root of the variance. It’s another way to measure how spread out the data is, but it’s in the same units as the original data, making it easier to understand.

In [None]:
# Standard Deviation
std_dev_value = np.std(data)
print(f"Standard Deviation: {std_dev_value}")

### 6\. **Minimum and Maximum**

The **minimum** is the smallest number, and the **maximum** is the largest number in a dataset.

In [None]:
# Minimum and Maximum
min_value = np.min(data)
max_value = np.max(data)

print(f"Minimum: {min_value}")
print(f"Maximum: {max_value}")

### 7\. **Percentiles**

A **percentile** tells you the value below which a certain percentage of your data lies. For example, the **50th percentile** is the **median**, which is the middle value.

In [None]:
# Percentile (e.g., 25th, 50th, 75th)
percentile_25 = np.percentile(data, 25)
percentile_50 = np.percentile(data, 50)
percentile_75 = np.percentile(data, 75)

print(f"25th Percentile: {percentile_25}")
print(f"50th Percentile (Median): {percentile_50}")
print(f"75th Percentile: {percentile_75}")

### 8\. **Correlation**

**Correlation** tells you how two sets of data are related. If two sets of data have a high correlation, it means when one set increases, the other tends to increase as well (or decrease, depending on the type of correlation).

In [13]:
data_2 = np.array([5, 4, 3, 2, 1])

# Correlation coefficient
correlation = np.corrcoef(data, data_2)
print(f"Correlation Coefficient:\n{correlation}")

Correlation Coefficient:
[[ 1. -1.]
 [-1.  1.]]


### 9\. **Cumulative Sum**

The **cumulative sum** is the running total of the values. It gives you a sense of how the sum changes as you move through the data.

In [None]:
# Cumulative Sum
cumsum_value = np.cumsum(data)
print(f"Cumulative Sum: {cumsum_value}")