# Statistics Primer


## Statistical Symbols and Characters Reference

This reference provides a guide to common statistical symbols and characters used in statistical concepts.

### Greek Letters

| Symbol   | Description                               | Usage in Statistics                          | Example                                 |
|----------|-------------------------------------------|----------------------------------------------|-----------------------------------------|
| \(\mu\)  | Population mean (average)                | Central tendency for a population            | \(\mu = 170\) cm (average height)      |
| \(\sigma^2\) | Population variance                  | Variability of a population                  | \(\sigma^2 = 25\)                      |
| \(\sigma\) | Population standard deviation          | Square root of population variance           | \(\sigma = 5\)                         |
| \(\Sigma\) | Summation                              | Adding up all values in a dataset            | \(\Sigma x_i = x_1 + x_2 + \dots + x_n\) |
| \(\rho\)  | Population correlation coefficient       | Strength/direction of linear relationships   | \(\rho = 0.8\)                         |
| \(\beta\) | Regression coefficient (slope)          | Coefficient for predictors in regression     | \(\beta = 2\)                          |
| \(\alpha\) | Significance level                     | Threshold for rejecting null hypothesis      | \(\alpha = 0.05\)                      |
| \(\chi^2\) | Chi-square statistic                   | Used in chi-square tests                     | \(\chi^2 = 10.8\)                      |

### Latin Letters

| Symbol       | Description                                    | Usage in Statistics                          | Example                                 |
|--------------|------------------------------------------------|----------------------------------------------|-----------------------------------------|
| \(X\)        | Random variable                               | Variable representing a set of outcomes      | \(X =\) number of heads in coin flips   |
| \(x\)        | Observed value of a random variable          | Actual outcome of a random variable          | \(x = 3\) (heads)                       |
| \(N\)        | Number of observations in a population       | Total size of a population                   | \(N = 1000\) (students in a school)     |
| \(n\)        | Number of observations in a sample           | Size of a sample                             | \(n = 50\) (students in a sample)       |
| \(P(x)\)     | Probability of an event                      | Likelihood of a specific outcome             | \(P(x = \text{heads}) = 0.5\)          |
| \(E(X)\)     | Expected value (mean) of a random variable X  | Long-run average of repeated trials          | \(E(X) = 5\)                           |
| \(\bar{X}\)  | Sample mean                                   | Average of a sample                          | \(\bar{X} = 172\) cm (sample average height) |
| \(s^2\)      | Sample variance                               | Variability of a sample                      | \(s^2 = 28\)                           |
| \(s\)        | Sample standard deviation                    | Square root of sample variance               | \(s = 5.3\)                            |
| \(r\)        | Sample correlation coefficient               | Strength/direction of sample relationships   | \(r = 0.75\)                           |
| \(b\)        | Sample estimate of regression coefficient    | Slope of a regression line from sample data  | \(b = 1.8\)                            |
| \(t\)        | t-statistic                                   | Test statistic in t-tests                    | \(t = 2.5\)                            |
| \(F\)        | F-statistic                                   | Test statistic in F-tests                    | \(F = 4.2\)                            |
| \(p\)        | p-value                                       | Probability of observing the data under null hypothesis | \(p = 0.02\)                    |

### Other Symbols

| Symbol | Description                            | Usage in Statistics                          | Example                                |
|--------|----------------------------------------|----------------------------------------------|----------------------------------------|
| \(\approx\) | Approximately equal to             | Indicates a value is close but not exact      | \(x \approx 5.3\)                     |
| \(\neq\)    | Not equal to                      | Indicates inequality                          | \(x \neq 5\)                          |
| \(\leq\)    | Less than or equal to             | Indicates a value is less than or equal       | \(x \leq 10\)                         |
| \(\geq\)    | Greater than or equal to          | Indicates a value is greater than or equal    | \(x \geq 2\)                          |
| \(\infty\)  | Infinity                          | Represents an unbounded limit                | \(\infty\)                            |
| \(\in\)     | Belongs to (element of a set)     | Indicates membership in a set                | \(x \in A\)                           |
| \(\subset\) | Subset of                        | Indicates one set is contained in another    | \(A \subset B\)                       |
| \(\cap\)    | Intersection                     | Common elements of two sets                  | \(A \cap B\)                          |
| \(\cup\)    | Union                            | All elements of two sets                     | \(A \cup B\)                          |
| \(\emptyset\) | Empty set                      | Represents a set with no elements            | \(\emptyset\)                         |
| \(\sqrt{}\) | Square root                      | Represents the square root of a value        | \(\sqrt{25} = 5\)                     |

**Note:** This table is designed as a comprehensive reference for statistical analysis. It focuses on commonly used symbols and their practical applications.



## Random Variables

**Definition:** A random variable maps the outcomes of random processes to numbers.

**Example:** Flipping a coin

* **Random Variable (X):** 
    * 1 if heads
    * 0 if tails
* **Sample Space:** {0, 1} (all possible outcomes)
* **Event:** A single instance of the random process (e.g., flipping the coin once and getting tails)
* **Probability (P(x)):** Likelihood of an event occurring with a particular outcome (e.g., P(X=1) = 0.5)

**In essence, a random variable assigns numerical values to the outcomes of a random process, allowing us to analyze and quantify the probability of different events.**

# Mean

The **mean**, also known as the **average** or **expected value**, is a fundamental statistical concept that represents the central tendency of a dataset. It's calculated by summing all the values in the dataset and dividing by the number of values.

**Sample Mean**

The sample mean (often used to estimate the population mean) is calculated as:

**μ = ( Σ xi ) / N**

Where:

* μ represents the sample mean.
* xi represents each individual value in the dataset.
* Σ xi represents the sum of all values in the dataset.
* N represents the number of observations in the dataset.

**Expectation**

The mean can also be expressed as the **expectation** of a random variable, denoted by E(X) or  $\bar{X}$. For random variables X and Y, their expectations are:

* **E(X) =  $\bar{X}$** 
* **E(Y) =  $\bar{Y}$**

The expectation represents the average value that a random variable is expected to take over a large number of trials.

In [2]:
#Example of Mean
import numpy as np
import math

# Calculate the mean of a dataset using np.mean
x = np.array([1,3,5,7,9])
mean_x = np.mean(x)
print("Mean of x is: ", mean_x)

# Account for cases where the data contains NaN values using np.nanmean
y_nan = np.array([1,3,5,7,9, math.nan])
mean_y_nan = np.nanmean(y_nan)
print("Mean of y_nan is: ", mean_y_nan)


Mean of x is:  5.0
Mean of y_nan is:  5.0


# Variance

Variance measures the dispersion or spread of data points around the mean (average). A higher variance indicates that the data points are more spread out, while a lower variance indicates they are clustered more closely around the mean.

**Real-world example:**

Imagine you're comparing the heights of students in two different classrooms. 

* **Classroom A:**  Most students are around the same height, with only a few slightly taller or shorter. This classroom would have a **low variance** in heights.
* **Classroom B:** There's a wide range of heights, with some very tall students, some very short students, and everything in between. This classroom would have a **high variance** in heights.

**Population Variance**

The population variance (σ²) is calculated as:

**σ² = Σ (xi - μ)² / N**

Where:

* σ² represents the population variance.
* xi represents each individual value in the dataset (e.g., the height of each student).
* μ represents the population mean (e.g., the average height of all students in the classroom).
* Σ (xi - μ)² represents the sum of squared differences between each data point (xi) and the population mean (μ).
* N represents the number of observations in the population (e.g., the total number of students in the classroom).

**Key Points**

* Variance is always non-negative (σ² ≥ 0).
* The square root of the variance is the standard deviation, which is another measure of dispersion.
* When calculating the variance of a sample (as an estimate of the population variance), a slight modification is made to the formula (using N-1 instead of N in the denominator) to ensure an unbiased estimate. This is known as the sample variance.

In [9]:
# Example of Variance
x = np.array([1,3,5,7,9])
variance_x = np.var(x)
print("Variance of x is: ", variance_x)

x = np.array([1, 3, 5, 7, 9])
variance_x = np.var(x, ddof=1)  # Calculate sample variance
print("Variance of x is:", variance_x)  # Output: 10

data = np.array([10, 12, 15, 13, 11])

# Calculate the sample variance using np.var() with ddof=1
# ddof (Delta Degrees of Freedom) = 1 corrects for bias in sample variance 
# by dividing the sum of squared differences by (N-1) instead of N.
sample_variance = np.var(data, ddof=1)  
print("The Sample Variance is: ", sample_variance)  # Output: 4.5


Variance of x is:  8.0
Variance of x is: 10.0
The Sample Variance is:  3.7


# Standard Deviation

Standard deviation is a measure of how spread out the data points are in a dataset. It tells you how much, on average, the individual data points deviate from the mean (average) of the dataset.

**Real-world examples:**

* **Test scores:** Imagine two classes took the same test. Class A had a mean score of 80 with a standard deviation of 5, while Class B had a mean score of 80 with a standard deviation of 10. This means that the scores in Class B were more spread out than in Class A.  In Class A, most students scored within 5 points of the average (between 75 and 85), while in Class B, many students scored further away from the average.

* **Manufacturing:** A factory produces bolts with a target diameter of 10mm. A higher standard deviation in the bolt diameters would mean that the bolts produced have more variability in their size, potentially affecting the quality and consistency of the product.

**Formula**

The standard deviation (σ) is calculated as the square root of the variance:

**σ = √σ²**

Where:

* σ represents the standard deviation.
* σ² represents the variance.

**Key Points**

* Standard deviation is always non-negative (σ ≥ 0).
* A higher standard deviation indicates greater variability in the data.
* Standard deviation has the same unit as the data, making it easier to interpret than variance.
* When calculating the standard deviation of a sample (as an estimate of the population standard deviation), a slight modification is made to the variance formula (using N-1 instead of N in the denominator) to ensure an unbiased estimate. This is known as the sample standard deviation.

In [12]:
# Example of Standard Deviation
x = np.array([1,3,5,7,9])
std_dev_x = np.std(x)
print("Standard deviation of x is: ", std_dev_x)  # Output: 2.828

# Calculate the sample standard deviation using np.std() with ddof=1
x_nan = np.array([1,3,5,7,9, math.nan])
std_dev_x_nan = np.nanstd(x_nan, ddof=1)
print("Standard deviation of x_nan is: ", std_dev_x_nan) # Output: 3.162
#

Standard deviation of x is:  2.8284271247461903
Standard deviation of x_nan is:  3.1622776601683795


In [None]:
# Covariance