# Statistics and Probability for Data Science

The work done in the name of data science has roots in various fields, including computer science, mathematics, and statistics. Broadly speaking, **statistics** involves collecting and analyzing data to find useful information, with the goal of making decisions based off that knowledge.

There are two major areas in the field of statistics:

1. **Descriptive statistics**, which covers collecting and organizing data and usually performing general calculations to measure central tendency and variability (or spread) of that data
2. **Inferential statistics**, which are procedures to understand relationships, draw conclusions, or make predictions about a population based on sample data from that population

A **population** is the set of all measurements of interest, and a number describing a population is a **parameter**. A **sample** is any small group of individuals or objects selected to represent the entire group (the population), where a number describing a sample is a **statistic**. Often, the population parameters are unknown, hence the need to use sample data to make inferences about it.

A number of inferential statistics techniques rely on **probability theory**, which is a branch of mathematics that's concerned with calculating the likelihood of outcomes of experiments. In other words, it's the science of uncertainty.

## Introductory Terminology - Types of Data

Before diving into each field, it's important to understand the different **types of data** out there. Generally, it can be qualitative or quantitative, each with further sub-categories.

- **Quantitative Data**: numeric observations, such as height, the number of classes a student is enrolled in, or the percentage of households with pets by city. This can be further broken down into either discrete or continuous data
- **Qualitative Data**: observations that are non-numeric attributes, such as gender, the mode of transportation a person takes, or the type of material something is made of. Also referred to as **categorical** data
- **Discrete Numerical Data**: a subset of quantitative data, where you are able to *count* the possible values (think integers), such as the number of people who contracted the flu this year
- **Continuous Numerical Data**: a subset of quantitative data, where you aren't able to count the possible values (think decimals - usually the result of a measurement that can take on any real number). An example is the fluid ounces of coffee you drink each morning
- **Ordinal Data**: qualitative or categorical data that has an inherent order to it, such as months (January, February, March,...), sizes (small, medium, large), or quality ratings (1-10)
- **Nominal Data**: qualitative data where the categories don't have an inherent order or rank, such as the style of a home (colonial, ranch, contemporary, etc.)
- **Binary Data**: a subset of qualitative data where observations fall into one of two mutually exclusive categories, such as true/false or meets standards / is defective

## Formulae for Descriptive Statistics and Random Variables

**General Measures of Central Tendency and Variability**

- **Arithmetic Mean**: also known as an average or the expected value. It's the sum of values ($x_i$'s) divided by the number of values ($n$)
- **Median**: the middle value of $x_i$'s in a sorted dataset (also the 50th percentile, or 0.5 quantile)
- **Mode**: the most common value in the set
- **Variance**: a measure of spread within a collection of data, it's the sum of squared differences between values and the mean
- **Standard Deviation**: the square root of variance
- **Z-Score**: the number of standard deviations a data point is above or below the mean
- **Random Variable**: assigns a numerical value to each possible outcome of a random experiment. The value depends on chance. The formulae below use $p$ to indicate the probability of an outcome happening

| **Statistic** |**Population** | **Sample** | **Random Variables**|
| ------- | :-------: | :-------: | :-------: |
| Mean | $\mu = \frac{\Sigma x}{n}$ | $\bar{x} = \frac{\Sigma x}{n}$ | $\text{E(x)} = \sum_i p_i x_i$ |
| Variance | $\sigma^2 = \frac{\Sigma (x-\mu)^2}{n}$ | $s^2 = \frac{\Sigma (x-\bar{x})^2}{n-1}$ | $\text{Var(x)} = \sum_i p_i(x_i-\mu)^2$ |
| Standard Deviation | $\sigma = \sqrt{\frac{\Sigma (x-\mu)^2}{n}}$ | $s = \sqrt{\frac{\Sigma (x-\bar{x})^2}{n-1}}$ | $\text{Std(x)} = \sqrt{\sum_i p_i (x_i-\mu)^2}$ |
| Average Deviation | $\frac{\Sigma |x -\mu|}{n}$ | $\frac{\Sigma |x-\bar{x}|}{n}$ | $\sum_i p_i|x_i-\mu|$ |
| Z-Score | $z = \frac{x-\mu}{\sigma}$ | $\hat{z} = \frac{x-\bar{x}}{s}$ | $z = \frac{x-\mu}{\sigma}$ |


- **Chebyshev's Theorem**: let $k$ be any number $\ge 1$, then the proportion of the distribution that lies within $k$ standard deviations of the mean is at least $1 - \frac{1}{k^2}$ 

## Inferential Statistics

[TO COME]