Skip to content

NajiElKotob/Awesome-Statistics-For-Data-Science

Repository files navigation

Awesome Statistics for Data Science

{Awesome Works in Progress}

Let's Talk Data


Data Bias

  • Selection bias
  • Berkson’s Bias - statology.org | Berkson’s bias is a type of bias that occurs in research when two variables appear to be negatively correlated in sample data, but are actually positively correlated in the overall population.
  • Historical Bias
  • Outlier Bias
  • Visualization Bias
  • Simpson's Paradox

Numbers and Statistics

Statistical Analysis

Samples and Populations

  • Samples & Populations - stat.psu.edu
  • Sampling Methods
    • Probability sampling; Non-probability sampling
  • Sample Size
  • Law Of Large Numbers (LLN) - investopedia.com | The law of large numbers, in probability and statistics, states that as a sample size grows, its mean gets closer to the average of the whole population.
  • Central Limit Theorem (CLT) - investopedia.com | In probability theory, the central limit theorem (CLT) states that the distribution of a sample variable approximates a normal distribution (i.e., a “bell curve”) as the sample size becomes larger, assuming that all samples are identical in size, and regardless of the population's actual distribution shape.
  • Confidence Intervals
    • Confidence intervals explained - scribbr.com
    • Understanding Confidence Intervals 📺 - Dr Nic's Maths and Stats
    • The correct interpretation of a 95% confidence interval is that "we are 95% confident that the population parameter is between X and X." Learn more
    • Confidence Intervals estimate how close a sample mean is to the actual mean.
    • Critical Values (Z) 99%=2.575; 95%=1.960; 90%=1.645; 85%=1.440; 80%=1.282

Descriptive Statistics

Describe, show or summarize data in a meaningful way

  • Measures of Central Tendency - are single values that attempt to describe the central position of a set of data.
    • Mean (Average) - Most meaningful with normally distributed data
      • Arithmetic Mean Σx/n, Geometric Mean n√∏x, Harmonic Mean n/Σ(1/x)
      • The Greek letter μ (mu) is used in statistics to represent the population mean of a distribution.
    • Median (The "middle" of a sorted list of numbers) - Diminish the effect of outliers (aka Med, M, x̃ 'x-tilde')
    • Mode (Most Often) - bi-modal distribution; categorical data
    • Numerical Summarization - stat.psu.edu
    • Sensitivity to skewness
  • Measures of Variability (Dispersion)
    • Range
    • Interquartile Range (IQR)
    • Variance σ2
    • Standard Deviation σ
      • Standard deviation (S) = square root of the variance
    • Standard Error | A mathematical tool used in statistics to measure variability
  • Shapes of Distribution
    • Normal Distribution
      • The empirical rule, also referred to as the three-sigma rule or 68-95-99.7 rule, is a statistical rule which states that for a normal (aka Gaussian or Gauss or Laplace–Gauss) distribution, almost all observed data will fall within three standard deviations (denoted by σ) of the mean or average (denoted by µ).
      • The quincunx (or Galton Board) mathsisfun.com - | Simulator
    • Non-normal Distribution (Flat, Bi-modal, Parabolic)
    • Skewness (-ve Skewness/Skewed Left, +ve Skewness/Skewed Right)
      • When data are skewed right, the mean is larger than the median.
      • When data are skewed left, the mean is smaller than the median.
    • Kurtosis (Leptokurtic, Mesokurtic, Platykurtic)
  • Percentiles, Quartiles, Quintile and Decile
    • Percentiles
      • The 30th percentile is the value from the data set greater than 30% of observations, and therefore less than 70% of observations.
      • Median = 50th percentile
      • 1st Quartile = 25th percentile
      • 3rd Quartile = 75th percentile
      • IQR = The difference between Q3 and Q1. IQR contains the middle 50% of data
    • Quartiles are the values that divide a list of numbers into quarters - mathsisfun.com
      • Interquartile range = 3rd quartile - 1st quartile
      • Exclusive method vs inclusive method - The exclusive method excludes the median when identifying Q1 and Q3, while the inclusive method includes the median in identifying the quartiles.
      • Quartile Deviation - byjus.com | Quartile Deviation for Ungrouped/Grouped Data
    • Quintile - "A quintile is one of five values that divide a range of data into five equal parts, each being 1/5th (20 percent) of the range." Investopedia
    • Decile - Devide a range of data into ten equal parts, each being 1/10th (10 percent) of the range. 1st and 9th deciles equal to 10th and 90th percentiles.
    • Outliers
  • Percentage
  • Binning
  • Measures of Association
  • Normalization vs Standardization

Probabilities

Probabilities refer to the measure of the likelihood that a particular event will occur. It quantifies uncertainty and is a fundamental concept in statistics and mathematics. Probabilities are expressed as numbers between 0 and 1, where 0 indicates that an event will not occur, and 1 indicates that an event will certainly occur.

Probability Basics

Percentiles

Percentiles are measures that indicate the relative position of a value within a data set. A percentile represents the percentage of values in the data set that fall below a given value. For example, the 50th percentile (median) is the value below which 50% of the data points lie.

Permutations

Permutations refer to the different ways in which a set of items can be arranged in a specific order. The order of arrangement is important in permutations. For example, the permutations of the set {A, B, C} are ABC, ACB, BAC, BCA, CAB, and CBA.

Combinations

Combinations are the different ways of selecting items from a larger set where the order of selection does not matter. For example, the combinations of choosing 2 items from the set {A, B, C} are AB, AC, and BC.


Inferential Statistic

Hypothesis
  • Hypothesis tests attempt to provide an answer to questions such as "How likely is an observation just random change?"
  • Null Hypothesis
    • The null hypothesis, H0 is the commonly accepted fact; it is the opposite of the alternate hypothesis. Researchers work to reject, nullify or disprove the null hypothesis. Researchers come up with an alternate hypothesis, one that they think explains a phenomenon, and then work to reject the null hypothesis. learn more
    • The null statement must always contain some form of equality (=, ≤ or ≥) Always write the alternative hypothesis, typically denoted with Ha or H1, using less than, greater than, or not equals symbols, i.e., (≠, >, or <) learn more
  • Confidence Intervals
  • Confidence Level
  • Alpha value (aka significance level)
  • Type I and Type II errors - scribbr.com
    • Which is more dangerous for a smoke detector? A type I (false positive) or type II error (false negative)?
  • t-test - compares the means of two groups
  • Chi-Square test - determines whether categorical variables are associated
  • z-test
    • Z-Scores
      • Simply put, a z-score (also called a standard score) gives you an idea of how far from the mean a data point is. But more technically it’s a measure of how many standard deviations below or above the population mean a raw score is. learn more
      • Z-Table - z-table.net
  • Probability
  • P-value ⬆⬇
    • Meet P. Value (aka p-value) - Alteryx Community Team
      • P-Values, clearly explained (Video) - StatQuest
      • In general, P values larger than 0.01 should be reported to two decimal places, those between 0.01 and 0.001 to three decimal places; P values smaller than 0.001 should be reported as P<0.001. learn more
      • A p-value less than 0.05 (typically ≤ 0.05) is statistically significant. It indicates strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct (and the results are random). Therefore, we reject the null hypothesis, and accept the alternative hypothesis. learn more
    • Cheatsheet
  • Univariate, bivariate, multivariate and multivariate multiple analysis (MMR)
  • Upsampling and Downsampling
Error Types
  • Accept H0 AND H0 is True = Correct
  • Reject H0 AND H0 is False = Correct
  • Reject H0 AND H0 is True = Type I Error
  • Accept H0 AND H0 is False = Type II Error
Statisticsl Tests and Analysis
Error Estimatinos
Effect Size

Other Topics


Tools


Write-Up


Surveys

Books


Mathematics for Machine Learning

Linear Algebra: Fundamental concepts such as matrices, vectors, dot products, and matrix multiplication are crucial. Linear algebra provides the language and framework for describing and manipulating data in ML. Calculus: Understanding of derivatives and gradients is important for optimization problems in ML, including gradient descent. Probability and Statistics: Basic understanding of probabilities, probability distributions, means, variances, and expectation values is essential for understanding models, making predictions, and evaluating model performance.

Pre-Algebra

Linear Algebra

YouTube 📺
Tools
  • Graphing Calculator ⭐ - desmos.com | Explore math with our beautiful, free online graphing calculator.

Calculus

Precalculus

Matrices

Gradient

Optimization Methods

Extra Knowledge

Books

Python

  • manim - Mathematical Animation Engine

YouTube 📺


Learning


Special Videos 📺

Special Channels 📺


Related Topics

About

Awesome Statistics For Data Science

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published