Awesome Statistics for Data Science

{Awesome Works in Progress}

Let's Talk Data

What is data?
- What is Data - Data represent facts or something that has actually taken place, observed and measured.
  - Data > (beats) Opinion
Data literacy
Data literacy is the ability to read, work with, analyze and communicate with data.
- How to build data literacy in your company - mit.edu
- Boost Your Team’s Data Literacy - hbr.org
- Data Analytics vs Data Analysis - bmc.com
- Data literacy training - statcan.gc.ca
- Data Literacy Preview: Study Hall: ASU + Crash Course 📺 ⭐ - Arizona State University
A Data Culture is the collective behaviors and beliefs of people who value, practice, and encourage the use of data to improve decision-making.
Data Types
- Qualitative
  - Nominal, Ordinal, Binary
- Quantitative
  - Discrete, Continous
- Learn more
  - Types of Data & Measurement Scales: Nominal, Ordinal, Interval and Ratio - mymarketresearchmethods.com
  - Types of Variable - Laerd.com
  - Interval scale Vs Ratio scale: Interval scales hold no true zero and can represent values below zero e.g., you can measure temperature below 0 degrees Celsius, such as -10 degrees.; Ratio variables never fall below zero. Height and weight measure from 0 and above, but never fall below it.
  - When a Variable’s Level of Measurement Isn’t Obvious - theanalysisfactor.com
  - Foot Size vs Shoe Size - Shoe size (discrete), but the underlying measure is foot length which is measurement (continuous) data. learn more
  - Discrete vs. Continuous Variables: Meaning and Differences - outlier.org
  - Is Age Discrete or Continuous?
  - The numbers on the basketball players' t-shirts?

Data Bias

Selection bias
Berkson’s Bias - statology.org | Berkson’s bias is a type of bias that occurs in research when two variables appear to be negatively correlated in sample data, but are actually positively correlated in the overall population.
Historical Bias
Outlier Bias
Visualization Bias
Simpson's Paradox

Numbers and Statistics

Statistical Analysis

Samples and Populations

Samples & Populations - stat.psu.edu
Sampling Methods
- Probability sampling; Non-probability sampling
Sample Size
- Sample Size Calculator - calculator.net
- A general rule of thumb for the Large Enough Sample Condition is that n≥30, where n is your sample size. Learn more - statisticshowto.com
- Margin of Error & Sample size Calculator - aytm.com
- Number of Samples (You can Afford) = Budget / Cost per Sample
- Power Analysis
Law Of Large Numbers (LLN) - investopedia.com | The law of large numbers, in probability and statistics, states that as a sample size grows, its mean gets closer to the average of the whole population.
Central Limit Theorem (CLT) - investopedia.com | In probability theory, the central limit theorem (CLT) states that the distribution of a sample variable approximates a normal distribution (i.e., a “bell curve”) as the sample size becomes larger, assuming that all samples are identical in size, and regardless of the population's actual distribution shape.
Confidence Intervals
- Confidence intervals explained - scribbr.com
- Understanding Confidence Intervals 📺 - Dr Nic's Maths and Stats
- The correct interpretation of a 95% confidence interval is that "we are 95% confident that the population parameter is between X and X." Learn more
- Confidence Intervals estimate how close a sample mean is to the actual mean.
- Critical Values (Z) 99%=2.575; 95%=1.960; 90%=1.645; 85%=1.440; 80%=1.282

Descriptive Statistics

Describe, show or summarize data in a meaningful way

Measures of Central Tendency - are single values that attempt to describe the central position of a set of data.
- Mean (Average) - Most meaningful with normally distributed data
  - Arithmetic Mean Σx/n, Geometric Mean n√∏x, Harmonic Mean n/Σ(1/x)
  - The Greek letter μ (mu) is used in statistics to represent the population mean of a distribution.
- Median (The "middle" of a sorted list of numbers) - Diminish the effect of outliers (aka Med, M, x̃ 'x-tilde')
- Mode (Most Often) - bi-modal distribution; categorical data
- Numerical Summarization - stat.psu.edu
- Sensitivity to skewness
Measures of Variability (Dispersion)
- Range
- Interquartile Range (IQR)
- Variance σ2
- Standard Deviation σ
  - Standard deviation (S) = square root of the variance
- Standard Error | A mathematical tool used in statistics to measure variability
Shapes of Distribution
- Normal Distribution
  - The empirical rule, also referred to as the three-sigma rule or 68-95-99.7 rule, is a statistical rule which states that for a normal (aka Gaussian or Gauss or Laplace–Gauss) distribution, almost all observed data will fall within three standard deviations (denoted by σ) of the mean or average (denoted by µ).
  - The quincunx (or Galton Board) mathsisfun.com - | Simulator
- Non-normal Distribution (Flat, Bi-modal, Parabolic)
- Skewness (-ve Skewness/Skewed Left, +ve Skewness/Skewed Right)
  - When data are skewed right, the mean is larger than the median.
  - When data are skewed left, the mean is smaller than the median.
- Kurtosis (Leptokurtic, Mesokurtic, Platykurtic)
Percentiles, Quartiles, Quintile and Decile
- Percentiles
  - The 30th percentile is the value from the data set greater than 30% of observations, and therefore less than 70% of observations.
  - Median = 50th percentile
  - 1st Quartile = 25th percentile
  - 3rd Quartile = 75th percentile
  - IQR = The difference between Q3 and Q1. IQR contains the middle 50% of data
- Quartiles are the values that divide a list of numbers into quarters - mathsisfun.com
  - Interquartile range = 3rd quartile - 1st quartile
  - Exclusive method vs inclusive method - The exclusive method excludes the median when identifying Q1 and Q3, while the inclusive method includes the median in identifying the quartiles.
  - Quartile Deviation - byjus.com | Quartile Deviation for Ungrouped/Grouped Data
- Quintile - "A quintile is one of five values that divide a range of data into five equal parts, each being 1/5th (20 percent) of the range." Investopedia
- Decile - Devide a range of data into ten equal parts, each being 1/10th (10 percent) of the range. 1st and 9th deciles equal to 10th and 90th percentiles.
- Outliers
  - Identifying outliers with the 1.5xIQR rule - khanacademy.org
Percentage
- Percentage Difference, Percentage Error, Percentage Change - mathsisfun.com
- Percentage Change and Percent Difference - sumn.org
- Change = ((New - Old) / |Old|) * 100
- Difference = |(First - Second)/((First + Second)/2)| * 100
Binning
- Freedman–Diaconis rule
- Sturge’s rule - Optimal Bins = ⌈log2n + 1⌉
Measures of Association
- Correlation and Causation
  - Correlation vs Causation: Understand the Difference for Your Product - amplitude.com
  - Correlation means there is a relationship or pattern between the values of two variables. A scatterplot displays data about two variables as a set of points in the xyxyx, y-plane and is a useful tool for determining if there is a correlation between the variables.
  - Causation means that one event causes another event to occur. Causation can only be determined from an appropriately designed experiment. In such experiments, similar groups receive different treatments, and the outcomes of each group are studied. We can only conclude that a treatment causes an effect if the groups have noticeably different outcomes.
  - Confounding Variables - scribbr.com
- Correlation Coefficients
  - What Do Correlation Coefficients Positive, Negative, and Zero Mean? - investopedia.com
- Covariance
  - What Is Covariance? - investopedia.com
  - Covariance (CFI) - orporatefinanceinstitute.com | A measure of the relationship between random variables
Normalization vs Standardization

Probabilities

Probabilities refer to the measure of the likelihood that a particular event will occur. It quantifies uncertainty and is a fundamental concept in statistics and mathematics. Probabilities are expressed as numbers between 0 and 1, where 0 indicates that an event will not occur, and 1 indicates that an event will certainly occur.

Probability Basics

Probability Basics - 365 Data Science

Percentiles

Percentiles are measures that indicate the relative position of a value within a data set. A percentile represents the percentage of values in the data set that fall below a given value. For example, the 50th percentile (median) is the value below which 50% of the data points lie.

Permutations

Permutations refer to the different ways in which a set of items can be arranged in a specific order. The order of arrangement is important in permutations. For example, the permutations of the set {A, B, C} are ABC, ACB, BAC, BCA, CAB, and CBA.

Combinations

Combinations are the different ways of selecting items from a larger set where the order of selection does not matter. For example, the combinations of choosing 2 items from the set {A, B, C} are AB, AC, and BC.

Inferential Statistic

Hypothesis

Hypothesis tests attempt to provide an answer to questions such as "How likely is an observation just random change?"
Null Hypothesis
- The null hypothesis, H0 is the commonly accepted fact; it is the opposite of the alternate hypothesis. Researchers work to reject, nullify or disprove the null hypothesis. Researchers come up with an alternate hypothesis, one that they think explains a phenomenon, and then work to reject the null hypothesis. learn more
- The null statement must always contain some form of equality (=, ≤ or ≥) Always write the alternative hypothesis, typically denoted with Ha or H1, using less than, greater than, or not equals symbols, i.e., (≠, >, or <) learn more
Confidence Intervals
Confidence Level
Alpha value (aka significance level)
Type I and Type II errors - scribbr.com
- Which is more dangerous for a smoke detector? A type I (false positive) or type II error (false negative)?
t-test - compares the means of two groups
Chi-Square test - determines whether categorical variables are associated
z-test
- Z-Scores
  - Simply put, a z-score (also called a standard score) gives you an idea of how far from the mean a data point is. But more technically it’s a measure of how many standard deviations below or above the population mean a raw score is. learn more
  - Z-Table - z-table.net
Probability
- Probability Line - dcp.edu.gov.on.ca
- Intro to Probability for Data Science (Free e-book) - probability4datascience.com
- Statistical significance ♟
  - Statistical significance refers to the claim that a result from data generated by testing or experimentation is not likely to occur randomly or by chance but is instead likely to be attributable to a specific cause learn more
  - A Refresher on Statistical Significance - Amy Gallo (Harvard Business Review)
P-value ⬆⬇
- Meet P. Value (aka p-value) - Alteryx Community Team
  - P-Values, clearly explained (Video) - StatQuest
  - In general, P values larger than 0.01 should be reported to two decimal places, those between 0.01 and 0.001 to three decimal places; P values smaller than 0.001 should be reported as P<0.001. learn more
  - A p-value less than 0.05 (typically ≤ 0.05) is statistically significant. It indicates strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct (and the results are random). Therefore, we reject the null hypothesis, and accept the alternative hypothesis. learn more
- Cheatsheet
Univariate, bivariate, multivariate and multivariate multiple analysis (MMR)
- What’s the difference between univariate, bivariate and multivariate descriptive statistics? - scribbr.com
- What is a multivariate relationship?
Upsampling and Downsampling

Error Types

Accept H0 AND H0 is True = Correct
Reject H0 AND H0 is False = Correct
Reject H0 AND H0 is True = Type I Error
Accept H0 AND H0 is False = Type II Error

Statisticsl Tests and Analysis

Statistical tests: which one should you use? - scribbr.com
Choosing the correct statistical test in SAS, Stata, SPSS and R - stats.idre.ucla.edu
How to choose the right statistical test? - Barun Nayak and Avijit Hazra1
Frequency Distribution
Cross Tabulation
Correspondence analysis
Multinomial Logistic Regression
Cluster Analysis
One-hot encoding
Numerical encoding
Ordinal encoding

Error Estimatinos

Effect Size

Cohen's d is designed for comparing two groups. It takes the difference between two means and expresses it in standard deviation units. It tells you how many standard deviations lie between the two means learn more
- How To Calculate Cohen's d In Excel
Power Analysis, Clearly Explained! 📺 - StatQuest

Tools

Statistics Kingdom - statskingdom.com
DrawMyData - a tool for teaching stats and data science by Robert Grant
Desmos - Graphing Calculator - desmos.com
Standard Normal Distribution Table - mathsisfun.com
Probability Distribution Applets - divms.uiowa.edu (Matt Bognar, Ph.D.)
Simulated Sampling Distributions ⭐ - albany.edu
One Sample T-Test Calculator - statskingdom.com
Comparing Two Independent Samples (Sample Size) ⭐ - stat.ubc.ca
Poisson Distribution Calculator - statology.org | The probability that the restaurant receives more than 100 (Normal avg). e.g., 130 => P(X = 130): 0.00058
Online Statistics Calculator ⭐ - datatab.net
Dice Roller - random.org

Write-Up

Analytical Report – What Is It and How to Write It? - whatagraph.com
APA
- Numbers & Statistics - APA Formatting And Style Guide (7th Edition)
- Reporting a multiple linear regression in APA

Surveys

Survey testing - abs.gov.au

Books

Pattern Recognition and Machine Learning (Free)
Analysis of Multiple Dependent Variables
Introductory Statistics - saylordotorg.github.io
Introduction to Statistics - courses.lumenlearning.com
Mathspace ⭐ - mathspace.co | We bring all of your learning tools together in one place, from video lessons, textbooks, to adaptive practice. Encourage your students to become self-directed learners.
Introduction to Probability for Data Science - probability4datascience.com

Mathematics for Machine Learning

Linear Algebra: Fundamental concepts such as matrices, vectors, dot products, and matrix multiplication are crucial. Linear algebra provides the language and framework for describing and manipulating data in ML. Calculus: Understanding of derivatives and gradients is important for optimization problems in ML, including gradient descent. Probability and Statistics: Basic understanding of probabilities, probability distributions, means, variances, and expectation values is essential for understanding models, making predictions, and evaluating model performance.

Pre-Algebra

Factors and multiples
Prime numbers

Linear Algebra

Khan Academy - Linear algebra
Linear Algebra for Machine Learning - Jon Krohn | (48 videos) This is a complete course on linear algebra for machine learning.

YouTube 📺

Essence of linear algebra - 3Blue1Brown
The Art of Linear Programming 📺 ~19min ⭐ - Tom S

Tools

Graphing Calculator ⭐ - desmos.com | Explore math with our beautiful, free online graphing calculator.

Calculus

Optimization Methods

Extra Knowledge

Mathematics is the queen of Sciences (Video)
What Is The Fibonacci Sequence? - The Fibonacci Sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55… (Xn = Xn-1 + Xn-2)

Books

Mathematics for Machine Learning (Free)

Python

manim - Mathematical Animation Engine

YouTube 📺

MIT 18.650 Statistics for Applications, Fall 2016 - MIT OpenCourseWare
Joshua Emmanuel
zedstatistics ⭐
Stephanie Glen - statisticshowto.com
Khan Academy - The average, Descriptive statistics, Probability and Statistics
Evidence-Based Practice - Rich Simpson
How Imaginary Numbers Were Invented

Learning

Praxis Core Math - khanacademy.org
Data Science for Beginners (Microsoft)
Introduction to Data Science - umich.edu
The Data Journey - statcan.gc.ca
Free Data Science Courses - Harvard University - harvard.edu

Special Videos 📺

The Trillion Dollar Equation - Veritasium

Name		Name	Last commit message	Last commit date
Latest commit History 289 Commits
Math		Math
Resources		Resources
Data-Literacy.md		Data-Literacy.md
DataTypes.md		DataTypes.md
Fundamental-Skills.md		Fundamental-Skills.md
README.md		README.md
Terms-and-definition.md		Terms-and-definition.md

NajiElKotob/Awesome-Statistics-For-Data-Science

Folders and files

Latest commit

History

Repository files navigation

Awesome Statistics for Data Science

Let's Talk Data

Data Bias

Numbers and Statistics

Statistical Analysis

Samples and Populations

Descriptive Statistics

Probabilities

Probability Basics

Percentiles

Permutations

Combinations

Inferential Statistic

Hypothesis

Error Types

Statisticsl Tests and Analysis

Error Estimatinos

Effect Size

Other Topics

Tools

Write-Up

Surveys

Books

Mathematics for Machine Learning

Pre-Algebra

Linear Algebra

YouTube 📺

Tools

Calculus

Precalculus

Matrices

Gradient

Optimization Methods

Extra Knowledge

Books

Python

YouTube 📺

Learning

Special Videos 📺

Special Channels 📺

Related Topics

About

Resources

Stars

Watchers

Forks