# Statistics Tutorial

Hi!

This notebook is a gentle tutorial to essential concepts in statistics. I try to present the concepts in a fun and interactive way and I encourage you to play with the code to get a better grasp of the concepts.

I will be using a "[Toy Dataset](https://www.kaggle.com/carlolepelaars/toy-dataset)" to illustrate concepts in this kernel.

The Jupyter Notebook and dataset are also available as a [Github repository](https://github.com/CarloLepelaars/stats_tutorial).

![](https://i.stack.imgur.com/c88K3.png)

## Table of contents

- [Preparation](#1)
- [Discrete and Continuous Variables](#2)
  - PMF (Probability Mass Function)
  - PDF (Probability Density Function)
  - CDF (Cumulative Distribution Function)
- [Distributions](#3)
  - Uniform Distribution
  - Normal Distribution
  - Binomial Distribution
  - Poisson Distribution
  - Log-normal Distribution
- [Summary Statistics and Moments](#4)
- [Bias, MSE and SE](#5)
- [Sampling Methods](#6)
- [Covariance](#7)
- [Correlation](#8)
- [Linear Regression](#9)
  - Anscombe's Quartet
- [Bootstrapping](#10)
- [Hypothesis Testing](#11)
  - p-value
  - q-q plot
- [Outliers](#12)
  - Grubbs Test
  - Tukey's Method
- [Overfitting](#20)
  - Prevention of Overfitting
  - Cross-Validation
- [Generalized Linear Models (GLMs)](#13)
  - Link Functions
  - Logistic Regression
- [Frequentist vs. Bayes](#14)
- [Bonus: Free Statistics Courses](#15)
- [Sources](#16)

## Preparation <a id="1"></a>

In [1]:
# Dependencies

# Standard Dependencies
import os
import numpy as np
import pandas as pd
from math import sqrt

# Visualization
from pylab import *
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import seaborn as sns

# Statistics
from statistics import median
from scipy import signal
from scipy.special import factorial
import scipy.stats as stats
from scipy.stats import sem, binom, lognorm, poisson, bernoulli, spearmanr
from scipy.fftpack import fft, fftshift

# Scikit-learn for Machine Learning models
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Seed for reproducability
seed = 12345
np.random.seed(seed)

# Kaggle Directory for Kernels
KAGGLE_DIR = 'D:/python/kaggle/1. toy dataset/dataset/'

# Read in csv of Toy Dataset
# We will use this dataset throughout the tutorial
df = pd.read_csv('D:/python/kaggle/1. toy dataset/dataset/toy_dataset.csv')

# Files and file sizes
print('\n# Files and file sizes')
for file in os.listdir(KAGGLE_DIR):
    print('{}| {} MB'.format(file.ljust(30), 
                             str(round(os.path.getsize(KAGGLE_DIR + file) / 1000000, 2))))


# Files and file sizes
toy_dataset.csv               | 5.74 MB
