# Skewness and kurtosis tutorial

## Outline
- Introduction
    - Provide a brief introduction to the importance of understanding data distributions in data science.
    - Explain how skewness and kurtosis are key statistical measures to evaluate the shape of data.
    - Introduce the goal of the tutorial: to help beginners grasp the concepts of skewness and kurtosis.
- What is Skewness?
    - Define skewness and explain its significance in data analysis.
    - Discuss how skewness measures the asymmetry in the data distribution.
- Types of Skewness
    - Explain the three types of skewness: positive skew, negative skew, and zero skew (symmetrical).
    - Provide examples and real-world scenarios for each type.
- Calculating Skewness
    - Offer step-by-step instructions on how to calculate skewness manually. 
    - Introduce the skewness formula and provide examples for practice.
- What is Kurtosis?
    - Define kurtosis and its importance in characterizing data distributions.
    - Discuss how kurtosis measures the tail behavior of data.
- Types of Kurtosis
    - Explain the three types of kurtosis: leptokurtic, mesokurtic, and platykurtic.
    - Provide examples and real-world data representations for each type.
- Calculating Kurtosis
    - Describe the methods to calculate kurtosis manually.
    - Introduce the kurtosis formula and walk through practical examples.
- Interpreting Skewness and Kurtosis
    - Guide readers on how to interpret skewness and kurtosis values.
    - Explain how the results can be used to make data-driven decisions.
- Real-World Applications
    - Provide examples of real-world scenarios where understanding skewness and kurtosis is crucial.
    - Discuss how these statistics are used in fields like finance, healthcare, and social sciences.
- Skewness and Kurtosis in Python
    - Introduction to Python Libraries
        - Introduce popular Python libraries (e.g., NumPy, SciPy, and pandas) for calculating skewness and kurtosis.
        - Explain how to import these libraries and their key functions.
    - Practical Implementation
        - Walk readers through Python code snippets for calculating skewness and kurtosis.
        - Provide sample datasets and show how to apply these calculations.
- Visualization of Skewness and Kurtosis
    - Explain the importance of data visualization in data analysis.
    - Introduce data visualization libraries such as Matplotlib and Seaborn.
    - Creating Visualizations
        - Provide guidance on creating histograms, box plots, and probability plots.
        - Interpret the visualizations in the context of skewness and kurtosis.
- Conclusion


## Introduction

After collecting data and spending hours on cleaning it, you can finally start exploring it! This stage, often called Exploratory Data Analysis (EDA), is perhaps the most important step in a data project. The insights gained from EDA affects everything down the way.

Fill in later, damn it!

## What is skewness?

We see normal distribution everywhere: human body measurements, weights of objects, IQ scores, test results or even at the gym:

![](images/gym.jpeg)

[Source](https://twitter.com/alvinfoo/status/1588664741211570176).

Besides being nature's most favorite distribution, it is universally loved by almost all machine learning algorithms. While some want it to improve and stabilize their performance, some outright refuse to work well with anything other than normal distribution (I am talking to you, linear models). 

So, to satisfy the algorithms' need for normalcy, we need a way to measure how similar or (dissimilar) our own distributions are compared to the perfect bell-shaped curve.

CREATE AN IMAGE THAT SHOWS NORMAL, SKEWED AND KURTOSIS DISTRIBUTIONS

When there is asymmetry between the tails of the normal distribution, giving it a leaned, squished-to-one-side look, we say it is skewed. And you guessed it, we measure the extent of this asymmetry with __skewness__. 

Correctly categorizing and measuring skewness provides insights how values are spread around the mean and influences the choices of statistical techniques and data transformations. For example, highly skewed distributions might benefit from normalization or scaling techniques to make them resemble normal distribution. This would aid in model performance. 

## Types of skewness

There are three types of skewness: positive skewness, negative skewness, and zero skewness.

Let's start with the last one - a distribution with zero skewness:

- Symmetric distribution with values evenly centered around the mean.
- No skew, lean or tail to either side. 
- The mean, median and mode are all at the center point.

In practice, mean, median and mode may not form a perfect overlapping straight line. They may be slightly away from each but the difference would be too small to matter.

In a distribution with positive skewness (right skewed):
- The right tail of the distribution is longer or fatter than the left.
- The mean is greater than the median, and the mode is less than both mean and median.
- Lower values are clustered in the "hill" of the distribution while extreme values are in the long right tail. 
- It is also known as right-skewed distribution.

In a distribution with negative skewness (left skewed):
- The left tail of the distribution is longer or fatter than the right.
- The mean is less than the median, and the mode is greater than both mean and median. 
- Higher values are clustered in the "hill" of the distribution while extreme values are in the long left tail.
- It is also known as left-skewed distribution.

To remember the differences between positive and negative skewness, think of it this way: if you want to increase the mean of a distribution, you should add much higher values than the mean to the distribution. To lower the mean, you should do the opposite - introduce much lower values than the mean to the distribution. So, if the majority of the extreme values is higher than the mean, it is positive skewness because they increase the mean. If the majority of extreme values is smaller than the mean, it is negative skewness because they decrease the mean. 

Here is a summary table on the types of skewness:

DRAW A TABLE IN EXCALIDRAW ON THE TYPES OF SKEWNESS. BE QUICK.

## Calculating skewness in python

There are many ways to [calculate skewness](https://en.wikipedia.org/wiki/Skewness#Other_measures_of_skewness)  but the simplest one is Pearson's second skewness coefficient, also known as median skewness. 

THE FORMULA FOR MEAN/MEDIAN SKEWNESS HERE WITH WORD

Let's do it manually in Python:

In [22]:
import numpy as np
import pandas as pd
import seaborn as sns

# Example dataset
diamonds = sns.load_dataset("diamonds")
diamond_prices = diamonds["price"]

mean_price = diamond_prices.mean()
median_price = diamond_prices.median()
std = diamond_prices.std()

skewness = (3 * (mean - median)) / std

print(
    f"The Pierson's second skewness score of diamond prices distribution is {skewness:.5f}"
)

The Pierson's second skewness score of diamond prices distribution is 0.00015


Another formula highly influenced by the works of Karl Pearson is the moment-based formula to approximate skewness. It is more reliable and given as follows:

THE FORMULA WRITTEN IN WORD 

Here:
- _n_ represents the number of values in a distribution
- x_i denotes each data point

Let's implement it in Python:

In [17]:
def moment_based_skew(distribution):
    n = len(distribution)
    mean = np.mean(distribution)
    std = np.std(distribution)

    # Divide the formula into two parts
    first_part = n / ((n - 1) * (n - 2))
    second_part = np.sum(((distribution - mean) / std) ** 3)

    skewness = first_part * second_part

    return skewness

In [21]:
moment_based_skew(diamond_prices)

1.618440289857168

If you don't want to calculate skewness manually (like me), you can use built-in methods from pandas or scipy:

In [23]:
# Pandas version
diamond_prices.skew()

1.618395283383529

In [24]:
# SciPy version
from scipy.stats import skew

skew(diamond_prices)

1.6183502776053016

While all formulas to approximate skewness return different scores, their differences are too small to be significant or change the categorization of the skew (excluding Pearson's second skewness coefficient). For example, the three different methods we've used leverage three different formulas under the hood but their results only vary after the third decimal place.

Once you calculate skewness, you can categorize the extent of the skew:
- (-0.5, 0.5) - low or approximately symmetric.
- (-1, -0.5) U (0.5, 1) - moderately skewed.
- Beyond -1 and 1 - Highly skewed.

## What is kurtosis?

## Types of kurtosis?

## Calculating kurtosis

## Kurtosis in Python

## Conclusion