I was today years old when I learnt that the UK has a loneliness minister:

<div style="width:480px"><iframe allow="fullscreen" frameBorder="0" height="361" src="https://giphy.com/embed/TKHYxGvybuPNGMLcdf/video" width="480"></iframe></div>

Stay curious 🔍

# <a id='toc1_'></a>[Key items for this class: <span style="color:red">numerical</span>, <span style="color:orange">categorical</span>, <span style="color:yellow">continuous</span>, <span style="color:green">discrete</span>, <span style="color:blue">percentile</span>, <span style="color:purple">outlier</span> + more terminology](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Key items for this class: <span style="color:red">numerical</span>, <span style="color:orange">categorical</span>, <span style="color:yellow">continuous</span>, <span style="color:green">discrete</span>, <span style="color:blue">percentile</span>, <span style="color:purple">outlier</span> + more terminology](#toc1_)    
- [What is statistics?](#toc2_)    
- [Why statistics?](#toc3_)    
- [Descriptive vs Inferential Statistics](#toc4_)    
- [What is data?](#toc5_)    
- [Types of data](#toc6_)    
- [Summary statistics](#toc7_)    
  - [Measures of central tendency](#toc7_1_)    
    - [Mean](#toc7_1_1_)    
    - [Median](#toc7_1_2_)    
    - [Mode](#toc7_1_3_)    
  - [Measures of dispersion](#toc7_2_)    
    - [Range](#toc7_2_1_)    
    - [Variance](#toc7_2_2_)    
    - [Standard deviation](#toc7_2_3_)    
    - [Quartiles](#toc7_2_4_)    
    - [Percentile](#toc7_2_5_)    
      - [Example: Which 🍕 place should I choose when I'm REALLY hungry?](#toc7_2_5_1_)    
    - [Coefficient of variation](#toc7_2_6_)    
    - [💡 Check for understanding](#toc7_2_7_)    
- [Outliers](#toc8_)    
  - [Measures of shape](#toc8_1_)    
    - [Skewdness](#toc8_1_1_)    
    - [Kurtosis](#toc8_1_2_)    
- [Resources](#toc9_)    
- [Extra: Estimating the minimum proportion of observations for any distribution (Chebyshev's Theorem)](#toc10_)    
- [Extra: Sources of error in data analysis](#toc11_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc2_'></a>[What is statistics?](#toc0_)

> **The field of statistics is the science of learning from data**. Statistical knowledge helps you use the proper methods to collect the data, employ the correct analyses, and effectively present the results. Statistics is a crucial process behind how we make discoveries in science, make decisions based on data, and make predictions. Statistics allows you to understand a subject much more deeply.

![image.png](attachment:image.png)

# <a id='toc3_'></a>[Why statistics?](#toc0_)

 > Statistics facilitates the creation of new knowledge. 
 
 Also, knowing statistics well allows you to avoid drawing wrong conclusions. One great example of this is the Dunning-Kruger effect, which is not a real phenomenom but [an example of incorrectly interpreting charts](https://economicsfromthetopdown.com/2022/04/08/the-dunning-kruger-effect-is-autocorrelation/#fn2):

 ![image.png](attachment:image.png)

> Statistics allow you to evaluate claims based on quantitative evidence and help you differentiate between reasonable and dubious conclusions. That aspect is particularly vital these days because data are so plentiful along with interpretations presented by people with unknown motivations.


![image.png](attachment:image.png)

# <a id='toc4_'></a>[Descriptive vs Inferential Statistics](#toc0_)

![image.png](attachment:image.png)

|                            | Descriptive Statistics                                      | Inferential Statistics  
| -------------------------- | --------------------------------------------------------------- | -------------------------
| Purpose                    | Describe and summarize data                                                    | Make inferences and draw conclusions about a population based on sample data |
| Data Analysis              | Analyzes and interprets the characteristics of a dataset        | Uses sample data to make generalizations or predictions about a larger population |
| Population vs Sample       | Focuses on the entire population or dataset                     | Focuses on a subset of the population (sample) to draw conclusions about the entire population |
| Measurements               | Provides measures of central tendency and dispersion            | Estimates parameters, tests hypotheses, and determines the level of confidence or significance in the results |
| Examples                   | Mean, median, mode, standard deviation, range, frequency tables | Hypothesis testing, confidence intervals, regression analysis, ANOVA (analysis of variance), chi-square tests, t-tests, etc. |
| Goal                       | Summarize, organize, and present data                           | Generalize findings to a larger population, make predictions, test hypotheses, evaluate relationships, and support decision-making |
| Population Parameters      | Not typically estimated                                         | Estimated using sample statistics (e.g., sample mean as an estimate of population mean) |
| Sample Representativeness  | Not required                                                    | Crucial; the sample should be representative of the population to ensure accurate inferences |

Today we will focus on **descriptive statistics**.

# <a id='toc5_'></a>[What is data?](#toc0_)

> Data are evidence you can use to answer questions. For example:   

**Descriptive statistics questions**
- How much did sales improve for a supermarket in the past 2 years?
- What score do the highest performing students get in the SAT exams?
- What do people Google search the most around Christmas?

**Inferential statistics questions**
> •Do flu shots prevent the flu?   
> •Does exercise improve your health?   
>•Does a gasoline additive improve gas mileage?  


# <a id='toc6_'></a>[Types of data](#toc0_)

![image.png](attachment:image.png)

<span style="color:orange">Is time a numerical discrete or continuous variable?</span>

# <a id='toc7_'></a>[Summary statistics](#toc0_)

> A summary statistic is a number derived from a dataset that summarizes a property of the entire dataset. There are four categories of summary statistics:   
  
> - Measures of central tendency or location, such as the mean.   
> - Measures of spread or dispersion, such as the standard deviation.  
> - Measures of the shape of a distribution, such as skewness or thickness of the tails.    
> - Measures of correlation or dependency, such as Pearson’s correlation coefficient.    

## <a id='toc7_1_'></a>[Measures of central tendency](#toc0_)

> + **Arithmetic mean (average)** of a variable is found by adding all numbers in the variable and then dividing by the number of values.
> + **Median** is the middle value when a variable is ordered from least to greatest.(If even number is the arithmetic mean between the two in the middle). It is also the Q2 or the 50% of accumulated values.
> + **Mode** is the value/category that occurs most often in a variable.The most frequent.

In [None]:
import pandas as pd
import numpy as np
from scipy import stats

# Show all columns in pandas
pd.set_option('display.max_columns', None)

# Remove warnings (not necessary)
import warnings
warnings.filterwarnings('ignore')

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
%matplotlib inline

In [None]:
heights = [192, 184, 184, 184, 182, 179, 175, 174, 174, 172, 172, 170, 165, 162, 160, 158, 157, 154, 153]
shoe_sizes = [46, 46, 44.5, 44.5, 44, 42, 42, 42, 39.5, 39, 39, 39, 38, 38, 38, 37, 36.5, 36, 34.5]

Before we get into the mean... behold, the [normal distribution](https://www.youtube.com/watch?v=rzFX5NWojp0)! 

![image.png](attachment:image.png)

You definitely saw this bell curve before. On the x-axis we have the values our data can take and on the y-axis we have the number (or in this case, percentage) of records (data points) corresponding to a value on the x-axis. We will learn more about this plot in the EDA class but for now, I want you to remember that this plot is looking at numerical continuous data ONLY, i.e. **only numerical continuous values can have a normal distribution**.

<span style="color:orange">If we look at this plot, where would we find the mean / median?</span>

### <a id='toc7_1_1_'></a>[Mean](#toc0_)

**Numerical data only**

> + **Arithmetic mean (average)** of a variable is found by adding all numbers in the variable and then dividing by the number of values.

In [None]:
# Get the heights mean the classical way

In [None]:
# Now with numpy

In [None]:
# Shoe sizes mean

### <a id='toc7_1_2_'></a>[Median](#toc0_)

**Numerical data only**

> + **Median** is the middle value when a variable is ordered from least to greatest.(If even number is the arithmetic mean between the two in the middle). It is also the Q2 or the 50% of accumulated values.


In [None]:
# Heights median

In [None]:
# Shoe sizes median

### <a id='toc7_1_3_'></a>[Mode](#toc0_)

**Numerical and categorical data** - but more useful for categorical data.

> + **Mode** is the value/category that occurs most often in a variable. The most frequent.

In [None]:
fortune = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/Fortune_1000.csv")
# isi's recommendation
fortune.sample(10)

In [None]:
# Get the mode for one categorical column

## <a id='toc7_2_'></a>[Measures of dispersion](#toc0_)

All of these measures apply to **numerical data only**!  
> + **Range:** defines the difference between the highest and lowest values.
> + **Variance**: measures how far each number in the set is from the mean and thus from every other number in the set.
> + **Standard deviation:** The standard deviation is a statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance
> + **Quartiles:** A quartile is a statistical term that describes a division of observations into four defined intervals based on the values of the data and how they compare to the entire set of observations.
> + **Percentiles:** same but divided in 100 groups.
> + **Coefficient of variation (CV):** the ratio of the standard deviation to the mean (so std/mean). It shows the extent of variability in relation to the mean of the population

### <a id='toc7_2_1_'></a>[Range](#toc0_)

> + **Range:** defines the difference between the highest and lowest values.


In [None]:
# Min - max difference

### <a id='toc7_2_2_'></a>[Variance](#toc0_)

> + **Variance**: measures how far each number in the set is from the mean and thus from every other number in the set.


In [None]:
# Manual variance - sum of squared differences over the length
var = sum([(height - np.mean(heights)) ** 2 for height in heights]) / len(heights)
var

In [None]:
# Numpy variance

### <a id='toc7_2_3_'></a>[Standard deviation](#toc0_)

> + **Standard deviation:** The standard deviation is a statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance

In [None]:
# Manual STD

In [None]:
# Numpy std

When looking at a **symmetrical normal distribution**, we can say that 68.2% of our observations are one standard deviation away from the mean.

![image.png](https://imgs.search.brave.com/890nJAyetvjXKqV3Dcfu4-Pj-BUk56L5MSP-zc5wArg/rs:fit:860:0:0/g:ce/aHR0cHM6Ly90NC5m/dGNkbi5uZXQvanBn/LzA1LzY4Lzk1LzU5/LzM2MF9GXzU2ODk1/NTk2MV9Pc0dkYmpo/MXFQa1N5czlsVU1z/bTVQa3VzTTdGR1B2/SC5qcGc)

*Notes:* 
- The sigma (σ) symbol represents the standard deviation in maths' vocabulary.
- The mu (μ) symbol represents the mean in maths' vocabulary.

### <a id='toc7_2_4_'></a>[Quartiles](#toc0_)

> + **Quartiles:** A quartile (from quarter!) is a statistical term that describes a division of observations into four defined intervals based on the values of the data and how they compare to the entire set of observations.


![quartiles](https://www.onlinemathlearning.com/image-files/median-quartiles.png)

In [None]:
# 1st quartile

In [None]:
# 2nd quartile

In [None]:
# 3rd quartile

In [None]:
# 4th quartile - what is it equivalent to?

<span style="color:orange">How do we summarize this in words?</span>

To look at quartiles we usually use **box plots** (top graph in the image below), which we'll talk about more in the next lesson.

![image](https://imgs.search.brave.com/YwALrTkM3oqQ9ifCcdNPsUTCyMc4gFUD97oOEPQNpJM/rs:fit:860:0:0/g:ce/aHR0cHM6Ly9jZG4x/LmJ5anVzLmNvbS93/cC1jb250ZW50L3Vw/bG9hZHMvMjAyMC8x/MC9Cb3gtUGxvdC1h/bmQtV2hpc2tlci1Q/bG90LTIucG5n)

*Notes:* 
- The sigmas (σ) symbol represents the standard deviation in maths' vocabulary.
- The "probability density" is just fancy for the percentage of observations, i.e. how much of your data sits between a range of values. For example, I can say that 50% of my data lies between 0.67 standard deviations from the mean. 

### <a id='toc7_2_5_'></a>[Percentile](#toc0_)

> + **Percentiles:** same but divided in 100 groups.

<span style="color:orange">What height do people in the top 5% of this class have?</span>

In [None]:
# 

<span style="color:orange">What shoe size do people in the middle 50% have?</span>

In [None]:
# 

**Advanced**: The [quantile calculation algorithm](https://www.fon.hum.uva.nl/praat/manual/quantile_algorithm.html).

> ### Why Understanding Variability is Important   


> Analysts frequently use the mean to summarize the center of a population or a process. While the mean is relevant, people often react to variability even more. When a distribution has lower variability, the values in a dataset are more consistent. However, when the variability is higher, the data points are more dissimilar and extreme values become more likely. Consequently, **understanding variability helps you grasp the likelihood of unusual events**.

![image (9).png](https://github.com/sabinagio/data-analytics/blob/main/images/distribution_variability.png?raw=True)

#### <a id='toc7_2_5_1_'></a>[Example: Which 🍕 place should I choose when I'm REALLY hungry?](#toc0_)

> Let’s take a look at two hypothetical pizza restaurants. They both advertise a mean delivery time of 20 minutes. When we’re ravenous, they sound equally good! However, this equivalence can be deceptive! To determine the restaurant that you should order from when you’re hungry, we need to analyze their variability.

> The graphs below display the distribution of delivery times and provide the answer. 
 
![image (10).png](https://github.com/sabinagio/data-analytics/blob/main/images/pizza_high_variability.png?raw=True)![image (11).png](https://github.com/sabinagio/data-analytics/blob/main/images/pizza_low_variability.png?raw=True)

> Nearly 16% of the deliveries for the high variability restaurant exceed 30 minutes. On the other hand, only 2% of the deliveries take too long with the low variability restaurant. They both have an average delivery time of 20 minutes, but I know where I’d place my order when I’m hungry!

### <a id='toc7_2_6_'></a>[Coefficient of variation](#toc0_)

> - **Coefficient of variation (CV)** - the ratio of the standard deviation to the mean (so std/mean). It shows the extent of variability in relation to the mean of the population

The reason this is awesome is that it allows us to compare variability across 2 completely different distributions!!

In [None]:
# CV for height and shoe sizes

In [None]:
# Check the coefficients

<span style="color:orange">Which one is more dispersed: shoe size or height?</span>

**Caveat / Problem:** This doesn't work for all numerical continuous variables! It only works for variables that have what is called a <span style="color:red">*true zero*</span>, aka when the number is 0, it means there is nothing. `Height`, `weight`, `shoe size` all have a true zero but `degrees in Celsius` do not (because 0 degrees Celsius is not equivalent to no temperature!).

### <a id='toc7_2_7_'></a>[💡 Check for understanding](#toc0_)

Let's work with the Fortune 1000 dataset we had before:

In [None]:
fortune = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/Fortune_1000.csv")

**Questions**
- What's the most common sector for companies in the Fortune 1000? What about location?
- What's the average revenue made by these companies? What about profit?
- What's the median revenue made by these companies? What about profit?
- How much do companies in the top 1% percentile make?
- What sector is most common for the companies in the top 10% percentile? What about the bottom 10% percentile?
- How dispersed is the revenue made by these companies compared to their profit? What about market cap?
- What did the previous questions tell you about Fortune 1000 companies? Any surprises?

*Note:* Remember you might need to convert the `Market Cap` data type to numerical using `pd.to_numeric()`! 

# <a id='toc8_'></a>[Outliers](#toc0_)

![image.png](https://i0.wp.com/37.media.tumblr.com/tumblr_m0w7ccQomh1rpjhxuo1_400.jpg?zoom=2)

> An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal.

![image (9).png](https://github.com/sabinagio/data-analytics/blob/main/images/outlier_boxplot.png?raw=True)

In [None]:
# Let's get the outliers for the profit data

# 1. Get the IQR

# 2. Create pandas condition

# 3. Apply to profit column

## <a id='toc8_1_'></a>[Measures of shape](#toc0_)

> + **Skewness** is a measurement of the distortion of symmetrical distribution or asymmetry in a data set. 
> + **Kurtosis** is a measure of the tailedness of a distribution. Tailedness is how often outliers occur.

### <a id='toc8_1_1_'></a>[Skewdness](#toc0_)

> + **Skewness** is a measurement of the distortion of symmetrical distribution or asymmetry in a data set. 

![image (12).png](https://github.com/sabinagio/data-analytics/blob/main/images/skewness.png?raw=True)

**Normal distribution**:  
- data points are equally distributed around the mean
- `Mode < Median < Mean`

**Positively skewed distribution**:  
- most of the data points are below the mean (i.e. the mean is more positive)
- `Mode < Median < Mean`

**Negatively skewed distribution**:
- most of the data points are above the mean (i.e. the mean is more negative)
- `Mode > Median > Mean`

In [None]:
# Let's have a look at our fortune dataset

In [None]:
# Let's check profits - what is the skew?

In [None]:
# What about revenue?

In [None]:
# How will market cap look like?

Uh-oh! Why do I have an issue with Market Cap?

In [None]:
# Check dtype & number of nulls

In [None]:
# Convert to numerical without errors argument

In [None]:
# Convert to numerical with errors argument & check dtype

In [None]:
# Check number of nulls

In [None]:
# Plot Market Cap

We can also get an idea of skewness from box plots:  

![image (13).png](https://github.com/sabinagio/data-analytics/blob/main/images/quartiles_distributions_box_plots.png?raw=True)

### <a id='toc8_1_2_'></a>[Kurtosis](#toc0_)

> + Kurtosis is a measure of the tailedness of a distribution. Tailedness is how often outliers occur.

![image.png](https://imgs.search.brave.com/3klMtOz-A_6d6Nbe31s558TJL5fnyOtscuMhm6f2YPA/rs:fit:860:0:0/g:ce/aHR0cHM6Ly93d3cu/c2ltcGx5cHN5Y2hv/bG9neS5vcmcvd3At/Y29udGVudC91cGxv/YWRzL0t1cnRvc2lz/LmdpZg.gif)

*Notes:*
- Meso = middle in Greek. The mesokurtic curve corresponds to the typical symmetrical normal distribution.
- Lepto = thin in Greek. Lepto also means minute in Greek, so you can think it looks thin like a minute.
- Platy = wide in Greek. It also sounds like platypus, so think of something flat like the platypus tail.

![values](https://archive.ph/Y4za4/20e270146abbc12d53842787265b02f0a6a2cc08.webp)

In [None]:
from scipy.stats import kurtosis
print(kurtosis(fortune.dropna(subset='profit').profit, fisher=False))

*Note:*  
> Setting `fisher=False` in the above code does the calculation of the Pearson’s definition of kurtosis where the kurtosis value for normal distribution = 3:

<span style="color:orange">What type of distribution does the profit data have? How might we ensure that the kurtosis is not so high?</span>

In [None]:
# Try to get kurtosis without dropping the null values
print(kurtosis(fortune.profit, fisher=False))

# <a id='toc9_'></a>[Resources](#toc0_)

- [StatQuest - Statistics Fundamentals](https://www.youtube.com/watch?v=vikkiwjQqfU&list=PLblh5JKOoLUK0FLuzwntyYI10UQFUhsY9) - for now, the first 3 videos.
- [Statistics by Jim - Intro to Statistics](https://www.amazon.com/dp/1735431109) - pretty good intro where many examples in this lesson were taken from, but you have to bear with the ugly charts 😃
- [Statistics by Jim - the blog](https://statisticsbyjim.com/graphs/) - only if you're super keen, lots of interesting topics, many more advanced!
- [Our World in Data](https://ourworldindata.org/) - specific case studies centered on social issues such as pandemics, mental health, economic inequalities, etc.

# <a id='toc10_'></a>[Extra: Estimating the minimum proportion of observations for any distribution (Chebyshev's Theorem)](#toc0_)

> Chebyshev’s Theorem helps you determine where most of your data fall within a distribution of values. This theorem provides helpful results when you have only the mean and standard deviation. You do not need to know the distribution your data follow.

![image (9) g.png](https://github.com/sabinagio/data-analytics/blob/main/images/chebyshev_distribution.png?raw=true)

Minimum proportion of observations that are within k standard deviations from the mean:

![image (9) d.png](https://github.com/sabinagio/data-analytics/blob/main/images/chebyshev_table.png?raw=true)

In [None]:
from scipy import stats
import numpy as np
import plotly.express as px

In [None]:
px.histogram(fortune.profit)  # We know it's positively skewed

> Will this data obey the Chebyshev criteria for k=2 (2 standard deviations)?

In [None]:
skewed_data = fortune.dropna(subset='profit').profit
mean = skewed_data.mean()
std = skewed_data.std()

> Chebychev predicts 75% or more data lies within 2 standard deviations:

In [None]:
points_within_2_std = skewed_data[(skewed_data < (mean + 2 * std)) & (skewed_data > (mean - 2 * std))]
proportion = round(len(points_within_2_std) * 100 / len(skewed_data), 1)
print(f"{proportion}% of the profit data falls within 2 standard deviations from the mean")

Since 97.4% > 75%, the distribution does obey the Chebyshev theorem.

# <a id='toc11_'></a>[Extra: Sources of error in data analysis](#toc0_)

For a more detailed overview of where statistics can go wrong, [here is an article](https://archive.ph/20230428034256/https://towardsdatascience.com/misleading-with-data-statistics-c6d506bdb9cf) for you. Some of the concepts here may be unfamiliar (A/B testing, p-values) but worry not, by the mid-bootcamp project we will learn about all of these techniques!

![image.png](https://archive.ph/C26ns/e5f3e70310de0774fa88fd91e02f6fafbfe99b34.webp)