I was today years old when I learnt that the UK has a loneliness minister:

Stay curious 🔍

# <a id='toc1_'></a>[Key items for this class: <span style="color:red">apply</span>, <span style="color:orange">map</span>, <span style="color:yellow">applymap</span>, <span style="color:green">str</span>, <span style="color:blue">&</span>, <span style="color:purple">|</span>.](#toc0_)

# What is statistics?

> **The field of statistics is the science of learning from data**. Statistical knowledge helps you use the proper methods to collect the data, employ the correct analyses, and effectively present the results. Statistics is a crucial process behind how we make discoveries in science, make decisions based on data, and make predictions. Statistics allows you to understand a subject much more deeply.

![image.png](attachment:image.png)

# Why statistics?

 > Statistics facilitates the creation of new knowledge. 
 
 Also, knowing statistics well allows you to avoid drawing wrong conclusions. One great example of this is the Dunning-Kruger effect, which is not a real phenomenom but [an example of incorrectly interpreting charts](https://economicsfromthetopdown.com/2022/04/08/the-dunning-kruger-effect-is-autocorrelation/#fn2):

 ![image.png](attachment:image.png)

> Statistics allow you to evaluate claims based on quantitative evidence and help you differentiate between reasonable and dubious conclusions. That aspect is particularly vital these days because data are so plentiful along with interpretations presented by people with unknown motivations.


![image.png](attachment:image.png)

# Why it's cool?

You can use it everywhere:

# Types of statistics

# What is data?

> Data are evidence you can use to answer questions. For example:   
> •Do flu shots prevent the flu?   
> •Does exercise improve your health?   
>•Does a gasoline additive improve gas mileage?  


# Types of data

![image.png](attachment:image.png)

# Summary statistics

> A summary statistic is a number derived from a dataset that summarizes a property of the entire dataset. There are four categories of summary statistics:   
  
> - Measures of central tendency or location, such as the mean.   
> - Measures of spread or dispersion, such as the standard deviation.   
> - Measures of correlation or dependency, such as Pearson’s correlation coefficient.    
> - Measures of the shape of a distribution, such as skewness or thickness of the tails.   

## Measures of central tendency

> + **Arithmetic mean (average)** of a variable is found by adding all numbers in the variable and then dividing by the number of values.
> + **Median** is the middle value when a variable is ordered from least to greatest.(If even number is the arithmetic mean between the two in the middle). It is also the Q2 or the 50% of accumulated values.
> + **Mode** is the value/category that occurs most often in a variable.The most frequent.

In [13]:
import pandas as pd
import numpy as np
from scipy import stats

# Show all columns in pandas
pd.set_option('display.max_columns', None)

# Remove warnings (not necessary)
import warnings
warnings.filterwarnings('ignore')

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
%matplotlib inline

In [None]:
sample1 = [1, 1, 2, 2, 3, 4, 5, 7, 8, 9]
salaries = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/data_science_salaries.csv")
salaries.head()

### Mean

For numerical data

In [9]:
my_mean = sum(sample1)/len(sample1)
my_mean

4.2

In [10]:
import numpy as np
np.mean(sample1)

4.2

### Median

For numerical data

In [12]:
sample1 = [1, 1, 2, 2, 3, 4, 5, 7, 8, 9]

In [11]:
np.median(sample1)

3.5

### Mode

For categorical data

In [None]:
salaries = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/data_science_salaries.csv")
salaries.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,Senior,Full Time,Principal Data Scientist,80000,EUR,85847,ES,100,ES,Large
1,2023,Mid,Contract,ML Engineer,30000,USD,30000,US,100,US,Small
2,2023,Mid,Contract,ML Engineer,25500,USD,25500,US,100,US,Small
3,2023,Senior,Full Time,Data Scientist,175000,USD,175000,CA,100,CA,Medium
4,2023,Senior,Full Time,Data Scientist,120000,USD,120000,CA,100,CA,Medium


In [None]:
# isi's recommendation
salaries.sample(10)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
1061,2023,Senior,Full Time,Data Manager,120000,USD,120000,US,0,US,Medium
613,2023,Senior,Full Time,Data Scientist,172000,USD,172000,US,0,US,Medium
3467,2021,Entry,Full Time,Data Scientist,80000,USD,80000,US,100,US,Medium
2030,2022,Mid,Full Time,Data Analyst,100000,USD,100000,US,100,US,Medium
123,2023,Senior,Full Time,Analytics Engineer,289800,USD,289800,US,0,US,Medium
2779,2022,Mid,Full Time,Data Engineer,55000,GBP,67723,GB,100,GB,Medium
3734,2021,Mid,Full Time,Lead Data Analyst,1450000,INR,19609,IN,100,IN,Large
379,2023,Senior,Full Time,Data Engineer,145000,USD,145000,US,0,US,Medium
1010,2023,Senior,Full Time,Data Analyst,121904,USD,121904,US,0,US,Medium
1042,2023,Senior,Full Time,Data Engineer,153600,USD,153600,US,0,US,Medium


## Measures of dispersion

> #### Why Understanding Variability is Important   

![image.png](attachment:image.png)

> Analysts frequently use the mean to summarize the center of a population or a process. While the mean is relevant, people often react to variability even more. When a distribution has lower variability, the values in a dataset are more consistent. However, when the variability is higher, the data points are more dissimilar and extreme values become more likely. Consequently, **understanding variability helps you grasp the likelihood of unusual events**.

> #### Example of Different Amounts of Variability 

> Let’s take a look at two hypothetical pizza restaurants. They both advertise a mean delivery time of 20 minutes. When we’re ravenous, they sound equally good! However, this equivalence can be deceptive! To determine the restaurant that you should order from when you’re hungry, we need to analyze their variability.

> The graphs below display the distribution of delivery times and provide the answer. 

![image-2.png](attachment:image-2.png)

> In these graphs, we consider a 30-minute wait or longer to be unacceptable. We’re hungry after all! The shaded area in each chart represents the proportion of delivery times that surpass 30 minutes. Nearly 16% of the deliveries for the high variability restaurant exceed 30 minutes. On the other hand, only 2% of the deliveries take too long with the low variability restaurant. They both have an average delivery time of 20 minutes, but I know where I’d place my order when I’m hungry!

> + **Range:** defines the difference between the highest and lowest values.
> + **Variance**: measures how far each number in the set is from the mean and thus from every other number in the set.
> + **Standard deviation:** The standard deviation is a statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance
> + **Quartiles:** A quartile is a statistical term that describes a division of observations into four defined intervals based on the values of the data and how they compare to the entire set of observations.
> + **Percentiles:** same but divided in 100 groups.

# Outliers

![image.png](attachment:image.png)

# Extra: Sources of error in data analysis

For a more detailed overview of where statistics can go wrong, [here is an article](https://archive.ph/20230428034256/https://towardsdatascience.com/misleading-with-data-statistics-c6d506bdb9cf) for you. Some of the concepts here may be unfamiliar (A/B testing, p-values) but worry not, by the mid-bootcamp project we will learn about all of these techniques!

![image.png](attachment:image.png)