# **<span style="color:#2E86C1">Statistics</span>**

<span style="color:#2E86C1"><b>What is Statistics?</b></span>

- <span style="color:#D35400"><b>Definition:</b></span>  
  **Statistics** is a branch of mathematics that deals with collecting, analyzing, interpreting, presenting, and organizing data. It provides methods and techniques for making sense of data, allowing researchers, analysts, and decision-makers to draw meaningful conclusions and make informed decisions based on numerical information.

- <span style="color:#D35400"><b>Key Components of Statistics:</b></span>
    - <span style="color:#28B463"><b>Data Collection:</b></span>  
      Gathering data through surveys, experiments, or observational studies.
      
    - <span style="color:#28B463"><b>Data Analysis:</b></span>  
      Using various mathematical and computational methods to analyze and summarize data. This includes descriptive statistics (mean, median, mode, etc.) and inferential statistics (hypothesis testing, confidence intervals, etc.).
      
    - <span style="color:#28B463"><b>Data Interpretation:</b></span>  
      Drawing conclusions and making predictions based on the analyzed data, considering the context and limitations of the data.
      
    - <span style="color:#28B463"><b>Data Presentation:</b></span>  
      Communicating the findings effectively through charts, graphs, tables, and reports to convey the insights to others.

- <span style="color:#D35400"><b>Importance of Statistics:</b></span>
    - Helps in understanding trends and patterns.
    - Supports decision-making in various fields, including business, healthcare, social sciences, and more.
    - Provides tools for testing hypotheses and validating scientific research.


# **<span style="color:#2E86C1">Descriptive Statistics</span>**

After collecting data, one of the first things to do is to graph the data, calculate the mean, and get an overview of the distributions of the data. This is the task of descriptive statistics.

**<span style="color:#D35400; font-size: 20px;">Definition:</span>**  
"The term descriptive statistics covers statistical methods for describing data using statistical characteristics, charts, graphics, or tables."

Depending on the question and the available measurement scale, different key figures, tables, and graphics are used for evaluation.

##  <span style="color:#28B463"><b>Level of Measurement:</b></span>

The level of measurement of a variable can be either nominal, ordinal, or metric. In a nutshell:
- For **nominal variables**, the values can be differentiated.
- For **ordinal variables**, the values can be sorted.
- For **metric scale level**, the distances between the values can be calculated.

Nominal and ordinal variables are also called **categorical variables**.

<center><img src="../../images/level_of_measurement.png" alt="error" width="600"/></center>

Different levels of measurement support different statistical analyses. For instance, mean and standard deviation are suitable for metric data. In some cases, it may be suitable for ordinal data, but only if you know how to interpret the results correctly. It definitely makes no sense to calculate it for nominal data.

The level of measurement also indicates which hypothesis tests are possible and determines the most effective type of data visualization.

---

### **<span style="color:purple">Categorical Variables</span>**

Variables that have a nominal scale or an ordinal scale are referred to as categorical variables. Categorical is an umbrella term for variables scaled nominally and ordinally.
- Categorical variables can have a limited and usually fixed number of expressions, e.g., country with "Germany," "Austria," ... or gender with "female" and "male." 
- It is essential that there is a finite number of categories or groups. The different categories may have a ranking but must not.


- **<span style="color:#D35400; font-size: 20px;">Nominal Variables</span>**
    - The nominal measurement scale is the lowest level of measurement in statistics, thus having the lowest information content. Possible values of the variables can be distinguished, but a meaningful order is not possible. If there are only two characteristics, such as in the case of gender (male and female), we also speak of dichotomous or binary variables.
        - Only relations "equal" and "unequal" are possible.
        - No logical ranking of categories.
        - The order of the answer categories is interchangeable.
        - Nominal characteristics with only two expressions are also called "binary" or "dichotomous."

In [2]:
# Nominal Variables
nominal_data = ['Male', 'Female', 'Female', 'Male', 'Other', 'Female', 'Male']

# Count occurrences of each category
from collections import Counter
nominal_counts = Counter(nominal_data)

# Output nominal data
print("Nominal Data Counts:")
for category, count in nominal_counts.items():
    print(f"{category}: {count}")


Nominal Data Counts:
Male: 3
Female: 3
Other: 1



- **<span style="color:#D35400; font-size: 20px;">Ordinal Variables</span>**
    - The ordinal level of measurement is the next higher level, containing nominal information with the added ability to form a ranking. A classic example of the ordinal scale is school grades; here a ranking can be formed, but it cannot be said that the distance between A and B is the same as the distance between B and C.
        - Next higher scale of measurement.
        - "Equal" and "unequal," as well as "greater" and "smaller," can be determined.
        - There is a logical hierarchy of categories.
        - The distances between the numerical values are not equal and cannot be interpreted.


In [3]:

# Ordinal Variables
ordinal_data = ['Poor', 'Fair', 'Good', 'Good', 'Excellent', 'Fair', 'Poor']

# Map ordinal ranks
ordinal_rank = {
    'Poor': 1,
    'Fair': 2,
    'Good': 3,
    'Excellent': 4
}

# Count occurrences
ordinal_counts = Counter(ordinal_data)

# Output ordinal data with ranks
print("Ordinal Data Counts with Ranks:")
for category, count in ordinal_counts.items():
    rank = ordinal_rank[category]
    print(f"{category} (Rank {rank}): {count}")


Ordinal Data Counts with Ranks:
Poor (Rank 1): 2
Fair (Rank 2): 2
Good (Rank 3): 2
Excellent (Rank 4): 1


### **<span style="color:purple">Numeric Variables</span>**


- **<span style="color:#D35400; font-size: 20px;">Metric Variables</span>**
    - Metric variables represent the highest possible level of measurement. With a metric level of measurement, characteristic values can be compared, sorted, and the distances between the values can be calculated. Examples include the weight and age of subjects.
        - Highest level of measurement.
        - Creation of rankings is possible.
        - "Equal" and "unequal," as well as "greater" and "smaller," can be determined.
        - Differences and sums can be meaningfully formed.


In [4]:

# Metric Variables
metric_data = [45.5, 23.0, 56.7, 34.2, 67.8, 89.1]

# Calculate mean and standard deviation ( you will learn what this is in next notebooks )
import statistics

mean_metric = statistics.mean(metric_data)
std_dev_metric = statistics.stdev(metric_data)

# Output metric data
print("Metric Data:")
print(f"Mean: {mean_metric:.2f}")
print(f"Standard Deviation: {std_dev_metric:.2f}")


Metric Data:
Mean: 52.72
Standard Deviation: 23.85


- **<span style="color:#D35400; font-size: 20px;">Ratio Scale and Interval Scale</span>**
    - The metric level of measurement can be further subdivided into interval scale and ratio scale. As the name suggests, the values of the ratio scale can be expressed in a ratio. Thus, a statement like the following can be made: "One value is twice as large as another." For this, an absolute zero must be available as a reference.
        
        - **Example Ratio Scale:**  
          The time of marathon runners is measured. Here, the statement can be made that the fastest runner is twice as fast as the last runner. This is possible because there is an absolute zero point at the beginning of the marathon where all runners start from zero.
        
        - **Example Interval Scale:**  
          If the stopwatch is forgotten to start at the beginning of the marathon and only the differences are measured starting from the fastest runner, the runners cannot be compared proportionally. In this case, it can be said how big the interval between the runners is (e.g., runner A is 22 minutes faster than runner B), but it cannot be said that runner A ran 20 percent faster than runner B.


In [5]:

# Ratio Scale Example
# Time taken (in seconds) for marathon runners
marathon_times = [1800, 2100, 2400, 2700]  # in seconds

# Convert to minutes for interpretation
marathon_times_minutes = [time / 60 for time in marathon_times]

# Output ratio scale data
print("Marathon Times (in minutes):")
for time in marathon_times_minutes:
    print(f"{time:.2f} minutes")


Marathon Times (in minutes):
30.00 minutes
35.00 minutes
40.00 minutes
45.00 minutes
