# Introduction to Statistics

Statistics provides a set of tools that enable us to find answers to crucial questions about data. These tools can be divided into two main categories:

1. **Descriptive Statistics:** These methods allow us to convert raw data into understandable and shareable information. They help us summarize and present the data effectively.

2. **Inferential Statistics:** These methods help us draw conclusions about entire populations based on data collected from smaller samples.

In this chapter, we will explore the significance of statistics, not just in general but also in the context of machine learning. By the end of this chapter, you will have a clear understanding of the following points:

- Statistics is considered a fundamental component of applied machine learning.
- We rely on statistics to transform raw data into valuable information and to address questions related to data samples.
- Statistics encompasses a range of tools developed over centuries for summarizing data and quantifying the characteristics of a domain based on sample observations.

---



# Statistics as a Required Foundation

Machine learning and statistics are closely intertwined fields of study. In fact, statisticians often refer to machine learning as "applied statistics" or "statistical learning" rather than its computer science-centric name.

When you delve into the world of machine learning, it's typically assumed that you have some foundational knowledge of statistics. Let's illustrate this with a few selected examples:

- In the well-known book "Applied Predictive Modeling," the authors state that readers should have a grasp of basic statistics. This includes concepts like variance, correlation, simple linear regression, and fundamental hypothesis testing (such as p-values and test statistics).

- Similarly, the book "An Introduction to Statistical Learning" expects its readers to have completed at least an introductory course in statistics.

- Even when a book doesn't explicitly require prior knowledge of statistics, it often suggests that having some basic understanding of topics like trigonometry and statistics will facilitate the comprehension of the material.

Understanding machine learning is closely tied to having a foundational understanding of statistics. To appreciate this connection, it's essential to recognize the fundamental role statistics plays in both fields.

---

# The Significance of Learning Statistics

Raw data, by itself, is not information or knowledge. Instead, it raises questions, such as:

- What is the most common or expected observation?
- What are the boundaries of these observations?
- What does the data actually look like?

To transform raw data into useful information, we need to answer these seemingly simple questions. Moreover, when we conduct experiments to gather observations, we encounter more complex questions:

- Which variables are the most relevant to our analysis?
- What differences exist between the outcomes of different experiments?
- Are these differences genuine or merely a result of random noise in the data?

These questions are critical because their answers impact our projects, stakeholders, and the decisions we make. To find answers to such questions, we turn to statistical methods.

In the context of machine learning, statistics play a vital role in understanding the data used to train a model and interpreting the results of testing various machine learning models. In fact, statistical methods are a fundamental component of every step in a predictive modeling project. They are indispensable for extracting meaningful insights and making informed decisions throughout the process.

---


# Understanding Statistics

Statistics is a branch of mathematics that provides a set of tools for working with data and using that data to find answers to various questions. It's often described as the art of making numerical predictions about puzzling questions.

The methods used in statistics have evolved over centuries as people sought solutions to their questions. However, to beginners, statistics can appear vast and somewhat complex. It's not always easy to distinguish between statistical methods and those used in other fields of study. In fact, many techniques can straddle the line between classical statistical methods and modern algorithms employed in areas like feature selection and modeling.

While a deep theoretical understanding is not necessary for practical use, a foundation in some fundamental theorems helps. For instance, the law of large numbers helps us understand why larger samples are generally more reliable, and the central limit theorem provides a basis for comparing expected values between samples, such as mean values.

In practice, statistics can be divided into two main categories of methods:

1. **Descriptive Statistics:** These methods are used for summarizing data, allowing us to understand typical experiences and differences between groups.

2. **Inferential Statistics:** These methods enable us to draw conclusions from samples of data, assess relationships between variables, and make predictions.

Statistics serves as a powerful tool for collecting information from a large number of individuals, summarizing their experiences, and making informed conclusions about general differences and relationships between groups. It plays a critical role in research and decision-making processes.

---


## Descriptive Statistics

Descriptive statistics encompass methods for transforming raw observations into understandable and shareable information. These methods involve:

- Calculating statistical values, such as the mean or median, to summarize the central tendencies of data.
- Assessing the spread of data through measures like variance and standard deviation.
- Utilizing graphical methods, including charts and graphics, to visualize data samples. These visuals provide qualitative insights into the distribution of observations and the relationships between variables.

---


## Inferential Statistics

Inferential statistics are methods designed to quantify characteristics of an entire population using data obtained from a smaller subset known as a sample.

At its core, inferential statistics involve:

- Estimating population parameters like the expected value or variability based on sample data.

More advanced inferential statistical tools help assess the probability of observing data samples under specific assumptions. These tools are commonly known as statistical hypothesis testing methods. In hypothesis testing, the fundamental assumption being tested is referred to as the null hypothesis. The broad array of inferential statistical methods comes into play due to the wide range of hypotheses that can be formulated and the various constraints imposed on data, all aimed at enhancing the reliability and correctness of the test's findings.

---


## Further Reading

### Books
- [**Applied Predictive Modeling** (2013)](https://amzn.to/2InAS0T)
- [**An Introduction to Statistical Learning with Applications in R** (2013)](https://amzn.to/2Gvhkqz)
- [**Programming Collective Intelligence: Building Smart Web 2.0 Applications** (2007)](https://amzn.to/2GIN9jc)
- [**Statistics, Fourth Edition** (2007)](https://amzn.to/2pUA0tU)
- [**All of Statistics: A Concise Course in Statistical Inference** (2004)](https://amzn.to/2H224Tp)
- [**Statistics in Plain English, Third Edition** (2010)](https://amzn.to/2Gv0A2V)

### Articles
- [**Statistics on Wikipedia**](https://en.wikipedia.org/wiki/Statistics)
- [**Portal: Statistics on Wikipedia**](https://en.wikipedia.org/wiki/Portal:Statistics)
- [**List of statistics articles on Wikipedia**](https://en.wikipedia.org/wiki/List_of_statistics_articles)
- [**Mathematical statistics on Wikipedia**](https://en.wikipedia.org/wiki/Mathematical_statistics)
- [**History of statistics on Wikipedia**](https://en.wikipedia.org/wiki/History_of_statistics)

### Summary
- [**Descriptive Statistics on Wikipedia**](https://en.wikipedia.org/wiki/Descriptive_statistics)
- [**Statistical Inference on Wikipedia**](https://en.wikipedia.org/wiki/Statistical_inference)

---
