# **Statistics**
Statistics is a fundamental discipline that underpins data analysis, data science, and many machine learning techniques.

In simple terms, statistics involves collecting, summarizing, analyzing, and interpreting data to identify patterns, relationships, and trends.

Technically, statistics covers concepts such as measures of central tendency (mean, median, mode), variability (variance, standard deviation), probability distributions, sampling, and hypothesis testing.

Statistical analysis is typically conducted using tools like Python (with libraries such as pandas, NumPy, SciPy), R, Excel, and SPSS. It is an essential step in building predictive models, which combine statistical methods and machine learning algorithms to forecast future outcomes based on historical data.

---

## Statistics and the Role of Questions

Statistics is not just about numbers—it’s about answering questions. Every statistical study begins with a clear question, because the question determines what data to collect and how to analyze it.
However, in today’s data-rich world, the process can start either way:

❓ Question First:
Define a problem → Gather data to answer it.

📂 Data First:
Explore existing data → Discover patterns → Form new questions.



Regardless of the starting point, statistics provides the tools—such as measures of central tendency, variability, and hypothesis testing—to interpret data and draw meaningful conclusions.
Good statistics depends on relevant data. The closer the data aligns with the question, the more accurate and actionable the insights. This is why statistical analysis is essential in research, business decisions, and predictive modeling.

---

## Statistics Terms to Know

| **Term**                   | **Definition**                                                                                                                                         |
|-----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Data**                    | Individual bits of numerical observations                                                                                                              |
| **Population**              | The group containing all possible entities of concern                                                                                                 |
| **Sample**                  | A part of the population from whom data is collected                                                                                                  |
| **Observation**             | Each separate collection of one bit of data                                                                                                           |
| **Study**                   | Collecting data and using statistics to make an inference about it                                                                                    |
| **Inference**               | An educated, statistically supported “guess” about a group of data                                                                                    |
| **Descriptive statistics**  | Values that describe (e.g., center, spread, shape) data sets                                                                                          |
| **Inferential statistics**  | Making educated guesses, testing theories, modeling observations’ relationships, and predicting outcomes with data analysis                            |
| **Descriptive observations**| Data that describes qualities rather than amounts (such as hair color, eye color, etc.)                                                               |
| **Random variables**        | Numerical or descriptive observations that happen by chance                                                                                           |
| **Data set**                | A group of collected or observed data bits (or data points)                                                                                           |
| **Quantitative data**       | Data that is numerical                                                                                                                                |
| **Qualitative data**        | Data that is not numerical                                                                                                                            |

---

## Types of Data Analysis

1. **Descriptive Analysis**:
Summarizes and organizes data to describe its main features. It only Focuses only on the dataset you have; it does not generalize beyond that.
(Simply a way to describe data). 

🛠️ Techniques used for Descriptive Analysis:
- Measures of central tendency: Mean, median, mode.
- Measures of dispersion: Range, variance, standard deviation.
- Visualizations: Histograms, boxplots, frequency tables.

✅ Example: Calculating the average test score of 1,000 students and plotting a histogram of their scores.

2. **Inferential Analysis**:
Inferential analysis are a set of techniques that are used when the purpose of the study is to not to describe the data that has been collected but to make generalisation and inferences based on it. Inferential statistics are performed on samples rather than the whole population and allow us to make generalisation about the whole population for the sample. Statistics is used to determine the quality of data to determine if any inferences or educated guesses are accurate.


🛠️ Techniques used for Inferential Analysis:
- Hypothesis Testing (t-test, Chi-square).
- Confidence Intervals.
- Regression Analysis (ANOVA) Analysis of Variations.

✅ Example: Using a sample of 100 students to estimate the average test score of all students in a country.


---

## Types of data

Data in statistics and data science is broadly classified into **Categorical** and **Numerical** types, with further subcategories as shown in the figure.

![image.png](attachment:image.png)


**1.** **Categorical Data**
Categorical data represents **qualities or characteristics**, not numbers and can be: 

- **Nominal**: Categories without any order or ranking. Example - Gender (Male/Female), Hair Color (Black, Brown, Blonde). **Special Case:** **Binary** – only two categories (e.g., Yes/No, True/False).

- **Ordinal**: Categories with a meaningful order, but differences between ranks are not measurable. Example - Education Level (High School < Bachelor < Master), Customer Satisfaction (Poor < Good < Excellent).

**2.** **Numerical Data**: Numerical data represents **quantities** and can be measured.

- **Discrete**: Countable values, often integers. Example - Number of students in a class, Number of cars in a parking lot or number on users served by a IT network.

- **Continuous**: Values within a range, can take any real number. Example - Height, Weight, Temperature. It can futher be subtyped to: 
    - _Interval_: Numeric scale with equal intervals, but no true zero. Example - Temperature in Celsius or Fahrenheit.
    - _Ratio_: Numeric scale with a true zero, allowing meaningful ratios. Examples - Weight (0 kg means no weight), Age, Income.


### Population (N) and Sample (n)

![image.png](attachment:image.png)

 - Small samples tend to show greater variability from one sample to another. In other words, their results can differ more because there are fewer data points.
 - Large samples usually have less variability, so their results are more consistent.
 - If a sample is representative of the population, its summary measures (like mean or median) should be close to the population’s measures.

   #### Sampling Techniques
     - simple Random sampling: Randomaly selecting samples
     - Stratified Sampling: Divide the population into **strata** (groups) based on characteristics (e.g., age, gender) - Non overlapping groups
     - 


# Practice - Krish 