## What Is Statistics?

**Statistics** is the science of **collecting, organizing, and analyzing data**.

**Data:**  
Facts or pieces of information.

---

## Types of Statistics

### 1. Descriptive Statistics  
Descriptive statistics consist of **organizing and summarizing data**.

**Examples:**
- Measures of central tendency: **Mean, Median, Mode**
- Measures of dispersion: **Variance, Standard Deviation**
- Different types of data distributions
- **Histogram**
- **PDF (Probability Density Function)**
- **PMF (Probability Mass Function)**

**Question Type:**  
- What is the most common age in your statistics class?

---

### 2. Inferential Statistics  
Inferential statistics consist of **using measured data to form conclusions** about a population.

**Examples:**
- **Z-test**
- **T-test**
- **Chi-square test**
- **ANOVA**
- Hypothesis testing: **H‚ÇÄ (null hypothesis), H‚ÇÅ (alternative hypothesis)**
- **p-value**
- **Level of significance**

**Question Type:**  
- Are the ages of students in the classroom similar to the ages of students in the university?


## Population and Sample

**Population:**  
The entire group that we are interested in studying.  
- Denoted by **N**

**Sample:**  
A subset of the population selected for study.  
- Denoted by **n**

*The goal of sampling is to create a sample that is representative of the entire population.*

---

## Types of Sampling

### 1. Simple Random Sampling  
When performing simple random sampling, **every member of the population (N)** has an **equal chance** of being selected for the sample (n).

---

### 2. Stratified Sampling  
- The population is divided into **non-overlapping groups** called **strata**  
- "Stratified" means **layering**

---

### 3. Systematic Sampling  
- Selection is made at **regular intervals** from the population list

---

### 4. Convenience Sampling / Voluntary Response Sampling  
- Samples are chosen based on **ease of access** or **self-selection**


## Types of Variables

A **variable** is a property that can take on many values.

> A variable represents a **single value**.  
> Example: `Ages = [1, 5, 3, 6]` is **not** a variable.

---

## Types of Variables

### 1. Quantitative Variables (Numerical)

#### a. Discrete Variable  
- Takes **whole number** values  
- **Example:**  
  - Number of children = 2, 3

#### b. Continuous Variable  
- Takes **any real number**  
- **Example:**  
  - Height = 175.24, 180.9

---

### 2. Qualitative / Categorical Variables

#### a. Nominal  
- No **order** in the data  
- No category is greater than another  
- **Example:**  
  - Gender = Male, Female

#### b. Ordinal  
- There **is an order** in the data  
- One category is better or higher than another  
- **Example:**  
  - Grade = Poor, Good, Excellent


## Measure of Central Tendency

**Definition:**  
Central tendency refers to the measure used to determine the **center** of a data distribution.

**Examples:**  
- Mean  
- Median  
- Mode  

---

### Mean (Average)

#### Population Data (N)
$$\mu = \frac{1}{N} \sum_{i=1}^{N} x_i$$

#### Sample Data (n)
$$\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i$$

---

### Median

**Steps to calculate the median:**
1. Sort all the numbers in ascending order  
2. Find the central element  

**Cases:**
- **Odd length:** middle value  
- **Even length:** average of the two middle values  

---

### Mode

- The **most frequently occurring** element in the dataset

---

### Notes
- Use **median** for detecting outliers  
- Use **mode** for **categorical features**


## Measure of Dispersion

Measures of dispersion describe how **spread out** the data values are.

### Types of Measure of Dispersion
1. Variance  
2. Standard Deviation  

---

## Variance

**Variance** measures the **spread of the data** around the mean.

### For Population (N)
$$\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$$

### For Sample (n)
$$s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}$$

**Note:**  
- \( n - 1 \) is used for sample data due to **Bessel‚Äôs correction** (degree of freedom).

---

## Standard Deviation

- Standard Deviation is the **square root of variance**.

$$\text{S.D.} = \sqrt{\text{Variance}}$$

<img src="https://image4.slideserve.com/254161/standard-deviation-l.jpg" alt="Variance vs Standard Deviation" width="300" style="background-color:white;">

### Interpretation
- When **variance is large**, standard deviation is large ‚Üí **spread increases**
- When **variance is small**, standard deviation is small ‚Üí **data is more concentrated**

---

## Visualization

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f9/Comparison_standard_deviations.svg/960px-Comparison_standard_deviations.svg.png" alt="Variance vs Standard Deviation" width="300" style="background-color:white;">


## Coefficient of Variation (CV)

The **Coefficient of Variation (CV)** is a **relative measure of dispersion**. ¬†
It expresses the **standard deviation as a percentage of the mean**.

---

## Formula

Since CV is a ratio, the formula is the same for both **population** and **sample**, as long as the corresponding **mean** and **standard deviation** are used.

### Standard Formula

### Coefficient of Variation (CV)
$$CV = \left( \frac{\sigma}{\mu} \right) \times 100$$

**Where:**
* $\sigma$ or $s$: Standard Deviation
* $\mu$ or $\bar{x}$: Mean 

---

## Key Properties

- **Unitless:** ¬†
¬† Units cancel out, allowing comparison between datasets with different units ¬†
¬† *(e.g., weight in kg vs. height in cm)*

- **Relative Variability:** ¬†
¬† Measures how much **noise** (standard deviation) exists relative to the **signal** (mean)

- **Scale Invariant:** ¬†
¬† Multiplying all data points by a constant does **not** change the CV ¬†
¬† *(unlike standard deviation)*

---

## Interpretation

- **Low CV:** ¬†
¬† Data is more **consistent** and **stable** relative to its size

- **High CV:** ¬†
¬† Data is more **volatile** or **inconsistent** relative to its size

---

## Comparing Two Series

| Aspect ¬† ¬† ¬† ¬†| Lower CV ¬† ¬† ¬† ¬† ¬†| Higher CV ¬† ¬† ¬† ¬†|
|--------------|-------------------|------------------|
| Consistency ¬†| More consistent ¬† | Less consistent ¬†|
| Stability ¬† ¬†| More stable ¬† ¬† ¬† | Less stable ¬† ¬† ¬†|
| Risk (Finance) | Less risky ¬† ¬† ¬†| More risky ¬† ¬† ¬† |

---

## When to Use CV vs. Standard Deviation

### Use **Standard Deviation**
- To understand the spread **within a single dataset**
- When **units matter** ¬†
¬† *Example: ‚ÄúThe error margin is 5 cm‚Äù*

### Use **Coefficient of Variation (CV)**
- To compare **two or more datasets**
- When datasets have **different means or different units** ¬†
¬† *Example: ‚ÄúWhich is more volatile: the price of Gold or the price of Milk?‚Äù*

---

## Important Note

- CV is meaningful **only for ratio-scale data** (data with a true zero), such as:
¬† - Height
¬† - Weight
¬† - Income

- CV is **not useful** for data like temperature in ¬†
¬† * Celsius: $^\circ\text{C}$ or* Fahrenheit: $^\circ\text{F}$, where zero is arbitrary.

# Frequency Distribution Table

A **Frequency Distribution Table** is a statistical tool used to organize and summarize raw data.  
Instead of analyzing a long, unorganized list of numbers, this table clearly shows how many times each value or range of values occurs.

**OR**

A table that organizes data by showing how often each value or range of values occurs.

---

## Components of a Frequency Distribution Table

| Term | Description |
|------|-------------|
| **Class Interval** | Range of values (e.g., 0‚Äì10, 10‚Äì20) |
| **Class Limits** | Lower limit and upper limit of each class |
| **Class Width (Class Size)** | Difference between upper and lower limit |
| **Class Midpoint (Class Mark)** | Middle value of the class |
| **Formula (Midpoint)** | (Lower limit + Upper limit) / 2 |
| **Frequency (f)** | Number of observations in each class |
| **Cumulative Frequency (cf)** | Running total of frequencies |
| **Relative Frequency** | Frequency expressed as a fraction or percentage |

---

## Types of Frequency Distribution

### A. Ungrouped Frequency Distribution

An **Ungrouped Frequency Distribution** is used for **small datasets** or **discrete data** where values repeat.  
Each unique value is listed separately along with its frequency.

**Example:**  
Counting the number of pets per household on a small street  
(0, 1, 2, or 3 pets)

<img src="https://www.scribbr.com/wp-content/uploads/2022/07/02.png" alt="Ungrouped Frequency Distribution" width="300" style="background-color:white;">

---

### B. Grouped Frequency Distribution

A **Grouped Frequency Distribution** is used for **large datasets** or **continuous data**.  
Data is organized into **class intervals (ranges)** to make analysis easier.

**Example:**  
Student test scores grouped into:
- 0‚Äì10  
- 11‚Äì20  
- 21‚Äì30  
- and so on

<img src="https://www.scribbr.com/wp-content/uploads/2022/07/03.png" alt="Grouped Frequency Distribution" width="300" style="background-color:white;">

#### Key Terms
- **Class Limit:** The lowest and highest values in a class
- **Class Width:** The difference between consecutive class limits (e.g., width = 10)

---

### C. Cumulative Frequency Distribution

A **Cumulative Frequency Distribution** shows the **running total** of frequencies.

It is calculated by adding the frequency of the current class to the sum of all previous classes.

**Purpose:**  
Helps determine how many observations fall **below or above** a specific value.

<img src="https://www.scribbr.com/wp-content/uploads/2022/07/05.png" alt="Cumulative Frequency Distribution" width="300" style="background-color:white;">

---

### D. Relative Frequency Distribution

A **Relative Frequency Distribution** displays the **proportion or percentage** of observations in each class instead of just the count.

**Formula:**

### Relative Frequency Formula
$$\text{Relative Frequency} = \frac{f}{n}$$

**Where:**
* $f$: Frequency of a specific class
* $n$: Total number of observations ($\sum f$)

This value can be expressed as a **fraction, decimal, or percentage**.

<img src="https://www.scribbr.com/wp-content/uploads/2022/07/04.png" alt="Relative Frequency Distribution" width="300" style="background-color:white;">

---


- Refer this article for more information:  
-> https://www.scribbr.com/statistics/frequency-distributions/



# Graphs for Univariate Analysis

## What is Univariate Analysis?

**Univariate Analysis** is the study of a single variable at a time.  
When you independently analyze one variable in a dataset, it is called univariate analysis.

There are **two types of data** commonly analyzed using univariate analysis:

- **Categorical Data**
- **Numerical Data**

To understand these variables, we usually visualize them using different graphs.

---

## üìä Graphs for Categorical Data

### 1. Count Plot

**Definition:**  
A count plot is a bar chart that shows how many times each category appears in the dataset.

**Interpretation:**  
- Taller bars indicate more frequent categories.
- Helps quickly identify the most and least common categories.

**Visualization:**  

<img src="https://media.geeksforgeeks.org/wp-content/uploads/20250403164321196778/single-categorical-variable.png" alt="Count Plot" width="300" style="background-color:white;">

---

### 2. Pie Chart

**Definition:**  
A circular chart divided into slices, where each slice represents a category‚Äôs proportion of the whole.

**Interpretation:**  
- Larger slices indicate a higher percentage.
- Useful for understanding the relative contribution of each category.

**Visualization:**  

<img src="https://media.geeksforgeeks.org/wp-content/uploads/20231005115505/Pie-Diagram-copy.webp" alt="Pie Chart" width="300" style="background-color:white;">

---

## üìà Graphs for Numerical Data

### 1. Histogram

**Definition:**  
A bar chart that groups continuous values into ranges (bins) and shows how many values fall into each range.

**Interpretation:**  
- Shows the distribution shape of data.
- Peaks indicate where most values are concentrated.
- Gaps show ranges with few observations.

**Visualization:**  

<img src="https://d138zd1ktt9iqe.cloudfront.net/media/seo_landing_files/histogram-1614091901.png" alt="Histogram" width="300" style="background-color:white;">

**Note**
https://medium.datadriveninvestor.com/how-to-decide-on-the-number-of-bins-of-a-histogram-3c36dc5b1cd8
how to decide no. of bins

Method 1: Sturge‚Äôs rule
$$\text{bins} = 1 + \text{ceil}(\log_2(n))$$

Method 2: Freedman-Diaconis rule

$$\text{bin width} = \frac{2(q3 - q1)}{\sqrt[3]{n}}$$
$$\text{bins} = \text{ceil}\left(\frac{\max(x) - \min(x)}{\text{bin width}}\right)$$

---

### 2. Dist Plot

**Definition:**  
A distribution plot combines a histogram with a smooth curve called **KDE (Kernel Density Estimation)**.

**Interpretation:**  
- The smooth curve highlights the overall distribution pattern.
- The highest peak represents the most common values.

**Visualization:**  

<img src="https://seaborn.pydata.org/_images/displot_7_0.png" alt="Dist Plot" width="300" style="background-color:white;">

**KDE (Kernel Density Estimation)**

**Definition:** 
A smooth curve that estimates the probability distribution of your data by placing small "bumps" at each data point and averaging them together.

**Interpretation:** 
- Shows where data values are concentrated. Higher curve = more data points in that region. Think of it as a smoothed-out histogram that makes patterns easier to see.

---

### 3. Box Plot

**Definition:**  
A box-and-whisker plot displays five key statistics:
- Minimum = Q1 - 1.5*IQR
- Q1 (25%)
- Median (50%)
- Q3 (75%)
- Maximum  = Q3 + 1.5*IQR

Outliers are shown as individual dots.

**Interpretation:**  
- The box represents the middle 50% of the data.
- The line inside the box is the median.
- Dots outside the whiskers indicate outliers.
- Useful for detecting spread and unusual values.

**Visualization:**  

<img src="https://media.geeksforgeeks.org/wp-content/uploads/20251121105754642527/419171619.webp" alt="Box Plot Components" width="300" style="background-color:white;">
<img src="https://i0.wp.com/statisticsbyjim.com/wp-content/uploads/2019/01/boxplot_pdf.png?w=437&ssl=1" alt="Box Plot Components" width="300" style="background-color:white;">

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Michelsonmorley-boxplot.svg/960px-Michelsonmorley-boxplot.svg.png" alt="Box Plot Example" width="300" style="background-color:white;">

---


## Graphs for Bivariant Analysis

When you analyse **two variables** in a dataset, it is called **bivariant analysis**.

There can be **three cases** while doing bivariant analysis:

a. **Numerical ‚Äì Numerical analysis**  
(i.e. both variables are numerical data type)

b. **Categorical ‚Äì Categorical analysis**  
(i.e. both variables are categorical data type)

c. **Numerical ‚Äì Categorical analysis**  
(i.e. one variable is categorical and the other is numerical)

---

## Types of Graphs for Numerical ‚Äì Numerical Analysis

### 1. Scatter Plot

**Definition:**  
A plot with dots where each dot represents one observation, showing the relationship between two numerical variables on x and y axes.

**Interpretation:**  
Shows correlation patterns:
- Upward line ‚Üí positive correlation  
- Downward line ‚Üí negative correlation  
- Random scatter ‚Üí no correlation  

Helps identify trends and outliers.

**Visualization:**

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/af/Scatter_diagram_for_quality_characteristic_XXX.svg/500px-Scatter_diagram_for_quality_characteristic_XXX.svg.png" alt="Scatter Plot" width="300" style="background-color:white;">

---

### 2. Line Plot

**Definition:**  
Connects data points with lines to show how one numerical variable changes with respect to another, typically over time.

**Interpretation:**  
- Rising lines indicate increase  
- Falling lines indicate decrease  
- Helps spot trends, patterns, and seasonal behavior

**Visualization:**

<img src="https://www.math-salamanders.com/image-files/line-plot-vs-line-graph-image.gif" alt="Line Plot" width="800" style="background-color:white;">

**Note:**  
Line plot should only be used when the variable on the x-axis is **time-based**.

---

## Types of Graphs for Numerical ‚Äì Categorical Analysis

### 3. Bar Plot

**Definition:**  
Uses rectangular bars of different heights to compare numerical values across different categories.

**Interpretation:**  
- Taller bars = higher values  
- Makes comparison of averages, totals, or other statistics easy

**Visualization:**

<img src="https://media.datacamp.com/cms/google/ad_4nxdts-ho-0fohkf80nmzxucdfx8aj1mikr28useuoin3hiup5hud8uv0pdl0v3se2016eeyqgsumvu9riowb7sh8-m7xv075-r8qyy13ytnq72s8bd7ixh8exrovpv9iayuiamdc1q.png" alt="Bar Plot" width="300" style="background-color:white;">

---

### 4. Box Plot

**Definition:**  
Shows the distribution of a numerical variable for each category using boxes and whiskers  
(minimum, Q1, median, Q3, maximum).

**Interpretation:**  
- Compare distributions across categories  
- Identify higher/lower medians  
- Detect spread and outliers  
- Non-overlapping boxes suggest significant differences

![barPlot-numerical-categorical-analysis.png](attachment:barPlot-numerical-categorical-analysis.png)

---

### 5. Dist Plot

**Definition:**  
Shows the distribution (histogram + KDE curve) of a numerical variable separately for each category.

**Interpretation:**  
- Compare shapes and centers of distributions  
- Identify if one group has higher/lower values  
- Helps see similarity or difference between groups

![displot-numerical-categorical-analysis.png](attachment:displot-numerical-categorical-analysis.png)

---

## Types of Graphs for Categorical ‚Äì Categorical Analysis

### 6. Heatmap

**Definition:**  
A grid of colored cells where rows and columns represent categories, and color intensity shows frequency or strength of relationship.

**Interpretation:**  
- Darker/brighter colors = higher values or stronger association  
- Quickly spot common or rare category combinations

**Visualization:**

<img src="https://tse1.mm.bing.net/th/id/OIP.rfx1V5ADPk2nfUTigAf-bwHaD7?w=768&h=408&rs=1&pid=ImgDetMain&o=7&rm=3" alt="Heatmap" width="500" style="background-color:white;">

---

### 7. Cluster Map

**Definition:**  
A heatmap with dendrograms (tree diagrams) that group similar rows and columns together automatically.

**Interpretation:**  
- Similar categories cluster together  
- Helps discover hidden groupings  
- Color patterns reveal relationships within and between clusters

**Visualization:**

<img src="https://seaborn.pydata.org/_images/clustermap_1_0.png" alt="Cluster Map" width="300" style="background-color:white;">
<img src="https://media.geeksforgeeks.org/wp-content/uploads/20240612184116/Clustered-Heatmap.webp" alt="Cluster Map Example" width="400" style="background-color:white;">

**Note:**  
Builds dendrograms.

---

### 8. Pair Plot

**Definition:**  
A grid showing scatter plots for every pair of numerical variables, with distributions on the diagonal.

**Interpretation:**  
- View all relationships at once  
- Identify correlated variable pairs  
- Detect outliers  
- Diagonal shows individual distributions

**Visualization:**

<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2024/02/image-528.png" alt="Pair Plot" width="800" style="background-color:white;">


---

## Credits

**Prepared by:**  
**Chetan Sharma**  
VGEC | AIML / Data Science Notes  

üîó **GitHub:** [github.com/Chetan559](https://github.com/Chetan559)  
üåê **Portfolio:** [chetan559.github.io](https://chetan559.github.io)  
üíº **LinkedIn:** [linkedin.com/in/sharma-chetan-k](https://www.linkedin.com/in/sharma-chetan-k/)  

These notes were compiled for learning, revision, and academic understanding. 
