# Descriptive Statistics

## In-Class Notes

- Statistics:
    - Inferential Statistics
    - Descriptive Statistics
- Descriptive Statistics: summarizing and visualizing the past data.
- Measures of central tendency (where the data is centered):
    - Mean ($\mu$) 
    - Median
    - Mode
- Measures of spread/variability (Dispersion - As in to what extreme your data belongs to):
    - Range: The difference between the maximum and minimum value.
        - Max Value - Minimum Value
    - Variance: How much the data varies from the mean value.
        $$
            \sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2
        $$
    - **Example:**  Suppose we have the following data: 2, 4, 6, 8, 10  
        - Mean ($\mu$) = (2 + 4 + 6 + 8 + 10) / 5 = 6  
        - Variance ($\sigma^2$):  
        $$
        \sigma^2 = \frac{1}{5} \left[(2-6)^2 + (4-6)^2 + (6-6)^2 + (8-6)^2 + (10-6)^2\right] = \frac{1}{5} (16 + 4 + 0 + 4 + 16) = \frac{40}{5} = 8
        $$
        - High variance indicates that the data is widely dispersed. Low variance indicates data is clustered.
    - Standard Deviation: The square root of variance.
        Standard Deviation ($\sigma$):  
        $$
        \sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2}
        $$
        - **Example:** Using the same data: 2, 4, 6, 8, 10  
            - Mean ($\mu$) = 6  
            - Standard Deviation ($\sigma$):  
                $$
                \sigma = \sqrt{8} \approx 2.83
                $$
        - Interpretation is same as variance. Helps to measure the variability in the data.
        - Standard deviation is measured in the same as unit as the data. While variance is that squared.
        - Interpretable measure of spread.
        - Another property of standard deviation is scale invariance. This is particularly useful in comparing the variability of datasets with different units of measurement. For example, if one dataset is measured in inches and another in centimeters, their standard deviations can still be compared directly without needing to convert units.
        - If the data behaves in a normal curve, then 68% of the data points will fall within one standard deviation of the average, or mean, data point.
            - If a dataset follows a normal distribution with a mean ($\mu$) of 50 and a standard deviation ($\sigma$) of 5, then approximately 68% of the data points will fall between 45 and 55 (i.e., within one standard deviation of the mean). For example, if you have test scores for a large class that are normally distributed, about 68% of students will have scores between 45 and 55. This property helps in quickly understanding the spread and concentration of data in a normal distribution.
            - ![image.png](attachment:image.png)
        - Impact of outliers: Outliers have a heavier impact on standard deviation. This is especially true considering that the difference from the mean is squared, resulting in an even larger quantity compared to other data points. Therefore, be mindful that standard observation naturally gives more weight to extreme values.
    - Inter Quartile Region: 
        - First Quartile (Q1) = (n + 1) x 1/4
        - Second Quartile (Q2), or the median = (n + 1) x 2/4
        - Third Quartile (Q3) = (n + 1) x 3/4
        - Where n is the number of integers in your dataset, and the result is the position of the number in the sequence dataset.
        - Q1 (Lower Quartile): Represents the 25% of the data. The set of data points between the minimum value and the first quartile.
        - Q2: Which is the 50% or the median of the data. The set of data points between the lower quartile and the median.
        - Q3( Upper Quartile): Represents the 75% of the data. The set of data between the median and the upper quartile.
        - IQR: Is the data that lies in between the Q3 and Q1 mark of the data.
        - Minimum Range: Q1 - 1.5*IQR
        - Maximum Range: Q3 + 1.5*IQR
        - Outliers: Generally the data points lying beyond the maximum or minimum ranges. 
        - <img src="1_0MPDTLn8KoLApoFvI0P2vQ.png" alt="alt text" width="500"/>
        - If the data point for Q1 is farther away from the median than Q3 is from the median, then you can say there is a greater dispersion among the smaller values of the dataset than among the larger values. This is termed as quartile skewness.
        - The purpose of quartiles is to give shape to a distribution, primarily indicating whether or not a distribution is skewed.

- Measure of distribution (Shape of the data):
    - Normal Distribution (symmetric distribution):
        - Mean and median is the same.
        - Data will follow a bell curve structure.
        - ![image-2.png](attachment:image-2.png)
    - Skewness
        - Skewness indicates the direction and degree to which the data deviates from a symmetrical bell curve. A distribution with zero skewness is perfectly symmetrical, meaning the left and right sides of the distribution are mirror images. Positive skewness means that the right tail is longer or fatter than the left, suggesting that the data has a tendency to have higher values. Negative skewness indicates that the left tail is longer or fatter, implying a tendency towards lower values.
        - Positive Skew: When the mean moves to the right side. There is more data towards the right side of the distribution. Mean is greater than the median.
        - Negative Skew: When the mean moves to the left side. There is more data towards the left side of the distribution. Mean is less than the median.
    - Kurtosis: Measure of tailness of your data
        - The plotted data that are farthest from the mean of the data usually form the tails on each side of the curve.
        - Tailness: Measure of data present away from the central region of the distribution curve.
        - Kurtosis indicates how much data resides in the tails.
        - Types of kurtosis:
            - mesokurtic (normal): (Kurtosis = 3.0)
            - platykurtic (less than normal): (Kurtosis < 3.0)
            - leptokurtic (more than normal): (Kurtosis > 3.0)
        - Not to be confused with peakness

    - Kurtosis measures the tailedness of a distribution. Skewness measures the asymmetry of a distribution.

- Mean: Sum of all values divided by the total population

- For normally distributed data the mean and median will be the same value.

- Null value imputation strategies:
    - When the data is normally distributed, the mean and median are the same, hence we can impute any of them to fill up null values.
    - If there are outliers present in the data, it is not recommended to impute the missing data with the mean.
        - The mean is calculated based on all the values present and hence is influenced by the outliers.
        - It is advised to use median in such cases.

- Z-score is a measure that describes a value's position relative to the mean of a group of values, measured in terms of standard deviations.

The formula for Z-score is:
$$
z = \frac{x - \mu}{\sigma}
$$
where:
- $x$ = the value
- $\mu$ = mean of the dataset
- $\sigma$ = standard deviation

**Example:**  
Suppose the mean test score in a class is 70 with a standard deviation of 10. If a student scored 85, the z-score would be:
$$
z = \frac{85 - 70}{10} = 1.5
$$
This means the student's score is 1.5 standard deviations above the mean.

The empirical rule, also known as the 68-95-99.7 rule, states that for a normal distribution:
- Approximately 68% of the data falls within one standard deviation of the mean.
- Approximately 95% falls within two standard deviations.
- Approximately 99.7% falls within three standard deviations.
    - ![image-3.png](attachment:image-3.png)
- Transformation of numerical data types:
    - Standardization
        - Minmax scalar
        - Robust scalar
    - Normalization
    - Winsorization