# Basics of Statistics

1. Explain the different types of data (qualitative and quantitative) and provide examples of each. Discuss nominal, ordinal, interval, and ratio scales.

-> Data can broadly be categorized into two types:

1. Qualitative (Categorical) Data

2. Quantitative (Numerical) Data

> Qualitative Data :

Qualitative data describes qualities or characteristics. It cannot be measured in numbers but rather categorized based on traits and attributes.

  - Nominal Scale

Definition : Data is categorized without any order or ranking.

Examples:

Gender (Male, Female, Other)

Eye color (Brown, Blue, Green)

Nationality (Indian, American, Canadian)

  - Ordinal Scale :

Definition : Data is categorized with a meaningful order, but the differences between ranks are not equal or measurable.

Examples:

Movie ratings (Poor, Fair, Good, Excellent)

Education level (High School, Bachelor's, Master's, PhD)

Customer satisfaction (Dissatisfied, Neutral, Satisfied)

> Quantitative Data

Quantitative data refers to numerical values that can be measured or counted.

  - Interval Scale

Definition: Numerical data with equal intervals between values, but no true zero point.

Examples:

Temperature in Celsius or Fahrenheit

Dates (e.g., years like 1990, 2000)

IQ scores

  - Ratio Scale

Definition: Numerical data with equal intervals and a true zero point, allowing for meaningful ratios.

Examples:

Weight (kg, lb)

Height (cm, inches)

Income, age, distance, time duration

#2.  What are the measures of central tendency, and when should you use each? Discuss the mean, median and mode with examples and situations where each is appropriate.

-> These are values that represent the center or typical value of a dataset. The three main measures are:

1. Mean
2. Median
3. Mode


#1. Mean (Average)

Definition:
The sum of all values divided by the number of values.

Formula:

    Mean= ∑x / n

Example:
If test scores are: 70, 75, 80, 85, 90

    Mean = (70+75+80+85+90) / 5

    Mean= 400 / 5

    Mean = 80

Use When:

  - Data is numerical and evenly distributed

  - No extreme outliers (which can distort the average)

Real-Life Example:

  - Calculating the average marks in a class

  - Average income (when data is not skewed)

#2. Median

Definition:
The middle value when the data is sorted in order.

How to Find:

  - Odd number of values → Middle value

  - Even number of values → Average of two middle values


```
Example:

    Data: 10, 20, 30, 100, 1000  
    Median = 30 (middle value)
    
```



Use When:

  - Data has outliers or is skewed

  - You want the central position rather than the arithmetic average

Real-Life Example:

  - Median household income (avoids distortion from very rich individuals)

  - Home prices in real estate  

#3. Mode

Definition:
The value that appears most frequently in a dataset.

    Example:

    Data: 3, 4, 4, 4, 5, 6, 6
    Mode = 4

Use When:

  - You want the most common or frequent item

  - Data is categorical or discrete

Real-Life Example:

  - Most popular shoe size in a store

  - Most common exam grade in a class

  - Favorite color in a survey

#3. Explain the concept of dispersion. How do variance and standard deviation measure the spread of data?

-> In statistics, dispersion refers to the extent to which data points in a dataset are spread out or clustered around the mean. Variance and standard deviation are key measures of dispersion, quantifying how much data deviates from the average, with standard deviation being the square root of variance and expressed in the same units as the original data.

- Dispersion:
  
  - It's a measure of how spread out or varied a set of data is.
  - It complements measures of central tendency (like mean, median, and mode) by providing insight into the data's variability.
  - A high dispersion indicates that data points are widely scattered, while a low dispersion means data points are clustered closely around the mean.

-  Variance:

  - It's calculated by finding the average of the squared differences between each data point and the mean.
  - Squaring the differences emphasizes larger deviations from the mean, giving more weight to outliers.
  - A larger variance indicates a greater spread of data, while a smaller variance suggests data points are closer to the mean.

- Standard Deviation:
  
  - It's the square root of the variance.
  - Unlike variance, which is expressed in squared units, standard deviation is in the same units as the original data, making it easier to interpret.
  - A larger standard deviation means the data is more spread out, and a smaller standard deviation indicates data points are clustered more tightly around the mean.

  
  Relationship between Variance and Standard Deviation:

  - Standard deviation is a more intuitive measure of dispersion than variance because it's in the same units as the original data.
  - Variance is a crucial intermediate step in calculating standard deviation, and it's used in various statistical analyses

#4. What is a box plot, and what can it tell you about the distribution of data?

-> A box plot (aka box and whisker plot) uses boxes and lines to depict the distributions of one or more groups of numeric data. Box limits indicate the range of the central 50% of the data, with a central line marking the median value. Lines extend from each box to capture the range of the remaining data, with dots placed past the line edges to indicate outliers.

A Box Plot Tells You :

1. Spread of the Data

  - The width of the box (Q3 − Q1) represents the interquartile range (IQR) → middle 50% of the data.

  - Longer boxes or whiskers indicate greater variability.

2. Center of the Data

  - The line inside the box shows the median → a key measure of central tendency.

3. Skewness (Symmetry)

  - If the median is centered in the box and whiskers are of equal length, the data is symmetrical.

  - If the median is closer to Q1 or Q3, or if one whisker is longer, it suggests skewness (left or right).

4. Outliers

  - Dots or asterisks outside the whiskers represent outliers → values far from the rest of the data.

Why Use a Box Plot?

  - Quick summary of large datasets

  - Visual comparison between groups (e.g., comparing scores across different classes)

  - Easily detect skewness, spread, and outliers  

#5.  Discuss the role of random sampling in making inferences about populations.

-> Random sampling is crucial for making valid inferences about a population because it ensures the sample is representative, minimizing bias and allowing for generalizations about the larger group based on the sample's characteristics.

Example : A researcher randomly selects patients from a hospital to study the effectiveness of a new treatment.

Types of Random Sampling:

- Simple Random Sampling: Every member of the population has an equal chance of being selected.
- Stratified Random Sampling: The population is divided into subgroups (strata), and then a random sample is taken from each stratum.
- Cluster Sampling: The population is divided into clusters, and then a random sample of clusters is selected.


How It Helps in Making Inferences

 1. Reduces Bias
Random sampling prevents systematic errors.

  - Every group or type within the population has a fair shot at being selected.

 2. Enables Generalization

  - If the sample is random and representative, we can extend conclusions from the sample to the entire population.

 3. Supports Valid Statistical Tests
  
  - Many statistical techniques (e.g., hypothesis testing, confidence intervals) assume randomness in sampling for the results to be valid.

 4. Ensures Diversity in Data

  - Random selection captures different segments of the population, leading to more accurate and complete insights.

#6. Explain the concept of skewness and its types. How does skewness affect the interpretation of data?

-> Skewness measures the asymmetry of a data distribution, indicating whether the data is spread out more on one side than the other. It can be positive (right-skewed), negative (left-skewed), or zero (symmetrical).

- A symmetric distribution has the mean = median = mode
- A skewed distribution is not symmetric—the tail on one side is longer or fatter than the other

Types of Skewness :

1. Positive Skew (Right-Skewed)
  
  - Tail is longer on the right

  - Mean > Median > Mode

  - Most values are clustered on the lower end, with a few large outliers

Example:

Income distribution: Most people earn a modest amount, but a few earn extremely high salaries

2. Negative Skew (Left-Skewed)

  - Tail is longer on the left

  - Mean < Median < Mode

  - Most values are clustered on the higher end, with a few small outliers

Example:

Test scores where most students score high, but a few score very low

3. Zero Skew (Symmetrical)

  - Data is evenly distributed

  - Mean = Median = Mode

  - Classic example: Normal distribution (bell curve)

How Skewness Affects Interpretation :

 1. Mean is Sensitive to Skewness

  - Mean gets pulled in the direction of the tail

  - Right-skew → mean is higher

  - Left-skew → mean is lower

 2. Median is More Robust

  - Median is less affected by extreme values, making it more reliable in skewed data

 3. Impacts Statistical Analysis

  - Skewed data can violate assumptions of statistical tests (like normality in t-tests)

  - May require data transformation (e.g., log transformation) or use of non-parametric tests


#7. What is the interquartile range (IQR), and how is it used to detect outliers?

-> The Interquartile Range (IQR) is a measure of statistical dispersion.
It represents the range of the middle 50% of the data.

`Formula: IQR = Q3 - Q1`

Where:

  - Q1 (First Quartile): 25th percentile (the value below which 25% of the data lies)

  - Q3 (Third Quartile): 75th percentile (the value below which 75% of the data lies)

### What Does IQR Tell You?

  - It shows how spread out the middle values are

  - It is resistant to outliers (unlike the range)

  - It helps you understand data concentration and variability

Outliers are data points that lie far outside the typical range of the data.

### Outlier Rules Using IQR:

A data point is considered an outlier if it falls:

  - Below:  Q1 - 1.5 * IQR

  - Above: Q3 + 1.5 * IQR  

#8.  Discuss the conditions under which the binomial distribution is used.

-> The binomial distribution is a discrete probability distribution that describes the number of successes in a fixed number of independent trials, where each trial has only two possible outcomes:

a) Success (e.g., heads, pass, win)

b) Failure (e.g., tails, fail, lose)

To use the binomial distribution, the following four key conditions must be met:

1. Fixed Number of Trials (n)

  - The number of experiments or observations (n) is set in advance.
  - Example: Flipping a coin 10 times.

2. Two Possible Outcomes (Success/Failure)

  - Each trial results in just one of two outcomes.
  - You must be able to classify each outcome as a "success" or a "failure".
  - Examples:

    Success = getting heads; Failure = getting tails

    Success = passing an exam; Failure = failing the exam  

3. Constant Probability of Success (p)

  - The probability of success remains the same for every trial.
  - For example, in a fair coin toss, P(heads) = 0.5 for each toss.

4. Independent Trials

  - The outcome of one trial does not affect the outcome of another.
  - Each trial is independent of the other



#9. Explain the properties of the normal distribution and the empirical rule (68-95-99.7 rule).

-> The normal distribution is a bell-shaped, symmetric probability distribution that is widely used in statistics to model real-world data.

Some of the important properties of the normal distribution are listed below:

- In a normal distribution, the mean, median and mode are equal.(i.e., Mean = Median= Mode).
- The total area under the curve should be equal to 1.
- The normally distributed curve should be symmetric at the centre.
- There should be exactly half of the values are to the right of the centre and exactly half of the values are to the left of the centre.
- The normal distribution should be defined by the mean and standard deviation.
- The normal distribution curve must have only one peak. (i.e., Unimodal)
- The curve approaches the x-axis, but it never touches, and it extends farther away from the mean.

### Empirical Rule (68-95-99.7 Rule)

In statistics, the 68 - 95 - 99.7 rule, also known as the empirical rule, and sometimes abbreviated 3sr or 3σ, is a shorthand used to remember the percentage of values that lie within an interval estimate in a normal distribution: approximately 68%, 95%, and 99.7% of the values lie within one, two, and three standard

#10. Provide a real-life example of a Poisson process and calculate the probability for a specific event.

-> Real-Life Example :  Calls per Hour at a Call Center

Call centers use the Poisson distribution to model the number of expected calls per hour that they’ll receive so they know how many call center reps to keep on staff.

For example, suppose a given call center receives 10 calls per hour. We can use a Poisson distribution calculator to find the probability that a call center receives 0, 1, 2, 3 … calls in a given hour

P(X = 0 calls) = 0.00005

P(X = 1 call) = 0.00045

P(X = 2 calls) = 0.00227

P(X = 3 calls) = 0.00757

And so on.

This gives call center managers an idea of how many calls they’re likely to receive per hour and enables them to manage employee schedules based on the number of expected calls.

#11. Explain what a random variable is and differentiate between discrete and continuous random variables.

-> A random variable is a numerical value assigned to the outcome of a random experiment. It helps translate outcomes (like "heads" or "tails") into numbers (like 0 or 1) so we can apply mathematical analysis.

Types of Random Variables :

###1. Discrete Random Variable

  - Takes on a countable number of distinct values
  - Often involves counting outcomes

  Examples:

    - Number of students in a class (0, 1, 2, ...)
    - Number of heads in 3 coin tosses (0, 1, 2, 3)
    - Number of emails received per day

  Key Features:

    - Values can be listed
    - Often used in binomial, Poisson, or geometric distributions

###2. Continuous Random Variable

  - Takes on infinite values within a range
  - Involves measuring something

  Examples:

    - Height of students (e.g., 160.5 cm)
    - Time it takes to complete a test
    - Temperature in a day

  Key Features:

    - Values are not countable
    - Any value in an interval is possible
    - Often used in normal, exponential, or uniform distributions  

#12. Provide an example dataset, calculate both covariance and correlation, and interpret the results.

-> Example Dataset:
Imagine you're analyzing the relationship between the number of hours students study and their exam scores. Here's a sample dataset:

    Hours Studied (X)       Exam Score (Y)

      2                         60
      4                         75
      6                         85
      8                         90
      10                        95


1. Calculate the Covariance:
  
  - Step 1: Calculate the mean of each variable:
  
  Mean of X (Hours Studied): (2+4+6+8+10)/5 = 6
  
  Mean of Y (Exam Score): (60+75+85+90+95)/5 = 80
  
  - Step 2: Calculate the covariance:
  
  Cov(X,Y) = Σ[(Xᵢ - Mean(X)) * (Yᵢ - Mean(Y))] / (n - 1)
  
  Cov(X,Y) = [(2-6)(60-80) + (4-6)(75-80) + (6-6)(85-80) + (8-6)(90-80) + (10-6)*(95-80)] / (5-1)
  
  Cov(X,Y) = [(-4*-20) + (-2*-5) + (05) + (210) + (4*15)] / 4
  
  Cov(X,Y) = (80 + 10 + 0 + 20 + 60) / 4
  
  Cov(X,Y) = 170 / 4 = 42.5
  
2. Calculate the Correlation (Pearson's Correlation Coefficient):
  
  - Step 1: Calculate the standard deviation of each variable:
  
  Standard Deviation of X (Hours Studied): √[Σ((Xᵢ - Mean(X))²) / (n - 1)]
  
  Standard Deviation of X = √[((2-6)² + (4-6)² + (6-6)² + (8-6)² + (10-6)²) / (5-1)]
  
  Standard Deviation of X = √[(16 + 4 + 0 + 4 + 16) / 4] = √[40/4] = √10 ≈ 3.16
  
  Standard Deviation of Y (Exam Score): √[Σ((Yᵢ - Mean(Y))²) / (n - 1)]
  
  Standard Deviation of Y = √[((60-80)² + (75-80)² + (85-80)² + (90-80)² + (95-80)²) / (5-1)]
  
  Standard Deviation of Y = √[(400 + 25 + 25 + 100 + 225) / 4] = √[775/4] ≈ 12.25
  
  - Step 2: Calculate the correlation coefficient:
  
  Correlation(X, Y) = Cov(X, Y) / (Standard Deviation of X * Standard Deviation of Y)
  
  Correlation(X, Y) = 42.5 / (3.16 * 12.25)
  
  Correlation(X, Y) ≈ 42.5 / 38.75 ≈ 1.09     

3. Interpretation:

- Covariance:

The covariance of 42.5 indicates a positive relationship between hours studied and exam scores. As hours studied increases, exam scores tend to increase as well. The positive sign indicates that the variables move in the same direction.

  - Correlation:

The correlation coefficient of 1.09 (approximately) indicates a strong positive linear relationship between hours studied and exam scores. The closer the correlation is to +1, the stronger the positive linear relationship  