## Q1

The three measures of central tendency are


1.   Mean
2.   Median
3.   Mode



## Q2

The **mean**, **median**, and **mode** are measures of central tendency:

1. **Mean**: The average of all values. \((\{Sum of values} / \{Count})\).  
   - **Use**: For symmetric data.  
   - **Limitation**: Sensitive to outliers.

2. **Median**: The middle value when data is ordered.  
   - **Use**: For skewed data or outliers.  
   - **Limitation**: Ignores exact values.

3. **Mode**: The most frequent value(s).  
   - **Use**: For categorical or repeated data.  
   - **Limitation**: May not exist or be unique.


## Q3

In [2]:
import numpy as np
from scipy import stats
arr = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [3]:
median = np.median(arr)
mean = np.mean(arr)
mode = stats.mode(arr)

In [4]:
ans = {
    'mean':mean,
    'median':median,
    'mode':mode
}

In [5]:
ans

{'mean': 177.01875, 'median': 177.0, 'mode': ModeResult(mode=177.0, count=3)}

## Q4

Find the standard deviation for the given data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [6]:
arr = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [7]:
np.std(arr)

1.7885814036548633

## Q5

Measures of dispersion such as **range**, **variance**, and **standard deviation** describe how spread out the data values are around the central tendency. Here's how each works:



### **1. Range**
- **Definition**: The difference between the maximum and minimum values in the dataset.
  \[
  \text{Range} = \text{Max} - \text{Min}
  \]
- **Usage**: Measures the total spread of the data but ignores intermediate values and is sensitive to outliers.
- **Example**:
  - Dataset: \( \{2, 4, 6, 8, 100\} \)
  - Range: \( 100 - 2 = 98 \)



### **2. Variance**
- **Definition**: The average of the squared differences between each data point and the mean. It quantifies how much the data varies.
  \[
  \text{Variance} = \frac{\sum (x_i - \text{mean})^2}{n}
  \]
  (For a population; divide by \( n-1 \) for a sample.)
- **Usage**: Provides a measure of spread but in squared units.
- **Example**:
  - Dataset: \( \{2, 4, 6\} \)
  - Mean: \( 4 \)
  - Variance: \( \frac{(2-4)^2 + (4-4)^2 + (6-4)^2}{3} = \frac{4}{3} \approx 1.33 \)



### **3. Standard Deviation (SD)**
- **Definition**: The square root of variance, representing the spread in the same units as the data.
  \[
  \text{SD} = \sqrt{\text{Variance}}
  \]
- **Usage**: Indicates how much data points deviate from the mean on average.
- **Example**:
  - Variance: \( 1.33 \)
  - Standard Deviation: \( \sqrt{1.33} \approx 1.15 \)



### **Comparison and Applications**
| Measure           | Pros                                      | Cons                                      | Best Used For                       |
|--------------------|-------------------------------------------|-------------------------------------------|-------------------------------------|
| **Range**         | Simple and quick                         | Ignores intermediate data; affected by outliers | General spread estimation          |
| **Variance**      | Considers all data points                | Hard to interpret due to squared units    | Detailed spread analysis            |
| **Standard Deviation** | Same units as data; widely used        | Sensitive to outliers                     | General measure of average deviation |




## Q6

A Venn diagram is a visual representation of relationships between different sets. It uses overlapping circles to show commonalities (intersection) and differences (non-overlapping areas) among sets.



## Q7

1.  {2,6}
2.  {0,2,3,4,5,6,7,8,10}

## Q8

**Skewness** in data refers to the asymmetry of a dataset's distribution:

1. **Symmetric**: Skewness = 0. The data is evenly distributed around the mean (e.g., normal distribution).  
2. **Positive Skew (Right Skew)**: Tail extends to the right. Mean > Median.  
3. **Negative Skew (Left Skew)**: Tail extends to the left. Mean < Median.  


## Q9

If a dataset is right-skewed (positively skewed):

The mean will be greater than the median.

## Q10

**Covariance** and **correlation** are both measures of the relationship between two variables, but they differ in scale and interpretation:

---

### **1. Covariance**
- **Definition**: Measures the direction of the linear relationship between two variables.  
  \[
  \text{Covariance} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n}
  \]
- **Range**: Can take any value (positive, negative, or zero).  
- **Interpretation**:
  - Positive covariance: Variables move in the same direction.  
  - Negative covariance: Variables move in opposite directions.  
  - Zero covariance: No linear relationship.  
- **Limitation**: Magnitude depends on the scale of variables, making comparison difficult.

---

### **2. Correlation**
- **Definition**: Standardized measure of the strength and direction of the linear relationship between two variables.  
  \[
  \text{Correlation} (r) = \frac{\text{Covariance}(x, y)}{\sigma_x \sigma_y}
  \]
  (\( \sigma_x, \sigma_y \) are the standard deviations of \( x \) and \( y \)).  
- **Range**: Always between -1 and 1.
  - \( r = 1 \): Perfect positive correlation.
  - \( r = -1 \): Perfect negative correlation.
  - \( r = 0 \): No linear relationship.  
- **Interpretation**:
  - Indicates both strength and direction, irrespective of scale.  
- **Advantage**: Easier to interpret and compare.

---

### **Usage in Statistical Analysis**:
- **Covariance**: Helps identify the direction of relationships. Often a precursor to correlation.  
- **Correlation**: Quantifies relationships for better comparison and is widely used in fields like finance, machine learning, and research.

**Example**:
- Two variables: Study hours and exam scores.  
  - Covariance: \( 20 \) (positive, units depend on the variables).  
  - Correlation: \( 0.8 \) (strong positive relationship).

## Q11

### **Formula for Sample Mean**:
The sample mean (\( \bar{x} \)) is calculated as:  
\[
\bar{x} = \frac{\sum x_i}{n}
\]
Where:  
- \( x_i \): Each data point in the sample  
- \( n \): Total number of data points in the sample  

---

### **Example Calculation**:
#### Dataset: \( \{5, 10, 15, 20, 25\} \)

1. **Step 1**: Calculate the sum of all data points:  
   \[
   \sum x_i = 5 + 10 + 15 + 20 + 25 = 75
   \]

2. **Step 2**: Count the number of data points:  
   \[
   n = 5
   \]

3. **Step 3**: Divide the sum by the number of data points:  
   \[
   \bar{x} = \frac{\sum x_i}{n} = \frac{75}{5} = 15
   \]

---

### **Result**: The sample mean is \( \bar{x} = 15 \).

## Q12

---

### **1. Perfectly Normal Distribution**
- **Relationship**: Mean = Median = Mode.  
- In a perfectly normal distribution, the data is symmetric around the central peak, and all three measures of central tendency coincide.

### **2. Right-Skewed Distribution (Positive Skew)**
- **Relationship**: Mode < Median < Mean.  
- The long tail on the right pulls the mean higher than the median, while the mode remains the smallest.


### **3. Left-Skewed Distribution (Negative Skew)**
- **Relationship**: Mode > Median > Mean.  
- The long tail on the left pulls the mean lower than the median, while the mode remains the largest.



## Q13
- **Covariance**: Indicates the **direction** of the linear relationship between two variables (positive, negative, or no relationship). However, it does not quantify the strength of the relationship and is influenced by the scale of the variables.  

- **Correlation**: Measures both the **direction** and the **strength** of the linear relationship between two variables. It is scale-independent and ranges between -1 and 1, making it easier to interpret and compare across datasets.  

### **In short**:  
"Covariance indicates the direction of the linear relationship between two random variables, while correlation provides both the direction and the strength of the relationship."

## Q14

Outliers are data points that significantly differ from the rest of the dataset. They can affect both **measures of central tendency** (mean, median, mode) and **measures of dispersion** (range, variance, standard deviation).

### **Impact on Measures of Central Tendency**:
1. **Mean**:  
   - **Effect**: Outliers can **skew the mean** significantly because the mean is sensitive to extreme values. A single large or small outlier can pull the mean toward it, making it unrepresentative of the dataset.
   - **Example**:  
     - Dataset: \( \{2, 4, 6, 8, 100\} \)  
     - Mean: \( \frac{2 + 4 + 6 + 8 + 100}{5} = 24 \)  
     - The outlier 100 increases the mean, making it much higher than the center of most values.

2. **Median**:  
   - **Effect**: The median is **less affected by outliers** because it depends on the middle value, not the specific data points.
   - **Example**:  
     - Dataset: \( \{2, 4, 6, 8, 100\} \)  
     - Median: \( 6 \) (middle value)  
     - The outlier 100 does not change the median.

3. **Mode**:  
   - **Effect**: The mode is usually **unaffected by outliers**, unless the outlier is a frequent value, in which case it may become the new mode.
   - **Example**:  
     - Dataset: \( \{2, 4, 6, 8, 100\} \)  
     - Mode: No mode (no repetition)  
     - The outlier 100 does not change the mode.

---

### **Impact on Measures of Dispersion**:
1. **Range**:  
   - **Effect**: The range is highly sensitive to outliers because it depends on the **maximum** and **minimum** values in the dataset. Outliers increase the range, making it appear more spread out.
   - **Example**:  
     - Dataset: \( \{2, 4, 6, 8, 100\} \)  
     - Range: \( 100 - 2 = 98 \)  
     - The outlier 100 increases the range drastically.

2. **Variance and Standard Deviation**:  
   - **Effect**: Both variance and standard deviation are **sensitive to outliers** because they are based on squared differences from the mean. Outliers increase the squared deviations, leading to higher variance and standard deviation.
   - **Example**:  
     - Dataset: \( \{2, 4, 6, 8, 100\} \)  
     - Mean: \( 24 \)  
     - Variance and Standard Deviation will be larger due to the large deviation of 100 from the mean.


### **Summary**:
- **Mean** and **range/variance/standard deviation** are highly **sensitive** to outliers.
- **Median** and **mode** are **less affected** by outliers, making them more robust for datasets with extreme values.