Q1. What are the three measures of central tendency?

The three measures of central tendency are:

1. **Mean:**
   - **Definition:** The arithmetic average of a set of values.
   - **Calculation:** \( \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \)
   - **Use:** Provides a balance point, representing the center of the distribution. Sensitive to outliers.

2. **Median:**
   - **Definition:** The middle value in a sorted dataset.
   - **Calculation:** Identify the middle value or average the two middle values.
   - **Use:** Less sensitive to outliers; represents the central position.

3. **Mode:**
   - **Definition:** The value(s) that occur most frequently in a dataset.
   - **Use:** Identifies the most common values; applicable to nominal and ordinal data.

Q2. What is the difference between the mean, median, and mode? How are they used to measure the
central tendency of a dataset?

### Mean:

- **Definition:** The mean is the arithmetic average of a set of values.
- **Calculation:** \( \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \)
- **Use:** Represents the balance point of the distribution; sensitive to outliers.

### Median:

- **Definition:** The median is the middle value in a sorted dataset.
- **Calculation:** Identify the middle value or average the two middle values.
- **Use:** Less sensitive to outliers than the mean; represents the central position.

### Mode:

- **Definition:** The mode is the value(s) that occur most frequently in a dataset.
- **Calculation:** Identify the most frequently occurring value(s).
- **Use:** Identifies the most common values; applicable to nominal and ordinal data.

### Differences:

1. **Sensitivity to Outliers:**
   - **Mean:** Sensitive to extreme values; outliers can significantly influence the mean.
   - **Median:** Less sensitive to outliers; the middle value is not affected by extreme values.
   - **Mode:** Not sensitive to outliers as it is based on frequencies.

2. **Calculation Method:**
   - **Mean:** Involves summing all values and dividing by the number of values.
   - **Median:** Involves ordering the values and identifying the middle one or averaging the two middle values.
   - **Mode:** Involves identifying the value(s) with the highest frequency.

3. **Applicability to Data Types:**
   - **Mean:** Suitable for interval and ratio data; requires numerical values.
   - **Median:** Applicable to ordinal, interval, and ratio data; less affected by skewed distributions.
   - **Mode:** Applicable to nominal and ordinal data; may be used for any data type.

### Use in Measuring Central Tendency:

- **Mean:**
  - Provides a measure of the center by considering the sum of all values.
  - Influenced by the magnitude of values.

- **Median:**
  - Represents the center by identifying the middle position.
  - Less affected by extreme values, making it suitable for skewed distributions.

- **Mode:**
  - Identifies the most frequently occurring values.
  - Useful for identifying common patterns or modes in the data.

Q3. Measure the three measures of central tendency for the given height data:

[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [1]:
l=[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [2]:
import numpy as np

In [3]:
np.mean(l)

177.01875

In [4]:
np.median(l)

177.0

In [5]:
from scipy import stats

In [6]:
stats.mode(l)

  stats.mode(l)


ModeResult(mode=array([177.]), count=array([3]))

Q4. Find the standard deviation for the given data:

[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [7]:
l1=[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [8]:
np.std(l)

1.7885814036548633

Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe
the spread of a dataset? Provide an example.

### Measures of Dispersion:

1. **Range:**
   - **Definition:** The difference between the maximum and minimum values in a dataset.
   - **Calculation:** \( \text{Range} = \text{Max} - \text{Min} \)
   - **Use:** Provides a quick assessment of the spread of data.

2. **Variance:**
   - **Definition:** The average of the squared differences from the mean.
   - **Calculation:** \( \sigma^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n} \) (for population) or \( s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1} \) (for sample)
   - **Use:** Quantifies the overall variability in a dataset.

3. **Standard Deviation:**
   - **Definition:** The square root of the variance.
   - **Calculation:** \( \sigma = \sqrt{\sigma^2} \) (for population) or \( s = \sqrt{s^2} \) (for sample)
   - **Use:** Provides a measure of the average distance of data points from the mean; more interpretable than variance.

### How They Describe the Spread:

- **Range:**
  - **Use:** Describes the overall range or extent of values in a dataset.
  - **Example:** For a dataset of daily temperatures in Celsius, if the range is 20°C, it indicates that temperatures vary by 20 degrees between the coldest and warmest days.

- **Variance:**
  - **Use:** Quantifies the average squared deviation of each data point from the mean.
  - **Example:** In a dataset of exam scores, a high variance suggests that scores deviate more from the average, indicating greater variability in performance.

- **Standard Deviation:**
  - **Use:** Provides a more interpretable measure of variability, as it is in the same units as the data.
  - **Example:** In a dataset of product weights, a standard deviation of 2 grams suggests that most products deviate from the mean weight by approximately 2 grams.

In [9]:
a=[20, 25, 18, 22, 28, 24]
b=[15, 30, 20, 25, 22, 18]

In [10]:
print("Range of a is {}".format(max(a)-min(a)))

Range of a is 10


In [11]:
print("Range of b is {}".format(max(b)-min(b)))

Range of b is 15


In [12]:
print("Var of a is {}".format(np.var(a)))

Var of a is 10.805555555555555


In [13]:
print("Var of b is {}".format(np.var(b)))

Var of b is 23.555555555555557


In [14]:
print("Std dev of a is {}".format(np.std(a)))

Std dev of a is 3.2871804872193366


In [15]:
print("Std dev of b is {}".format(np.std(b)))

Std dev of b is 4.853406592853679


Q6. What is a Venn diagram?

A Venn diagram is a visual representation of the relationships between different sets or groups of items. It consists of overlapping circles, each representing a set, and the overlapping regions represent the elements shared between those sets. Venn diagrams are widely used to illustrate the intersections and differences between various categories or groups.

Key features of a Venn diagram:

Sets Representation: Each circle in a Venn diagram represents a set, and the elements within that set are contained within the circle.

Overlap Regions: The overlapping areas between circles represent the elements that are common to multiple sets. The size of the overlap corresponds to the degree of intersection between sets.

Distinct Regions: The non-overlapping portions of the circles represent elements that are unique to each set and do not belong to any other set.

Universal Set: In some cases, a rectangle or another shape is used to enclose all the circles, representing the universal set from which the individual sets are drawn.

Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:

(i) A ∩ B

In [16]:
a=set((2,3,4,5,6,7))
b=set((0,2,6,8,10))
a.intersection(b)

{2, 6}

(ii) A ⋃ B

In [17]:
a.union(b)

{0, 2, 3, 4, 5, 6, 7, 8, 10}

Q8. What do you understand about skewness in data?

Skewness is a statistical measure that describes the asymmetry or lack of symmetry in a dataset's distribution. In other words, it quantifies the degree and direction of skew (departure from horizontal symmetry) in the data. Skewness provides insights into the shape of the distribution, particularly regarding the concentration of values on one side of the mean compared to the other.

### Characteristics of Skewness:

1. **Positively Skewed (Right Skewed):**
   - The right tail of the distribution is longer or fatter than the left.
   - Most values are concentrated on the left side of the mean.
   - The mean is typically greater than the median.

2. **Negatively Skewed (Left Skewed):**
   - The left tail of the distribution is longer or fatter than the right.
   - Most values are concentrated on the right side of the mean.
   - The mean is typically less than the median.

3. **Symmetrical (Zero Skewness):**
   - The distribution is perfectly balanced, with equal tail lengths on both sides of the mean.
   - The mean and median are equal.

### Calculation of Skewness:

The skewness of a dataset (denoted as \( \text{Skewness} \)) can be calculated using the formula:

\[ \text{Skewness} = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^3}{n \cdot \sigma^3} \]

Where:
- \( n \) is the number of data points.
- \( x_i \) is each individual data point.
- \( \bar{x} \) is the mean of the dataset.
- \( \sigma \) is the standard deviation of the dataset.

### Interpretation:

1. **Skewness = 0:**
   - The distribution is perfectly symmetrical.

2. **Skewness > 0:**
   - Positively skewed distribution.
   - The tail on the right side is longer or fatter.

3. **Skewness < 0:**
   - Negatively skewed distribution.
   - The tail on the left side is longer or fatter.

### Example:

Consider two datasets:

1. Positively Skewed:
   - Data: [5, 8, 10, 12, 15, 20, 25, 30]
   - Mean > Median
   - Right tail is longer.

2. Negatively Skewed:
   - Data: [30, 25, 20, 15, 12, 10, 8, 5]
   - Mean < Median
   - Left tail is longer.

Skewness provides a quantitative measure to confirm visual observations about the symmetry of a distribution. It is a valuable tool in exploratory data analysis, helping to understand the shape and characteristics of the data distribution.

Q9. If a data is right skewed then what will be the position of median with respect to mean?

In a right-skewed distribution (positively skewed), the right tail of the distribution is longer or fatter than the left tail. This means that the majority of values are concentrated on the left side of the distribution, and there are a few larger values on the right side. In such a distribution:

1. **Position of Mean:**
   - The mean is influenced by the presence of larger values in the right tail.
   - The mean will be pulled in the direction of the skewness, toward the larger values.

2. **Position of Median:**
   - The median, being the middle value when the data is ordered, is less affected by extreme values in the right tail.
   - The median remains closer to the bulk of the data, reflecting the central tendency of the majority of values.

3. **Relationship Between Mean and Median:**
   - In a right-skewed distribution, the mean is typically greater than the median.
   - This is because the mean is being influenced by the larger values in the right tail, pulling it in that direction.

Mathematically, the relationship can be summarized as follows:

- If \( \text{Skewness} > 0 \) (right-skewed),
  - \( \text{Mean} > \text{Median} \)

The difference between the mean and median in a right-skewed distribution indicates the direction and extent of skewness. The larger the difference, the more pronounced the skewness toward the right side of the distribution.

Q10. Explain the difference between covariance and correlation. How are these measures used in
statistical analysis?

### Covariance:

1. **Definition:**
   - Covariance measures the degree to which two variables change together.
   - It indicates whether an increase in one variable corresponds to an increase or decrease in another.

2. **Calculation:**
   - For two variables \(X\) and \(Y\):
   \[ \text{Cov}(X, Y) = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{n} \]
   - \(n\) is the number of data points.
   - \(\bar{X}\) and \(\bar{Y}\) are the means of \(X\) and \(Y\), respectively.

3. **Interpretation:**
   - Positive covariance: Indicates a positive relationship (as one variable increases, the other tends to increase).
   - Negative covariance: Indicates a negative relationship (as one variable increases, the other tends to decrease).
   - Covariance is not standardized, so its magnitude is influenced by the scale of the variables.

### Correlation:

1. **Definition:**
   - Correlation is a standardized measure of the strength and direction of the linear relationship between two variables.
   - It ranges from -1 to 1, where:
     - \( \text{Correlation} = 1 \) indicates a perfect positive linear relationship.
     - \( \text{Correlation} = -1 \) indicates a perfect negative linear relationship.
     - \( \text{Correlation} = 0 \) indicates no linear relationship.

2. **Calculation:**
   - For two variables \(X\) and \(Y\):
   \[ \text{Corr}(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \cdot \sigma_Y} \]
   - \(\sigma_X\) and \(\sigma_Y\) are the standard deviations of \(X\) and \(Y\), respectively.

3. **Interpretation:**
   - The correlation coefficient is dimensionless and always falls between -1 and 1.
   - Positive correlation: \(0 < \text{Correlation} < 1\)
   - Negative correlation: \(-1 < \text{Correlation} < 0\)
   - Correlation provides a standardized measure, making it easier to compare relationships across different pairs of variables.

### Use in Statistical Analysis:

- **Covariance:**
  - Provides information about the direction of the relationship between variables but does not indicate the strength or the degree of the relationship.
  - Sensitive to the scale of the variables.

- **Correlation:**
  - Standardized measure, making it suitable for comparing relationships.
  - Provides information about both the direction and strength of the linear relationship.
  - Ranges from -1 to 1, offering a more interpretable measure.

In statistical analysis, covariance and correlation are used to understand the relationship between two variables. Correlation is often preferred in practice because of its standardized nature, which allows for easier comparison between different pairs of variables and is not influenced by the scale of the variables. Both measures are valuable tools in regression analysis, portfolio management, and various fields where understanding relationships between variables is essential.

Q11. What is the formula for calculating the sample mean? Provide an example calculation for a
dataset.

The sample mean (\( \bar{x} \)) is calculated by summing up all the values in a dataset and dividing the sum by the number of observations. The formula for the sample mean is as follows:

\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]

Where:
- \( \bar{x} \) is the sample mean.
- \( x_i \) represents each individual value in the dataset.
- \( n \) is the number of observations in the dataset.

In [None]:
import seaborn as sns

In [None]:
flights=sns.load_dataset("flights")

In [None]:
flights.head()

In [None]:
np.mean(flights["passengers"])

Q12. For a normal distribution data what is the relationship between its measure of central tendency?

For a normal distribution, which is a symmetric distribution, the relationship between its measures of central tendency (mean, median, and mode) is as follows:

1. **Mean (μ):**
   - The mean of a normal distribution is located at the center.
   - In a perfectly symmetrical normal distribution, the mean is equal to the median.
   - The mean is the point of balance, and the tails on both sides are of equal length.

2. **Median:**
   - The median of a normal distribution is also located at the center.
   - In a perfectly symmetrical normal distribution, the median is equal to the mean.
   - The median is the middle value when the data is ordered.

3. **Mode:**
   - In a normal distribution, the mode is also at the center.
   - In a perfectly symmetrical normal distribution, the mode is equal to both the mean and the median.
   - The mode represents the most frequently occurring value.

### Mathematical Precision:

For a truly normal distribution, where the probability density function is given by the bell-shaped curve, the mean, median, and mode coincide at the center. Mathematically, the equality holds:

\[ \text{Mean} = \text{Median} = \text{Mode} \]

This relationship is a characteristic property of a perfectly symmetric normal distribution. However, in real-world data, distributions may deviate slightly from perfect symmetry due to factors such as skewness. In such cases, while the mean and median may still be close, the mode might not necessarily coincide with them.

### Visualization:

A visual representation of a normal distribution shows a symmetric bell-shaped curve, and the measures of central tendency are concentrated at the peak of the curve, representing the center of the distribution. The tails on both sides of the distribution extend equally from the center.

It's important to note that deviations from perfect symmetry may occur in practice, and the relationship between mean, median, and mode may vary in non-ideal cases.

Q13. How is covariance different from correlation?

Covariance and correlation are both measures that describe the relationship between two variables, but they differ in terms of scale, interpretation, and standardization:

### Covariance:

1. **Scale:**
   - Covariance is not standardized and can take any value, positive or negative.
   - The magnitude of covariance depends on the scales of the variables being measured.

2. **Interpretation:**
   - Covariance measures the direction of the linear relationship between two variables.
   - A positive covariance indicates a positive relationship, and a negative covariance indicates a negative relationship.

3. **Formula:**
   - For two variables \(X\) and \(Y\):
     \[ \text{Cov}(X, Y) = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{n} \]

4. **Units:**
   - The unit of covariance is the product of the units of the two variables.

### Correlation:

1. **Scale:**
   - Correlation is standardized and always falls between -1 and 1.
   - A correlation coefficient of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

2. **Interpretation:**
   - Correlation not only measures the direction but also the strength of the linear relationship.
   - It provides a standardized measure, allowing for easier comparison between different pairs of variables.

3. **Formula:**
   - For two variables \(X\) and \(Y\):
     \[ \text{Corr}(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \cdot \sigma_Y} \]
   - \(\sigma_X\) and \(\sigma_Y\) are the standard deviations of \(X\) and \(Y\), respectively.

4. **Units:**
   - Correlation is dimensionless, making it easier to compare relationships between variables.

### Use Cases:

- **Covariance:**
  - Used to identify the direction of the linear relationship.
  - Sensitive to changes in scale.

- **Correlation:**
  - Provides a standardized measure of both direction and strength.
  - Facilitates comparison between different pairs of variables.

Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

Outliers, which are extreme values in a dataset, can have a significant impact on measures of central tendency (mean, median, mode) and dispersion (range, variance, standard deviation). Their influence depends on the extent of their deviation from the rest of the data. Here's how outliers affect these measures:

### Measures of Central Tendency:

1. **Mean:**
   - **Impact:** Outliers can heavily influence the mean, pulling it in the direction of the outliers.
   - **Example:** Consider the dataset [10, 12, 14, 15, 100]. The mean is significantly affected by the outlier "100."

2. **Median:**
   - **Impact:** Less affected by outliers since it represents the middle value.
   - **Example:** Using the same dataset [10, 12, 14, 15, 100], the median remains close to the bulk of the data.

3. **Mode:**
   - **Impact:** Typically not influenced by outliers since it represents the most frequent value.
   - **Example:** The mode of [10, 12, 14, 15, 100] remains unaffected by the outlier.

### Measures of Dispersion:

1. **Range:**
   - **Impact:** Outliers can significantly increase the range.
   - **Example:** For [10, 12, 14, 15, 100], the range is affected by the outlier "100."

2. **Variance and Standard Deviation:**
   - **Impact:** Outliers can inflate both variance and standard deviation.
   - **Example:** In the dataset [10, 12, 14, 15, 100], the variance and standard deviation will be influenced by the outlier.

3. **Interquartile Range (IQR):**
   - **Impact:** Less affected by outliers as it focuses on the central portion of the data.
   - **Example:** The IQR of [10, 12, 14, 15, 100] is less influenced by the outlier.

### Example:

Consider the following dataset representing the incomes of a group of individuals (in thousands of dollars):

\[ [30, 35, 40, 45, 50, 500] \]

- **Mean:** The mean is heavily influenced by the outlier "500."
- **Median:** The median is less affected and remains closer to the center of the data.
- **Mode:** The mode is not influenced by outliers.

- **Range:** The range is affected by the extreme value "500."
- **Variance and Standard Deviation:** Both will be influenced by the outlier.
- **IQR:** Less affected, as it focuses on the central portion of the data.

In summary, outliers can distort measures of central tendency and dispersion, particularly the mean, range, variance, and standard deviation. Using robust measures like the median and interquartile range can be more appropriate when dealing with datasets containing outliers.