# **Statistics Basic-2**

### **Q1. What are the three measures of central tendency?**

The three measures of central tendency are:

1. **Mean**: The arithmetic average of a set of values, calculated by summing all the values and dividing by the number of values.
2. **Median**: The middle value in a set of ordered values, such that half the values are above it and half are below it. If there is an even number of values, the median is the average of the two middle values.
3. **Mode**: The value that appears most frequently in a set of values. A data set may have one mode, more than one mode, or no mode at all if no number repeats.

These measures provide different perspectives on the central point of a data set.

### **Q2. What is the difference between the mean, median, and mode? How are they used to measure the central tendency of a dataset?**

The mean, median, and mode are all measures of central tendency, each providing a different method to represent the center of a dataset. Here are the key differences and their uses:

1. **Mean**:
    - **Definition**: The mean is the arithmetic average of all the values in a dataset. It is calculated by summing all the values and then dividing by the number of values.
    - **Usage**: The mean is useful for datasets with values that are evenly distributed without extreme outliers. It takes all values into account, which can provide a balanced measure of central tendency.
    - **Sensitivity**: It is sensitive to outliers and skewed data, which can distort the mean.

2. **Median**:
    - **Definition**: The median is the middle value of a dataset when the values are arranged in ascending or descending order. If there is an even number of values, the median is the average of the two middle values.
    - **Calculation**: Order the values and find the middle one (or the average of the two middle ones).
    - **Usage**: The median is useful for skewed distributions or datasets with outliers because it is not affected by extreme values. It represents the point at which half the values are above and half are below.
    - **Sensitivity**: It is robust to outliers and skewed data.

3. **Mode**:
    - **Definition**: The mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal), more than one mode (bimodal or multimodal), or no mode if no value repeats.
    - **Calculation**: Identify the value(s) that occur most frequently.
    - **Usage**: The mode is useful for categorical data or for identifying the most common value in a dataset. It can provide insights into the frequency distribution of the data.
    - **Sensitivity**: It is not affected by outliers but may not be a useful measure for all types of data, especially if the data set has many unique values or no repeating values.

### Summary of Uses:
- **Mean**: Best for datasets without outliers or skewed data; provides a comprehensive measure considering all data points.
- **Median**: Best for skewed distributions or data with outliers; represents the middle point of the dataset.
- **Mode**: Best for categorical data or for identifying the most frequent value in a dataset; useful for understanding frequency distribution.

Each measure gives a different perspective on the central tendency, and the choice of which to use depends on the nature of the dataset and the specific analysis goals.

### **Q3. Measure the three measures of central tendency for the given height data:**
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [2]:
import numpy as np
from scipy import stats

height_data = [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]

mean_height = np.mean(height_data)

median_height = np.median(height_data)

mode_height = stats.mode(height_data, keepdims=True)

print(f"Mean: {mean_height}")
print(f"Median: {median_height}")
print(f"Mode: {mode_height.mode[0]} (appears {mode_height.count[0]} times)")


Mean: 177.01875
Median: 177.0
Mode: 177.0 (appears 3 times)


### **Q4. Find the standard deviation for the given data:**
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [5]:
import numpy as np

data = [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]

standard_deviation = np.std(data)  

print(f"Standard Deviation: {standard_deviation}")


Standard Deviation: 1.7885814036548633


### **Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe the spread of a dataset? Provide an example.**

Measures of dispersion, such as range, variance, and standard deviation, are used to describe the spread or variability of a dataset. They provide insights into how much the data points differ from the central tendency (mean, median, or mode) and from each other. Here's how each measure is used:

1. **Range**:
    - **Definition**: The range is the difference between the highest and lowest values in a dataset.
    - **Usage**: It gives a quick sense of the spread but does not provide information about the distribution of values within that range.
    - **Example**: For the dataset \([178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]\), the range is \(180 - 172.5 = 7.5\).

2. **Variance**:
    - **Definition**: Variance measures the average squared deviation of each data point from the mean. It gives an idea of how much the data points vary from the mean.
    - **Usage**: It is useful for understanding the degree of spread in the data, with larger variance indicating greater dispersion.

    - **Example**: For the same dataset, calculating variance involves finding the mean first and then averaging the squared differences from the mean.

3. **Standard Deviation**:
    - **Definition**: Standard deviation is the square root of the variance. It is expressed in the same units as the data, making it easier to interpret.
    - **Usage**: It provides a measure of the average distance of each data point from the mean. A smaller standard deviation indicates that the data points are close to the mean, while a larger standard deviation indicates greater spread.
    - **Example**: For the dataset \([178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]\), we previously calculated the standard deviation to be approximately 1.7886.

### Example Calculation:

Consider the dataset \([178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]\):

1. **Range**:
    - Highest value: 180
    - Lowest value: 172.5
    - Range: 180 - 172.5 = 7.5

2. **Variance**:
    - Mean: 177.01875
    - Variance: 3.198

3. **Standard Deviation**:
    - Standard Deviation: 1.7886

These measures help describe the spread of the dataset, providing insight into the variability and distribution of the data points around the central tendency.

### **Q6. What is a Venn diagram?**

A Venn diagram is a visual representation used to show the relationships between different sets. It consists of overlapping circles, each representing a set. The areas where the circles overlap represent the common elements shared by the sets, while the areas where the circles do not overlap represent elements unique to each set.

### Key Features of a Venn Diagram:
- **Circles**: Each circle represents a different set.
- **Overlapping Regions**: The intersections of circles represent the common elements between the sets.
- **Non-overlapping Regions**: These areas indicate elements that are unique to a particular set.

### Uses of Venn Diagrams:
- **Set Theory**: Illustrating relationships and operations between sets, such as unions, intersections, and differences.
- **Logic and Probability**: Showing logical relations and probabilities of events.
- **Data Science**: Visualizing similarities and differences between data groups.
- **Problem Solving**: Identifying commonalities and differences among concepts or groups.

### Example:
Consider three sets:
- \(A = \{1, 2, 3, 4\}\)
- \(B = \{3, 4, 5, 6\}\)
- \(C = \{4, 6, 7, 8\}\)

A Venn diagram for these sets would have three overlapping circles. The intersections would show:
- \(A intersection B = \{3, 4\}\)
- \(A intersection C = \{4\}\)
- \(B intersection C = \{4, 6\}\)
- \(A intersection B intersection C = \{4\}\)

The areas outside the intersections would show the elements unique to each set:
- Unique to \(A\): \{1, 2\}
- Unique to \(B\): \{5\}
- Unique to \(C\): \{7, 8\}

Venn diagrams are powerful tools for visualizing and analyzing the relationships between multiple sets, making complex information easier to understand.

### **Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:**
(i) A intersection B

(ii) A ⋃ B

In [7]:
A = {2, 3, 4, 5, 6, 7}
B = {0, 2, 6, 8, 10}

intersection = A & B

union = A | B

print(f"A ∩ B: {intersection}")
print(f"A ⋃ B: {union}")


A ∩ B: {2, 6}
A ⋃ B: {0, 2, 3, 4, 5, 6, 7, 8, 10}


### **Q8. What do you understand about skewness in data?**

Skewness in data refers to the measure of asymmetry in the distribution of values in a dataset. It indicates the extent to which the data deviates from a symmetrical, bell-shaped distribution (normal distribution). Skewness can be positive, negative, or zero, and it helps in understanding the direction and magnitude of the skew.

### Types of Skewness:

1. **Positive Skewness (Right Skewness)**:
    - **Description**: The tail on the right side of the distribution is longer or fatter than the left side.
    - **Characteristics**: Most data points are concentrated on the left, with a few larger values stretching the tail to the right.
    - **Implications**: The mean is typically greater than the median.
    - **Example**: Income distribution in many societies, where most people earn below average, but a few high earners increase the mean.

2. **Negative Skewness (Left Skewness)**:
    - **Description**: The tail on the left side of the distribution is longer or fatter than the right side.
    - **Characteristics**: Most data points are concentrated on the right, with a few smaller values stretching the tail to the left.
    - **Implications**: The mean is typically less than the median.
    - **Example**: Age of retirement, where most people retire at an age close to the upper limit, but a few retire earlier.

3. **Zero Skewness (Symmetrical Distribution)**:
    - **Description**: The distribution is symmetrical, with tails on both sides of the mean being mirror images of each other.
    - **Characteristics**: The mean, median, and mode are all equal.
    - **Example**: Heights of adult men in a well-defined population tend to be symmetrically distributed around the average height.

### Measuring Skewness:
Skewness can be quantified using statistical formulas. A common measure is the sample skewness, calculated as:


### Importance of Skewness:
- **Data Analysis**: Understanding skewness helps in choosing the appropriate statistical methods and transformations for analysis.
- **Data Interpretation**: Skewness provides insights into the underlying patterns and potential anomalies in the data.
- **Decision Making**: In fields like finance, skewness helps in assessing the risk and return profiles of investments.

### Example:
Consider a dataset of exam scores:
- **Positive Skewness**: Most students scored between 50 and 70, but a few scored above 90.
- **Negative Skewness**: Most students scored between 70 and 90, but a few scored below 50.
- **Zero Skewness**: Scores are evenly distributed around the mean, with equal numbers of students scoring above and below the mean.

Understanding skewness helps in better interpreting the distribution of data and making informed decisions based on its characteristics.

### **Q9. If a data is right skewed then what will be the position of median with respect to mean?**

In a right-skewed distribution, the tail on the right-hand side of the distribution is longer or fatter than the tail on the left-hand side. In this case, the mean is typically larger than the median because the mean is influenced by extreme values in the longer right tail.

So, with respect to the mean:

- In a right-skewed distribution, the median will be to the left of the mean.
- In a left-skewed distribution, it will be the opposite; the mean will be to the right of the median.

### **Q10. Explain the difference between covariance and correlation. How are these measures used in statistical analysis?**

Covariance and correlation are both measures used to quantify the relationship between two variables in statistics, but they have some key differences.

1. **Covariance**:
   - Covariance measures the degree to which two variables change together. 
   - It indicates the direction of the linear relationship between variables. A positive covariance means that as one variable increases, the other tends to also increase, while a negative covariance means that as one variable increases, the other tends to decrease.
   - However, covariance doesn't provide a standardized measure, so it can be difficult to interpret. It's influenced by the scale of the variables, making it hard to compare covariances across different datasets.

2. **Correlation**:
   - Correlation is a standardized measure of the relationship between two variables.
   - It ranges from -1 to 1, where:
      - 1 indicates a perfect positive linear relationship,
      - -1 indicates a perfect negative linear relationship, and
      - 0 indicates no linear relationship.
   - Correlation standardizes the covariance by dividing it by the product of the standard deviations of the two variables, making it unitless and easier to interpret.
   - Correlation also indicates the strength and direction of the linear relationship between variables, similar to covariance.

In statistical analysis:
- Covariance is used to understand how two variables change together. For example, it can be used to study the relationship between the returns of two different stocks in finance.
- Correlation is often preferred over covariance because it provides a standardized measure that is easier to interpret and compare. It's widely used in various fields, such as finance, economics, social sciences, and more, to understand the strength and direction of relationships between variables while controlling for differences in scale.

### **Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset.**

The formula for calculating the sample mean, denoted by \(\bar{x}\), is:


![image.png](attachment:image.png)


Here's an example calculation:

Let's say we have the following dataset representing the scores of 5 students in a test: 

[75, 80, 85, 90, 95]

To find the sample mean, we sum up all the scores and divide by the total number of scores:


\bar{x} = 75 + 80 + 85 + 90 + 95 / 5 =  85


So, the sample mean score for this dataset is 85.

### **Q12. For a normal distribution data what is the relationship between its measure of central tendency?**

For a normal distribution, the measures of central tendency, namely the mean, median, and mode, are all equal and located at the center of the distribution.

- **Mean**: In a normal distribution, the mean is the arithmetic average of all the data points. It represents the balance point of the distribution.
- **Median**: The median is the middle value of the dataset when arranged in ascending or descending order. In a normal distribution, the median is also equal to the mean.
- **Mode**: The mode is the value that appears most frequently in the dataset. In a normal distribution, the mode is also equal to the mean and median.

In summary, for a normal distribution:

Mean = Median = Mode

This relationship holds true regardless of the shape, spread, or scale of the normal distribution.

### **Q13. How is covariance different from correlation?**

Covariance and correlation are both measures used to quantify the relationship between two variables, but they have some key differences:

1. **Definition**:
   - **Covariance**: Covariance measures the degree to which two variables change together. It indicates the direction of the linear relationship between variables. A positive covariance means that as one variable increases, the other tends to also increase, while a negative covariance means that as one variable increases, the other tends to decrease.
   - **Correlation**: Correlation is a standardized measure of the relationship between two variables. It ranges from -1 to 1, where 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. Correlation standardizes the covariance by dividing it by the product of the standard deviations of the two variables, making it unitless and easier to interpret.

2. **Scale**:
   - **Covariance** is influenced by the scale of the variables. Therefore, it can be difficult to interpret covariances across different datasets.
   - **Correlation** is unitless and does not depend on the scale of the variables, making it easier to compare relationships between variables.

3. **Interpretation**:
   - **Covariance** provides a measure of the direction of the relationship between two variables but does not indicate the strength or magnitude of the relationship.
   - **Correlation** not only indicates the direction but also the strength and direction of the linear relationship between variables.

4. **Range**:
   - **Covariance** can range from negative infinity to positive infinity.
   - **Correlation** ranges from -1 to 1, with -1 indicating a perfect negative linear relationship, 1 indicating a perfect positive linear relationship, and 0 indicating no linear relationship.

In summary, while both covariance and correlation measure the relationship between two variables, correlation provides a standardized measure that is easier to interpret and compare across different datasets, as it is unitless and always ranges between -1 and 1.

### **Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.**

Outliers can significantly impact measures of central tendency and dispersion:

1. **Measures of Central Tendency**:
   - **Mean**: Outliers can greatly influence the mean, pulling it towards their extreme values. A single outlier can distort the mean significantly, especially in smaller datasets.
   - **Median**: The median is less affected by outliers compared to the mean. Since it represents the middle value when the data is sorted, outliers have less influence on its value.
   - **Mode**: Outliers do not affect the mode directly because it represents the most frequently occurring value in the dataset. However, if an outlier appears frequently, it can create a new mode or change the existing mode.

2. **Measures of Dispersion**:
   - **Range**: Outliers can increase the range of the dataset, as they contribute to the maximum or minimum values.
   - **Standard Deviation/Variance**: Outliers can significantly affect the standard deviation and variance, especially when using the sample-based formulas. They can inflate the measure of spread because the squared differences between each data point and the mean are used in the calculation.
   - **Interquartile Range (IQR)**: The presence of outliers can affect the spread of the middle 50% of the data (the IQR). Outliers can increase the range between the first and third quartiles, leading to a wider IQR.

**Example**:
Consider a dataset of exam scores:

[65, 70, 75, 80, 85, 90, 95, 100, 105]

Adding an outlier, say 150, would significantly affect various measures:

- **Mean**: Without the outlier: 720/9 = 80. With the outlier:870/10 = 87.
- **Median**: Without the outlier: 85. With the outlier: 87.5.
- **Standard Deviation**: Without the outlier: approximately 14.85. With the outlier: approximately 27.36.
- **Range**: Without the outlier: 105 - 65 = 40. With the outlier: 150 - 65 = 85.

In this example, the outlier has shifted both the mean and median upwards, increased the standard deviation, and widened the range of the dataset.

# *Complete*