# Q1. What are the three measures of central tendency?

The three measures of central tendency are:

1. Mean: The mean is calculated by adding up all the values in a data set and then dividing the sum by the total number of values. It represents the average value of the data set and is influenced by extreme values.

2. Median: The median is the middle value of a data set when the values are arranged in ascending or descending order. If the data set has an odd number of values, the median is the middle value. If the data set has an even number of values, the median is the average of the two middle values.

3. Mode: The mode is the value that appears most frequently in a data set. A data set can have one mode (unimodal) or multiple modes (multimodal) if there are several values with the same highest frequency. It is useful for categorical or discrete data, but it can also be applied to continuous data with grouped frequency distributions.

# Q2. What is the difference between the mean, median, and mode? How are they used to measure the
central tendency of a dataset?

The mean, median, and mode are three different measures of central tendency, used to understand the central or typical value of a dataset. Here's how they differ and how they are used:

1. Mean:
- The mean is the average value of a dataset.
- It is calculated by summing up all the values in the dataset and then dividing the sum by the total number of values.
- The mean is sensitive to extreme values, which means outliers can significantly affect its value.
- It is commonly used when dealing with continuous and interval data, where the values have a natural order and can be averaged.

2. Median:
- The median is the middle value of a dataset when the values are arranged in ascending or descending order.
- If the dataset has an odd number of values, the median is the middle value itself.
- If the dataset has an even number of values, the median is the average of the two middle values.
- The median is not affected by extreme values or outliers, making it a more robust measure of central tendency in the presence of skewed data.
- It is commonly used when dealing with ordinal or skewed data, where the central value is more representative of the dataset than the mean.

3. Mode:
- The mode is the value that appears most frequently in a dataset.
- A dataset can have one mode (unimodal) or multiple modes (multimodal) if there are several values with the same highest frequency.
- The mode is useful for categorical or discrete data, but it can also be applied to continuous data with grouped frequency distributions.
- It is not affected by extreme values and is particularly helpful when dealing with nominal data, such as categories or labels.

In summary, these measures of central tendency provide different perspectives on the typical value of a dataset. The mean is commonly used when dealing with continuous data and has a strong connection with mathematical properties. The median is preferred when dealing with skewed data or ordinal variables to avoid the influence of outliers. The mode is useful for categorical or nominal data to identify the most frequent value or category. The choice of which measure to use depends on the nature of the data and the research question at hand.

# Q3. Measure the three measures of central tendency for the given height data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [19]:
import numpy as np
data = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]
np.mean(data)

177.01875

In [20]:
np.median(data)

177.0

In [23]:
from scipy import stats as st

In [24]:
st.mode(data)

  st.mode(data)


ModeResult(mode=array([177.]), count=array([3]))

# Q4. Find the standard deviation for the given data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [25]:
data2 = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [26]:
np.std(data2)

1.7885814036548633

# Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe
the spread of a dataset? Provide an example.

Measures of dispersion, such as range, variance, and standard deviation, are used to describe the spread or variability of a dataset. They provide valuable information about how the data points are distributed around the central tendency (mean, median, or mode) and help to understand the degree of variability within the dataset. Let's explore each measure and provide an example to illustrate their usage:

1. Range:
The range is the simplest measure of dispersion, and it represents the difference between the largest and smallest values in the dataset. It gives an idea of how much the data is spread out from one extreme to the other.

Example:
Consider the following dataset representing the ages of a group of people: [20, 25, 22, 28, 19, 30]
Range = Maximum value - Minimum value = 30 - 19 = 11
The range of this dataset is 11, indicating that the ages are spread out across an 11-year range.

2. Variance:
Variance measures the average squared deviation from the mean. It quantifies how much the data points deviate from the mean and provides a measure of the overall dispersion.

Example:
Consider the dataset of exam scores: [85, 89, 92, 78, 80]
Step 1: Calculate the mean: (85 + 89 + 92 + 78 + 80) / 5 = 84.8
Step 2: Calculate the squared deviations from the mean:
(85 - 84.8)^2 = 0.04
(89 - 84.8)^2 = 17.64
(92 - 84.8)^2 = 52.84
(78 - 84.8)^2 = 45.64
(80 - 84.8)^2 = 23.04
Step 3: Calculate the variance:
Variance = (0.04 + 17.64 + 52.84 + 45.64 + 23.04) / 5 = 27.24
The variance of this dataset is approximately 27.24.

3. Standard Deviation:
The standard deviation is the square root of the variance. It is a widely used measure of dispersion, providing a more interpretable value compared to the variance. It represents the average amount of deviation of data points from the mean.

Example (using the same dataset as above):
Standard Deviation = √27.24 ≈ 5.219
The standard deviation of this dataset is approximately 5.219.

In summary, measures of dispersion like range, variance, and standard deviation complement measures of central tendency by providing information about the spread and variability of the data. They are valuable in various fields, such as finance, research, and data analysis, to understand the distribution and characteristics of datasets.

# Q6. What is a Venn diagram?

A Venn diagram is a visual representation used to show the relationships between different sets of elements or groups. It consists of overlapping circles (or other shapes) that represent the sets, with each circle representing a specific category or group. The overlapping regions show the elements that belong to multiple sets, while the non-overlapping regions represent elements unique to each set.

The primary purpose of a Venn diagram is to illustrate the commonalities and differences between sets and to help visualize set operations like union, intersection, and complement. These diagrams are widely used in mathematics, logic, statistics, and various fields where set theory and data analysis are involved.

The standard Venn diagram uses circles, but variations with different shapes, such as rectangles, can also be used to represent the sets. Venn diagrams are simple and intuitive tools, making them valuable for educational purposes and presenting complex relationships in an easy-to-understand format.

Let's look at a simple example of a Venn diagram:

Consider two sets: Set A, representing people who like pizza, and Set B, representing people who like burgers.

- Circle A: Contains elements (people) who like pizza.
- Circle B: Contains elements (people) who like burgers.
- Overlapping region: Contains elements (people) who like both pizza and burgers.
- Non-overlapping regions: Contains elements (people) who like only pizza or only burgers.

By using a Venn diagram in this scenario, you can quickly see the number of people who like only pizza, only burgers, or both, as well as the total number of people who like either pizza or burgers.

# Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:
(i) A B
(ii) A ⋃ B

In [4]:
#(i)A ∩ B = {2,6}
#(ii)A ∪ B = {0,2,3,4,5,6,7,8,10}

# Q8. What do you understand about skewness in data?

Skewness is a statistical measure that helps us understand the asymmetry of the probability distribution of a dataset. In simpler terms, it indicates the degree to which a dataset's distribution deviates from being symmetrical (bell-shaped) around its mean.

In a symmetrical distribution, the data points are evenly distributed on both sides of the mean, and the graph of the distribution would appear to be roughly symmetric. However, in a skewed distribution, the data points tend to be concentrated more on one side of the mean, causing the graph to be stretched towards that direction.

There are three types of skewness:

1. Positive Skewness (Right Skewness):
   - In a positively skewed distribution, the tail of the distribution extends towards the right side (higher values).
   - The mean is typically greater than the median and mode.
   - This commonly occurs in datasets with a few extremely large values that pull the mean in their direction.

2. Negative Skewness (Left Skewness):
   - In a negatively skewed distribution, the tail of the distribution extends towards the left side (lower values).
   - The mean is typically less than the median and mode.
   - This often happens in datasets with a few extremely small values that pull the mean in their direction.

3. Symmetrical Distribution:
   - In a symmetrical distribution, the data is evenly distributed around the mean, and there is no skewness.
   - The mean, median, and mode are approximately equal.

Skewness is an important concept in statistics because it helps us understand the shape of a dataset's distribution and identify potential outliers or extreme values that can impact the interpretation of data analysis. Skewness can be quantitatively measured using different methods, with the most common measure being Pearson's first skewness coefficient or the third standardized moment. These measures provide a numeric value representing the direction and degree of skewness in the data.

# Q9. If a data is right skewed then what will be the position of median with respect to mean?

If a dataset is right-skewed, the position of the median will be to the left of the mean. Let's understand why this is the case:

In a right-skewed distribution:
- The tail of the distribution extends towards the right side, which means there are relatively more higher values in the dataset.
- As a result, the extreme values on the right side (higher values) have a greater influence on the mean, pulling it in that direction.

On the other hand:
- The median is not affected by extreme values; it only depends on the central value(s) of the dataset.
- Since there are more higher values in a right-skewed distribution, the central value (median) will be closer to the left side, away from the extreme higher values that are pulling the mean to the right.

To summarize:
- Mean: The mean will be pulled towards the right side due to the presence of higher extreme values.
- Median: The median will be positioned to the left of the mean, closer to the center of the dataset.

In graphical terms, in a right-skewed distribution, the mean will be shifted to the right, towards the longer tail, while the median will be positioned to the left, towards the thicker part of the distribution.

# Q10. Explain the difference between covariance and correlation. How are these measures used in
statistical analysis?

Covariance and correlation are two important statistical measures used to quantify the relationship between two variables in a dataset. While they are related concepts, they serve different purposes and have distinct interpretations.

1. Covariance:
Covariance measures the degree to which two variables change together. It indicates the direction of the relationship between two variables (whether they move in the same or opposite direction) and the magnitude of their joint variability. The formula for the sample covariance of two variables X and Y with n data points is:

\[ \text{cov}(X, Y) = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{n-1} \]

where:
- \(X_i\) and \(Y_i\) are individual data points of variables X and Y, respectively.
- \(\bar{X}\) and \(\bar{Y}\) are the sample means of X and Y, respectively.
- \(n\) is the number of data points.

Interpretation of covariance:
- If \(\text{cov}(X, Y) > 0\): The variables X and Y have a positive covariance, indicating they tend to increase or decrease together.
- If \(\text{cov}(X, Y) < 0\): The variables X and Y have a negative covariance, indicating they tend to move in opposite directions.
- If \(\text{cov}(X, Y) = 0\): The variables X and Y have no linear relationship or association, but this doesn't necessarily mean they are independent.

2. Correlation:
Correlation is a standardized measure of the linear relationship between two variables. It expresses the strength and direction of the relationship, ranging from -1 to 1. The most commonly used correlation coefficient is the Pearson correlation coefficient, denoted by \(r\). The formula for the sample Pearson correlation coefficient is:

\[ r = \frac{\text{cov}(X, Y)}{s_X \cdot s_Y} \]

where:
- \(\text{cov}(X, Y)\) is the covariance between X and Y.
- \(s_X\) and \(s_Y\) are the sample standard deviations of X and Y, respectively.

Interpretation of correlation:
- \(r > 0\): Positive correlation. As one variable increases, the other tends to increase as well.
- \(r < 0\): Negative correlation. As one variable increases, the other tends to decrease.
- \(r \approx 0\): Little to no linear relationship between the variables.

How are these measures used in statistical analysis?
- Covariance and correlation are used to identify the relationship between two variables. They help determine if changes in one variable are associated with changes in another.
- Correlation, being a standardized measure, is particularly useful as it allows for comparison between different datasets with different scales and units.
- These measures are often used in data exploration, regression analysis, portfolio management, and various fields of research to gain insights into the associations between variables and to make data-driven decisions.

It's essential to note that correlation does not imply causation. Even if two variables have a strong correlation, it does not necessarily mean that one variable causes the other to change.

# Q11. What is the formula for calculating the sample mean? Provide an example calculation for a
dataset.

In [8]:
dataset = [10,15,20,25,30,35,40]
sample_mean = sum(dataset) / len(dataset)
print("sample_mean: " , sample_mean)

sample_mean:  25.0


# Q12. For a normal distribution data what is the relationship between its measure of central tendency?

In a normal distribution, which is a symmetric bell-shaped distribution, the three measures of central tendency, namely the mean, median, and mode, have a special relationship:

1. Mean:
In a normal distribution, the mean (average) is located at the center of the distribution, exactly at the peak of the symmetric bell-shaped curve. The mean is often denoted by the symbol \(\mu\) (mu).

2. Median:
The median in a normal distribution is also located at the center of the distribution, just like the mean. Because the normal distribution is symmetric, the mean and median are the same. This means that the median is equal to the mean, and it is also denoted by the symbol \(\mu\) (mu).

3. Mode:
In a normal distribution, there is no single mode because every value occurs with the same frequency. All points on the curve are equally likely to be observed, resulting in a continuous distribution without distinct modes.

# Q13. How is covariance different from correlation?

Covariance and correlation are two related but different measures used to quantify the relationship between two variables in a dataset. Let's explore their differences:

1. Definition and Interpretation:
- Covariance: Covariance measures the degree to which two variables change together. It indicates the direction of the relationship between the variables (whether they move in the same or opposite direction) and the magnitude of their joint variability. However, it doesn't provide information about the strength of the relationship in a standardized way.
- Correlation: Correlation is a standardized measure of the linear relationship between two variables. It expresses the strength and direction of the relationship on a scale from -1 to 1. A correlation of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

2. Scale:
- Covariance: The value of covariance is not standardized, which means it depends on the units of the variables. If the variables are measured in different units or have different scales, the magnitude of covariance can be significantly affected.
- Correlation: The correlation coefficient is standardized, and it is unitless. This makes it easier to compare the strength of the relationship between different pairs of variables, regardless of their original units.

3. Range:
- Covariance: The covariance can take any real value, ranging from negative infinity to positive infinity. It is not bounded.
- Correlation: The correlation coefficient is bounded between -1 and 1, with -1 indicating a perfect negative linear relationship, 1 indicating a perfect positive linear relationship, and 0 indicating no linear relationship.

4. Interpretation of Strength:
- Covariance: Since covariance is not standardized, it is challenging to interpret its magnitude. A large positive covariance indicates that the variables tend to increase or decrease together, while a large negative covariance indicates they tend to move in opposite directions. However, the exact strength of the relationship is not immediately evident from the covariance value.
- Correlation: The correlation coefficient provides a clear indication of the strength of the linear relationship between two variables. A correlation of 1 or -1 suggests a strong linear relationship, while a correlation close to 0 indicates a weak or no linear relationship.

# Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

In [14]:
import numpy as np
dataset = [5,10,15,200]


mean = np.mean(dataset)
median = np.median(dataset)
std = np.std(dataset)
max = np.max(dataset)
min = np.min(dataset)

print("with outliers")
print("mean:",mean)
print("median:",median)
print("std:",std)
print("max:",max)
print("min:" , min)


data2 = [5,10,15,20]

mean2 = np.mean(data2)
median2 = np.median(data2)
std2 = np.std(data2)
max2 = np.max(data2)
min2 = np.min(data2)

print("\nWithout Outlier:")
print("mean:" , mean2)
print("median:" , median2)
print("std:" , std2)
print("max:" , max2)
print("min:" , min2)


with outliers
mean: 57.5
median: 12.5
std: 82.34834546000302
max: 200
min: 5

Without Outlier:
mean: 12.5
median: 12.5
std: 5.5901699437494745
max: 20
min: 5


As we can see, the outlier (200) significantly affected the measures of central tendency and dispersion. With the outlier included, the mean is much larger, the standard deviation is higher, and the range is much wider. Additionally, the median, which is less affected by outliers, remains relatively unchanged. However, upon removing the outlier, the measures return to values more representative of the central tendency and spread of the original data. The median and interquartile range remain the same, while the mean, standard deviation, and range are reduced.

This example illustrates the importance of identifying and handling outliers appropriately when analyzing datasets to ensure more accurate and meaningful statistical conclusions