## Q1. What are the three measures of central tendency?

1. Mean
2. median 
3. mode

## Q2. What is the difference between the mean, median, and mode? How are they used to measure the central tendency of a dataset?

The mean, median, and mode are all measures of central tendency used to describe a dataset's typical or central value. However, they are calculated differently and have different characteristics that make them appropriate for different types of data.

The mean, also known as the arithmetic mean, is the sum of all values in a dataset divided by the total number of values. It is the most common measure of central tendency and is highly sensitive to outliers. If a dataset has outliers, the mean may not be a representative measure of central tendency. The mean is calculated as:

mean = (sum of all values) / (total number of values)

The median is the middle value in a dataset when the values are arranged in numerical order. If there is an even number of values, the median is the average of the two middle values. The median is less sensitive to outliers than the mean and is more appropriate for skewed distributions. The median is calculated as:

Arrange the data in ascending or descending order
Find the middle number
If there is an even number of values, find the mean of the two middle numbers
The mode is the most common value in a dataset, i.e., the value that appears most frequently. A dataset can have multiple modes, one mode, or no mode at all. The mode is appropriate for nominal or categorical data, such as colors or names. The mode is calculated as:

Count the frequency of each value in the dataset
The mode is the value with the highest frequency

## Q3. Measure the three measures of central tendency for the given height data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [1]:
import statistics

In [2]:
height_data = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [11]:
mean_height = statistics.mean(height_data)
mean_height

177.01875

In [12]:
median_height = statistics.median(height_data)
median_height


177.0

In [13]:
mode_height = statistics.mode(height_data)
mode_height

178

## Q4. Find the standard deviation for the given data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [14]:
data = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [15]:
std_dev = statistics.stdev(data)

In [16]:
std_dev

1.8472389305844188

## Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe the spread of a dataset? Provide an example.

Measures of dispersion such as range, variance, and standard deviation are used to describe the spread or variability of a dataset. These measures provide information about how much the data points in a dataset differ from each other.

The range is the simplest measure of dispersion, which is the difference between the maximum and minimum values in a dataset. It gives an idea about the spread of the data, but it is not very informative as it does not consider the values between the minimum and maximum.

The variance and standard deviation are more informative measures of dispersion. The variance is the average of the squared differences of each value from the mean, while the standard deviation is the square root of the variance. The standard deviation is a more commonly used measure of dispersion as it has the same units as the data, while the variance has squared units.

For example, suppose we have the following dataset of exam scores:

80, 85, 90, 92, 95

The range of the dataset is 15, which tells us that the highest and lowest scores differ by 15 points. However, this does not give us much information about the spread of the scores.

The variance of the dataset can be calculated as:

variance = [(80-88)^2 + (85-88)^2 + (90-88)^2 + (92-88)^2 + (95-88)^2] / 5
variance = 31.2

The standard deviation of the dataset can be calculated as:

standard deviation = sqrt(variance)
standard deviation = sqrt(31.2)
standard deviation = 5.58

The standard deviation tells us that the scores are spread out by about 5.58 points on average, which gives us a more informative measure of the variability in the dataset than the range.

## Q6. What is a Venn diagram?

A Venn diagram is a visual representation of sets, showing the relationships between different sets of data. It consists of overlapping circles or other shapes, with each circle representing a set and the overlapping areas representing the intersections between sets.

## Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:
1. A intersection B
2. A ⋃ B

1. {2,6}
2. {0,2,3,4,5,6,7,8,10}


## Q8. What do you understand about skewness in data?

Skewness is a measure of the degree of asymmetry in a distribution of data. In other words, it measures how much a dataset deviates from a perfectly symmetrical distribution. A distribution is said to be symmetric if it is evenly distributed around its mean, with half of the data points falling on either side of the mean.

Positive skewness occurs when a distribution is skewed to the right, meaning that the tail of the distribution extends further to the right than to the left. This indicates that there are more data points on the left side of the distribution, with a few large values on the right side.

Negative skewness occurs when a distribution is skewed to the left, meaning that the tail of the distribution extends further to the left than to the right. This indicates that there are more data points on the right side of the distribution, with a few small values on the left side.

Skewness is a useful measure for understanding the shape of a distribution and can provide insights into the nature of the data. For example, if a dataset has a positive skewness, it may indicate that there are outliers or extreme values on the right side of the distribution. Similarly, if a dataset has a negative skewness, it may indicate that there are outliers or extreme values on the left side of the distribution. Understanding skewness is important for choosing appropriate statistical techniques and making accurate inferences from the data.





## Q9. If a data is right skewed then what will be the position of median with respect to mean?

If a data is right skewed, the median will be less than the mean. This is because the right-skewed distribution has a long tail on the right side, indicating the presence of some unusually large values. These large values pull the mean towards the right side of the distribution, making it larger than the median, which is the value that divides the dataset into two equal parts.

In other words, in a right-skewed distribution, the mean is pulled towards the tail, while the median is pulled towards the center of the distribution. The extent to which the mean and median differ depends on the degree of skewness in the dataset.





## Q10. Explain the difference between covariance and correlation. How are these measures used in statistical analysis?

Covariance and correlation are two measures used in statistical analysis to describe the relationship between two variables.

Covariance measures the degree to which two variables vary together. It is a measure of the strength and direction of the linear relationship between two variables. A positive covariance indicates that the two variables are positively related, meaning that they tend to increase or decrease together. A negative covariance indicates that the two variables are inversely related, meaning that as one variable increases, the other decreases.

However, covariance by itself is not a standardized measure and can be difficult to interpret since it is affected by the scales of the variables being measured. Correlation, on the other hand, is a standardized measure of the relationship between two variables that ranges from -1 to 1.

Correlation measures the degree of association between two variables and describes how closely the relationship between the two variables follows a straight line. A correlation of 1 indicates a perfect positive correlation, meaning that the two variables move in perfect harmony, whereas a correlation of -1 indicates a perfect negative correlation, meaning that the two variables move in opposite directions.

In statistical analysis, correlation is used to determine the strength and direction of the relationship between two variables, while covariance is used to describe the strength and direction of the relationship between two variables without any standardized measure. Both measures are important in determining the nature of the relationship between two variables and can be used to identify patterns and make predictions about future behavior.

## Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset.

The formula for calculating the sample mean is:

Sample Mean = (Sum of all values in the dataset) / (Number of values in the dataset)

Here is an example calculation of the sample mean for the dataset [2, 5, 8, 10, 12]:

Sample Mean = (2 + 5 + 8 + 10 + 12) / 5
Sample Mean = 37 / 5
Sample Mean = 7.4

Therefore, the sample mean for the dataset [2, 5, 8, 10, 12] is 7.4.

## Q12. For a normal distribution data what is the relationship between its measure of central tendency?

For a normal distribution data, the mean, median, and mode are all equal to each other. This is because a normal distribution is symmetric, with half of the data falling on either side of the mean. The peak of a normal distribution occurs at the mean, indicating that the mode is also at the same point.

Furthermore, the median, which is the value that divides the dataset into two equal parts, is also equal to the mean in a normal distribution because of the symmetry of the distribution.

## Q13. How is covariance different from correlation?

Covariance and correlation are two measures of the relationship between two variables.

Covariance is a measure of how two variables change or vary together. It measures the degree to which two variables are linearly related, with a positive covariance indicating that the variables tend to increase or decrease together, and a negative covariance indicating that they tend to move in opposite directions.

However, covariance by itself is not a standardized measure and can be difficult to interpret because it is affected by the scale of the variables being measured. Correlation, on the other hand, is a standardized measure that indicates the degree to which two variables are related on a scale from -1 to 1.

Correlation is calculated by dividing the covariance by the product of the standard deviations of the two variables. This normalization makes it easier to interpret the strength and direction of the relationship between two variables. A correlation of 1 indicates a perfect positive relationship, a correlation of -1 indicates a perfect negative relationship, and a correlation of 0 indicates no relationship.

In summary, while covariance measures the degree to which two variables vary together, correlation measures the strength and direction of the linear relationship between two variables on a standardized scale.





## Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

Outliers are data points that are significantly different from the rest of the data in a dataset. Outliers can have a significant impact on measures of central tendency and dispersion.

Measures of central tendency, such as the mean, are particularly sensitive to outliers because they are calculated by summing up all the values in the dataset and dividing by the number of values. If an outlier is present in the dataset, it can significantly shift the mean, making it a less reliable measure of the center of the data.

Measures of dispersion, such as the range, standard deviation, and variance, are also affected by outliers. Outliers can increase the range of the dataset, making it appear more spread out than it actually is. Outliers can also increase the standard deviation and variance, making it appear that the data is more variable than it actually is.

Here is an example of how outliers can affect measures of central tendency and dispersion:

Consider the following dataset: [1, 2, 3, 4, 5, 100]. The mean of this dataset is (1+2+3+4+5+100)/6 = 19.17. However, the presence of the outlier value "100" has significantly increased the mean, making it a less reliable measure of the center of the data.

Similarly, the range of the dataset is 99 (100-1), which appears to indicate a large amount of spread in the data. However, this range is primarily due to the outlier value of "100".

Finally, the standard deviation and variance of the dataset are significantly higher than they would be without the outlier value, which again makes it appear that the data is more variable than it actually is.