### Q1. What are the three measures of central tendency?

    1. Mean
    2. Median
    3. Mode

### Q2. What is the difference between the mean, median, and mode? How are they used to measure the central tendency of a dataset?

1. The `mean` is the average value of all the data points in the dataset. It is calculated by adding up all the values in the dataset and then dividing by the number of data points. The mean is sensitive to extreme values or outliers, as they can heavily influence its value.

2. The `median` is the middle value in a dataset when the data points are arranged in order from smallest to largest. If there is an even number of data points, the median is the average of the two middle values. The median is often used when the dataset has outliers, as it is not as affected by extreme values as the mean.

3. The `mode` is the value that appears most frequently in a dataset. A dataset may have one mode, multiple modes (in which case it is called bimodal or multimodal), or no mode at all. The mode is often used for categorical or discrete data, where there may not be a meaningful concept of a "middle" value.

To measure the central tendency of a dataset, we can use any of these measures, depending on the nature of the data and the research question. For example, if we want to know the typical income of a group of people, we might use the mean. If we want to know the typical house price in a neighborhood, we might use the median, as extreme values can heavily skew the mean. If we want to know the most common color of a type of flower, we might use the mode.

### Q3. Measure the three measures of central tendency for the given height data:


In [1]:
data = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

1. Mean:
To find the mean, we add up all the values in the dataset and divide by the number of data points:
    (178 + 177 + 176 + 177 + 178.2 + 178 + 175 + 179 + 180 + 175 + 178.9 + 176.2 + 177 + 172.5 + 178 + 176.5) / 16 = 176.9
So the mean of the dataset is 176.9.

2. Median:
To find the median, we first need to arrange the data points in order from smallest to largest:
172.5, 175, 175, 176, 176.2, 176.5, 177, 177, 177, 178, 178, 178, 178.2, 178.9, 179, 180
There are 16 data points, so the median is the average of the two middle values, which are 177 and 177:
(177 + 177) / 2 = 177
So the median of the dataset is 177.

3. Mode:
To find the mode, we need to identify the value that appears most frequently in the dataset. In this case, the value 177 and 178 appears most frequently, both appear 3 times
mode = 177 and 178

In [2]:
import numpy as np

data = np.array(data)

In [3]:
mean = np.mean(data)
print(f"Mean : {mean}")

Mean : 177.01875


In [4]:
median = np.median(data)
print(f"Median : {median}")

Median : 177.0


In [7]:
from scipy import stats
mode = stats.mode(data)
print(f"Mode : {mode}")

Mode : ModeResult(mode=array([177.]), count=array([3]))


  mode = stats.mode(data)


### Q4. Find the standard deviation for the given data:


In [8]:
data_4 = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [9]:
standard_deviation = np.std(data_4)
print(f"Standard Deviation : {standard_deviation}")

Standard Deviation : 1.7885814036548633


### Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe the spread of a dataset? Provide an example.

Measures of dispersion such as range, variance, and standard deviation are used to describe how spread out the data is in a dataset. Here's a brief explanation of each measure and how they can be used:

* `Range`: The range of a dataset is simply the difference between the maximum and minimum values. It provides a rough idea of how spread out the data is, but it can be influenced by outliers. For example, if we have a dataset of test scores ranging from 50 to 100, the range would be 50 (100 - 50).

* `Variance`: The variance of a dataset measures how much the data points deviate from the mean. It is calculated by subtracting the mean from each data point, squaring the differences, summing them up, and dividing by the number of data points minus 1. A higher variance means the data points are more spread out from the mean. For example, consider the following dataset of test scores: [60, 70, 80, 90, 100]. The mean is 80, so the variance would be:

    [(60-80)^2 + (70-80)^2 + (80-80)^2 + (90-80)^2 + (100-80)^2] / (5-1) = 200
    
    
* `Standard deviation`: The standard deviation is simply the square root of the variance. It is a more commonly used measure of dispersion than variance because it is expressed in the same units as the original data. A higher standard deviation means the data points are more spread out from the mean. For the same dataset of test scores [60, 70, 80, 90, 100], the standard deviation would be:

    sqrt(200) = 14.14

### Q6. What is a Venn diagram?

A Venn diagram is a visual representation of the relationships between different sets or groups of data. It consists of overlapping circles or ellipses, each representing a set or group, with the overlapping parts representing the elements that are shared between the sets.

### Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:
(i) A∩B
(ii) A ⋃ B

1. A ∩ B = {2, 6}

2. A ⋃ B = {0, 2, 3, 4, 5, 6, 7, 8, 10}

### Q8. What do you understand about skewness in data?

Skewness is a statistical measure that describes the asymmetry of a distribution of data around its mean. It is a measure of the degree to which the data is skewed to the left or right of the mean.

A distribution is said to be skewed if it is not symmetric. If the tail of the distribution is longer on the left side, it is said to be negatively skewed or left-skewed, and if the tail is longer on the right side, it is said to be positively skewed or right-skewed. A distribution that is symmetric has zero skewness.

### Q9. If a data is right skewed then what will be the position of median with respect to mean?

If a data is right-skewed, meaning it has a long tail to the right, then the position of the median with respect to the mean will be less than the mean. In other words, the median will be less than the mean

### Q10. Explain the difference between covariance and correlation. How are these measures used in statistical analysis?

Covariance and correlation are two statistical measures used to describe the relationship between two variables. Here's the difference between them:

`Covariance` measures the linear relationship between two variables. It is a measure of how much two variables vary together. Specifically, covariance measures how much two variables vary from their mean values together. A positive covariance indicates that when one variable is above its mean, the other variable tends to be above its mean as well, and when one variable is below its mean, the other variable tends to be below its mean as well. A negative covariance indicates that when one variable is above its mean, the other variable tends to be below its mean and vice versa.

`Correlation`, on the other hand, measures the strength and direction of the linear relationship between two variables. Correlation is always between -1 and 1. A correlation of +1 indicates a perfect positive relationship (i.e., as one variable increases, the other variable increases by a constant amount), a correlation of 0 indicates no relationship between the variables, and a correlation of -1 indicates a perfect negative relationship (i.e., as one variable increases, the other variable decreases by a constant amount).

Covariance is used to describe the relationship between two variables, while correlation is used to measure the strength of that relationship. Both covariance and correlation are used in statistical analysis to identify patterns and relationships in data. Covariance is often used in finance to measure the risk and return of two different assets. Correlation is used in a variety of fields, including finance, economics, and social sciences, to measure the relationship between two variables and to make predictions based on that relationship.

### Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset.

The formula for calculating the sample mean is:

sample mean = (sum of all values in the sample) / (number of values in the sample)

Here's an example calculation for a dataset:

Consider the following dataset of 10 values: 5, 7, 8, 6, 10, 2, 4, 9, 7, 3.

To calculate the sample mean, we first add up all the values in the sample:

5 + 7 + 8 + 6 + 10 + 2 + 4 + 9 + 7 + 3 = 61

Next, we divide the sum by the number of values in the sample, which is 10:

61 / 10 = 6.1

Therefore, the sample mean for this dataset is 6.1.





### Q12. For a normal distribution data what is the relationship between its measure of central tendency?

For a normal distribution, the measures of central tendency, which include the mean, median, and mode, are all equal. This means that if a dataset follows a normal distribution, the mean, median, and mode will all have the same value. This is because the normal distribution is symmetric around its mean, and the mean represents the center of the distribution. The median, which is the middle value when the data is ordered from smallest to largest, also represents the center of the distribution for a normal distribution, since it divides the data into two equal halves. Finally, the mode, which is the value that appears most frequently in the dataset, also represents the center of the distribution in a normal distribution since the distribution is symmetric around the mode as well.

### Q13. How is covariance different from correlation?

Covariance and correlation are both measures of the relationship between two variables, but they differ in several ways:

1. Definition: Covariance measures the extent to which two variables vary together, while correlation measures the strength and direction of the linear relationship between two variables.

2. Units of measure: Covariance has units of measure that are the product of the units of the two variables being measured, while correlation is a unitless measure that ranges between -1 and 1.

3. Interpretation: A covariance value can be positive, negative, or zero. A positive covariance indicates that as one variable increases, the other variable tends to increase as well, while a negative covariance indicates that as one variable increases, the other variable tends to decrease. A covariance of zero indicates that the two variables are independent of each other. On the other hand, correlation is always between -1 and 1, with a positive correlation indicating a positive relationship between the variables, a negative correlation indicating a negative relationship, and a correlation of zero indicating no relationship.

4. Scale dependence: Covariance is dependent on the scale of the variables being measured, while correlation is not. For example, if one variable is measured in dollars and the other variable is measured in euros, the covariance between them will depend on the exchange rate used to convert between the two currencies. Correlation, however, will be unaffected by the choice of units of measurement.

5. Interpretation in regression: Covariance is used to estimate regression coefficients, while correlation is not. Specifically, the covariance between the independent variable and the dependent variable is used to estimate the slope of the regression line, while the correlation between the two variables is used to assess the strength of the relationship.

### Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

Outliers can have a significant effect on measures of central tendency and dispersion. Central tendency measures such as the mean and median can be pulled towards the outlier, while dispersion measures such as range, variance, and standard deviation can be inflated or deflated depending on the position of the outlier.

For example, consider the following dataset of 10 values: 5, 7, 8, 6, 10, 2, 4, 9, 7, 100.

The mean of this dataset is (5+7+8+6+10+2+4+9+7+100)/10 = 16.8, which is heavily influenced by the outlier value of 100. The median, on the other hand, is the middle value when the data is ordered from smallest to largest, which in this case is 7, and is less affected by the outlier.

When it comes to measures of dispersion, the outlier has a more significant effect on the range than on the variance or standard deviation. The range is the difference between the largest and smallest values in the dataset, so the presence of an outlier with a large value can greatly increase the range. In this example, the range is 98, which is dominated by the outlier value of 100. The variance and standard deviation, on the other hand, are less affected by the outlier since they take into account the deviations of all values from the mean.

Overall, outliers can have a strong influence on measures of central tendency and dispersion, so it's important to identify and handle them appropriately in statistical analysis.