## Q1. What are the three measures of central tendency?

The three measures of central tendency are:

1. `Mean`: It is the arithmetic average of all the values in a dataset. It is calculated by adding up all the values in the dataset and dividing by the total number of values.

2. `Median`: It is the middle value in a dataset when the values are arranged in order. It is calculated by taking the value that is exactly in the middle of the dataset. If there are an even number of values, then the median is the average of the two middle values.

3. `Mode`: It is the most frequently occurring value in a dataset. It is calculated by finding the value that appears most often in the dataset. A dataset can have one or more modes, or it can have no mode if no value appears more than once.

## Q2. What is the difference between the mean, median, and mode? How are they used to measure the central tendency of a dataset?

The `mean`, `median`, and ` mode` are three measures of central tendency used to describe the central or typical value of a dataset.

- The `mean` is calculated by adding up all the values in the dataset and dividing by the total number of values. It is sensitive to extreme values or outliers and can be influenced by them, making it sometimes not representative of the dataset.

- The `median` is the middle value in a dataset when the values are arranged in order. It is less sensitive to extreme values or outliers than the mean, making it more representative of the dataset in skewed distributions.

- The `mode`is the most frequently occurring value in a dataset. It is useful when the goal is to determine the most common value in the dataset.

- The selection of a measure of central tendency depends on the nature of the dataset and the research question. If the dataset is symmetrical, then the `mean`, `median`, and `mode` will be close to each other. In skewed distributions, the `median` is often preferred as a measure of central tendency as it is less affected by `outliers`.

## Q3. Measure the three measures of central tendency for the given height data:
## [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [5]:
data = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [8]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
from scipy import stats
import numpy as np

In [9]:
# calculate the mean
mean = np.mean(data)
print("Mean of the data:", mean)

# calculate the median
median = np.median(data)
print("Median of the data:", median)

# calculate the mode
mode = stats.mode(data)
print("Mode of the data:", mode.mode)

Mean of the data: 177.01875
Median of the data: 177.0
Mode of the data: [177.]


## Q4. Find the standard deviation for the given data:
## [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [13]:
data =[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [14]:
print("The Standard Deviation of given Data :", np.std(data))

The Standard Deviation of given Data : 1.7885814036548633


## Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe the spread of a dataset? Provide an example.

`Measures of dispersion`, such as `range`, `variance`, and `standard deviation`, are used to describe the spread or variability of a dataset. They help to provide information about how far the data points are from the central tendency of the dataset.

- 1. `Range` is a simple measure of dispersion that represents the difference between the largest and smallest values in a dataset. It is easy to calculate but can be sensitive to outliers. For example, if we have a dataset of test scores ranging from 60 to 100, the range would be 40.

- 2. `Variance` is a more robust measure of dispersion that takes into account the deviation of each data point from the mean of the dataset. It is calculated by squaring the difference between each data point and the mean, summing these values, and dividing by the number of data points minus one. Variance is useful because it gives a sense of how much the data points deviate from the mean. For example, if we have a dataset of test scores with a mean of 80 and a variance of 100, we know that the scores are more spread out than a dataset with a variance of 25.

- 3. `Standard deviation` is the square root of the variance and is also a measure of how much the data points deviate from the mean. It is used more often than variance because it is expressed in the same units as the original data and is easier to interpret. For example, if we have a dataset of test scores with a mean of 80 and a standard deviation of 10, we can say that most of the scores fall within 10 points of the mean.

## Q6. What is a Venn diagram?

A `Venn diagram` is a visual representation of sets or groups of data that helps to show their relationships and overlaps. The diagram consists of circles, each representing a set or group, with the size of the circle proportional to the number of elements in the set. The circles are placed in such a way that their intersections show the elements that belong to both sets or all sets. Venn diagrams are commonly used in mathematics, statistics, logic, and other fields to illustrate concepts such as set operations, logical relationships, and data analysis.

### Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10).
## Find:
## $$ (i)  A \cap B $$
## $$ (ii) A \bigcap B $$

 ## $$ (i) A \cap B  =(2,6) $$
 
 ## $$ (ii) A \cap B = (0,2,3,4,5,6,7,8,10) $$

## Q8. What do you understand about skewness in data?

`Skewness` in data refers to the degree of asymmetry of the probability distribution of the data. In a symmetrical distribution, the data is evenly distributed around the mean, with equal frequencies on either side. However, in a skewed distribution, the data is not evenly distributed around the mean, and the tail of the distribution is longer on one side than the other. Skewness can be positive or right-skewed, negative or left-skewed, or zero, indicating no skewness. Positive skewness means that the tail of the distribution is longer on the right side, while negative skewness means that the tail of the distribution is longer on the left side. Skewness is an important measure in data analysis as it helps to understand the shape of the distribution and the degree of variability in the data.

## Q9. If a data is right skewed then what will be the position of median with respect to mean?

If a data is right-skewed, then the median will be less than the mean. This is because in a right-skewed distribution, the tail of the distribution is longer on the right side, which pulls the mean in that direction, making it greater than the median. The median is the value that divides the data into two equal halves, while the mean is the arithmetic average of all the values. In a right-skewed distribution, there are some extreme values on the right side of the distribution, which increase the mean, but not the median, as the median is not affected by extreme values. Therefore, the mean will be higher than the median in a right-skewed distribution.

## Q10. Explain the difference between covariance and correlation. How are these measures used in statistical analysis?

`Covariance` and `correlation` are both measures of the relationship between two variables, but they have some important differences.

`Covariance` measures how much two variables change together. It is a measure of the strength and direction of the linear relationship between two variables. Covariance can be positive, negative, or zero. A positive covariance indicates a positive relationship, meaning that as one variable increases, the other variable tends to increase as well. A negative covariance indicates a negative relationship, meaning that as one variable increases, the other variable tends to decrease. A covariance of zero indicates that there is no relationship between the two variables.

`Correlation`, on the other hand, measures the strength and direction of the linear relationship between two variables, but it does so by standardizing the data. This means that correlation is a unitless measure that ranges from -1 to 1. A correlation of +1 indicates a perfect positive relationship, a correlation of -1 indicates a perfect negative relationship, and a correlation of 0 indicates no relationship. Correlation is a more informative measure of the relationship between two variables than covariance because it is not affected by the scale of the variables.

In `statistical analysis`, `covariance` and `correlation` are used to study the relationship between two variables. `Covariance` is used to determine the direction of the relationship between two variables and to calculate the degree to which they change together. `Correlation` is used to measure the strength and direction of the linear relationship between two variables, and to determine the degree to which one variable can be predicted from the other variable. `Correlation` is preferred over `covariance` when the scale of the variables is different or when the researcher wants to compare the relationship between two different pairs of variables.

## Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset.

The formula for calculating the sample mean is:

mean = (sum of all the values in the dataset) / (number of values in the dataset)

For example, let's calculate the sample mean for the following dataset:

[10, 12, 15, 18, 20]

mean = (10 + 12 + 15 + 18 + 20) / 5
= 75 / 5
= 15

Therefore, the sample mean for this dataset is 15.

## Q12. For a normal distribution data what is the relationship between its measure of central tendency?

For a `normal distribution`, the `mean`, `median`, and `mode` are all equal. This is because a normal distribution is symmetric around its mean, and the mean is the point around which the data are balanced. The median is also at this point, since it is the middle value of a sorted dataset. And since the data are symmetric, the mode is also at the same point as the mean and median. Therefore, for a normal distribution, the measure of central tendency (mean, median, and mode) are all equal

## Q13. How is covariance different from correlation?

`Covariance` and `correlation` are both measures of the relationship between two variables, but they are different in their interpretation and scale.

`Covariance` is a measure of how two variables vary together. It measures the direction of the relationship (positive or negative) and the magnitude of the relationship. However, covariance is affected by the scale of the variables, which makes it difficult to compare the strength of relationships between variables with different scales. A positive covariance indicates that the variables tend to increase or decrease together, while a negative covariance indicates that one variable tends to increase when the other decreases.

`Correlation`, on the other hand, is a standardized measure of the linear relationship between two variables. It is expressed on a scale from -1 to 1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation. Correlation is not affected by the scale of the variables, which makes it easier to compare the strength of relationships between variables with different scales.

In `summary`, covariance measures the direction and strength of the relationship between two variables, but it is difficult to interpret and compare due to its dependence on the scale of the variables. Correlation, on the other hand, measures the linear relationship between two variables on a standardized scale, which makes it easier to interpret and compare.

## Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

`Outliers` can significantly affect measures of central tendency and dispersion in a dataset.

For example, consider a dataset of ages of employees in a company: [22, 25, 23, 24, 27, 28, 30, 35, 40, 45, 65]. The mean age is 32.36 and the standard deviation is 12.88. However, if we add an outlier value of 100 to the dataset, the mean age increases to 41.45 and the standard deviation increases to 29.87. This shows that the presence of an outlier can greatly affect the measures of central tendency and dispersion.

Outliers can also affect the interpretation of results obtained from statistical tests. For example, if we are testing the hypothesis that the mean age of employees in the company is less than 30, the outlier value of 100 can lead to an incorrect rejection of the null hypothesis, since it greatly inflates the mean age. Therefore, it is important to identify and handle outliers appropriately in statistical analysis.