### Q1. What are the three measures of central tendency?

The three measures of central tendency are:

1. Mean: The mean is the most commonly used measure of central tendency. It is the average of all the values in a dataset. To calculate the mean, you add up all the values and divide the sum by the total number of values.

2. Median: The median is the middle value in a dataset when the values are arranged in numerical order. If there is an even number of values, the median is the average of the two middle values.

3. Mode: The mode is the value that appears most frequently in a dataset. It is possible for a dataset to have more than one mode (if multiple values appear with equal frequency) or no mode (if no value appears more than once).

### Q2. What is the difference between the mean, median, and mode? How are they used to measure the central tendency of a dataset?

The mean, median, and mode are different measures of central tendency, which provide information about the center or typical value of a dataset.

The mean is calculated by adding up all the values in the dataset and dividing by the total number of values. It is a useful measure when the dataset is normally distributed, meaning the data is symmetrically distributed around the mean. The mean can be affected by extreme values, called outliers, which can significantly affect the result.

The median is the middle value in the dataset when the values are arranged in numerical order. Half of the values in the dataset are above the median, and half are below it. The median is a good measure of central tendency when the dataset is skewed or has outliers, as it is less sensitive to these extreme values.

The mode is the value that occurs most frequently in the dataset. It is used to describe the most common value in the dataset. The mode can be useful for categorical data or when the data has a high frequency of a particular value.

All three measures can be used to describe the central tendency of a dataset, but the appropriate measure depends on the nature of the data and the research question. In general, the mean is used for normally distributed data, the median for skewed data or when there are outliers, and the mode for categorical data or when there is a high frequency of a particular value.

### Q3. Measure the three measures of central tendency for the given height data:
        [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [1]:
heights = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [15]:
# Mean
import numpy as np
mean = np.mean(heights)
print("Mean is:",mean)


Mean is: 177.01875


In [14]:
# Median
import numpy as np
median=np.median(heights)
print("Median is:",median)


Median is: 177.0


In [16]:
#Mode
from scipy.stats import mode
heights_mode = mode(heights,keepdims=False)
print(f'Mode for given height data is {heights_mode}')

Mode for given height data is ModeResult(mode=177.0, count=3)


### Q4. Find the standard deviation for the given data:
        [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [23]:
import numpy as np
data= [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]
data_std = np.std(data)
print(f'Standard Deviation for give data is {data_std}')

Standard Deviation for give data is 1.7885814036548633


### Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe the spread of a dataset? Provide an example.

Measures of dispersion, such as range, variance, and standard deviation, are used to describe the spread or variability of a dataset. They provide information about how much the individual data points in the dataset differ from the central tendency measure, such as the mean or median.

The range is the simplest measure of dispersion and is calculated as the difference between the largest and smallest values in the dataset. It is a useful measure to give an idea of how much the data varies but can be affected by extreme values.

The variance is a measure of how much the individual data points deviate from the mean of the dataset. It is calculated by taking the sum of the squared differences between each value and the mean, divided by the total number of values in the dataset. The variance is a useful measure to describe the variability of the data, but its units are squared, so it is not directly interpretable.

The standard deviation is the square root of the variance and is used more commonly as it has the same units as the original data. It provides a measure of how much the individual data points deviate from the mean in the same units as the data itself. A larger standard deviation indicates that the data points are more spread out from the mean, while a smaller standard deviation indicates that the data points are closer to the mean.

For example, consider the following dataset: [3, 5, 7, 8, 10]. The mean of this dataset is 6.6. To calculate the range, we subtract the smallest value (3) from the largest value (10) and get a range of 7. The variance can be calculated by subtracting each value from the mean, squaring the difference, summing these squared differences, and dividing by the total number of values (n = 5). This gives a variance of 7.04. The standard deviation is the square root of the variance, which is approximately 2.65. Therefore, we can say that the dataset has a mean of 6.6, a range of 7, a variance of 7.04, and a standard deviation of approximately 2.65. This tells us that the values in the dataset are relatively spread out from the mean, with a range of 7 and a standard deviation of approximately 2.65.

### Q6. What is a Venn diagram?

A Venn diagram is a visual representation of the relationships between sets of elements or groups of objects. It is named after John Venn, a British mathematician who introduced the concept in the 1880s. A Venn diagram is usually composed of overlapping circles or other shapes, with each circle representing a set or a group of objects, and the overlapping regions representing the elements or objects that are common to both sets or groups.

The Venn diagram is commonly used in mathematics, statistics, logic, and computer science to illustrate set theory concepts and relationships between sets. It can be used to solve problems involving set operations, such as union, intersection, complement, and symmetric difference.

### Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:
        (i) A ∩ B
        (ii) A ⋃ B

In [25]:
A = {2,3,4,5,6,7}
B = {0,2,6,8,10}

In [26]:
#Union
A.union(B)

{0, 2, 3, 4, 5, 6, 7, 8, 10}

In [27]:
#Intersection
A.intersection(B)

{2, 6}

### Q8. What do you understand about skewness in data?

Skewness is a measure of the asymmetry of a probability distribution or dataset. It describes how much the shape of the distribution deviates from a symmetric, bell-shaped curve.

A dataset is said to be skewed if it is not symmetrical, meaning that the data is not evenly distributed around a central value or mean. In a positively skewed dataset, the tail of the distribution is longer on the right-hand side, and the majority of the data is concentrated on the left-hand side. In a negatively skewed dataset, the tail of the distribution is longer on the left-hand side, and the majority of the data is concentrated on the right-hand side.

Skewness is commonly measured using the skewness coefficient, which is a numerical value that indicates the degree and direction of skewness in a dataset. The skewness coefficient can take positive or negative values, with positive values indicating a positive skew (right-skewed distribution), and negative values indicating a negative skew (left-skewed distribution). A skewness coefficient of 0 indicates a perfectly symmetric distribution.

Skewness is an important aspect to consider when analyzing data, as it can affect the choice of statistical tests, and the interpretation of results. For example, in a positively skewed dataset, the mean will be greater than the median and mode, while in a negatively skewed dataset, the mean will be less than the median and mode. Thus, understanding the skewness of a dataset is crucial for selecting appropriate statistical methods and interpreting the results accurately.

### Q9. If a data is right skewed then what will be the position of median with respect to mean?

If a data is right-skewed, the median will be lower than the mean. This is because the skewed data has a longer tail on the right-hand side, with more extreme values pulling the mean towards the right. In contrast, the median is the value that separates the lower half of the data from the upper half and is not affected by extreme values on either end of the dataset.

![ice_screenshot_20230311-132317.png](attachment:47af2b23-4771-4050-925d-2a63cd37671e.png)

### Q10. Explain the difference between covariance and correlation. How are these measures used in statistical analysis?

Covariance and correlation are both measures of the relationship between two variables, but they differ in their scaling and interpretation.

Covariance measures the degree to which two variables vary together. It measures the direction and strength of the linear relationship between two variables. A positive covariance means that the two variables tend to increase or decrease together, while a negative covariance means that one variable tends to increase when the other decreases.

Correlation, on the other hand, measures the strength and direction of the linear relationship between two variables, but it is scaled to always be between -1 and 1. A correlation of 1 indicates a perfect positive linear relationship, a correlation of -1 indicates a perfect negative linear relationship, and a correlation of 0 indicates no linear relationship.

Both covariance and correlation are used in statistical analysis to understand the relationship between two variables. Covariance is used to identify the direction of the relationship, while correlation is used to quantify the strength of the relationship. In particular, correlation is often preferred over covariance because it is standardized and can be more easily compared across datasets with different units of measurement.

In addition, correlation is used to test for the significance of the relationship between two variables, and to identify whether the relationship is likely to have occurred by chance. This is typically done using hypothesis testing and calculating a p-value.

Overall, both covariance and correlation are useful measures in statistical analysis for identifying and quantifying the relationship between two variables. However, correlation is generally preferred due to its standardized interpretation and ease of comparison across datasets.

### Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset.

The formula for calculating the sample mean is:

x̄ = (x₁ + x₂ + ... + xn) / n

where x₁, x₂, ..., xn are the values in the dataset, and n is the number of values in the sample.

Consider below 5 Values were sampled from a larger dataset

Sample: 5, 6, 7, 8, 9

x̄ = (5 + 6 + 7 + 8 + 9) / 5

x̄ = 35 / 5

x̄ = 7

### Q12. For a normal distribution data what is the relationship between its measure of central tendency?

For a normal distribution, the mean, median, and mode are all equal, and they are located at the center of the distribution.

This is because a normal distribution is a symmetric distribution, with an equal number of observations on both sides of the mean. Therefore, the middle observation of the dataset (i.e., the median) is also equal to the mean, which represents the average of all the observations. Additionally, since the normal distribution is a unimodal distribution (i.e., it has one peak), the mode is also equal to the mean and median.

Thus, for a normal distribution, the measures of central tendency (i.e., mean, median, and mode) are all equal and located at the center of the distribution. This makes the normal distribution a convenient distribution to work with, as it simplifies calculations and interpretations of statistical analyses.

![distrubution.png](attachment:c8541a5e-61b2-4fae-a754-9e01f685f60c.png)

### Q13. How is covariance different from correlation?

Covariance and correlation are both measures of the relationship between two variables, but they differ in their scaling and interpretation.

Covariance measures the degree to which two variables vary together. It measures the direction and strength of the linear relationship between two variables. A positive covariance means that the two variables tend to increase or decrease together, while a negative covariance means that one variable tends to increase when the other decreases.

Correlation, on the other hand, measures the strength and direction of the linear relationship between two variables, but it is scaled to always be between -1 and 1. A correlation of 1 indicates a perfect positive linear relationship, a correlation of -1 indicates a perfect negative linear relationship, and a correlation of 0 indicates no linear relationship.

While both covariance and correlation are used to describe the relationship between two variables, correlation is often preferred over covariance because it is standardized and can be more easily compared across datasets with different units of measurement. Additionally, because covariance is not scaled, it is more difficult to interpret the magnitude of the relationship. For example, a covariance of 1000 may indicate a very strong relationship between two variables, but it may be difficult to interpret whether the magnitude of the covariance is large or small without more context.

Overall, covariance and correlation are both useful measures in statistical analysis for identifying and quantifying the relationship between two variables. However, correlation is generally preferred due to its standardized interpretation and ease of comparison across datasets.






### Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

Outliers are extreme values that fall far outside of the range of typical values in a dataset. Outliers can have a significant impact on both measures of central tendency and measures of dispersion.

In terms of measures of central tendency, outliers can pull the mean away from the rest of the data, causing it to be skewed. This is because the mean is sensitive to extreme values, and it takes into account the magnitude and sign of each observation. For example, consider a dataset of salaries for a company, where the majority of employees earn between $50,000 and $100,000 per year. If one employee earns $1 million per year, the mean salary will be significantly higher than the typical salary of the employees, and will be skewed by the outlier.

On the other hand, outliers have less of an effect on measures of central tendency that are less sensitive to extreme values, such as the median. The median is the middle value in a dataset, and is less affected by extreme values than the mean.

In terms of measures of dispersion, outliers can increase the range, variance, and standard deviation of a dataset. This is because these measures take into account the distance between each observation and the mean or median. Outliers that are far from the mean or median can increase these measures of dispersion by increasing the overall spread of the dataset.

For example, consider a dataset of exam scores for a class, where the majority of scores range from 60 to 80, but one student scores 100. The range of the dataset will be much larger than the typical range of scores, as it will now be from 60 to 100. Additionally, the variance and standard deviation will be increased, as the distance between the outlier and the mean will be much larger than the typical distance between scores and the mean.

Overall, outliers can have a significant impact on both measures of central tendency and measures of dispersion, and should be carefully considered when analyzing data.