## Q1. What are the three measures of central tendency?

## Ans:

1. Mean: The mean, also known as the average, is calculated by summing up all the values in a data set and then dividing that sum by the number of data points. It provides a measure of the central value of a data set.

2. Median: The median is the middle value in a data set when the data points are arranged in ascending or descending order. If there is an even number of data points, the median is the average of the two middle values. The median is less sensitive to extreme outliers compared to the mean.

3. Mode: The mode is the value that appears most frequently in a data set. A data set can have one mode (unimodal) if one value occurs more frequently than any other, or it can have multiple modes (multimodal) if two or more values have the same highest frequency. The mode is especially useful for categorical or discrete data.

## Q2. What is the difference between the mean, median, and mode? How are they used to measure the central tendency of a dataset?

## Ans:

1. Mean:
        Calculation: The mean is calculated by summing up all the values in a dataset and then dividing that sum by the number of data points.\
        Use: The mean represents the arithmetic average of the dataset. It is sensitive to the value of every data point and is most appropriate for datasets with a roughly symmetrical and bell-shaped distribution. However, it can be heavily influenced by extreme outliers.

2. Median:
        Calculation: The median is the middle value in a dataset when the data points are arranged in ascending or descending order. If there is an even number of data points, the median is the average of the two middle values.\
        Use: The median is less affected by extreme values (outliers) than the mean. It is useful when dealing with skewed data distributions, as it reflects the value that divides the data into two equal halves. The median is also appropriate for ordinal or interval data.

3. Mode:
        Calculation: The mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal) if one value occurs more frequently than any other, or it can have multiple modes (multimodal) if two or more values have the same highest frequency.\
        Use: The mode is commonly used for categorical or discrete data and can provide insights into the most common category or value in the dataset. Unlike the mean and median, the mode can be used with nominal data, where values have no inherent numerical order.

## Q3. Measure the three measures of central tendency for the given height data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

## Ans:

In [1]:
import numpy as np
from scipy import stats

In [2]:
values = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]
# Mean
mean = np.mean(values)
# Median
median = np.median(values)
# Mode
mode = stats.mode(values)
print('Mean of the data:', mean)
print('Median of the data:', median)
print('Mode of the data:', mode)

Mean of the data: 177.01875
Median of the data: 177.0
Mode of the data: ModeResult(mode=array([177.]), count=array([3]))


  mode = stats.mode(values)


## Q4. Find the standard deviation for the given data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

## Ans:

In [3]:
data = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]
std = np.std(data)
print('Standard Deviation:',std)

Standard Deviation: 1.7885814036548633


## Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe the spread of a dataset? Provide an example.

## Ans:

1. Range:
        Calculation: The range is the simplest measure of dispersion and is calculated by subtracting the minimum value from the maximum value in the dataset.\
        Use: The range provides a basic understanding of how far apart the extreme values in the dataset are. It's easy to calculate but can be sensitive to outliers.\

    Example: Consider a dataset of exam scores for a class of students:
    Scores: 65, 70, 75, 80, 95
    Range = Maximum value (95) - Minimum value (65) = 30

2. Variance:
        Calculation: The variance measures the average of the squared differences between each data point and the mean. It is calculated as the sum of the squared differences divided by the number of data points.\
        Use: Variance quantifies the overall spread or dispersion of the data points. A higher variance indicates greater variability among the data points.\

    Example: Using the same exam scores dataset:
    Mean = (65 + 70 + 75 + 80 + 95) / 5 = 77
    Variance = [(65 - 77)^2 + (70 - 77)^2 + (75 - 77)^2 + (80 - 77)^2 + (95 - 77)^2] / 5 = 110.4

3. Standard Deviation:
        Calculation: The standard deviation is the square root of the variance. It provides a more interpretable measure of dispersion that is in the same units as the original data.\
        Use: Standard deviation is commonly used because it gives a sense of the average "distance" between data points and the mean. A higher standard deviation indicates greater variability.\

    Example: Using the same exam scores dataset:
    Standard Deviation = √Variance = √110.4 ≈ 10.51

## Q6. What is a Venn diagram?

## Ans:

A Venn diagram is a graphical representation used to illustrate the relationships between sets or groups of objects, elements, or concepts. It was developed by the British mathematician and logician John Venn in the late 19th century. Venn diagrams are composed of overlapping circles, each circle representing a set, and the overlapping areas showing the intersections between sets. They are commonly used in mathematics, logic, statistics, and various fields to visualize and understand the relationships and commonalities between different sets.

## Q7. For the two given sets 
A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:\
(i) A $\cap$ B\
(ii) A $\cup$ B

## Ans:

(i) A $\cap$ B = (2,6)\
(ii) A $\cup$ B = (0,2,3,4,5,6,7,8,10)

## Q8. What do you understand about skewness in data?

## Ans:

Skewness is a statistical measure that describes the asymmetry or lack of symmetry in the distribution of data. It quantifies the degree to which the data distribution deviates from being perfectly symmetrical or bell-shaped. Understanding skewness is important in data analysis and statistics because it can provide insights into the shape and characteristics of a dataset.

There are three main types of skewness:

1. Positive Skew (Right Skew):
        In a positively skewed distribution, the tail on the right side (the upper tail) is longer or fatter than the tail on the left side (the lower tail).
        The majority of data points are concentrated on the left side of the distribution, and there are a few extreme values on the right side.
        The mean is typically greater than the median in a positively skewed distribution because the extreme values on the right side pull the mean in that direction.\

    Example: Income distribution in a population where most people have moderate incomes, but a small number of individuals have extremely high incomes.

2. Negative Skew (Left Skew):
        In a negatively skewed distribution, the tail on the left side (the lower tail) is longer or fatter than the tail on the right side (the upper tail).
        The majority of data points are concentrated on the right side of the distribution, and there are a few extreme values on the left side.
        The mean is typically less than the median in a negatively skewed distribution because the extreme values on the left side pull the mean in that direction.\

    Example: Test scores in a class where most students perform well, but a few students perform poorly, bringing down the average.

3. Symmetrical (No Skew):
        In a symmetrical distribution, the data is evenly distributed on both sides of the central point (usually the mean or median).
        There is no pronounced skew to either the left or right.
        The mean and median are close to each other in a symmetrical distribution.\

    Example: The heights of a randomly selected group of people, where the distribution of heights is roughly bell-shaped.

## Q9. If a data is right skewed then what will be the position of median with respect to mean?

## Ans:

mean$\ge$median

## Q10. Explain the difference between covariance and correlation. How are these measures used in statistical analysis?

## Ans:

1. Covariance:

        Calculation: Covariance measures the degree to which two variables change together. It is calculated as the average of the product of the differences between each variable's value and its respective mean.

        Range: The range of covariance is unbounded, which means it can take any real value, positive or negative.

        Interpretation: A positive covariance indicates a positive relationship, where both variables tend to increase or decrease together. A negative covariance suggests an inverse relationship, where one variable tends to increase as the other decreases. A covariance of zero implies no linear relationship, but it does not necessarily mean no relationship at all.

        Units: The units of covariance are the product of the units of the two variables. Therefore, it is not a standardized measure and can be challenging to interpret, especially when comparing different datasets with different scales.

        Use: Covariance is primarily used to understand the direction of the relationship between two variables. However, its value alone does not provide a clear measure of the strength or degree of association, as it is sensitive to the scale of the variables.

2. Correlation:

        Calculation: Correlation is a standardized measure that quantifies the strength and direction of the linear relationship between two variables. The most commonly used measure of correlation is the Pearson correlation coefficient (r), which is calculated as the covariance of the variables divided by the product of their standard deviations.

        Range: The range of correlation is between -1 and 1, inclusive. A correlation of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

        Interpretation: Correlation provides a more interpretable measure of association compared to covariance. The sign of the correlation coefficient (+/-) indicates the direction of the relationship, while the magnitude (absolute value) indicates the strength of the relationship. A correlation of 0 implies no linear relationship.

        Units: Correlation is a unitless measure, making it easier to compare and interpret relationships between variables with different scales.

        Use: Correlation is widely used in statistical analysis, data science, and research to assess the strength and direction of linear associations between variables. It is particularly useful for identifying whether and to what extent two variables move together, and it helps in making predictions and decisions based on these relationships.

In summary, covariance and correlation both measure the association between two variables, but correlation provides a standardized measure that is easier to interpret and compare across different datasets. Correlation is often preferred in statistical analysis because of its scale-invariant properties and clear interpretation, while covariance is mainly used when the absolute scale of the relationship is important, or when more advanced statistical techniques are being applied.

## Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset.

## Ans:

The formula for calculating the sample mean (often denoted as x̄, pronounced as "x-bar") of a dataset is as follows:

Sample Mean ( $\bar{x}$ ) = 

$ \frac{\sum_{i=1}^{n}x_{i}}{n} $

Where:\
    $\bar{x}$: is the sample mean.

$\sum_{i=1}^{n}x_{i}$ 

represents the sum of each individual data point in the dataset 

To calculate the sample mean, you add up all the individual data points and then divide by the total number of data points in the sample.

Here's an example calculation for a dataset:

Suppose you have the following dataset representing the scores of five students on a math test:

Scores: 85, 92, 78, 88, 95

To calculate the sample mean ($\bar{x}$), you would:

    Add up all the individual scores:
    Sum of scores = 85+92+78+88+95 = 438 
    
    Determine the total number of data points in the dataset (n), which is 5 in this case.

    Use the formula to calculate the sample mean:

$\bar{x}$ = 438/5 = 87.6\
So, the sample mean of the dataset is 87.6. This means that, on average, the students scored 87.6 on the math test.

## Q12. For a normal distribution data what is the relationship between its measure of central tendency?

## Ans:

In a normal distribution, also known as a Gaussian distribution or a bell curve, there is a specific and well-defined relationship between its measures of central tendency: the mean, median, and mode. This relationship is a key characteristic of the normal distribution, and it holds true for any normal distribution. Here's the relationship between these measures:

1. Mean (μ):
        In a normal distribution, the mean is located at the center of the distribution.
        The mean is equal to the median, and they are both located at the peak (highest point) of the normal curve.
        Mathematically, in a normal distribution, μ = median.

2. Median (Median):
        In a normal distribution, the median is also located at the center of the distribution.
        The median is equal to the mean, and they are both located at the peak (highest point) of the normal curve.
        Mathematically, in a normal distribution, median = μ.

3. Mode (Mode):
        In a normal distribution, the mode is also located at the center of the distribution.
        The mode is equal to the mean and the median, and they all coincide at the peak (highest point) of the normal curve.
        Mathematically, in a normal distribution, mode = median = μ.

## Q13. How is covariance different from correlation?

## Ans:

1. Covariance:

 Calculation: Covariance measures the degree to which two variables change together. It is calculated as the average of the product of the differences between each variable's value and its respective mean.

 Range: The range of covariance is unbounded, which means it can take any real value, positive or negative.

 Interpretation: A positive covariance indicates a positive relationship, where both variables tend to increase or decrease together. A negative covariance suggests an inverse relationship, where one variable tends to increase as the other decreases. A covariance of zero implies no linear relationship, but it does not necessarily mean no relationship at all.

 Units: The units of covariance are the product of the units of the two variables. Therefore, it is not a standardized measure and can be challenging to interpret, especially when comparing different datasets with different scales.

 Use: Covariance is primarily used to understand the direction of the relationship between two variables. However, its value alone does not provide a clear measure of the strength or degree of association, as it is sensitive to the scale of the variables.
Correlation:

2. Correlation:

 Calculation: Correlation is a standardized measure that quantifies the strength and direction of the linear relationship between two variables. The most commonly used measure of correlation is the Pearson correlation coefficient (r), which is calculated as the covariance of the variables divided by the product of their standard deviations.

 Range: The range of correlation is between -1 and 1, inclusive. A correlation of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

 Interpretation: Correlation provides a more interpretable measure of association compared to covariance. The sign of the correlation coefficient (+/-) indicates the direction of the relationship, while the magnitude (absolute value) indicates the strength of the relationship. A correlation of 0 implies no linear relationship.

 Units: Correlation is a unitless measure, making it easier to compare and interpret relationships between variables with different scales.

 Use: Correlation is widely used in statistical analysis, data science, and research to assess the strength and direction of linear associations between variables. It is particularly useful for identifying whether and to what extent two variables move together, and it helps in making predictions and decisions based on these relationships.

## Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

## Ans:

Outliers are data points that are significantly different from the majority of the data in a dataset. They can have a substantial impact on measures of central tendency (such as the mean, median, and mode) and measures of dispersion (such as the range, variance, and standard deviation). The extent of the impact depends on the number and magnitude of the outliers. Here's how outliers affect these measures:

    Measures of Central Tendency:

        Mean: Outliers can have a significant impact on the mean because it takes into account the value of each data point. If there are outliers with extreme values, they can pull the mean in their direction. This is especially true for a mean calculation because it involves summing all values and dividing by the number of values.

        Median: The median is less affected by outliers because it is not influenced by the actual values of the outliers but rather their position relative to the rest of the data. Outliers do not affect the median as long as they do not change the order of values in the dataset.

        Mode: The mode is generally not affected by outliers because it represents the most frequently occurring value(s) in the dataset, and outliers are often isolated data points.

Example for Measures of Central Tendency:

Suppose you have the following dataset representing the salaries of employees in a company:

Salaries: $40,000, $45,000, $42,000, $39,000, $250,000

    Mean: Adding the outlier salary of $250,000 significantly increases the mean, pulling it toward the outlier. The mean is now higher than what most employees earn, making it less representative of the central salary.

    Median: The median remains at $42,000 because the outlier does not change the order of values. It continues to represent the middle salary in the dataset.

    Mode: The mode remains the same as before, representing the most frequently occurring salary, which is still $40,000.

    Measures of Dispersion:

        Range: Outliers can significantly affect the range because it is calculated as the difference between the maximum and minimum values. If there are outliers with extreme values, they can increase the range substantially.

        Variance and Standard Deviation: Outliers can increase the variance and standard deviation because these measures consider the squared differences between data points and the mean. Outliers with large deviations from the mean contribute more to the overall variance and standard deviation.

Example for Measures of Dispersion:

Using the same salary dataset as before:

    Range: Including the outlier salary of $250,000 increases the range of salaries from $210,000 to $249,000, a significant expansion.

    Variance and Standard Deviation: The presence of the outlier with a salary of $250,000 will increase the variance and standard deviation because it has a large squared deviation from the mean, contributing to higher overall dispersion.