Q1 What are the three measures of central tendency?

Mean: The mean is the most commonly used measure of central tendency. It is calculated by summing up all the values in a data set and dividing the sum by the total number of values. The mean represents the average value of the data set and is sensitive to extreme values.

Median: The median is the middle value in a data set when the values are arranged in ascending or descending order. If there is an odd number of values, the median is the middle value. If there is an even number of values, the median is the average of the two middle values. The median is not affected by extreme values and is often used when there are outliers present in the data.

Mode: The mode is the value or values that occur most frequently in a data set. In other words, it represents the peak(s) or the most common value(s) in the data set. A data set can have no mode (when no value is repeated), a single mode (when one value occurs most frequently), or multiple modes (when more than one value occurs with the same highest frequency). The mode is useful for categorical or discrete data, but it can also be used with numerical data.



Q2. What is the difference between the mean, median, and mode? How are they used to measure the
central tendency of a dataset?

The mean, median, and mode are different measures of central tendency used to describe the typical or central value of a dataset. Here's an explanation of their differences and how they are used:

Mean: The mean is calculated by summing up all the values in a dataset and dividing the sum by the total number of values. It represents the average value of the dataset. The mean is commonly used and provides a balance between all the values in the dataset. However, it is sensitive to extreme values, also known as outliers. Even a single outlier can significantly affect the mean, pulling it towards the extreme value. The mean is useful when the data is normally distributed or symmetrically distributed around a central value.

Median: The median is the middle value in a dataset when the values are arranged in ascending or descending order. It is not affected by extreme values or outliers since it is only concerned with the position of the values. The median is useful when the data contains outliers or when the distribution is skewed. Skewness refers to the asymmetry of the dataset, where the tail of the distribution is elongated in one direction. The median provides a more robust measure of central tendency in such cases.

Mode: The mode is the value or values that occur most frequently in a dataset. It represents the peak(s) or the most common value(s) in the dataset. Unlike the mean and median, the mode can be used with categorical or discrete data, as well as with numerical data. The mode is useful for identifying the most frequent category or value in a dataset. It is particularly helpful when dealing with qualitative or nominal data, such as colors, types of cars, or survey responses.

Q3. Measure the three measures of central tendency for the given height data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [1]:
import numpy as np
from scipy import stats
height_data = [[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]]


In [2]:
mean_data = np.mean(height_data)
print("Mean height", mean_data)

Mean height 177.01875


In [3]:
median_data = np.median(height_data)
print("Median height", median_data)

Median height 177.0


In [4]:
mode_data = stats.mode(height_data[0])
print("mode height",mode_data )

mode height ModeResult(mode=array([177.]), count=array([3]))


  mode_data = stats.mode(height_data[0])


Q4. Find the standard deviation for the given data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [5]:
import numpy as np
data = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]
df = np.std(data)
print("Standard Deviation", df)

Standard Deviation 1.7885814036548633


Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe
the spread of a dataset? Provide an example.

Measures of dispersion, such as range, variance, and standard deviation, are used to describe the spread or variability of a dataset. They provide information about how the data points are dispersed or spread out from the central tendency measures (such as the mean or median). Here's an example to illustrate their usage:

Consider the following dataset representing the daily temperatures (in degrees Celsius) recorded over a week: [28, 30, 29, 32, 27, 31, 29].

Range:
The range is the simplest measure of dispersion and is calculated as the difference between the maximum and minimum values in the dataset. In this example, the range is 32 - 27 = 5 degrees Celsius. The range provides a basic understanding of the spread of the data but can be sensitive to outliers.

Variance:
Variance measures how the data points deviate from the mean. It quantifies the average squared difference between each data point and the mean. To calculate the variance, you subtract the mean from each data point, square the differences, sum them up, and divide by the total number of data points. In this example, the variance is calculated as follows: 
Mean: (28 + 30 + 29 + 32 + 27 + 31 + 29) / 7 = 29
Differences from the mean: [-1, 1, 0, 3, -2, 2, 0]
Squared differences: [1, 1, 0, 9, 4, 4, 0]
Sum of squared differences: 19
Variance: 19 / 7 ≈ 2.71

The variance gives a measure of the average squared deviation from the mean, providing information about the spread of the data. However, it is not in the original units of the data, making it less interpretable.

Standard Deviation:
The standard deviation is the square root of the variance. It provides a measure of the average deviation from the mean and is expressed in the same units as the original data. In this example, the standard deviation is the square root of the variance calculated earlier: √2.71 ≈ 1.65 degrees Celsius. The standard deviation allows for a more intuitive understanding of the spread of the data and is widely used in statistical analysis.

Q6. What is a Venn diagram? 

A Venn diagram is a graphical representation of the relationships between different sets of elements. It consists of overlapping circles or other closed curves, each representing a set. The areas where the circles overlap indicate the elements that are common to the sets being compared.

Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:
(i) A B
(ii) A ⋃ B

In [9]:
A = {2, 3, 4, 5, 6, 7}
B = {0, 2, 6, 8, 10}


intersection = A.intersection(B)
print("A ∩ B:", intersection)


union = A.union(B)
print("A ∪ B:", union)


A ∩ B: {2, 6}
A ∪ B: {0, 2, 3, 4, 5, 6, 7, 8, 10}


Q8. What do you understand about skewness in data? 


Skewness is a measure of the asymmetry or departure from symmetry in a dataset's distribution. It quantifies the extent to which a dataset's values are concentrated on one side of the distribution compared to the other.

In a symmetrical distribution, the data points are evenly distributed around the mean, resulting in a bell-shaped curve. Skewness measures how the tails of the distribution are elongated or skewed relative to the center.

There are three types of skewness:

Positive Skewness (Right-skewed): In a positively skewed distribution, the tail on the right side of the distribution is longer or fatter than the left side. The mean is typically greater than the median, and the majority of the values are concentrated towards the lower end of the range.

Negative Skewness (Left-skewed): In a negatively skewed distribution, the tail on the left side of the distribution is longer or fatter than the right side. The mean is typically less than the median, and the majority of the values are concentrated towards the higher end of the range.

Q9. If a data is right skewed then what will be the position of median with respect to mean?|

If a dataset is right-skewed, it means that the tail of the distribution extends towards the right side, indicating a higher concentration of values on the left side and a few larger values on the right side. In this case, the position of the median with respect to the mean can provide insights into the distribution of the data.

In a right-skewed distribution:

The mean is typically greater than the median.
The median is closer to the left side (lower values) of the distribution.
The median is positioned towards the lower values, away from the right-skewed tail.

Q10. Explain the difference between covariance and correlation. How are these measures used in
statistical analysis?

Covariance and correlation are both measures used in statistical analysis to understand the relationship between variables, but they capture different aspects of the relationship. Here's an explanation of the difference between covariance and correlation and how they are used:

Covariance:
Covariance measures the extent to which two variables vary together. It quantifies the direction and strength of the linear relationship between two variables. The covariance can be positive, indicating a positive relationship where both variables tend to increase or decrease together, or negative, indicating an inverse relationship where one variable tends to increase while the other decreases. However, the magnitude of covariance alone doesn't provide a clear measure of the strength or degree of the relationship, as it is affected by the units of the variables.  
Covariance is calculated using the following formula:

Cov(X, Y) = Σ[(Xᵢ - μₓ)(Yᵢ - μᵧ)] / (n - 1)

Where X and Y are the variables, Xᵢ and Yᵢ are individual data points, μₓ and μᵧ are the means of X and Y, and n is the number of data points.


Correlation:
Correlation measures the strength and direction of the linear relationship between two variables but is standardized to a range between -1 and +1. It provides a more interpretable measure of the relationship compared to covariance, as it is unitless. A correlation of +1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.
Correlation is calculated using the following formula:

Corr(X, Y) = Cov(X, Y) / (σₓ * σᵧ)

Where Cov(X, Y) is the covariance between X and Y, and σₓ and σᵧ are the standard deviations of X and Y, respectively.

Q11. What is the formula for calculating the sample mean? Provide an example calculation for a
dataset.

The formula for calculating the sample mean is as follows:

Sample Mean = (Sum of all data points) / (Number of data points)

To calculate the sample mean, you sum up all the data points in the dataset and divide the sum by the total number of data points.

Here's an example calculation of the sample mean for a dataset:

Dataset: [10, 15, 20, 25, 30]

Step 1: Add up all the data points: 10 + 15 + 20 + 25 + 30 = 100

Step 2: Determine the number of data points: 5 (since there are five data points in the dataset)

Step 3: Calculate the sample mean: 100 / 5 = 20

Therefore, the sample mean of the given dataset is 2

Q12. For a normal distribution data what is the relationship between its measure of central tendency?

In a normal distribution, the three measures of central tendency—the mean, median, and mode—have a specific relationship:

Mean: In a normal distribution, the mean is equal to the median. The mean represents the average value of the data points and is located at the center of the distribution. Since a normal distribution is symmetric, with equal probabilities on both sides of the mean, the mean and median coincide.

Median: As mentioned above, the median is equal to the mean in a normal distribution. It represents the middle value of the dataset when arranged in ascending or descending order. The median splits the distribution into two equal halves.

Mode: In a normal distribution, the mode is also equal to the mean and median. The mode represents the value that occurs with the highest frequency or the peak of the distribution. Since a normal distribution is symmetric, with a single peak, the mode is located at the same point as the mean and median.

Q13. How is covariance different from correlation?

Covariance and correlation are both measures used to assess the relationship between variables, but they differ in terms of their interpretation and standardization. Here are the key differences between covariance and correlation:

Interpretation:
Covariance: Covariance measures the extent and direction of the linear relationship between two variables. A positive covariance indicates a positive relationship where both variables tend to increase or decrease together, while a negative covariance indicates an inverse relationship where one variable tends to increase as the other decreases. However, the magnitude of covariance is not standardized and depends on the scale of the variables, making it difficult to interpret the strength of the relationship.

Correlation: Correlation measures the strength and direction of the linear relationship between two variables. It is a standardized measure that ranges between -1 and +1. A correlation of +1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. The magnitude of the correlation coefficient provides insights into the strength of the relationship, regardless of the scale of the variables.

Standardization:
Covariance: Covariance is not standardized and is affected by the units of the variables being measured. As a result, it is not directly comparable across different datasets or variables with different scales.

Correlation: Correlation is standardized, meaning it is unitless and does not depend on the scale of the variables. This allows for direct comparison of the strength of the relationship across different datasets or variables.

Range of Values:
Covariance: Covariance can take any real value, positive or negative. The magnitude of the covariance is influenced by the units of the variables and can vary widely depending on the dataset.

Q14. How do outliers affect measures of central tendency and dispersion? Provide an example. 