The three measures of central tendency are:

1. Mean: The mean is the arithmetic average of a set of values. To calculate the mean, you add up all the values in the dataset and then divide the sum by the number of values. The mean is sensitive to extreme values (outliers) and is commonly used when the data follows a normal distribution.

2. Median: The median is the middle value in a dataset when the values are arranged in ascending or descending order. If there is an even number of values, the median is the average of the two middle values. The median is less affected by outliers and is a robust measure of central tendency.

3. Mode: The mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal), multiple modes (multimodal), or no mode at all. The mode is often used for categorical or nominal data and can also be used for numerical data.

These measures help summarize the central or typical value in a dataset, and they each have their own strengths and limitations depending on the characteristics of the data.

The mean, median, and mode are three different measures of central tendency, and they are used to describe where the "center" or the typical value of a dataset lies. Here are the key differences between these measures and how they are used:

Mean:
The mean is the average of all the values in a dataset.
To calculate the mean, you add up all the values and then divide the sum by the number of values.
The mean is sensitive to outliers, as a single extremely high or low value can significantly affect the mean.
It is commonly used when the data follows a normal distribution or when you want to capture the overall "balance" of the dataset.
The formula for the mean is: Mean = (Sum of values) / (Number of values).

Median:
The median is the middle value in a dataset when the values are ordered from smallest to largest or vice versa.
If there is an even number of values, the median is the average of the two middle values.
The median is less affected by outliers compared to the mean, making it a robust measure of central tendency.
It is particularly useful when the data contains outliers or when the distribution of the data is not symmetrical.
The median does not rely on the actual values, only their order.

Mode:
The mode is the value that appears most frequently in a dataset.
A dataset can have one mode (unimodal), multiple modes (multimodal), or no mode at all.
The mode is often used for categorical or nominal data but can also be applied to numerical data.
It provides information about the most common or popular value in the dataset.

In [2]:
import statistics
data = [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]
stdev = statistics.stdev(data)
print("Standard deviation: ", stdev)

Standard deviation:  1.8472389305844188


Measures of dispersion, such as range, variance, and standard deviation, are used to describe how the data points in a dataset are spread out or how much they vary from the central tendency measures (mean, median, mode). Here's how they are used:

1. Range:
   - The range is the simplest measure of dispersion and is calculated by finding the difference between the maximum and minimum values in a dataset.
   - It provides a quick way to understand the spread of data but can be affected by extreme outliers.
   - Range = Maximum value - Minimum value

   Example: Consider a dataset of exam scores for a class, where the scores range from 60 to 95. The range in this case would be 95 - 60 = 35, indicating that the scores vary by 35 points.

2. Variance:
   - Variance measures the average of the squared differences between each data point and the mean.
   - A higher variance indicates greater spread or variability in the data.
   - Variance is useful for understanding the distribution's shape and how data points deviate from the mean.

   Example: Suppose you have a dataset of the ages of people in a city. If the variance is high, it means that the ages are spread out over a wide range, indicating a diverse population. If the variance is low, it suggests that most people have similar ages.

3. Standard Deviation:
   - The standard deviation is a more interpretable measure of dispersion compared to variance. It is calculated as the square root of the variance.
   - It represents the average distance between data points and the mean. A larger standard deviation implies greater dispersion.
   - Standard deviation is commonly used because it is in the same unit as the data, making it easier to relate to the dataset's characteristics.

   Example: Consider a dataset of daily temperatures in a city over a year. If the standard deviation of temperatures is high, it means that the temperatures vary significantly throughout the year, indicating an unpredictable climate. A low standard deviation suggests more stable and predictable weather conditions.

Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:
(i) A B
(ii) A ⋃ B

Q8. What do you understand about skewness in data?

Skewness is a statistical measure that helps us understand the asymmetry or lack of symmetry in the distribution of data. It provides information about the shape of a data distribution and the direction in which the data is skewed or tilted. In particular, skewness tells us whether the data is concentrated more to the left or right of the central point (mean, median, or mode) and to what degree.

There are three types of skewness:

1. Positive Skewness (Right Skew):
   - In a positively skewed distribution, the tail on the right-hand side (the larger values) is longer or more spread out than the tail on the left.
   - The majority of the data points are concentrated on the left side, and there are relatively few extreme values on the right.
   - The mean is typically greater than the median in a positively skewed distribution.

2. Negative Skewness (Left Skew):
   - In a negatively skewed distribution, the tail on the left-hand side (the smaller values) is longer or more spread out than the tail on the right.
   - The majority of the data points are concentrated on the right side, and there are relatively few extreme values on the left.
   - The mean is typically less than the median in a negatively skewed distribution.

3. Zero Skewness:
   - A distribution is considered to have zero skewness (or be perfectly symmetrical) when it is perfectly balanced and has equal tail lengths on both sides of the central point.
   - In this case, the mean, median, and mode are all equal and are located at the same position in the distribution.

Skewness is a valuable tool in statistics and data analysis because it provides insights into the shape of data distributions. It helps analysts understand not only the central tendency but also the overall distribution characteristics, which can be important in making informed decisions and drawing conclusions about the data. Skewness is often used in conjunction with other statistical measures to gain a more comprehensive understanding of the data.

In a right-skewed (positively skewed) data distribution, the median is typically positioned to the left of the mean. Here's why:

- Right-skewed distributions have a long tail on the right side (the larger values), indicating that there are relatively few extremely large values that can pull the mean to the right.
- The majority of data points are concentrated on the left side, with smaller values. This concentration of values on the left side tends to "drag" the median towards the lower values.

As a result, in a right-skewed distribution:

- The mean is greater than the median because the larger values on the right side have a greater influence on the mean.
- The median is positioned to the left of the mean and is closer to the bulk of the data points.

It's important to note that the relationship between the mean and median in a skewed distribution can provide valuable insights into the distribution's shape and asymmetry. The greater the skewness to the right, the larger the difference between the mean and median. This difference serves as an indicator of the direction and degree of skewness in the data.

Covariance and correlation are two measures used in statistical analysis to quantify the relationship between two variables, particularly in the context of multivariate data analysis. While they both assess the association between variables, they have distinct differences in terms of scale and interpretation:

1. Covariance:

   - Covariance is a measure of the degree to which two variables change together. It indicates the direction of the linear relationship between the variables.
   - The formula for calculating the covariance between two variables X and Y is:
     Cov(X, Y) = Σ [(Xᵢ - X̄) * (Yᵢ - Ȳ)] / (n - 1)
     where Xᵢ and Yᵢ are individual data points, X̄ and Ȳ are the means of X and Y, and n is the number of data points.
   - Covariance can be positive, negative, or zero, indicating a positive linear relationship, a negative linear relationship, or no linear relationship, respectively.
   - The magnitude of covariance is not standardized and depends on the units of the variables, making it difficult to compare covariances between different datasets.

2. Correlation:

   - Correlation is a standardized measure that assesses the strength and direction of a linear relationship between two variables. It provides a more interpretable and comparable metric compared to covariance.
   - The most commonly used correlation coefficient is the Pearson correlation coefficient (Pearson's r), which ranges between -1 and 1.
   - The formula for calculating the Pearson correlation coefficient between X and Y is:
     r(X, Y) = Cov(X, Y) / (σ(X) * σ(Y))
     where Cov(X, Y) is the covariance between X and Y, and σ(X) and σ(Y) are the standard deviations of X and Y, respectively.
   - A correlation of 1 represents a perfect positive linear relationship, -1 represents a perfect negative linear relationship, and 0 represents no linear relationship.
   - Correlation is scale-independent and can be used to compare relationships between different pairs of variables and datasets.

In statistical analysis:

- Covariance is primarily used to determine whether two variables tend to increase or decrease together. However, its units make it difficult to interpret or compare.
- Correlation, and in particular Pearson's correlation coefficient, is widely used to measure the strength and direction of a linear relationship between two variables. It is suitable for comparing and interpreting the relationships between different sets of data, and it provides a standardized metric for assessing the degree of association.

In summary, while both covariance and correlation assess the relationship between variables, correlation provides a more standardized and interpretable measure of the strength and direction of this relationship and is preferred in many statistical analyses and data interpretation tasks.

In [4]:
data = [85, 76, 92, 88, 79, 91, 83, 80, 87, 90]
sample_mean = sum(data) / len(data)
print("Sample mean =", sample_mean)

Sample mean = 85.1
