Q1. What are the three measures of central tendency?

The three measures of central tendency are:

1. **Mean**: The average of a set of numbers, calculated by adding up all the values and then dividing by the number of values.

2. **Median**: The middle value in a set of numbers when they are arranged in order. If the set has an even number of values, the median is the average of the two middle numbers.

3. **Mode**: The value that appears most frequently in a set of numbers. A set may have one mode, more than one mode, or no mode if all values are unique.

Q2. What is the difference between the mean, median, and mode? How are they used to measure the
central tendency of a dataset?

The mean, median, and mode are all measures of central tendency, but they each represent the center of a dataset in different ways:

### 1. **Mean**
- **Definition**: The mean is the arithmetic average of a set of numbers.
- **Calculation**: Add all the values in the dataset and divide by the number of values.
- **Usage**: The mean is useful for datasets without extreme values (outliers). It considers all data points, making it sensitive to outliers, which can skew the result. It's often used in normally distributed data where the values are symmetrically distributed around the center.

### 2. **Median**
- **Definition**: The median is the middle value in an ordered dataset.
- **Calculation**: Arrange the data in ascending or descending order, and find the middle value. If the dataset has an even number of values, the median is the average of the two middle numbers.
- **Usage**: The median is useful when the dataset has outliers or is skewed, as it is not affected by extremely high or low values. It represents the point where half the data lies below and half lies above.

### 3. **Mode**
- **Definition**: The mode is the most frequently occurring value in a dataset.
- **Calculation**: Identify the value(s) that appear most frequently.
- **Usage**: The mode is useful for categorical data where you want to know the most common category. It can also be used for numerical data, particularly when the dataset has multiple peaks (bimodal or multimodal distributions).

### **Key Differences and Usage in Measuring Central Tendency**
- **Mean** is best for datasets with a normal distribution and no outliers, giving a comprehensive measure by including all data points.
- **Median** is preferred for skewed distributions or when outliers are present, as it better represents the center of the data by focusing on the middle.
- **Mode** is ideal for identifying the most common value, particularly in categorical data, or when the dataset has multiple repeated values. 

Each measure provides a different perspective on the "center" of the data, and the choice of which to use depends on the specific characteristics of the dataset.

Q3. Measure the three measures of central tendency for the given height data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [1]:
import numpy as np
from scipy import stats

# Given height data
height_data = [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]

# Calculating the three measures of central tendency
mean_height = np.mean(height_data)
median_height = np.median(height_data)
mode_height = stats.mode(height_data, keepdims=True)[0][0]

mean_height, median_height, mode_height


(177.01875, 177.0, 177.0)

For the given height data:

Mean: 177.02 (approximately)
Median: 177.0
Mode: 177.0
These values indicate that the central tendency of the dataset is clustered around 177 cm, with both the median and mode being exactly 177 cm. The mean is slightly above 177 cm, reflecting the overall average height in the dataset.

Q4. Find the standard deviation for the given data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [2]:
# Calculating the standard deviation for the given height data
standard_deviation = np.std(height_data)

standard_deviation


1.7885814036548633

The standard deviation for the given height data is approximately **1.79**. This value indicates the average amount by which the individual heights differ from the mean height of the dataset.

Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe
the spread of a dataset? Provide an example.

Measures of dispersion, such as range, variance, and standard deviation, describe the spread or variability of a dataset. They help us understand how much the data points differ from each other and from the central tendency (mean, median, or mode). Here's how each measure is used:

### 1. **Range**
- **Definition**: The range is the difference between the maximum and minimum values in a dataset.
- **Usage**: The range provides a simple measure of the spread by showing the extent of the data. However, it only considers the two extreme values and does not account for the distribution of the other data points.
- **Example**: If the heights in a dataset range from 172.5 cm to 180 cm, the range is \(180 - 172.5 = 7.5\) cm.

### 2. **Variance**
- **Definition**: Variance measures the average of the squared differences from the mean. It quantifies the overall spread of the data by considering how far each data point is from the mean.
- **Usage**: Variance is useful for understanding the overall dispersion in a dataset, but because it involves squaring the differences, it can be harder to interpret directly, especially in terms of the original units.
- **Example**: For a dataset of heights, the variance will tell us how much the heights vary around the mean height, on average. A high variance indicates a wide spread of heights, while a low variance indicates that most heights are close to the mean.

### 3. **Standard Deviation**
- **Definition**: The standard deviation is the square root of the variance. It represents the average distance of each data point from the mean, in the same units as the original data.
- **Usage**: The standard deviation is widely used because it is easier to interpret than variance—it gives a measure of spread in the same units as the data. It tells us how much the data points typically deviate from the mean.
- **Example**: If the standard deviation of the heights in a dataset is 1.79 cm, this means that, on average, each height is about 1.79 cm away from the mean height.

### **Example: Understanding Spread in a Dataset**
Consider two datasets of student exam scores:

- **Dataset A**: [85, 86, 87, 88, 89]
- **Dataset B**: [60, 70, 80, 90, 100]

Both datasets might have the same mean (e.g., 87), but their spreads are different.

- **Range**: 
  - Dataset A: \(89 - 85 = 4\)
  - Dataset B: \(100 - 60 = 40\)
  
  Dataset B has a much wider range, indicating greater variability in scores.

- **Variance and Standard Deviation**:
  - Dataset A: Low variance and standard deviation, as all scores are close to the mean.
  - Dataset B: High variance and standard deviation, as scores are spread out over a wide range.

These measures help us understand that while both datasets have the same central tendency (mean), Dataset B has a much greater spread, indicating more variability among student scores.

Q6. What is a Venn diagram?

A **Venn diagram** is a visual tool used to illustrate the relationships between different sets. It consists of overlapping circles, where each circle represents a set, and the areas of overlap between circles represent the common elements shared by the sets. Venn diagrams are commonly used in logic, mathematics, statistics, and set theory to show intersections, unions, and differences between sets.

### Key Components:
- **Circles**: Each circle represents a set of items or elements.
- **Overlapping Areas**: The regions where circles overlap show the elements that are common to the sets involved.
- **Non-overlapping Areas**: Parts of a circle that do not overlap with others represent elements that are unique to that set.

### Example:
Consider two sets:
- Set A: {1, 2, 3, 4}
- Set B: {3, 4, 5, 6}

In a Venn diagram:
- One circle represents Set A and another circle represents Set B.
- The overlapping area between the two circles would contain the elements {3, 4}, which are common to both sets.
- The non-overlapping parts of the circles would show {1, 2} (unique to Set A) and {5, 6} (unique to Set B).

### Uses:
- **Logic**: To illustrate logical relationships and operations like AND, OR, and NOT.
- **Set Theory**: To show unions, intersections, and complements of sets.
- **Probability**: To visualize events and their probabilities.
- **Comparison**: To compare and contrast different groups or categories.

Venn diagrams are helpful in making abstract relationships more concrete and easier to understand visually.

Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:
(i) A ∩ B
(ii) A ⋃ B

A = (2,3,4,5,6,7) B = (0,2,6,8,10)

A ∩ B : (A intersection B)
This represents the A intersection B which means all the common elements between A and B.
The output of the intersections of the above set will be (2,6)
A ⋃ B : (A union B)
This represents the A union B which means all the elements of A and B ie,distinct elements.
The output of the union of the above set will be (0,2,3,4,5,6,7,8,10)

In [3]:
A = {2,3,4,5,6,7}
B = {0,2,6,8,10}

a_intersection_b = A.intersection(B)
print("Intersection of A and B is : ",a_intersection_b)
a_union_b = A.union(B)
print("Union of A and B is : ",a_union_b)

Intersection of A and B is :  {2, 6}
Union of A and B is :  {0, 2, 3, 4, 5, 6, 7, 8, 10}


Q8. What do you understand about skewness in data?

Skewness is a statistical term that refers to the asymmetry or lack of symmetry in the distribution of a dataset. In a symmetrical distribution, the data is evenly distributed around the mean, with the left and right sides of the distribution mirroring each other. However, in a skewed distribution, the data is concentrated more towards one tail of the distribution, resulting in a longer tail on one side and a shorter tail on the other.

Skewness is an essential concept in data analysis as it helps to understand the shape and characteristics of a dataset. There are three types of skewness:

1. Positive Skewness (Right Skewness): In a positively skewed distribution, the majority of the data is concentrated towards the lower values (left side) of the distribution, and the tail extends to the right. This results in a longer right tail and a shorter left tail. The mean is typically greater than the median in positively skewed data.
Here the Mean > Median >Mode
Example of Positive Skewness:
Consider the distribution of household incomes in a country. Most households might have lower incomes, with a few extremely high-income households creating a long right tail.
2. Negative Skewness (Left Skewness): In a negatively skewed distribution, the majority of the data is concentrated towards the higher values (right side) of the distribution, and the tail extends to the left. This results in a longer left tail and a shorter right tail. The mean is typically less than the median in negatively skewed data.
Here the Mean < Median < Mode
Example of Negative Skewness:
Consider the distribution of test scores in a difficult exam. Most students might score higher marks, but a few students might perform poorly, leading to a long left tail.
3. Zero Skewness (Symmetrical/Gaussian Distribution): In a symmetrical distribution, the data is evenly distributed around the mean, and there is no skewness. The left and right sides of the distribution mirror each other, and the mean and median are approximately equal.
Here the Mean = Median = Mode
Example of Zero Skewness:
The distribution of heights in a population, where most people have average heights and the distribution is symmetrical around the mean height.

Q9. If a data is right skewed then what will be the position of median with respect to mean?

If a dataset is right-skewed, the median will be positioned to the left of the mean ie the value of median will be smaller than mean.

In a right-skewed distribution, the majority of the data is concentrated towards the lower values, resulting in a long tail extending to the right. This elongated tail on the right side is caused by a few extremely high values that pull the mean towards the right. As a result, the mean gets inflated by the presence of these high values, causing it to be larger than the median.

The median, on the other hand, is less influenced by extreme values or outliers because it represents the middle value of the dataset when the data is ordered. It is not affected by the specific values in the tail of the distribution.

Therefore, in a right-skewed distribution, where the tail is longer on the right, the median will be closer to the bulk of the data on the left side, and it will be to the left of the mean.

Q10. Explain the difference between covariance and correlation. How are these measures used in
statistical analysis?

Covariance and Correlation are useful measures in statistical analysis to assess the relationship between two variables. Covariance gives the direction of the relationship, while correlation provides a standardized measure of both direction and strength.

Covariance:
Covariance measures the degree to which two variables change together. It indicates whether the variables increase or decrease simultaneously. A positive covariance indicates that as one variable increases, the other tends to increase as well. Conversely, a negative covariance indicates that as one variable increases, the other tends to decrease. However, covariance alone does not provide information about the strength or direction of the relationship.

Formula for Covariance (for a sample): Cov(X, Y) = Σ((Xi - X̄)(Yi - Ȳ)) / (n - 1)

Where:

Cov(X, Y) is the covariance between variables X and Y.
Xi and Yi are individual data points in the X and Y datasets, respectively.
X̄ and Ȳ are the sample means of the X and Y datasets, respectively.
n is the number of data points in the datasets.

Correlation:
Correlation, on the other hand, is a standardized measure that provides the strength and direction of the relationship between two variables. It normalizes the covariance by dividing it by the product of the standard deviations of the two variables, resulting in a value between -1 and +1. A correlation of +1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

Formula for Correlation (for a sample): Corr(X, Y) = Cov(X, Y) / (sX * sY)

Where:

Corr(X, Y) is the correlation coefficient between variables X and Y.
Cov(X, Y) is the covariance between variables X and Y.
sX and sY are the sample standard deviations of variables X and Y, respectively.
Both covariance and correlation are used to analyze the relationship between two variables in a dataset:

Covariance: Covariance provides a measure of the direction of the relationship (positive or negative) between two variables. However, the value of covariance itself does not give information about the strength of the relationship or its significance, making it less interpretable than correlation.

Correlation: Correlation provides a standardized measure of the strength and direction of the linear relationship between two variables. It is widely used because it is scale-independent, making it easier to interpret and compare across different datasets. Positive correlation values indicate that the variables move together, while negative correlation values indicate that they move in opposite directions. A correlation close to +1 or -1 indicates a strong linear relationship, while a correlation close to 0 suggests a weak or no linear relationship.

Q11. What is the formula for calculating the sample mean? Provide an example calculation for a
dataset.

The formula for calculating the sample mean is:

Sample Mean (x̄) = (Sum of all data points) / (Number of data points)

In mathematical notation: x̄ = (Σ Xi) / n

Where:

x̄ is the sample mean.
Σ represents the summation symbol (sum of all data points).
Xi represents individual data points in the dataset.
n is the number of data points in the sample.
Example : Taking the grades of a sample student = [81,79,90,98,92]

Number of samples (n) = 5

Mean = (81+79+90+98+92)/5 = 440/5 = 88
Therefore the Mean of the sample data is 88

Q12. For a normal distribution data what is the relationship between its measure of central tendency?

For a normal distribution, the three measures of central tendency, namely the mean, median, and mode, are all equal to each other. In a perfectly symmetrical normal distribution, the data is evenly distributed around the center, resulting in the same value for each measure of central tendency.

Mean = Median = Mode

The relationship between the measures of central tendency in a normal distribution is as follows:

Mean: The mean of a normal distribution is located at the center of the distribution, and it is equal to the median and the mode.

Median: The median of a normal distribution is also located at the center of the distribution, and it is equal to the mean and the mode.

Mode: The mode of a normal distribution is the peak point of the curve, and it is equal to both the mean and the median.

This equality between the mean, median, and mode in a normal distribution is a characteristic of its symmetry. The symmetrical shape of the normal distribution curve is bell-shaped, and half of the data lies on either side of the mean, resulting in the median being the same as the mean and mode.

Q13. How is covariance different from correlation?

Covariance and Correlation are useful measures in statistical analysis to assess the relationship between two variables. Covariance gives the direction of the relationship, while correlation provides a standardized measure of both direction and strength.

Covariance:
Covariance measures the degree to which two variables change together. It indicates whether the variables increase or decrease simultaneously. A positive covariance indicates that as one variable increases, the other tends to increase as well. Conversely, a negative covariance indicates that as one variable increases, the other tends to decrease. However, covariance alone does not provide information about the strength or direction of the relationship.

Formula for Covariance (for a sample): Cov(X, Y) = Σ((Xi - X̄)(Yi - Ȳ)) / (n - 1)

Where:

Cov(X, Y) is the covariance between variables X and Y.
Xi and Yi are individual data points in the X and Y datasets, respectively.
X̄ and Ȳ are the sample means of the X and Y datasets, respectively.
n is the number of data points in the datasets.

Correlation:
Correlation, on the other hand, is a standardized measure that provides the strength and direction of the relationship between two variables. It normalizes the covariance by dividing it by the product of the standard deviations of the two variables, resulting in a value between -1 and +1. A correlation of +1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

Formula for Correlation (for a sample): Corr(X, Y) = Cov(X, Y) / (sX * sY)

Where:

Corr(X, Y) is the correlation coefficient between variables X and Y.
Cov(X, Y) is the covariance between variables X and Y.
sX and sY are the sample standard deviations of variables X and Y, respectively.
Covariance and Correlation are useful measures in statistical analysis to assess the relationship between two variables. Covariance gives the direction of the relationship, while correlation provides a standardized measure of both direction and strength.

Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

Outliers can have a significant impact on measures of central tendency and dispersion in a dataset. An outlier is an extreme value that is unusually distant from the rest of the data points. When present in a dataset, outliers can distort the typical characteristics of the data, affecting both the central tendency and the spread of the data.

In [4]:
import numpy as np
from scipy import stats
data1 = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5,220]
data2 = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5,220,210,204]
data3 = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [5]:
print("Mean Of Data3 : ",np.mean(data3))
print("Mean Of Data1 : ",np.mean(data1))
print("Mean Of Data2 : ",np.mean(data2))

Mean Of Data3 :  177.01875
Mean Of Data1 :  179.54705882352943
Mean Of Data2 :  182.43684210526317


In [6]:
print("Median of Data3 : ",np.median(data3))
print("Median of Data2 : ",np.median(data2))
print("Median of Data1 : ",np.median(data1))

Median of Data3 :  177.0
Median of Data2 :  178.0
Median of Data1 :  177.0


In [7]:
print("Mode of Data3 : ",stats.mode(data3))
print("Mode of Data2 : ",stats.mode(data2))
print("Mode of Data1 : ",stats.mode(data1))

Mode of Data3 :  ModeResult(mode=array([177.]), count=array([3]))
Mode of Data2 :  ModeResult(mode=array([177.]), count=array([3]))
Mode of Data1 :  ModeResult(mode=array([177.]), count=array([3]))


  print("Mode of Data3 : ",stats.mode(data3))
  print("Mode of Data2 : ",stats.mode(data2))
  print("Mode of Data1 : ",stats.mode(data1))


Here we observe that :

The Mean is the one which is most affected with outliers.There is a significant change in mean with the increase in the presence of outliers.
The Mean of the data without outliers is 177.01
The Mean of the data with 1 outlier is 179.43
The Mean of the data with 3 outlier is 182.54
The Median is not much affected with the presence of outliers in the data but the affect is significantly smaller compared to mean

The Median of the data without outliers is 177.00
The Median of the data with 1 outlier is 177.00
The Median of the data with 3 outliers is 178.00
The Mode is not affected with the presence of outliers in the data

The Mode of the data without outliers is 177.00
The Mode of the data with 1 outlier is 177.00
The Mode of the data with 3 outliers is 177.010