# <font color = 'orange'> Assignment-27 Solutions

---

### Q1. What are the three measures of central tendency?

1. Mean: The mean, also known as the average, is calculated by summing up all the values in a dataset and then dividing by the number of values. It represents the "typical" value of the data and is sensitive to extreme outliers.

2. Median: The median is the middle value in a dataset when the values are arranged in ascending or descending order. If there is an even number of data points, the median is the average of the two middle values. The median is less affected by extreme values compared to the mean, making it a more robust measure in the presence of outliers.

3. Mode: The mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal), more than one mode (multimodal), or no mode at all if all values occur with the same frequency. The mode is particularly useful for categorical or discrete data, but it can also be applied to continuous data.

---

### Q2. What is the difference between the mean, median, and mode? How are they used to measure the central tendency of a dataset?

1. Mean:
* Calculation: The mean, also known as the average, is calculated by adding up all the values in a dataset and then dividing by the number of values.
* Sensitivity to Outliers: The mean is sensitive to extreme outliers in the data. A single very large or very small value can significantly impact the mean, pulling it in the direction of the outlier.
* Use: The mean is commonly used when you want to find the arithmetic average of a dataset. It's appropriate for data that is approximately normally distributed and does not have extreme outliers.

2. Median:
* Calculation: The median is the middle value in a dataset when the values are arranged in ascending or descending order. If there is an even number of data points, the median is the average of the two middle values.
* Sensitivity to Outliers: The median is less sensitive to extreme outliers than the mean. Outliers have minimal effect on the median because it focuses on the middle value(s).
* Use: The median is often used when the data contains outliers or is skewed. It provides a more robust measure of central tendency in such cases. It's also used when you are concerned about the ordinal position of values rather than their actual numerical values.

3. Mode:
* Calculation: The mode is the value that occurs most frequently in a dataset.
* Sensitivity to Outliers: The mode is not influenced by outliers because it is based on frequency rather than numerical values.
* Use: The mode is used primarily with categorical or discrete data, such as categories or groups. It can also be applied to continuous data, but it may not be as informative in such cases. The mode can help identify the most common category or value in a dataset.

---

### Q3. Measure the three measures of central tendency for the given height data:

 [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [1]:
import numpy as np
from scipy import stats 

height = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]
print('Height :',height)
print()

mean = np.mean(height)
median = np.median(height)
mode = stats.mode(height,keepdims=True)

print('Mean :',mean)
print('Median :',median)
print('Mode :',mode.mode[0])

Height : [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]

Mean : 177.01875
Median : 177.0
Mode : 177.0


---

### Q4. Find the standard deviation for the given data:

[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [2]:
print('Height :',height)
print()

standard_deviation = np.std(height)

print('Standard Deviation :',standard_deviation)

Height : [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]

Standard Deviation : 1.7885814036548633


---

### Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe the spread of a dataset? Provide an example.

1. **Range**:
* Calculation: The range is the simplest measure of dispersion and is calculated as the difference between the maximum and minimum values in a dataset.
* Use: The range gives a rough idea of how spread out the data is. A larger range indicates greater variability in the data. However, it can be influenced by outliers and may not provide a complete picture of the data's distribution.
* Example: Consider the following dataset of exam scores: [65, 75, 80, 85, 90]. The range is 90 (maximum) - 65 (minimum) = 25.

2. **Variance**:
* Calculation: Variance measures how each data point deviates from the mean. It is calculated by taking the average of the squared differences between each data point and the mean.
* Use: Variance quantifies the overall spread of the data. A higher variance indicates greater variability. However, because it involves squared differences, it may not be in the same units as the original data, making it less interpretable.

3. **Standard Deviation**:
* Calculation: The standard deviation is the square root of the variance. It measures the average deviation of data points from the mean and is expressed in the same units as the original data.
* Use: The standard deviation provides a more interpretable measure of spread compared to variance. A higher standard deviation indicates greater variability.

---

### Q6. What is a Venn diagram?

#### A Venn diagram is a graphical representation used to illustrate the relationships and similarities between sets or groups of items. It consists of overlapping circles, where each circle represents a specific set or category, and the overlap between circles indicates the elements that belong to multiple sets or share common characteristics.

![VDAIntersectsB.png](attachment:VDAIntersectsB.png)

---

### Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:

(i) A intersecion B.

(ii) A union B.

In [3]:
A = {2,3,4,5,6,7}
B = {0,2,6,8,10}

print('Set A :',A)
print('Set B :',B)
print()

print('A intersection B :',A.intersection(B))
print('A union B :',A.union(B))

Set A : {2, 3, 4, 5, 6, 7}
Set B : {0, 2, 6, 8, 10}

A intersection B : {2, 6}
A union B : {0, 2, 3, 4, 5, 6, 7, 8, 10}


---

### Q8. What do you understand about skewness in data?

#### Skewness is a statistical measure that helps us understand the asymmetry or lack of symmetry in the distribution of data. It provides information about the shape of a dataset's probability distribution, specifically how the data is distributed around the mean.

There are three main types of skewness:

1. **Positive Skewness (Right-skewed):** In a positively skewed distribution, the tail on the right-hand side is longer or fatter than the left-hand side. This means that the majority of the data points are concentrated on the left side, and there are relatively few data points on the right side. The mean greater than median greater than mode.



2. **Negative Skewness (Left-skewed):** In a negatively skewed distribution, the tail on the left-hand side is longer or fatter than the right-hand side. This indicates that the majority of data points are concentrated on the right side, and there are relatively few data points on the left side. In a negatively skewed distribution, the mean less than median less than mode.


3. **Zero Skewness (Symmetrical):** In a symmetrical distribution, there is an equal balance on both sides of the mean, resulting in a distribution that is perfectly balanced and has no skew. In such cases, the mean, median and mode are equal.


Understanding skewness is important in data analysis because it provides insights into the nature of the data distribution. Skewed data can have implications for statistical analysis, such as the choice of appropriate statistical tests and the interpretation of results. For example:

* In positively skewed data, the mean is typically greater than the median, and extreme values on the right side can significantly impact the mean. Therefore, in such cases, the median is often a better measure of central tendency.
* In negatively skewed data, the mean is typically less than the median, and extreme values on the left side can affect the mean. Again, the median may be a more appropriate measure of central tendency.

Skewness can also affect the choice of data transformation or modeling techniques, as well as the assessment of the assumptions underlying various statistical methods. It's a valuable tool for understanding the shape and characteristics of data distributions in statistical analysis.

---

### Q9. If a data is right skewed then what will be the position of median with respect to mean?

* Mean: The mean is influenced by extreme values or outliers in a dataset. In a right-skewed distribution, there are typically a few relatively large values on the right side of the distribution (the tail), which pull the mean in that direction. As a result, the mean is usually greater than the median.

* Median: The median is the middle value when the data is sorted in ascending order. In a right-skewed distribution, most of the data points are concentrated on the left side (the bulk of the distribution), and there are relatively few data points on the right side. Because of this concentration on the left, the median is typically located to the left of the mean.
### If a data is right skewed then the position of the median is less than mean.

---

### Q10. Explain the difference between covariance and correlation. How are these measures used in statistical analysis?

Covariance and correlation are both measures used in statistics to quantify the relationship between two variables, but they serve slightly different purposes and have different properties:

1. **Covariance:**
* **Definition:** Covariance is a measure that indicates the degree to which two random variables change together. It measures the direction of the linear relationship between two variables.
* **Formula:** For two variables X and Y, the covariance (cov) is calculated as follows:

cov(X, Y) = Σ[(Xᵢ - μₓ) * (Yᵢ - μᵧ)] / (n - 1)

* Where:
* Xᵢ and Yᵢ are individual data points for X and Y.
* μₓ and μᵧ are the means (averages) of X and Y, respectively.
* n is the number of data points.
* **Properties:** Covariance can take on positive, negative, orzero values.
* Positive covariance indicates that as one variable increases, other tends to increase.
* Negative covariance indicates that as one variable increases, other tends to decrease.
* Zero covariance suggests no linear relationship between variables.
* **Units:** The units of covariance are the product of the unitsof the two variables, which can make it difficult to interpret andcompare across different datasets.

2. **Correlation:**
* **Definition:** Correlation is a standardized measure of linear relationship between two variables. It quantifies both strength and direction of the relationship.
* **Formula:** The Pearson correlation coefficient (ρ), measures linear correlation, is the most commonly used form correlation:
ρ(X, Y) = cov(X, Y) / (σₓ * σᵧ)

* Where:
* cov(X, Y) is the covariance between X and Y.
* σₓ and σᵧ are the standard deviations of X and Y, respectively.
* **Properties:** Correlation values range from -1 to 1.
* ρ = 1 indicates a perfect positive linear relationship.
* ρ = -1 indicates a perfect negative linear relationship.
* ρ = 0 indicates no linear relationship (although nonlinear relationships may exist).
* **Units:** Correlation is a unitless measure, making it to interpret and compare across datasets.


#### How they are used in statistical analysis:

* Covariance: Covariance is used to determine whether two variables tend to increase or decrease together and in which direction. It's often used in finance to analyze the relationships between different financial assets. However, its value alone can be difficult to interpret, and it is not standardized, making comparisons between datasets challenging.

* Correlation: Correlation, particularly the Pearson correlation coefficient, is widely used in various fields, including statistics, economics, social sciences, and natural sciences. It provides a standardized measure of the strength and direction of the linear relationship between two variables. Correlation is valuable for understanding associations between variables, identifying patterns, and making predictions. It's also helpful in selecting appropriate variables for regression analysis and assessing the goodness of fit in regression models.

---

### Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset.

### The formula for calculation of the sample mean is 
 
cov(X, Y) = Σ[(Xᵢ - μₓ) * (Yᵢ - μᵧ)] / (n - 1) 


#### Example : consider weights = [85, 90, 78, 92, 88] so to find the mean we have to sum up the elements of the list and then divide by total number of elements present in the list.
* sum = 85 + 90 + 78 + 92 + 88 = 433
* n = 5
* mean = sum / n = 433 / 5 = 86.6

In [4]:
weights = [85, 90, 78, 92, 88]

print('Weight :',weights)
print()

weights_sum = sum(weights)
print('Sum :',weights_sum)

n = len(weights)
print('n :',n)
print()

print(f'Mean = {weights_sum} / {n} = {weights_sum/n}')

import numpy as np 
print('Mean :',np.mean(weights))

Weight : [85, 90, 78, 92, 88]

Sum : 433
n : 5

Mean = 433 / 5 = 86.6
Mean : 86.6


---

### Q12. For a normal distribution data what is the relationship between its measure of central tendency?

1. Mean (μ): The mean of a normal distribution is equal to the median. This is a fundamental property of a normal distribution. Mathematically, μ = median.

2. Median: The median is located at the center of a normal distribution. Because the distribution is symmetric, the median is the point at which the data is divided into two equal halves.

3. Mode: In a normal distribution, the mode is also equal to the mean and the median. In other words, there is a single peak in the distribution, and that peak corresponds to the mean and median.

#### The normal distribution is characterized by its bell-shaped curve, and it is perfectly symmetric around the mean. This symmetry results in the mean, median, and mode all being located at the center of the distribution, with the same value.

---

### Q13. How is covariance different from correlation?

Covariance and correlation are both measures used in statistics to quantify the relationship between two variables, but they serve different purposes and have different properties. Here's a summary of the key differences between covariance and correlation:

1. **Definition**:
* **Covariance:** Covariance measures the degree to which random variables change together. It indicates the direction the linear relationship between two variables.
* **Correlation:** Correlation is a standardized measure of linear relationship between two variables. It quantifies both strength and direction of the relationship.

2. **Formula**:
- **Covariance:** For two variables X and Y, the covariance (cov) is calculated as:
cov(X, Y) = Σ[(Xᵢ - μₓ) * (Yᵢ - μᵧ)] / (n - 1)
* Where μₓ and μᵧ are the means of X and Y, respectively, and n is the number of data points.
* **Correlation:** The Pearson correlation coefficient (ρ) is the most commonly used form of correlation:
ρ(X, Y) = cov(X, Y) / (σₓ * σᵧ)
* Where σₓ and σᵧ are the standard deviations of X and Y, respectively.

3. **Range of Values**:
* **Covariance:** Covariance can take on any real value. It has no fixed range, which can make it challenging to interpret and compare across datasets.
* **Correlation:** Correlation values range from -1 to 1.
* ρ = 1 indicates a perfect positive linear relationship.
* ρ = -1 indicates a perfect negative linear relationship.
* ρ = 0 indicates no linear relationship (though nonlinear relationships may exist).

4. **Units**:
* **Covariance:** The units of covariance are the product of the units of the two variables, which can make it difficult to interpret and compare.
* **Correlation:** Correlation is a unitless measure, making it easier to interpret and compare across datasets.

5. **Standardization**:
* **Covariance:** Covariance is not standardized, and its value depends on the units of the variables.
* **Correlation:** Correlation is standardized, allowing for meaningful comparisons between datasets regardless of units.

---

### Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

Outliers can significantly impact measures of central tendency (mean, median, and mode) and measures of dispersion (range, variance, and standard deviation) in a dataset. Here's how they affect these measures with an example:

1. **Measures of Central Tendency:**
* **Mean:** Outliers have a substantial effect on the mean. A single extreme value can pull the mean in its direction. Therefore, the mean is sensitive to outliers. When outliers are present, the mean may not accurately represent the typical or central value of the data.
   
* Example: Consider a dataset of salaries for a small company: [40,000, 42,000, 41,000, 45,000, 1,000,000]. The outlier, 1,000,000, significantly inflates the mean, making it much higher than what is typical for most employees.

* **Median:** The median is less affected by outliers because it is not influenced by extreme values. It represents the middle value when the data is sorted. In the example above, the median remains 41,000, which better represents the typical salary of the majority of employees.
* **Mode:** Like the median, the mode is not heavily influenced by outliers, as it represents the most frequently occurring value. In the example, the mode is still 41,000.

2. **Measures of Dispersion:**
* **Range:** Outliers can significantly affect the range by expanding it. The range is the difference between the maximum and minimum values in the dataset. Outliers can pull the maximum value higher or the minimum value lower, resulting in a wider range.

* **Variance and Standard Deviation:** Outliers have a substantial impact on the variance and standard deviation because these measures involve the squared differences between data points and the mean. Outliers with large deviations from the mean contribute disproportionately to these measures. They increase the spread of the data, leading to higher variance and standard deviation values.

Example: Consider a dataset of test scores for a class: [80, 85, 90, 95, 50, 110]. The outlier, 110, increases both the variance and standard deviation, making them larger and indicating more variability in the scores than there actually is for the majority of students.


In [5]:
age = [24,36,78,43,98,23,17]
print('List without outlier :',age)
print()

print('Measure of Centeral Tendency.')
print('Mean :',np.mean(age))
print('Median :',np.median(age))
print('Mode :',stats.mode(age,keepdims = True).mode[0])
print()

print('Measure of Dispersion.')
print('Variance :',np.var(age))
print('Standard deviation :',np.std(age))

List without outlier : [24, 36, 78, 43, 98, 23, 17]

Measure of Centeral Tendency.
Mean : 45.57142857142857
Median : 36.0
Mode : 17

Measure of Dispersion.
Variance : 812.8163265306122
Standard deviation : 28.50993382192621


In [6]:
# adding an outlier to the age
age.append(200)
print('List without outlier :',age)
print()

print('Measure of Centeral Tendency.')
print('Mean :',np.mean(age))
print('Median :',np.median(age))
print('Mode :',stats.mode(age,keepdims = True).mode[0])
print()

print('Measure of Dispersion.')
print('Variance :',np.var(age))
print('Standard deviation :',np.std(age))

List without outlier : [24, 36, 78, 43, 98, 23, 17, 200]

Measure of Centeral Tendency.
Mean : 64.875
Median : 39.5
Mode : 17

Measure of Dispersion.
Variance : 3319.609375
Standard deviation : 57.61605136591712


---