### Q1. What are the three measures of central tendency?

The three measures of central tendency are:

1.Mean: The mean is the average of a set of values. To calculate it, we sum up all the values and then divide by the total number of values. The mean is sensitive to extreme values (outliers) in the data.

2.Median: The median is the middle value in a data set when the values are arranged in order. If there's an even number of values, the median is the average of the two middle values. The median is less affected by extreme values compared to the mean.

3.Mode: The mode is the value that occurs most frequently in a data set. There can be one mode (unimodal), more than one mode (multimodal), or no mode at all. The mode is useful for categorical or discrete data.

These measures provide different ways to summarize and understand the central or typical value in a data set. The choice of which measure to use depends on the nature of the data and the specific questions you want to answer.

#### Q2. What is the difference between the mean, median, and mode? How are they used to measure the
central tendency of a dataset?

The mean, median, and mode are three different measures of central tendency used to describe the typical or central value of a dataset. Here are the key differences between them and how they are used:

1. Mean:
   - Calculation: The mean is calculated by summing up all the values in the dataset and then dividing by the total number of values.
   - Use: It provides an average value for the dataset. It is the most commonly used measure of central tendency.
   - Sensitivity to Outliers: The mean is sensitive to extreme values (outliers) in the data, and a single outlier can significantly affect the mean.

2. Median:
   - Calculation: The median is the middle value in a dataset when the values are arranged in order. If there's an even number of values, the median is the average of the two middle values.
   - Use: It represents the middle value of the dataset and is less affected by outliers. It's useful when dealing with skewed data or datasets with extreme values.
   - Sensitivity to Outliers: The median is less sensitive to outliers compared to the mean.

3. Mode:
   - Calculation: The mode is the value that occurs most frequently in a dataset.
   - Use: It identifies the most common value in the dataset. It's particularly useful for categorical or discrete data.
   - Sensitivity to Outliers: The mode is not influenced by outliers since it's based on frequency counts.


#### Q3. Measure the three measures of central tendency for the given height data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [4]:
from scipy import stats
import numpy as np
height = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]
height_mean = np.mean(height)
height_median = np.median(height)
height_mode = stats.mode(height)

print("The mean of the height is:", height_mean)
print("The median of the heght is:", height_median)
print("The mode of the heght is:", height_mode)


The mean of the height is: 177.01875
The median of the heght is: 177.0
The mode of the heght is: ModeResult(mode=array([177.]), count=array([3]))


  height_mode = stats.mode(height)


#### Q4. Find the standard deviation for the given data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [17]:
val = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]
val_std = np.std(val)
print("The standeared deviation is:", val_std)

The standeared deviation is: 1.7885814036548633


#### Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe the spread of a dataset? Provide an example.

Measures of dispersion, including range, variance, and standard deviation, are used to quantify and describe the extent to which data points in a dataset deviate from the central tendency (mean, median, mode). They provide valuable insights into the spread, variability, and consistency of data. Here's how each measure is used, along with an example:

1. Range:
   - Definition: The range is the simplest measure of dispersion and represents the difference between the maximum and minimum values in a dataset.
   - Use: It provides a quick, rough estimate of the spread of data but is highly sensitive to outliers.
   - Example: Consider the following dataset of test scores: [60, 70, 75, 80, 95]. The range is 95 (maximum) - 60 (minimum) = 35.

2. Variance:
   - Definition: Variance measures how data points deviate from the mean. It calculates the average of the squared differences between each data point and the mean.
   - Use: Variance quantifies the overall spread of data while giving more weight to larger deviations. However, it provides a measure of variability, not in the original units.
   - Example: For the test scores mentioned earlier, the variance is calculated as follows:
     - Mean = (60 + 70 + 75 + 80 + 95) / 5 = 76.
     - Variance = [(60 - 76)² + (70 - 76)² + (75 - 76)² + (80 - 76)² + (95 - 76)²] / 5 = 158.4.

3. Standard Deviation:
   - Definition: Standard deviation is the square root of the variance. It provides a measure of the average deviation of data points from the mean and is expressed in the same units as the data.
   - Use: Standard deviation is the most commonly used measure of dispersion. It quantifies the spread while maintaining the original units of measurement.
   - Example: Using the same test scores, the standard deviation is the square root of the variance, which is approximately 12.58.


#### Q6. What is a Venn diagram?

A Venn diagram is a graphical representation used to illustrate the relationships and commonalities between different sets of data. It consists of overlapping circles or ellipses, each representing a specific set or category. The points of intersection between these circles represent elements that belong to multiple sets, highlighting the shared characteristics or data points.

Key features of a Venn diagram:

Circles or Ellipses: Each circle or ellipse corresponds to a specific set or category. The elements within each set are represented within the boundaries of the corresponding circle.

Overlap: When two or more circles intersect, the overlapping region represents elements that belong to multiple sets, indicating their shared characteristics.

Non-overlapping Areas: The non-overlapping regions of each circle represent elements that are unique to that set and do not belong to any other set.

#### Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find: (i) A ∩ B (ii) A ⋃ B

In [15]:
import numpy as np 
A,B = {2,3,4,5,6,7},{0,2,6,8,10}
intersection_Set= A.intersection(B)
union_set= A.union(B)

print("Intersection of A and B is :",intersection_Set)
print("Union of A and B is :",union_set)

Intersection of A and B is : {2, 6}
Union of A and B is : {0, 2, 3, 4, 5, 6, 7, 8, 10}


#### Q8. What do you understand about skewness in data?

Skewness in data refers to the measure of the asymmetry or lack of symmetry in the distribution of data points within a dataset. It quantifies the extent to which the data deviates from a perfectly symmetric or normal distribution. Skewness is an essential concept in statistics and data analysis, as it provides valuable insights into the shape and characteristics of data distributions.

Key points about skewness in data:

1. Symmetry: In a perfectly symmetric dataset, the mean, median, and mode are all equal and located at the center of the distribution. Such a dataset has zero skewness.

2. Positive Skewness (Right-skewed): When the right tail (larger values) of the distribution is longer or stretched than the left tail (smaller values), the data is said to be positively skewed. In a positively skewed distribution, the mean is typically greater than the median. The majority of data points are concentrated on the left side of the distribution, with a few extreme values on the right.

3. Negative Skewness (Left-skewed): When the left tail (smaller values) of the distribution is longer or stretched than the right tail (larger values), the data is said to be negatively skewed. In a negatively skewed distribution, the mean is typically less than the median. The majority of data points are concentrated on the right side of the distribution, with a few extreme values on the left.

4. Skewness Coefficient: Skewness is quantified using a skewness coefficient or index. A positive skewness coefficient indicates positive skewness, while a negative skewness coefficient indicates negative skewness. A skewness coefficient of zero signifies a perfectly symmetric distribution.

5. Causes of Skewness: Skewness can be caused by various factors, including outliers or extreme values, natural constraints in the data (e.g., age cannot be negative), or the inherent nature of the data itself (e.g., income data often exhibits positive skewness due to a few high-income individuals).


#### Q9. If a data is right skewed then what will be the position of median with respect to mean?

In a right-skewed distribution (positively skewed), the position of the median is typically to the left of the mean. Here's why:

1. Right Skewness (Positively Skewed): In a right-skewed distribution, the tail of the distribution is stretched out towards the right, indicating that there are a few extreme values on the right side of the distribution. Most of the data points are concentrated on the left side.

2. Mean vs. Median: The mean (average) is sensitive to extreme values or outliers. When there are outliers on the right side (higher values) of the distribution, they pull the mean towards the right. As a result, the mean is greater than the median.

3. Median: The median is the middle value of the dataset when it's arranged in ascending order. Because most of the data points are on the left side, the median tends to be closer to the center of the distribution, to the left of the mean.


#### Q10. Explain the difference between covariance and correlation. How are these measures used in statistical analysis?

Covariance and correlation are both measures used to quantify the relationship between two variables in statistics. However, they serve slightly different purposes and have distinct characteristics:

Covariance:
1. Definition: Covariance measures the degree to which two random variables change together. It quantifies whether an increase in one variable corresponds to an increase, decrease, or no change in the other variable.
2. Range: Covariance can take any value, positive or negative. A positive covariance indicates a positive relationship (both variables tend to increase or decrease together), a negative covariance indicates a negative relationship (one variable tends to increase when the other decreases), and a covariance near zero indicates little to no linear relationship.
3. Units: Covariance is not scaled and depends on the units of the variables being measured. Therefore, it can be challenging to interpret the magnitude of covariance.
4. Formula: The formula for covariance between two variables X and Y is: 
   {Cov}(X, Y) = cov(x,y)=sum_{i=1}^n ={(x_i−x̄)(y_i−ȳ)}/{n-1}

   Where n is the number of data points, (X_i) and (Y_i) are individual data points, and {x̄} and {ȳ} are the means of X and Y, respectively.

Correlation:
1. Definition: Correlation is a standardized measure of the strength and direction of the linear relationship between two variables. It quantifies the degree to which two variables are linearly related.
2. Range: Correlation values range between -1 and 1. A correlation of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.
3. Units: Correlation is unitless and does not depend on the units of the variables. This makes it easier to compare the strength of relationships between different pairs of variables.
4. Formula: The formula for the Pearson correlation coefficient  between X and Y is:
   ρ (x,y)={cov(x,y)}/{σx , σy}
Use in Statistical Analysis:
- Covariance: Covariance can be used to identify the direction of the relationship between two variables, whether positive or negative. It's commonly used in finance to analyze the risk and return of assets in a portfolio. However, it doesn't provide a standardized measure of the strength of the relationship, and its interpretation can be challenging due to units dependence.

- Correlation: Correlation is widely used in statistics and data analysis because of its standardization and ease of interpretation. It's used to:
  - Determine the strength and direction of the linear relationship between two variables.
  - Assess the degree to which changes in one variable are associated with changes in another.
  - Identify patterns and associations in data.
  - Make predictions and assess the validity of regression models.
  - Explore cause-and-effect relationships and dependencies.


#### Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset.

The formula for calculating the sample mean (average) of a dataset is:

Sample Mean = Sum of all data points \ Total number of data points

Here's an example calculation:

Suppose we have the following dataset of exam scores:

Scores = [85, 90, 78, 92, 88] 

To calculate the sample mean:

1. Add up all the data points:
85 + 90 + 78 + 92 + 88 = 433 

2. Count the total number of data points, which in this case is 5.

3. Apply the formula:
Sample Mean = 433 / 5 = 86.6

So, the sample mean of this dataset is 86.6.

In [26]:
import numpy as np 
scores = [85, 90, 78, 92, 88]
sample_mean = np.mean(scores)
print("The Sample mean is :",sample_mean )

The Sample mean is : 86.6


#### Q12. For a normal distribution data what is the relationship between its measure of central tendency?

In a normal distribution, the three measures of central tendency (mean, median, and mode) have a specific relationship:

1. Mean (Average): The mean of a normally distributed dataset is equal to the median. This means that if we were to plot the data on a histogram or a line graph, the point where the distribution is symmetrically divided into two equal halves is at the mean, and this point is also the median.

2. Median: The median of a normally distributed dataset is equal to the mean. This is because the normal distribution is symmetric, and the center of symmetry is where the median and mean coincide.

3. Mode: In a perfectly normal distribution, every data point has the same frequency, so every value is a mode. In practice, normal distributions may not have a distinct mode unless there's some specific pattern or multiple peaks in the data.

#### Q13. How is covariance different from correlation?

Covariance and correlation are related statistical concepts that both deal with the relationship between two variables, but they differ in several key ways:

1. Definition:

- Covariance: Covariance measures the degree to which two random variables change together. It quantifies whether an increase in one variable corresponds to an increase, decrease, or no change in the other variable. In other words, it indicates whether there is a linear relationship between the two variables.
- Correlation: Correlation is a standardized measure of the strength and direction of the linear relationship between two variables. It quantifies the degree to which two variables are linearly related. Correlation also includes information about the relative scales of the variables.

2. Range:

- Covariance: Covariance can take any value, positive or negative. A positive covariance indicates a positive relationship (both variables tend to increase or decrease together), a negative covariance indicates a negative relationship (one variable tends to increase when the other decreases), and a covariance near zero indicates little to no linear relationship.
- Correlation: Correlation values range between -1 and 1. A correlation of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. Correlation provides a more standardized measure of the strength of the relationship.

3. Units:

- Covariance: Covariance is not scaled and depends on the units of the variables being measured. Therefore, it can be challenging to interpret the magnitude of covariance.
- Correlation:** Correlation is unitless and does not depend on the units of the variables. This makes it easier to compare the strength of relationships between different pairs of variables.

4. Standardization:

- Covariance: Covariance is not standardized, so its magnitude depends on the scales of the variables. This can make it difficult to compare covariances between different pairs of variables.
- Correlation: Correlation is standardized, which means that its values are not affected by the scales of the variables. This standardization makes correlation a more useful measure for comparing relationships between different pairs of variables.

Use Cases:

- Covariance: Covariance is used to identify the direction of the relationship between two variables (positive or negative). It is sometimes used in financial analysis to analyze the risk and return of assets in a portfolio. However, its interpretation can be challenging due to units dependence.

- Correlation: Correlation is widely used in statistics and data analysis because of its standardization and ease of interpretation. It is used to determine the strength and direction of linear relationships, assess the degree of association between variables, make predictions, and explore cause-and-effect relationships.


#### Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

Outliers can significantly affect measures of central tendency (mean, median, mode) and measures of dispersion (variance, standard deviation, range) in a dataset. Here's how:

Measures of Central Tendency:

1. Mean: Outliers have a substantial impact on the mean. Since the mean is the sum of all values divided by the number of values, extreme outliers can pull the mean in their direction. For example, in a dataset of income where most people earn around rs.50,000, a single outlier earning rs.1,000,000 can significantly inflate the mean.

2. Median: The median is less affected by outliers. It represents the middle value when the data is sorted, so a few extreme values have minimal influence on it. In the example above, the median would still be close to rs.50,000.

3. Mode: The mode, or most frequent value, may not be influenced at all by outliers unless they create a new peak in the distribution.

Measures of Dispersion:

1. Variance and Standard Deviation: Outliers increase the spread of data, leading to larger variances and standard deviations. These measures are sensitive to extreme values, giving them more weight.

2. Range: Outliers can significantly increase the range, which is the difference between the maximum and minimum values.

Here's an example:

Consider a dataset of the ages of students in a high school class:


15, 16, 17, 15, 17, 16, 15, 16, 16, 70


- Mean without Outlier: (15 + 16 + 17 + 15 + 17 + 16 + 15 + 16 + 16) / 9 = 16.44 (approx)
- Mean with Outlier: (15 + 16 + 17 + 15 + 17 + 16 + 15 + 16 + 16 + 70) / 10 = 22.3 (approx)

The mean is significantly affected by the outlier (age 70), increasing from 16.44 to 22.3.

- Median without Outlier: 16 (middle value)
- Median with Outlier: 16 (middle value)

The median remains the same regardless of the outlier.

- Range without Outlier: 17 - 15 = 2
- Range with Outlier: 70 - 15 = 55

The range is greatly affected by the outlier, increasing from 2 to 55.

In summary, outliers can distort the measures of central tendency, especially the mean, and can increase measures of dispersion like range, variance, and standard deviation. It's essential to consider and sometimes handle outliers appropriately when analyzing data.