# Statistcs Assignment 2

## Q1. What are the three measures of central tendency?

The three measures of central tendency are:

1. **Mean**: 
   - **Definition**: The mean, often referred to as the average, is the sum of all the values in a dataset divided by the number of values.
   - **Usage**: It's used when data is evenly distributed without extreme outliers.
   - **Example**: In a dataset of test scores, the mean score can be calculated by summing all the scores and dividing by the number of students.

2. **Median**: 
   - **Definition**: The median is the middle value in a dataset when the values are arranged in ascending or descending order. If there is an even number of values, the median is the average of the two middle numbers.
   - **Usage**: It's useful when the data has outliers or is skewed, as it better represents the center of the data.
   - **Example**: In a list of house prices, the median price is the price at the middle of the list when the prices are sorted from lowest to highest.

3. **Mode**: 
   - **Definition**: The mode is the value that occurs most frequently in a dataset. A dataset can have one mode (unimodal), more than one mode (bimodal or multimodal), or no mode if no value repeats.
   - **Usage**: It is used with categorical data or when identifying the most common item is important.
   - **Example**: In a survey of favorite ice cream flavors, the mode would be the flavor chosen by the most people.

## Q2. What is the difference between the mean, median, and mode? How are they used to measure the central tendency of a dataset?

The mean, median, and mode are all measures of central tendency, each providing a different way to summarize the center of a dataset. Here's how they differ and their typical uses:

### **Mean**
- **Calculation**: The mean is calculated by adding all the values in the dataset and then dividing by the number of values.
- **Sensitivity to Outliers**: The mean is sensitive to outliers, meaning that extreme values can significantly affect it. This can sometimes give a misleading representation of the data's center, especially in skewed distributions.
- **Use Cases**: The mean is useful for data that is symmetrically distributed and has no extreme outliers. It's often used in quantitative data analysis, such as calculating average scores, salaries, or temperatures.

### **Median**
- **Calculation**: The median is the middle value when the data is arranged in ascending or descending order. If there are an even number of observations, the median is the average of the two middle numbers.
- **Sensitivity to Outliers**: The median is less affected by outliers and skewed data, making it a better measure of central tendency in such cases.
- **Use Cases**: The median is preferred when dealing with skewed distributions or data with outliers, such as income levels, home prices, or any scenario where extreme values could skew the mean.

### **Mode**
- **Calculation**: The mode is the value that appears most frequently in the dataset. There can be more than one mode if multiple values occur with the same highest frequency.
- **Sensitivity to Outliers**: The mode is not affected by outliers but may not always reflect the central tendency if the data has multiple modes or if the most common value is not near the center of the distribution.
- **Use Cases**: The mode is most useful with categorical data where we want to know the most common category, such as the most common color of cars in a parking lot, or when data are nominal.

### **Choosing the Right Measure**
- **Mean**: Use the mean when data is evenly distributed without extreme values, and you want an average measure.
- **Median**: Use the median when dealing with skewed data or outliers, as it provides a better representation of the data's center.
- **Mode**: Use the mode with categorical data to find the most common category or with numerical data to find the most frequent value.

## Q3. Measure the three measures of central tendency for the given height data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [14]:
import numpy as np
from scipy import stats
# Given height data
height_data = [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]

# Calculating the mean
mean_height = np.mean(height_data)

# Calculating the median
median_height = np.median(height_data)

# Calculating the mode
mode_height = stats.mode(height_data)

print(f"Mean:  {mean_height} \nMedian : { median_height} \nMODE : {mode_height}")


Mean:  177.01875 
Median : 177.0 
MODE : ModeResult(mode=array([177.]), count=array([3]))


  mode_height = stats.mode(height_data)


## Q4. Find the standard deviation for the given data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [15]:
# Calculating the standard deviation for the given height data
standard_deviation = np.std(height_data)

print(f"Standard_Deviation = {standard_deviation}")


Standard_Deviation = 1.7885814036548633


## Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe the spread of a dataset? Provide an example.

Measures of dispersion are statistical tools used to describe the spread or variability within a dataset. They help us understand how much the data points differ from each other and from the central tendency (mean, median, mode). The most common measures of dispersion are range, variance, and standard deviation. Here's how they are used:

### **1. Range**
- **Definition**: The range is the difference between the maximum and minimum values in a dataset.
- **Usage**: It provides a quick sense of the extent of data spread but is sensitive to outliers.
- **Example**: In a dataset of exam scores ranging from 40 to 95, the range is \( 95 - 40 = 55 \). This tells us that the scores span 55 points.

### **2. Variance**
- **Definition**: Variance measures the average squared deviations from the mean. It gives more weight to extreme values due to the squaring of deviations.
- **Usage**: Variance provides a measure of how data points spread around the mean but is less intuitive because it's in squared units of the data.
- **Example**: For a dataset with exam scores, a high variance indicates that scores are spread out over a wide range, while a low variance suggests that scores are closer to the mean.

### **3. Standard Deviation**
- **Definition**: Standard deviation is the square root of the variance. It provides a measure of dispersion in the same units as the data, making it easier to interpret.
- **Usage**: Standard deviation indicates the average distance of data points from the mean. A smaller standard deviation implies that data points are close to the mean, while a larger standard deviation indicates greater spread.
- **Example**: If the standard deviation of exam scores is 5, it means that, on average, scores deviate from the mean by 5 points.

### **Example to Illustrate Dispersion**
Consider a dataset of monthly temperatures in a city:
\[ [15, 16, 18, 19, 20, 21, 22, 23, 22, 21, 19, 17] \]

- **Range**: The range is \( 23 - 15 = 8 \), showing the temperature difference between the coldest and warmest months.
- **Variance**: Variance gives the average squared deviation from the mean temperature, highlighting the overall temperature fluctuation.
- **Standard Deviation**: If the standard deviation is 2, it indicates that most monthly temperatures are within 2 degrees of the mean temperature. This helps in understanding the consistency or variability of the temperature over the year.

In summary, while measures of central tendency tell us where the center of a dataset lies, measures of dispersion provide crucial information about how spread out or concentrated the data points are.

## Q6. What is a Venn diagram?

A Venn diagram is a graphical representation used to illustrate the relationships between different sets or groups. It consists of circles or other shapes that overlap to show the logical relationships and commonalities between the sets.

### Key Features of a Venn Diagram:
1. **Circles (or other shapes)**: Each circle represents a set. The contents of the set are typically written inside the circle.
2. **Overlapping Areas**: The overlapping areas between circles represent the elements that are common to the sets. These intersections show where sets share common elements.
3. **Non-overlapping Areas**: These represent elements that are unique to a particular set and not shared with others.

### Uses of Venn Diagrams:
- **Set Theory**: To visually represent the logical relationships between different sets, such as unions, intersections, and complements.
- **Problem Solving**: To organize information and solve problems that involve comparison and contrast.
- **Probability and Statistics**: To depict events and their relationships, such as independent, mutually exclusive, or overlapping events.

### Example:
Consider two sets: Set A represents people who like apples, and Set B represents people who like bananas. In a Venn diagram:
- Circle A includes people who like apples.
- Circle B includes people who like bananas.
- The overlapping area between A and B represents people who like both apples and bananas.
- Areas outside the circles represent people who like neither apples nor bananas.

This visual tool helps in understanding the relationships between different groups and is commonly used in education, research, and data analysis.

## Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find: (i) A.B  (ii) A ⋃ B

Let's solve for the two given sets \( A \) and \( B \):

- \( A = \{2, 3, 4, 5, 6, 7\} \)
- \( B = \{0, 2, 6, 8, 10\} \)

### (i) **Intersection of A and B (\( A \cap B \))**
The intersection of two sets includes only the elements that are present in both sets.

**Common elements between A and B**: 
- Both sets include 2 and 6.

So, \( A \cap B = \{2, 6\} \).

### (ii) **Union of A and B (\( A \cup B \))**
The union of two sets includes all the elements from both sets, with duplicates removed.

**All elements from A and B**:
- From A: 2, 3, 4, 5, 6, 7
- From B: 0, 2, 6, 8, 10

Combining all elements and removing duplicates gives:
- \( A \cup B = \{0, 2, 3, 4, 5, 6, 7, 8, 10\} \).

## Q8. What do you understand about skewness in data?

Skewness refers to the measure of asymmetry in a probability distribution of a real-valued random variable. It describes how much and in which direction a dataset deviates from the normal distribution, where data is symmetrically distributed around the mean.

### Types of Skewness:

1. **Positive Skewness (Right Skewness)**:
   - **Description**: In a positively skewed distribution, the tail on the right side (higher values) is longer or fatter than the left side. This means that there are relatively few high-value outliers pulling the mean to the right.
   - **Characteristics**: The mean is usually greater than the median, and the median is greater than the mode.
   - **Example**: Income distribution in many economies, where a small number of people earn significantly more than the majority.

2. **Negative Skewness (Left Skewness)**:
   - **Description**: In a negatively skewed distribution, the tail on the left side (lower values) is longer or fatter than the right side. This indicates that there are relatively few low-value outliers pulling the mean to the left.
   - **Characteristics**: The mean is usually less than the median, and the median is less than the mode.
   - **Example**: Test scores where a few students score exceptionally low, but most students score high.

3. **Zero Skewness (Symmetric Distribution)**:
   - **Description**: A distribution with zero skewness is perfectly symmetrical. The left and right sides of the distribution are mirror images.
   - **Characteristics**: The mean, median, and mode are all equal.
   - **Example**: A perfectly normal distribution (bell curve).

### Importance of Understanding Skewness:
- **Data Analysis**: Skewness helps in understanding the nature of the data distribution, which is crucial for choosing the appropriate statistical methods for analysis.
- **Descriptive Statistics**: It provides insight into the likelihood of extreme values, which can impact measures like the mean and standard deviation.
- **Decision Making**: Knowing the skewness of data can inform decisions in various fields, such as finance, where skewed returns may affect investment strategies.

In summary, skewness is a key aspect of data distribution analysis, offering valuable information about the asymmetry and potential outliers in the dataset.

## Q9. If a data is right skewed then what will be the position of median with respect to mean?

In a right-skewed distribution (also known as positively skewed), the distribution's tail is longer on the right side. This skewness indicates that there are a few exceptionally high values (outliers) that pull the mean to the right.

In this scenario:
- The **mean** is affected by the extreme high values and is pulled towards the right (higher values).
- The **median** represents the middle value when all observations are ordered. It is less affected by outliers and skews than the mean.

### Position of the Median Relative to the Mean:
In a right-skewed distribution, the **median** is typically less than the **mean**. This is because the mean, being influenced by the high outliers, is larger, while the median, as the middle value, remains closer to the center of the data distribution. 

Thus, in right-skewed data:
\[ \text{Median} < \text{Mean} \]

## Q10. Explain the difference between covariance and correlation. How are these measures used in statistical analysis?

Covariance and correlation are both measures of the relationship between two variables, but they have some key differences:

### Covariance
- **Definition**: Covariance measures the degree to which two variables change together. Specifically, it indicates whether increases in one variable correspond to increases (or decreases) in another variable.
- **Formula**: For two variables \( X \) and \( Y \), the covariance is given by:
  \[
  \text{Cov}(X, Y) = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{n - 1}
  \]
  where \( X_i \) and \( Y_i \) are individual data points, \( \bar{X} \) and \( \bar{Y} \) are the means of \( X \) and \( Y \), respectively, and \( n \) is the number of data points.
- **Scale**: The covariance value is not standardized and depends on the units of the variables. Thus, it is difficult to interpret on its own.
- **Range**: Covariance can range from \(-\infty\) to \(+\infty\). A positive covariance indicates that the variables tend to increase or decrease together, while a negative covariance indicates that one variable tends to increase when the other decreases.

### Correlation
- **Definition**: Correlation measures the strength and direction of the linear relationship between two variables. It standardizes the measure of covariance by dividing by the product of the standard deviations of the variables.
- **Formula**: The correlation coefficient, typically denoted as \( \rho \) (for population) or \( r \) (for sample), is given by:
  \[
  \text{Corr}(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}
  \]
  where \( \sigma_X \) and \( \sigma_Y \) are the standard deviations of \( X \) and \( Y \), respectively.
- **Scale**: Correlation is standardized and is dimensionless, which makes it easier to interpret.
- **Range**: Correlation values range from \(-1\) to \(+1\). A correlation of \(+1\) indicates a perfect positive linear relationship, \(-1\) indicates a perfect negative linear relationship, and \(0\) indicates no linear relationship.

### Use in Statistical Analysis
- **Covariance**: Useful in understanding the direction of the relationship between variables. It’s used in portfolio theory in finance to assess how different investments move together.
- **Correlation**: Provides a more interpretable measure of the strength and direction of a relationship. It is widely used in regression analysis, hypothesis testing, and data exploration to understand relationships between variables and to make predictions.

In summary, while covariance gives a raw measure of how two variables change together, correlation normalizes this measure to allow for easier interpretation and comparison.

## Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset.

The sample mean is a measure of the central tendency of a dataset. It is calculated by summing all the data points and then dividing by the number of data points.

### Formula
For a sample dataset \( X \) with \( n \) data points, the sample mean \( \bar{X} \) is given by:
\[
\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i
\]
where \( X_i \) represents each data point in the sample.

### Example Calculation in Python

Here's how you can calculate the sample mean for a dataset using Python:

In [16]:
import numpy as np

# Sample dataset
data = [10, 15, 23, 7, 9, 12]

# Calculate the sample mean
mean = np.mean(data)

print(f"The sample mean is: {mean}")

The sample mean is: 12.666666666666666


### Explanation
- `import numpy as np`: Imports the NumPy library, which is commonly used for numerical operations in Python.
- `data`: Defines the dataset as a list of values.
- `np.mean(data)`: Computes the mean of the dataset.
- `print(...)`: Outputs the result.

When you run this code, it will calculate and print the sample mean of the dataset. For the dataset `[10, 15, 23, 7, 9, 12]`, the sample mean is approximately `12.0`.

## Q12. For a normal distribution data what is the relationship between its measure of central tendency?

In a normal distribution, the measures of central tendency—mean, median, and mode—are all equal. This is a defining property of the normal distribution. Here’s a breakdown:

1. **Mean**: The arithmetic average of all data points. In a normal distribution, the mean is the center of the distribution.
2. **Median**: The value that divides the dataset into two equal halves. For a normal distribution, the median is also at the center of the distribution.
3. **Mode**: The value that appears most frequently in the dataset. In a normal distribution, the mode is the same as the mean and median, located at the peak of the bell curve.

### Relationship in a Normal Distribution
- **Mean = Median = Mode**: In a perfectly normal distribution, these three measures are identical. This equality is due to the symmetric nature of the normal distribution, where the highest frequency of data points occurs at the center, and the distribution is evenly spread on either side.

This property ensures that the normal distribution is symmetrical about its central point.

## Q13. How is covariance different from correlation?

Covariance and correlation both measure the relationship between two variables, but they differ in their scale and interpretation:

### Covariance
- **Definition**: Covariance measures how two variables change together. It indicates whether increases in one variable tend to be associated with increases (or decreases) in another variable.
- **Formula**: For two variables \( X \) and \( Y \), the covariance is calculated as:
  \[
  \text{Cov}(X, Y) = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{n - 1}
  \]
  where \( X_i \) and \( Y_i \) are individual data points, \( \bar{X} \) and \( \bar{Y} \) are the means of \( X \) and \( Y \), respectively, and \( n \) is the number of data points.
- **Scale**: The covariance value is not standardized and depends on the units of the variables. This makes it difficult to interpret the magnitude of covariance on its own.
- **Range**: Covariance can range from \(-\infty\) to \(+\infty\). A positive covariance indicates that the variables tend to increase or decrease together, while a negative covariance indicates that one variable tends to increase when the other decreases.

### Correlation
- **Definition**: Correlation measures both the strength and direction of the linear relationship between two variables. It standardizes the covariance by dividing it by the product of the standard deviations of the variables.
- **Formula**: The correlation coefficient, typically denoted as \( \rho \) (for population) or \( r \) (for sample), is given by:
  \[
  \text{Corr}(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}
  \]
  where \( \sigma_X \) and \( \sigma_Y \) are the standard deviations of \( X \) and \( Y \), respectively.
- **Scale**: Correlation is a standardized measure and is dimensionless, making it easier to interpret. It normalizes the covariance.
- **Range**: Correlation values range from \(-1\) to \(+1\). A correlation of \(+1\) indicates a perfect positive linear relationship, \(-1\) indicates a perfect negative linear relationship, and \(0\) indicates no linear relationship.

### Key Differences
- **Interpretation**: Covariance is useful for understanding the direction of the relationship but is hard to interpret due to its units. Correlation provides a standardized measure that makes it easier to interpret the strength and direction of the relationship.
- **Range and Units**: Covariance has no fixed range and its value depends on the units of the variables. Correlation has a fixed range of \(-1\) to \(+1\) and is unitless, making it more interpretable.

In summary, while covariance gives a raw measure of how two variables move together, correlation standardizes this measure, allowing for easier interpretation and comparison.

## Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

Outliers can significantly affect both measures of central tendency and dispersion. Here's how:

### Measures of Central Tendency
1. **Mean**: Outliers can have a substantial impact on the mean because the mean is calculated by summing all data points and dividing by the number of data points. A single extreme value can skew the mean significantly.
   - **Example**: Consider the dataset `[1, 2, 3, 4, 100]`. The mean is:
     \[
     \text{Mean} = \frac{1 + 2 + 3 + 4 + 100}{5} = \frac{110}{5} = 22
     \]
     Here, the value `100` is an outlier and has raised the mean significantly compared to the central values of the dataset.

2. **Median**: The median is less affected by outliers because it is the middle value in a sorted list. If there is an even number of data points, it is the average of the two middle values. Outliers generally do not affect the median as much as they do the mean.
   - **Example**: For the same dataset `[1, 2, 3, 4, 100]`, when sorted, the median is `3`, which is not significantly affected by the `100`.

3. **Mode**: The mode is the value that appears most frequently. Outliers usually have little effect on the mode unless the outlier is repeated frequently enough to be the most common value.

### Measures of Dispersion
1. **Range**: The range, which is the difference between the maximum and minimum values, is highly affected by outliers. A single outlier can dramatically increase the range.
   - **Example**: In the dataset `[1, 2, 3, 4, 100]`, the range is:
     \[
     \text{Range} = 100 - 1 = 99
     \]
     The outlier `100` has significantly increased the range compared to what it would be without it.

2. **Variance and Standard Deviation**: Both variance and standard deviation measure the spread of data points from the mean. Outliers can increase these measures substantially because they contribute disproportionately to the squared differences from the mean.
   - **Example**: For the dataset `[1, 2, 3, 4, 100]`, the variance and standard deviation will be much larger than in a dataset without the outlier. 

   To calculate variance:
   \[
   \text{Variance} = \frac{(1-22)^2 + (2-22)^2 + (3-22)^2 + (4-22)^2 + (100-22)^2}{5}
   \]
   For this example, the variance will be quite high due to the outlier `100`.

3. **Interquartile Range (IQR)**: The IQR is less affected by outliers compared to range, variance, and standard deviation. It measures the spread of the middle 50% of the data.
   - **Example**: In a dataset where `1`, `2`, `3`, `4`, and `100` are the values, the IQR is the difference between the 75th percentile and the 25th percentile. Even though outliers can affect the values slightly, IQR is generally more robust.

### Summary
- **Mean**: Highly affected by outliers, leading to potential misrepresentation of the central location.
- **Median**: Less affected, providing a better measure of central tendency in the presence of outliers.
- **Mode**: Generally unaffected unless outliers occur frequently.
- **Range, Variance, and Standard Deviation**: All can be significantly affected by outliers, increasing their values and potentially misleading about the spread of the data.