## Q1. What are the three measures of central tendency?


The three measures of central tendency are:

* Mean: It is the average of a set of values, calculated by summing all the values and dividing by the number of data points.

* Median: It is the middle value in a dataset when the values are arranged in ascending or descending order. If the dataset has an even number of values, the median is the average of the two middle values.

* Mode: It is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal), two modes (bimodal), or more (multimodal). It is possible for a dataset to have no mode if all values occur with equal frequency.

## Q2. What is the difference between the mean, median, and mode? How are they used to measure the central tendency of a dataset?


The mean, median, and mode are three different measures of central tendency used to describe the "center" or typical value of a dataset. They represent different ways of summarizing and understanding the distribution of data.

* Mean: The mean is often used when dealing with continuous numerical data and provides a sense of the average value in the dataset. It is useful for symmetrical distributions without significant outliers.

* Median: The median is used when the dataset contains outliers or is skewed, as it is less influenced by extreme values. It gives a representation of the "typical" value that is less affected by extreme observations.

* Mode: The mode is employed when dealing with categorical data or when identifying the most frequent value in a dataset. It is also useful for identifying the peaks in a distribution of continuous data.

## Q3. Measure the three measures of central tendency for the given height data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [2]:
height_data = [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]
import numpy as np

In [3]:
np.mean(height_data)

177.01875

In [4]:
np.median(height_data)

177.0

In [8]:
from scipy import stats
stats.mode(height_data)

  stats.mode(height_data)


ModeResult(mode=array([177.]), count=array([3]))

## Q4. Find the standard deviation for the given data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [10]:
import numpy as np

height_data = [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]
standard_deviation = np.std(height_data)
standard_deviation

1.7885814036548633

## Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe the spread of a dataset? Provide an example.


Measures of dispersion, such as range, variance, and standard deviation, are used to describe the spread or variability of a dataset.

* Range: The range is the simplest measure of dispersion, calculated as the difference between the maximum and minimum values in the dataset. It provides an indication of the total spread of the data.

* Variance: The variance is the average of the squared differences between each data point and the mean of the dataset. It quantifies the average deviation of individual data points from the mean. A higher variance indicates a more spread-out distribution, while a lower variance suggests that the data points are closer to the mean.

* Standard Deviation: The standard deviation is the square root of the variance. It measures the dispersion around the mean in the original units of the data, making it more interpretable than variance. Like variance, a higher standard deviation implies more variability, while a lower standard deviation means the data points are closer to the mean.

In [13]:
#Example:
    
import numpy as np

data = [10, 12, 14, 16, 18]  # Example dataset


# range
data_range = np.max(data) - np.min(data)

# variance
data_variance = np.var(data)

# standard deviation
data_std_deviation = np.std(data)

In [15]:
data_range

8

In [16]:
data_variance

8.0

In [17]:
data_std_deviation

2.8284271247461903

## Q6. What is a Venn diagram?

A Venn diagram is a graphical representation used to show the relationships between different sets or groups of items. It consists of overlapping circles, where each circle represents a set, and the overlapping regions represent the intersection of those sets. Venn diagrams are commonly used in mathematics, statistics, logic, and various other fields to visually illustrate set operations and the relationships between different groups or categories.

Venn diagrams provide a simple and intuitive way to represent set relationships and are widely used in educational contexts, problem-solving, and data analysis to visualize data with overlapping categories or attributes.

## Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:
## (i) A ∩ B
## (ii) A ⋃ B

Given sets:
A = {2, 3, 4, 5, 6, 7}
B = {0, 2, 6, 8, 10}

(i) A ∩ B (Intersection):
The intersection contains the elements that are common to both sets A and B.
A ∩ B = {2, 6}

(ii) A ⋃ B (Union):
The union contains all the unique elements from both sets A and B.
A ⋃ B = {0, 2, 3, 4, 5, 6, 7, 8, 10}

So, the intersection of sets A and B is {2, 6}, and the union of sets A and B is {0, 2, 3, 4, 5, 6, 7, 8, 10}.


## Q8. What do you understand about skewness in data?


Skewness in data refers to the asymmetry or lack of symmetry in the distribution of values. It is a statistical measure that helps us understand the shape of a dataset's distribution and how the data points are spread around the central value.

There are three types of skewness:

* Positive Skewness (Right Skewness):

In a positively skewed distribution, the majority of the data points are concentrated towards the left side of the distribution, and the tail extends towards the right side.
The mean of positively skewed data is typically greater than the median, as the long tail on the right side pulls the mean in that direction.
* Negative Skewness (Left Skewness):

In a negatively skewed distribution, the bulk of the data points are clustered towards the right side of the distribution, and the tail extends towards the left side.
The mean of negatively skewed data is typically less than the median, as the long tail on the left side pulls the mean in that direction.
* Symmetric Distribution:

In a symmetric distribution, the data is evenly distributed on both sides of the central value, and the left and right tails are of similar length.
The mean and median are equal in a symmetric distribution.

## Q9. If a data is right skewed then what will be the position of median with respect to mean?

If a dataset is right-skewed, the position of the median will be to the left of the mean. In a right-skewed distribution, the tail of the data extends towards the right side, pulling the mean in that direction. As a result, the mean is greater than the median.

in a right-skewed distribution, the median will be to the left of the mean, indicating that the central value of the dataset is closer to the majority of the data points rather than being influenced by the outliers or extreme values present in the tail.

## Q10. Explain the difference between covariance and correlation. How are these measures used in statistical analysis?

Covariance and correlation are both measures used to quantify the relationship between two variables in a dataset. While they are related concepts, they have some key differences in their interpretation and application:

* Covariance:

Covariance measures the degree to which two variables change together. It indicates the direction of the linear relationship between two variables, whether they increase or decrease together or if one increases while the other decreases.
A positive covariance indicates that when one variable increases, the other tends to increase as well. A negative covariance suggests that when one variable increases, the other tends to decrease.

* Correlation:

Correlation is a standardized version of covariance, which scales the covariance to always fall between -1 and 1. It provides a more interpretable measure of the strength and direction of the linear relationship between two variables.
A correlation of +1 indicates a perfect positive linear relationship, a correlation of -1 indicates a perfect negative linear relationship, and a correlation of 0 indicates no linear relationship.

* Usage in Statistical Analysis:

1. Both covariance and correlation are fundamental measures used in statistical analysis and data exploration to understand the relationship between two variables.
2. Covariance is useful in identifying the direction of the relationship between variables. For example, it helps in finance to understand how the returns of two assets move together or in manufacturing to analyze how two process variables affect each other.
3. Correlation is widely used in various fields, including finance, economics, social sciences, and natural sciences. It is especially valuable when comparing relationships across different datasets, as it provides a standardized measure of the strength and direction of the relationship.
4. In regression analysis, correlation is used to check for multicollinearity among predictor variables. High correlation between predictors may indicate redundant information and can impact the stability and interpretability of the regression model.

## Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset.


The formula for calculating the sample mean (often denoted as "x̄") is the sum of all the data points divided by the number of data points in the sample. Mathematically, it can be expressed as:

Sample Mean (x̄) = (Sum of all data points) / (Number of data points)

Let's go through an example to calculate the sample mean for a dataset:

Example:
Consider the following dataset of exam scores: [85, 90, 78, 92, 88]

Sum of data points = 85 + 90 + 78 + 92 + 88 = 433
Number of data points = 5

Apply the formula to find the sample mean.
Sample Mean (x̄) = 433 / 5 = 86.6

So, the sample mean for the given dataset is 86.6. This means that the average score in the exam for this sample of students is approximately 86.6.

## Q12. For a normal distribution data what is the relationship between its measure of central tendency?

For a normal distribution, the three measures of central tendency (mean, median, and mode) are all equal. In other words, they have the same value and are located at the center of the normal distribution.

In a normal distribution:

Mean: The arithmetic mean (average) of the data points is the same as the median and the mode. This is because the normal distribution is symmetric around its center, and the mean is at the exact center of the distribution.

Median: The median, which is the middle value in the dataset, is also the same as the mean and mode. As mentioned earlier, the normal distribution is symmetrical, so the middle value is the same as the center of the distribution.

Mode: The mode, which is the value that occurs most frequently, is also the same as the mean and median in a normal distribution. Again, due to the symmetry of the normal distribution, the peak of the distribution (mode) aligns with the center.


## Q13. How is covariance different from correlation?

Covariance and correlation are both measures used to quantify the relationship between two variables, but they are different in their interpretation, scale, and properties:
1. 
* Covariance measures the degree to which two variables change together. It indicates the direction of the linear relationship between two variables, whether they increase or decrease together or if one increases while the other decreases.

* Correlation is a standardized version of covariance, which scales the covariance to always fall between -1 and 1. It provides a more interpretable measure of the strength and direction of the linear relationship between two variables.

2. 
* Covariance is not scaled and can take on any value, positive or negative, depending on the direction of the relationship between the variables and the spread of the data.
* Correlation, on the other hand, is scaled between -1 and 1. A correlation of +1 indicates a perfect positive linear relationship, a correlation of -1 indicates a perfect negative linear relationship, and a correlation of 0 indicates no linear relationship.

3. 
* Covariance can be positive, negative, or zero. A positive covariance indicates that when one variable increases, the other tends to increase as well. A negative covariance suggests that when one variable increases, the other tends to decrease. A covariance of zero indicates no linear relationship between the variables.
* Correlation, being scaled between -1 and 1, has clear properties: +1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

4. 
* Covariance is useful in identifying the direction of the relationship between variables. It has applications in finance, engineering, and other fields where understanding the joint behavior of variables is important.
* Correlation is widely used in various fields, including finance, economics, social sciences, and natural sciences. It is especially valuable when comparing relationships across different datasets, as it provides a standardized measure of the strength and direction of the relationship.

## Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

Outliers can have a significant impact on measures of central tendency (e.g., mean, median, mode) and measures of dispersion (e.g., range, variance, standard deviation). Outliers are extreme values that are significantly different from the rest of the data points in a dataset. Their presence can distort the characteristics of the data and affect the interpretation of statistical measures.

* Measures of Central Tendency:

1. Mean: The mean is sensitive to outliers because it considers all data points equally. If there are extreme values in the dataset, the mean can be pulled towards them, causing it to no longer represent the typical or central value of the majority of the data.

2. Median: The median is less affected by outliers. Since it represents the middle value in the dataset, it is not influenced by extreme values at the tails of the distribution. Therefore, the median is often preferred over the mean when dealing with skewed or outlier-prone data.

2. Mode: The mode represents the most frequent value(s) in the dataset. Outliers usually do not impact the mode unless they become the most frequently occurring value, which is rare.

* Measures of Dispersion:

2. Range: The range, which is the difference between the maximum and minimum values in the dataset, can be significantly affected by outliers. The presence of outliers can lead to an overestimation of the spread of the data.

2. Variance and Standard Deviation: Both variance and standard deviation are based on the squared differences between each data point and the mean. Outliers, with their large deviations from the mean, can dramatically increase the variance and standard deviation, indicating greater dispersion in the data than what may actually be true.



In [18]:
#Example:
#Consider the following dataset representing the test scores of ten students:

students= [85, 92, 78, 86, 95, 88, 82, 89, 91, 200]
import numpy as np


In [21]:
np.mean(students)

98.6

In [22]:
np.median(students)

88.5

In [23]:
from scipy import stats
stats.mode(students)

  stats.mode(students)


ModeResult(mode=array([78]), count=array([1]))

In [25]:
# range
np.max(students) - np.min(students)

122

In [26]:
# variance
np.var(students)

1164.44

In [27]:
# standard deviation
np.std(students)

34.12389192340171