Q1. What are the three measures of central tendency?

### Mean:
The mean, also known as the average, is calculated by adding up all the values in a dataset and then dividing by the number of values. It represents the arithmetic "average" of the data.

### Median:
The median is the middle value in a dataset when the values are arranged in ascending or descending order. If there is an even number of values, the median is the average of the two middle values.

### Mode:
The mode is the value or values that occur most frequently in a dataset. A dataset can have one mode (unimodal), more than one mode (multimodal), or no mode if all values are equally frequent.

Q2. What is the difference between the mean, median, and mode? How are they used to measure the 
central tendency of a dataset?

## Measures of Central Tendency:

##### Mean (Average):
* The mean is the sum of all values in a dataset divided by the number of values.
* Mean = (Sum of all values) / (Number of values)
* Use: It represents the central or typical value in a dataset. The mean is sensitive to extreme values (outliers).

##### Median:
* The median is the middle value when the data is sorted. If there is an even number of values, it's the average of the two middle values.
* Median = Middle value (or average of two middle values)
* Use: It represents the middle value and is less affected by extreme values than the mean. It's often used when data is skewed.

##### Mode:
* The mode is the most frequently occurring value(s) in a dataset.
* Median = Middle value (or average of two middle values)
* Use: It represents the most common value(s) and is used for categorical or discrete data. A dataset can have one mode (unimodal), more than one mode (multimodal), or no mode.

## How they Measure of Central Tendency:

Mean: The mean measures central tendency by balancing all values equally. It gives equal weight to each data point and is suitable for symmetric distributions with no outliers.

Median: The median measures central tendency by identifying the middle value. It's particularly useful when the data is skewed or contains outliers, as it is not influenced by extreme values.

Mode: The mode measures central tendency by identifying the most frequently occurring value(s). It's used for data with distinct categories and helps identify the most common observation(s).

Q3. Measure the three measures of central tendency for the given height data:

 [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [1]:
data = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

import numpy as np
import statistics as st # Mode is not available in numpy

mean = np.mean(data)
median = np.median(data)
mode = st.mode(data)

print("Mean of this Data is: ", mean)
print("Median of this Data is: ", median)
print("Mode of this Data is: ", mode)

Mean of this Data is:  177.01875
Median of this Data is:  177.0
Mode of this Data is:  178


Q4. Find the standard deviation for the given data:

[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [4]:
# Same values in above data

#By using inbuilt libraries we can find
std = np.std(data)
print("Standard Deviation of above data is: ", std)

Standard Deviation of above data is:  1.7885814036548633


Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe 
the spread of a dataset? Provide an example.

They provide information about how data points are distributed and how they deviate from the central tendency measures like the mean or median. 
 
## Range:
The range is the simplest measure of dispersion. It quantifies the spread of data by calculating the difference between the maximum and minimum values in the dataset.
* Formula: Range = Maximum Value - Minimum Value
* Use: It provides a basic understanding of the overall spread of data but is sensitive to outliers.
* Example: Consider a dataset of daily temperatures (in degrees Fahrenheit) for a city over a month: [68, 72, 73, 66, 85, 79, 64]. The range would be 85 - 64 = 21 degrees, indicating a range of temperatures observed during the month.

## Variance:
Variance quantifies how data points deviate from the mean by calculating the average of the squared differences between each data point and the mean.
* Formula: Variance = Σ((X - μ)^2) / N, where X is each data point, μ is the mean, and N is the number of data points.
* Use: It provides a more precise measure of data spread, but it is sensitive to outliers.
* Example: Using the same temperature dataset, the variance would be calculated as (σ^2) = [(68-72)^2 + (72-72)^2 + ... + (64-72)^2] / 7, where σ^2 is the variance. The result quantifies how much temperatures deviate from the mean.
    
## Standard Deviation:
The standard deviation is the square root of the variance. It provides a measure of the average deviation of data points from the mean.
* Formula: Standard Deviation (σ) = √Variance
* Use: It is commonly used because it is in the same unit as the data, and it quantifies data spread while being sensitive to outliers.
* Example: Continuing with the temperature dataset, the standard deviation would be the square root of the previously calculated variance, providing a single number that describes the typical deviation of temperatures from the mean.

Q6. What is a Venn diagram?

* A Venn diagram is a visual representation used to depict the relationships between different sets or groups. 
* It consists of overlapping circles or ellipses, each representing a set, and the overlapping regions represent the intersections or common elements between the sets. 
* Venn diagrams are commonly used in mathematics, logic, statistics, and various fields.

Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:

(i) 	A N B

(ii)	A ⋃ B

1) A intersection B : {2,6}
2) A union B : {0, 2,3,4,5,6,7,8,10}

Q8. What do you understand about skewness in data?

Skewness in data is a statistical measure that describes the asymmetry or lack of symmetry in the distribution of values within a dataset. It provides insights into the shape of the data distribution,
* Negatively Skewed (Left Skewed)
* Positively Skewed (Right Skewed)
* Symmetrical Distribution (Zero Skewness)

Q9. If a data is right skewed then what will be the position of median with respect to mean?

The median will be less than the mean.
* In a right-skewed distribution, the tail on the right side (the higher values) is longer or fatter, which means that there are some extremely high values that pull the mean in that direction. 

Q10. Explain the difference between covariance and correlation. How are these measures used in 
statistical analysis

## Covariance and correlation:
are both measures used to describe the relationship between two variables in statistics, particularly in the context of multivariate data analysis. However, they have distinct characteristics and purposes:

#### Covariance	
* Covariance is a measure to indicate the extent to which two random variables change in tandem.	
* Covariance is nothing but a measure of correlation.	
* Covariance indicates the direction of the linear relationship between variables.	
* Covariance can vary between -∞ and +∞	
* Covariance assumes the units from the product of the units of the two variables.	
* Covariance is zero in case of independent variables (if one variable moves and the other doesn’t) because then the variables do not necessarily move together.

#### Correlation
* Correlation is a measure used to represent how strongly two random variables are related to each other.
* Correlation refers to the scaled form of covariance.
* Correlation on the other hand measures both the strength and direction of the linear relationship between two variables.
* Correlation ranges between -1 and +1
* Correlation is dimensionless, i.e. It’s a unit-free measure of the relationship between variables.
* Independent movements do not contribute to the total correlation. Therefore, completely independent variables have a zero correlation.

## Usage in Statistical Analysis:

Covariance: Covariance is primarily used to understand the direction of the linear relationship between two variables. However, its unstandardized nature makes it less suitable for comparing relationships across different datasets or variables with different units of measurement.

Correlation: Correlation is widely used in statistical analysis because of its standardized and interpretable nature. It helps assess the strength and direction of linear relationships and allows for comparisons across different datasets and variables. Common correlation coefficients include Pearson's correlation coefficient (for linear relationships), Spearman's rank correlation coefficient (for monotonic relationships), and others.

Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset

Sample Mean (X̄) = ΣX / N

In [7]:
Data =  [85, 92, 78, 88, 95, 89, 76, 90, 82, 94]
# ΣX = 85 + 92 + 78 + 88 + 95 + 89 + 76 + 90 + 82 + 94 = 869
#N =10

#X̄ = ΣX / N = 879 / 10 = 86.9
np.mean(Data)

86.9

Q12. For a normal distribution data what is the relationship between its measure of central tendency

In a normal distribution (also known as a Gaussian distribution or bell curve), there is a specific relationship between its measures of central tendency, which include the mean, median, and mode. 

Mean (Average): In a normal distribution, the mean is located exactly at the center of the distribution. This means that the mean is equal to the median and the mode.

Median: The median in a normal distribution is also located at the center of the distribution, just like the mean. Therefore, the median is equal to the mean and the mode.

Mode: In a normal distribution, the mode is the same as the mean and the median, and they all coincide at the peak or center of the distribution.

In Short for a normal distribution:

Mean = Median = Mode
This relationship is a fundamental property of the normal distribution and is a result of its symmetric and bell-shaped nature. It means that the central tendency measures are all equal and located at the same point in the distribution, making them useful for describing the typical or central value in normally distributed data.

Q13. How is covariance different from correlation?

### Definition:

Covariance: Covariance measures the degree to which two variables change together. It indicates the direction of the linear relationship between the variables, whether they move in the same direction (positive covariance) or in opposite directions (negative covariance).
Correlation: Correlation is a standardized measure of the strength and direction of the linear relationship between two variables. It quantifies the degree to which the variables are related and scales the relationship to a range between -1 and 1.

### Range:

Covariance: Covariance can take any real value, which makes it challenging to interpret its magnitude. It depends on the units of the variables.
Correlation: Correlation values range from -1 to 1. A correlation of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

### Unit of Measurement:
Covariance: The unit of covariance is the product of the units of the two variables (e.g., square units if both variables are in square units).
Correlation: Correlation is a unitless measure, making it easier to compare relationships across different datasets or variables with different units.

### Standardization:

Covariance: Covariance is not standardized, so its magnitude can vary widely depending on the scale of the variables. This makes it difficult to compare covariances between different datasets.
Correlation: Correlation is standardized, making it suitable for comparing relationships across different datasets. It provides a consistent measure of the strength and direction of the linear relationship.

### Interpretation:

Covariance: A positive covariance indicates a positive relationship, and a negative covariance indicates a negative relationship. However, the magnitude of covariance is not easily interpretable in isolation.
Correlation: Correlation is easier to interpret because it provides a standardized measure. Positive correlation values indicate a positive linear relationship, negative values indicate a negative linear relationship, and the magnitude indicates the strength of the relationship.

In short while both covariance and correlation describe relationships between variables, correlation is more widely used in statistics and data analysis because it provides a standardized and interpretable measure of the strength and direction of the linear relationship, making it easier to compare and draw conclusions from different datasets. Covariance, on the other hand, is useful for understanding the direction of the relationship but lacks standardization for easy comparison.

Q14. How do outliers affect measures of central tendency and dispersion? Provide an example

Outliers can significantly impact measures of central tendency (e.g., mean, median, mode) and measures of dispersion (e.g., range, variance, standard deviation) in a dataset.

## Measures of Central Tendency:

##### Mean:
Outliers can have a substantial effect on the mean because the mean is calculated by summing all values and dividing by the number of values. If there are extreme outliers, they can pull the mean toward their values. For positively skewed data, where there is a high outlier, the mean will be higher than expected. Conversely, for negatively skewed data, where there is a low outlier, the mean will be lower than expected.

##### Median:
The median is less affected by outliers because it is the middle value when data is sorted. Outliers do not impact the position of the median in the dataset. In cases where there are outliers, the median may provide a better representation of the central value.

##### Mode:
The mode represents the most frequently occurring value(s) and is less influenced by outliers, especially if the outliers are isolated cases.

## Measures of Dispersion:

##### Range:
Outliers can significantly affect the range because the range is calculated as the difference between the maximum and minimum values. Even a single extreme outlier can widen the range substantially.

##### Variance and Standard Deviation:
Outliers can increase the variance and standard deviation because these measures take into account the squared differences between each data point and the mean. Outliers tend to have larger squared differences, increasing the overall variability in the dataset.



In [8]:
Data2 =  [20, 25, 28, 30, 35, 40, 45, 50] # With out outlier

Without_Outlier_mean = np.mean(Data2)
Without_Outlier_median = np.median(Data2)

print("Mean Without Outlier: ", Without_Outlier_mean)
print("Median Without Outlier ", Without_Outlier_median)

Mean Without Outlier:  34.125
Median Without Outlier  32.5


In [9]:
Data2 =  [20, 25, 28, 30, 35, 40, 45, 50, 900] # With outlier

With_Outlier_mean = np.mean(Data2)
With_Outlier_median = np.median(Data2)

print("Mean With Outlier: ", With_Outlier_mean)
print("Median With Outlier ", With_Outlier_median)

Mean With Outlier:  130.33333333333334
Median With Outlier  35.0


In [None]:
#From above Mean is Highly increased, 