### Problem_1: What are the three measures of central tendency?

The three most commonly used measures of central tendency are:
1. Mean
2. Median
3. Mode

### Problem_2: What is the difference between the mean, median, and mode? How are they used to measure the central tendency of a dataset?

All three (mean, median, and mode) try to represent the center or most typical value in a dataset, but they do it in different ways:

  - Mean:  This is the classic average, calculated by summing all the values and dividing by the number of values. It's good for symmetrical datasets, but can be skewed by extreme values (outliers).

  - Median:  Imagine ordering your data points from least to greatest. The median is the "middle" value. If you have an even number of data points, it's the average of the two middle ones. Median is less affected by outliers than the mean.

  - Mode: This is the most frequent value. You can have one mode (most common scenario), multiple modes (bimodal or multimodal data), or even no mode if all values appear equally often. Mode is useful for categorical data (e.g., shoe sizes) where finding an average doesn't make sense.

### Problem_3: Measure the three measures of central tendency for the given height data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [13]:
import numpy as np
import statistics as st
import pandas as pd
data=[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]
# Measuring the central tendency for the data
statistics={
    'Mean': [np.mean(data)],
    'median':[np.median(data)],
    'mode':[st.mode(data)]
}
df=pd.DataFrame(statistics)
df

Unnamed: 0,Mean,median,mode
0,177.01875,177.0,178


### Problem_4: Find the standard deviation for the given data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [15]:
import numpy as np
data=[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]
# Standard deviation of the given data:
std=np.std(data)
std

1.7885814036548633

### Problem_5: How are measures of dispersion such as range, variance, and standard deviation used to describe the spread of a dataset? Provide an example.

1. Range:

  - This is the simplest measure, calculated as the difference between the highest and lowest values in the data set.
  - Example: In your data (172.5, 175, 175.5, ..., 180), the range is 180 - 172.5 = 7.5.
* Range is easy to calculate but has limitations:
  - It only considers the two extreme values and ignores the distribution of the rest of the data.
  - Outliers can significantly affect the range, making it less informative for skewed data.
2. Variance:

  - This gives a more detailed picture of spread. It calculates the average of the squared deviations of each data point from the mean.
  - A larger variance indicates that the data points are further away from the mean on average.
  - Calculation: We'd first find the mean of the data (around 177.2). Then, for each data point, subtract the mean, square the difference, and finally, average all those squared deviations.
* Variance has its own drawbacks:
  - Since it uses squared deviations, the units are squared compared to the original data, making interpretation less intuitive.
3. Standard Deviation:

  - This solves the interpretability issue of variance by taking the square root of the variance.
  - Standard deviation has the same units as the original data, making it easier to understand the spread in the context of the data values.
  - A higher standard deviation indicates that the data points are further away from the mean on average.
  
#### Example:

Let's say you have exam scores for 20 students:

  - Range: 85 (highest) - 50 (lowest) = 35 points
  - Variance: (Let's say the calculated variance is 121)
  - Standard Deviation: √(Variance) = √(121) = 11 points          
#### The range tells us the spread is within 35 points. The variance of 121 indicates an average squared deviation of 121 from the mean score. Finally, the standard deviation of 11 points gives a clearer picture - the scores typically deviate from the mean by 11 points on average.

### Problem_6: What is a Venn diagram?

  - A Venn diagram is a graphical tool that uses overlapping circles to represent relationships between sets. It's a way to visually show how different groups of things are similar and different. Circles are the most common shape used,

### Problem_7: For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:
1. A Intersection B
2. A ⋃nion B

In [4]:
# Sets A and B
A = {2, 3, 4, 5, 6, 7}
B = {0, 2, 6, 8, 10}

# intersection
intersection=A.intersection(B)
union=A.union(B)

print('Intersection:',intersection)
print('Union:',union)

Intersection: {2, 6}
Union: {0, 2, 3, 4, 5, 6, 7, 8, 10}


### Problem_8: What do you understand about skewness in data?

#### Skewness tells us how asymmetric a distribution of data is. In other words, it describes whether the data is clustered more towards one side (left or right) of the center compared to the other. Here's a breakdown of key points about skewness:

#### Visualizing Skewness:
Imagine a normal distribution (bell curve). If the data points are distributed symmetrically around the mean, the curve is perfectly balanced. This signifies zero skewness.

  - Positive Skew: When the tail of the distribution extends further to the right side of the mean, the curve leans right. This indicates positive skewness. Imagine income data where most people have a lower income, but a few have very high incomes, creating a long tail to the right.
  - Negative Skew: When the tail extends further to the left side of the mean, the curve leans left. This indicates negative skewness. An example might be exam scores where most students score well (clustered on the left) and only a few score very low.

#### How to Measure Skewness:

There are formulas to calculate skewness, but a simpler way to get a general idea is to look at the difference between the mean, median, and mode.

  - In a symmetrical distribution (no skew), mean, median, and mode will be close in value.
  - In a positively skewed distribution, the mean will be greater than the median, which will be greater than the mode (mean > median > mode).
  - In a negatively skewed distribution, the mean will be less than the median, which will be less than the mode (mean < median < mode).

### Problem_9: If a data is right skewed then what will be the position of median with respect to mean?

#### If a data set is right skewed, then the median will be less than the mean.

#### Here's why:

  - In a right-skewed distribution, there are more data points clustered on the left side of the distribution and a longer tail extending towards the right side.
  - The mean is sensitive to extreme values. Since the tail is on the right with higher values, the mean will be pulled slightly towards the right, placing it higher than the center of the data.
  - The median, however, is the middle value when the data is ordered from least to greatest. In a right-skewed distribution, the middle value will be located more towards the left side of the data where most of the data points are clustered.

### Problem_10: Explain the difference between covariance and correlation. How are these measures used in statistical analysis?

#### Covariance and correlation are both used to measure the relationship between two variables, but they do so in slightly different ways:

#### Covariance:

  - What it measures: Covariance tells you how two variables change together. A positive covariance indicates that when one variable increases, the other tends to increase as well, and vice versa for a negative covariance. A covariance of zero means there's no linear relationship between the variables.
  - Units: Covariance is measured in the units obtained by multiplying the units of the two variables. For example, if you're looking at weight (kg) and height (cm), the covariance would be in kg*cm. This makes comparing covariance values across different datasets tricky.
  - Interpretation: The magnitude of the covariance (positive or negative) tells you about the direction of the relationship, but it's difficult to interpret the strength of the relationship due to the units.
  
#### Correlation:

* What it measures: Correlation is a standardized version of covariance. It takes the covariance and divides it by the product of the standard deviations of the two variables. This removes the influence of units and puts the correlation on a scale of -1 to +1.
* Interpretation:
  - A correlation coefficient of +1 indicates a perfect positive linear relationship (as one variable increases, the other always increases proportionally).
  - A correlation coefficient of -1 indicates a perfect negative linear relationship (as one variable increases, the other always decreases proportionally).
  - Values closer to 0 indicate a weaker linear relationship, either positive or negative.
* Benefits: Since it's standardized, correlation allows you to compare the strength of linear relationships between different pairs of variables, regardless of their original units.

### Problem_11: What is the formula for calculating the sample mean? Provide an example calculation for a dataset.

#### The formula for calculating the sample mean (average) is:

Σ (xi) / n

#### where:

  - Σ (sigma) represents the sum of all the values in the data set (xi).
  - xi represents each individual value in the data set.
  - n represents the total number of values in the data set (sample size).

#### Here's an example calculation:

Let's say you have a dataset of exam scores for 5 students: {75, 82, 90, 85, 88}

  - Σ (xi): Add up all the exam scores: 75 + 82 + 90 + 85 + 88 = 420
  - n: The number of students (data points) is 5.
  - Sample Mean: Apply the formula: Σ (xi) / n = 420 / 5 = 84

### Problem_12: For a normal distribution data what is the relationship between its measure of central tendency?

#### In a normal distribution, all three measures of central tendency –  mean (average), median (middle value), and mode (most frequent value) –  coincide and have the same value.

#### Here's why:

* A normal distribution is symmetrical. The data points are mirrored around a central point, forming a bell-shaped curve.

### Problem_13: How is covariance different from correlation?

#### Covariance:

* What it measures: Covariance tells you how two variables change together.
  - A positive covariance indicates that when one variable increases, the other tends to increase as well, and vice versa for a negative covariance.
  - A covariance of zero means there's no linear relationship between the variables.
* Units: Covariance is measured in the units obtained by multiplying the units of the two variables. (For example, weight in kg and height in cm would result in covariance in kg*cm). This makes comparing covariance values across different datasets difficult.
* Interpretation: The magnitude of the covariance (positive or negative) tells you about the direction of the relationship, but it's hard to interpret the strength due to the units.

#### Correlation:

* What it measures: Correlation is a standardized version of covariance. It takes the covariance and divides it by the product of the standard deviations of the two variables. This removes the influence of units and puts the correlation on a scale of -1 to +1.
* Interpretation:
  - A correlation coefficient of +1 indicates a perfect positive linear relationship (as one variable increases, the other always increases proportionally).
  - A correlation coefficient of -1 indicates a perfect negative linear relationship (as one variable increases, the other always decreases proportionally).
  - Values closer to 0 indicate a weaker linear relationship, either positive or negative.
* Benefits: Since it's standardized, correlation allows you to compare the strength of linear relationships between different pairs of variables, regardless of their original units.

### Problem_14: How do outliers affect measures of central tendency and dispersion? Provide an example.

#### Outliers, which are data points that fall significantly outside the overall pattern of the data, can significantly impact how we measure both central tendency (mean, median, mode) and dispersion (range, variance, standard deviation).

#### Central tendency:

  - Mean: Outliers are very sensitive to the mean. A single extreme outlier can significantly pull the mean in its direction, misrepresenting the center of the data if outliers are not numerous.
  - Median: The median is less affected by outliers compared to the mean. Since it's the middle value, extreme outliers on either side won't necessarily influence its position as long as they don't become the new middle value.
  - Mode: The mode (most frequent value) is generally not affected by outliers unless the outlier itself becomes the most frequent value, which is uncommon.
  
#### Dispersion:

  - Range: Range, which is the difference between the highest and lowest values, is heavily influenced by outliers. A single outlier on either end can significantly inflate the range, making the data seem more spread out than it actually is.
  - Variance and Standard Deviation: Both variance and standard deviation are sensitive to outliers because they consider the squared deviations of each data point from the mean. Outliers, by definition, have large deviations from the mean, thus inflating the overall variance and standard deviation.
  
#### Example:

* Imagine a dataset representing exam scores: {75, 82, 85, 88, 90, 150}. Here, 150 is a clear outlier much higher than the rest.

  - Mean: The mean with the outlier is significantly higher (around 93) compared to the mean without it (around 84). The outlier pulls the mean upwards.
  - Median: The median with and without the outlier might be the same (88) depending on the order of the other scores. The median remains less influenced by the outlier in this case.        
  -  Mode: The mode (most frequent score) is likely not affected by the outlier (assuming there are no multiple scores of 150).          
  - Range: The range with the outlier is much larger (150 - 75 = 75) compared to the range without it (90 - 75 = 15). The outlier significantly inflates the range.
Variance and Standard Deviation: Both will be higher when the outlier is included, as the squared deviation of 150 from the mean is much larger than the deviations of other scores.