The three measures of central tendency are:

Mean: The mean is the average of a set of numbers. It is calculated by adding up all the values in a dataset and then dividing by the number of values.

Median: The median is the middle value in a dataset when the values are arranged in ascending or descending order. If there is an even number of values, the median is the average of the two middle values.

Mode: The mode is the value that appears most frequently in a dataset. It's possible for a dataset to have more than one mode if multiple values occur with the same highest frequency, in which case the dataset is called multimodal. If no value repeats, the dataset is considered to have no mode.

The mean, median, and mode are all measures of central tendency, but they each capture different aspects of the data distribution:

Mean: The mean represents the arithmetic average of a dataset. It is calculated by adding up all the values in the dataset and then dividing by the total number of values. The mean is sensitive to extreme values, also known as outliers, as it takes into account the magnitude of each value. It's commonly used when the data is normally distributed or when you want to find a typical or average value.

Median: The median is the middle value of a dataset when the values are arranged in ascending or descending order. If there is an even number of values, the median is the average of the two middle values. The median is less affected by extreme values compared to the mean, making it a more robust measure of central tendency, especially when dealing with skewed or non-normally distributed data.

Mode: The mode is the value that appears most frequently in a dataset. Unlike the mean and median, the mode doesn't consider the actual values of the dataset but focuses solely on frequency. The mode is useful for categorical or discrete data, where values represent categories or groups. It's often used to identify the most common category or response in a dataset.

In [1]:
height=[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [2]:
import numpy as np

In [3]:
np.mean(height)

177.01875

In [5]:
np.median(height)

177.0

In [6]:
from scipy import stats

In [10]:
stats.mode(height)

  stats.mode(height)


ModeResult(mode=array([177.]), count=array([3]))

In [11]:
data=[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [12]:
np.std(data)

1.7885814036548633

Range: The range is the simplest measure of dispersion. It's calculated by subtracting the minimum value from the maximum value in the dataset. A larger range indicates greater variability in the dataset. However, the range alone doesn't provide information about the distribution of values within the dataset.

Variance: The variance measures the average squared deviation of each data point from the mean of the dataset. It gives a sense of how much the data points differ from the mean on average. A higher variance indicates greater variability or spread in the dataset.

Standard Deviation: The standard deviation is the square root of the variance. It's often used because it's in the same units as the original data, making it more interpretable. A higher standard deviation indicates greater variability in the dataset.

A Venn diagram is a visual representation of the relationships between different sets or groups of objects. It consists of overlapping circles or other shapes, each representing a set, and the overlaps represent the intersections between these sets.

A = (2,3,4,5,6,7) & B = (0,2,6,8,10)

A intersection B = (2,6)

A union B = (0,2,3,4,5,6,7,8,10)

Skewness in data refers to the asymmetry or lack of symmetry in the distribution of values within a dataset. It indicates whether the data is skewed to the left, meaning it's concentrated towards higher values with a longer tail on the left side, or skewed to the right, meaning it's concentrated towards lower values with a longer tail on the right side. In a perfectly symmetrical distribution, the mean, median, and mode are all equal.

There are three main types of skewness:

Positive Skewness (Right Skewness): In a positively skewed distribution, the tail on the right side of the distribution is longer or fatter than the tail on the left side. This indicates that there are more extreme values on the right side of the distribution, and the mean is typically greater than the median and mode.

Negative Skewness (Left Skewness): In a negatively skewed distribution, the tail on the left side of the distribution is longer or fatter than the tail on the right side. This indicates that there are more extreme values on the left side of the distribution, and the mean is typically less than the median and mode.

Zero Skewness: A distribution is considered to have zero skewness when it is perfectly symmetrical. In such cases, the mean, median, and mode are all equal, and the distribution appears as a bell-shaped curve (like the normal distribution).

The mean is typically greater than the median.

The median tends to be closer to the smaller values in the dataset, away from the longer tail on the right side.

Covariance:
Covariance measures the degree to which two variables change together. It indicates the direction of the linear relationship between two variables.
A positive covariance indicates that as one variable increases, the other variable also tends to increase, while a negative covariance indicates that as one variable increases, the other variable tends to decrease.

Correlation:
Correlation is a standardized measure of the linear relationship between two variables. It measures both the strength and direction of the relationship between the variables.
Correlation coefficients range between -1 and 1. A correlation coefficient of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

In [14]:
import numpy as np
import seaborn as sns
df=sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [16]:
np.mean(df["total_bill"])

19.78594262295082

For a normal distribution, also known as a Gaussian distribution, the measures of central tendency, namely the mean, median, and mode, are all equal.

Mean: The mean represents the average value of the dataset. In a normal distribution, the mean is located at the center of the distribution and divides it symmetrically into two halves.

Median: The median is the middle value of the dataset when the values are arranged in ascending or descending order. In a normal distribution, the median is also located at the center of the distribution, coinciding with the mean.

Mode: The mode is the value that appears most frequently in the dataset. In a normal distribution, the mode is also located at the center of the distribution, where the peak of the bell-shaped curve occurs.

Covariance measures the direction of the relationship between two variables but does not provide a standardized measure of the strength of the relationship.

Correlation provides a standardized measure of both the direction and strength of the relationship between two variables, making it easier to interpret and compare across different datasets or variables.

Outliers can have a significant impact on measures of central tendency and dispersion, often distorting their values and leading to misinterpretation of the data.

Measures of Central Tendency:

Mean: Outliers can heavily influence the mean, pulling it towards their extreme values. For example, if you have a dataset of test scores where most students score between 70 and 90, but one student scores 10, the mean would be significantly lower than the typical scores of the rest of the students.

Median: The median is less affected by outliers since it represents the middle value of the dataset when arranged in ascending or descending order. However, extreme outliers can still shift the median, especially in smaller datasets.

Mode: Outliers typically don't affect the mode since it represents the most frequently occurring value(s) in the dataset. However, if an outlier is very extreme and occurs more frequently than other values, it could influence the mode.

Measures of Dispersion:

Range: Outliers can greatly impact the range, especially if they are extreme values. For example, in a dataset of ages where most individuals are between 20 and 40 years old, but there's one individual who is 100 years old, the range would be much larger due to the outlier.

Variance and Standard Deviation: Outliers can inflate the variance and standard deviation, as these measures quantify the spread of data around the mean. Outliers that are far from the mean contribute significantly to the squared deviations in the variance calculation, leading to larger variance and standard deviation values.

Example:

Consider the following dataset representing the income (in thousands of dollars) of individuals in a neighborhood:

20,25,30,35,40,45,50,500

Mean: Without the outlier (500), the mean income would be 

20+25+30+35+40+45+50 /7 = 34.29

However, with the outlier, the mean becomes 

20+25+30+35+40+45+50+500/8 = 82.63

Median: The median remains unaffected by the outlier and stays at 37.5.

Mode: There is no mode since no value repeats.

Range: Without the outlier, the range is 

50−20=30,

but with the outlier, the range becomes 

500−20=480.

Variance and Standard Deviation: The outlier significantly increases the variance and standard deviation, making them less representative of the typical spread of incomes in the neighborhood.