In [None]:
"""
Mean: The mean is the average of a set of numbers. It is calculated by adding up all the numbers in the 
set and then dividing by the total number of numbers. For example, the mean of the numbers 2, 4, 6, and 8 
is (2 + 4 + 6 + 8) / 4 = 5.

Median: The median is the middle value in a set of numbers when the numbers are arranged in order.
If there is an even number of numbers, the median is the average of the two middle numbers. For example,
the median of the numbers 1, 3, 5, 7, and 9 is 5.

Mode: The mode is the number that appears most frequently in a set of numbers. A set of numbers can have
one mode, more than one mode (multimodal), or no mode (no number appears more than once). For example, 
in the set of numbers 2, 3, 4, 4, 5, 6, 6, 6, the mode is 6 because it appears most frequently.
"""

In [None]:
"""
the mean is used when the data are normally distributed and there are no outliers, the median is used when
the data are skewed or have outliers, and the mode is used for categorical data or when identifying the 
most common value in a dataset.
"""

In [2]:
import numpy as np
from scipy import stats

height_data = np.array([178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5])

# Mean
mean_height = np.mean(height_data)

# Median
median_height = np.median(height_data)

# Mode
mode_height = stats.mode(height_data).mode[0]

print(f"Mean: {mean_height:.2f}")
print(f"Median: {median_height}")
print(f"Mode: {mode_height}")


Mean: 177.02
Median: 177.0
Mode: 177.0


  mode_height = stats.mode(height_data).mode[0]


In [3]:
import numpy as np

height_data = np.array([178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5])

std_dev = np.std(height_data)

print(f"Standard Deviation: {std_dev:.2f}")


Standard Deviation: 1.79


In [None]:
"""
Measures of dispersion, such as range, variance, and standard deviation, are used to describe the spread 
or variability of a dataset. They provide information about how spread out the values in the dataset are
from the central tendency (mean, median, mode). Here's how each measure is used to describe the spread of
a dataset:

Range: The range is the simplest measure of dispersion and is calculated as the difference between the
maximum and minimum values in a dataset. It provides a rough estimate of the spread of the data. 
For example, if the range of test scores in a class is 40 points (from 60 to 100), it indicates that
there is a wide variability in the scores.

Variance: The variance is a more precise measure of dispersion and is calculated as the average of the 
squared differences between each data point and the mean. A higher variance indicates greater variability
in the data. For example, if the variance of a set of test scores is 100, it means that the scores are more 
spread out from the mean compared to a variance of 50.

Standard Deviation: The standard deviation is the square root of the variance and provides a more 
interpretable measure of dispersion. It represents the average distance of data points from the mean.
A larger standard deviation indicates greater variability in the data. For example, if the standard
deviation of a set of test scores is 10, it means that most scores are within 10 points of the mean.
"""

In [None]:
"""
A Venn diagram is a visual representation of the relationships between different sets of data. 
It is composed of circles, usually overlapping, that represent the sets. The overlapping areas 
represent the intersections between the sets, showing common elements shared by the sets. Venn diagrams 
are commonly used in mathematics, statistics, logic, and computer science to illustrate set theory concepts
and to visualize the relationships between different groups or categories of data.
"""

In [4]:
A = {2, 3, 4, 5, 6, 7}
B = {0, 2, 6, 8, 10}

# Intersection of A and B
intersection = A.intersection(B)
print(f"Intersection of A and B: {intersection}")

# Union of A and B
union = A.union(B)
print(f"Union of A and B: {union}")


Intersection of A and B: {2, 6}
Union of A and B: {0, 2, 3, 4, 5, 6, 7, 8, 10}


In [None]:
"""
Skewness in data refers to the asymmetry or lack of symmetry in the distribution of data points. 
It is a measure of the degree to which a dataset deviates from a normal distribution.

Positive Skewness (Right Skew): If the tail on the right side of the distribution is longer or fatter than 
the left side, the data is said to be positively skewed. This means that there are more data points on the
left side of the distribution and the distribution is skewed towards the right. Positive skewness indicates
that the mean is greater than the median.

Negative Skewness (Left Skew): If the tail on the left side of the distribution is longer or fatter than
the right side, the data is said to be negatively skewed. This means that there are more data points on the
right side of the distribution and the distribution is skewed towards the left. Negative skewness indicates
that the mean is less than the median.
"""

In [None]:
"""
If a data is right skewed, the median will be less than the mean.
"""

In [None]:
"""
Covariance:

Covariance measures the extent to which two variables change together.
It is calculated as the average of the product of the deviations of each data point from the mean of the 
respective variables.
Covariance can be positive, negative, or zero.
The magnitude of the covariance is not standardized, making it difficult to interpret the strength of
the relationship.
Correlation:

Correlation is a standardized measure of the relationship between two variables.
It is calculated as the covariance divided by the product of the standard deviations of the variables.
Correlation ranges from -1 to 1, where 1 indicates a perfect positive linear relationship, -1 indicates a 
perfect negative linear relationship, and 0 indicates no linear relationship.
Correlation provides a more interpretable measure of the strength and direction of the relationship between
variables compared to covariance.

How they are used:

Covariance: Covariance is used to determine the direction of the relationship between two variables 
(positive or negative) and whether the relationship is linear or not. However, it does not provide
information about the strength of the relationship.
Correlation: Correlation is used to measure both the strength and direction of the linear relationship 
between two variables. It is widely used in statistical analysis, hypothesis testing, and model building
to understand the relationship between variables and make predictions based on the data.
"""

In [None]:
"""
 In a normal distribution, the mean is located at the center of the distribution, and the distribution is
 symmetric around the mean. 
 Therefore, the mean, median, and mode all have the same value.
"""

In [None]:
"""
Covariance:

Covariance measures the extent to which two variables change together.
It is calculated as the average of the product of the deviations of each data point from the mean of the 
respective variables.
Covariance can be positive, negative, or zero.
The magnitude of the covariance is not standardized, making it difficult to interpret the strength of
the relationship.
Correlation:

Correlation is a standardized measure of the relationship between two variables.
It is calculated as the covariance divided by the product of the standard deviations of the variables.
Correlation ranges from -1 to 1, where 1 indicates a perfect positive linear relationship, -1 indicates 
a perfect negative linear relationship, and 0 indicates no linear relationship.
Correlation provides a more interpretable measure of the strength and direction of the relationship between 
variables compared to covariance.
"""

In [None]:
"""
Mean: Outliers can greatly influence the mean. Since the mean is calculated by summing all values and then 
dividing by the number of values, a single outlier with a very large or very small value can significantly 
pull the mean in its direction. This can give a misleading representation of the central tendency, 
especially if the dataset is small.
Median: The median is less affected by outliers because it is simply the middle value when the data is 
sorted. Outliers have less impact on the median because they do not affect the order of other data points.
Mode: Outliers do not affect the mode, as the mode is simply the most frequently occurring value in the 
dataset.

Measures of Dispersion:

Range: Outliers can affect the range, especially if they are extreme values. The range is the difference 
between the maximum and minimum values in the dataset, so an outlier can increase the range significantly.
Variance and Standard Deviation: Outliers can increase the variance and standard deviation, as these 
measures are based on the deviations of each data point from the mean. Since outliers can be far from
the mean, they can increase the overall variability of the dataset.
Example:
Consider the following dataset of exam scores: 85, 88, 90, 92, 95, 100, 105. Without the outlier (105),
the mean is 90.71. However, with the outlier included, the mean increases to 94.57, showing the impact of
the outlier on the mean. The median remains the same at 92, and the mode is still 85. The outlier also
increases the range from 20 to 25, and increases the variance and standard deviation, indicating greater 
variability in the dataset.
"""