In [1]:
#Q1. The three measures of central tendency

The three common measures of central tendency are:

1. Mean: The mean, also known as the average, is calculated by summing up all the values in a dataset and dividing by the total number of observations. It represents the arithmetic center of the data and is sensitive to extreme values.

2. Median: The median is the middle value in a dataset when the values are arranged in ascending or descending order. It divides the dataset into two equal halves, with 50% of the observations falling below and 50% above the median. The median is not affected by extreme values and is useful when dealing with skewed or non-normally distributed data.

3. Mode: The mode is the value or values that occur most frequently in a dataset. It represents the peak or most common value(s) in the data. Unlike the mean and median, the mode can be used for both numerical and categorical data. A dataset can have one mode (unimodal), multiple modes (multimodal), or no mode.

In [2]:
#Q2. Difference between the mean, median and mode and their usage to measure the central tendency of a dataset

|Mean|Median|Mode|
|---|---|---|
|The average taken of given observations is called mean.|The middle number in a given set of observations is called median.|The most frequently occurred number in a given set of observations is called mode.|
|Mean is affected by outliers.|Median is not affected by outliers.|Mode is not directly affected by outliers in a dataset.|
|Adding up all the numbers and divided by the total number of terms gives us the mean. The mean is usually represented by the Greek letter μ; <p style = 'text-align: center;'> $ μ = \frac{\sum{x_i}}{n} $ </p>|Placing all the numbers in ascending or descending order we take out the middle number, which is our median. When series have even numbers, median is the simple average of the middle pair of numbers. If the number of data points is odd, the middle number is the median, meaning we take the $ (\frac{n+1}{2})^{th} $ value. If the number of data points is even, then we take the mean value of the middle two values. This means we take mean of the $ (\frac{n}{2})^{th} $ and $ (\frac{n+2}{2})^{th} $ value.|The mode is derived when a number has occurred more than once in our dataset. There cab be one modal value or more than one modal values. It is possible to have no mode at all, as well. The mode is the most frequently occurring observation or value.|
|When data is normally distributed, the mean is widely preferred. The mean is used to measure the central tendency of a dataset by providing an average or typical value that represents the dataset as a whole.|When data distribution is skewed, median is the best representative. The median is used as a measure of central tendency to represent the middle value in a dataset. It provides information about the location of the 'center' of the data distribution.|When there is a nominal distribution of data, mode is preferred. The mode is primarily used to describe the distribution of categorical or discrete data. It identifies the value(s) that occur with the highest frequency, providing insights into the typical category or the most common occurrence within a dataset.|

In [3]:
#Q3. Measuring the three measures of central tendency for the given height data

import pandas


heights = [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]

height_df = pandas.DataFrame({'heights': heights})

mean = height_df['heights'].mean()
median = height_df['heights'].median()
mode = str(height_df['heights'].mode().values.tolist()).strip('[]')

data = {'Mean': mean,
        'Median': median,
        'Mode': mode
       }

print('\n{:<20} {:<10}\n'.format('Central Tendency', 'Value'))

for s, v in data.items():
    print('{:<20} {:<10}'.format(s, v))


Central Tendency     Value     

Mean                 177.01875 
Median               177.0     
Mode                 177.0, 178.0


In [4]:
#Q4. Finding the standard deviation for the given data

import pandas


data = [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]

data_df = pandas.DataFrame({'Data': data})

std = data_df['Data'].std()

print(f'The standard deviation of the given data is {std}.')

The standard deviation of the given data is 1.847238930584419.


In [5]:
#Q5. Use of range, variance, and standard deviation to describe the spread of a dataset along with an example to describe them

1. Range: The range is the simplest measure of spread and represents the difference between the maximum and minimum values in a dataset. It provides an idea of the total span of the data. For example: We consider a dataset representing the ages of individuals in a group: {20, 25, 30, 35, 40}. The range in this case is calculated as the difference between the maximum value (40) and the minimum value (20), which is 20. Therefore, the range of the dataset is 20 years.

2. Variance: Variance quantifies the average squared deviation of each data point from the mean of the dataset. It gives a measure of how much the individual values deviate from the mean. For example: We continue with the ages dataset. To calculate the variance, first, we find the mean age by summing up all the values (20 + 25 + 30 + 35 + 40) and dividing by the total count (5). The mean age is 30 years. Then, we calculate the squared difference of each value from the mean: $ (20 - 30)^2 $, $ (25 - 30)^2 $, $ (30 - 30)^2 $, $ (35 - 30)^2 $, and $ (40 - 30)^2 $. Summing up these squared differences and dividing by the count (5) gives the variance. The sum of squared differences is 250. The variance is 50.

3. Standard Deviation: The standard deviation is the square root of the variance. It provides a measure of the average distance between each data point and the mean. The standard deviation is often preferred as it is in the same unit as the original data. For example: Building on the ages dataset, the standard deviation is obtained by taking the square root of the variance calculated earlier. If the variance is 50, the standard deviation is the square root of 50, which is approximately 7.07 years.

In [6]:
#Q6. Venn diagram

Venn diagrams are visual representations used to illustrate the relationships and overlaps between different sets or groups of objects, elements, or concepts. They were introduced by the English mathematician John Venn in the late $ 19^{th} $ century. A Venn diagram consists of circles (or other closed curves) that represent sets, with each circle representing a different set. The overlapping regions of the circles indicate the elements or characteristics that are common to multiple sets.

In [7]:
#Q7. Finding the following for the given sets; A = (2, 3, 4, 5, 6, 7) & B = (0, 2, 6, 8, 10)

    (i) A ∩ B = (2, 6)

    (ii) A ∪ B = (0, 2, 3, 4, 5, 6, 7, 8, 10)

In [8]:
#Q8. Understanding skewness in data

Skewness in data refers to the asymmetry or departure from symmetry in the distribution of values. It measures the extent to which a dataset deviates from a symmetrical, bell-shaped distribution (known as a normal distribution).

Skewness can be characterized as either positive skewness or negative skewness:

- Positive Skewness: A positively skewed distribution has a longer or fatter tail on the right side of the distribution. In other words, the majority of values are concentrated towards the left side of the distribution, while a few larger values extend the tail towards the right. The mean of a positively skewed distribution tends to be greater than the median.

- Negative Skewness: A negatively skewed distribution has a longer or fatter tail on the left side of the distribution. The majority of values are concentrated towards the right side of the distribution, while a few smaller values extend the tail towards the left. The mean of a negatively skewed distribution tends to be smaller than the median.

In [9]:
#Q9. Position of median with respect to mean in right skewed data

In a right skewed or positively skewed data, mean is greater than the median of tha data. Thus median is on the left-side of the mean.

In [10]:
#Q10. Difference between covariance and correlation

|Covariance|Correlation|
|---|---|
|Covariance indicates how two random variables are dependent on each other. It indicates the direction and magnitude of the linear relationship between two variables. A positive covariance suggests that the variables tend to move in the same direction, while a negative covariance suggests they move in opposite directions.|Correlation indicates how strongly these two variables are related, provided other conditions are constant. The maximum value is +1, representing a perfect dependent relationship, -1 represents a perfect negative correlation, and 0 indicates no linear relationship.|
|We can deduce Correlation from a Covariance.|Correlation provides a measure of Covariance on a standard scale. It is deduced by dividing the calculated Covariance by the standard deviation.|
|The value of Covariance lies in the range of -∞ and +∞.|Correlation is limited to values between the range -1 and +1.|
|Covariance has a definite unit as deduced by the multiplication of two numbers and their units.|Correlation is a unitless absolute number between -1 and +1, including decimal values.|

### Uses of Covariance and Correlation

1. Relationship Analysis: Covariance and correlation help in understanding how changes in one variable are associated with changes in another variable. They are used to identify and measure the strength of relationships in fields such as finance, economics, social sciences, and data analysis.

2. Variable Selection: In feature selection or model building, covariance and correlation are employed to assess the interdependence between variables. They help identify variables that are highly correlated with the target variable or with each other, aiding in the selection of relevant and independent variables.

3. Data Exploration: Covariance and correlation matrices provide valuable insights into the relationships among multiple variables in exploratory data analysis. They can reveal patterns and dependencies that assist in forming hypotheses or identifying areas for further investigation.

In [11]:
#Q11. Formula for calculating the sample mean along with an example calculation to illustrate it

Sample mean represents the measure of the center of the data. Sample Mean refers to the mean value of a sample of data calculated from within a large population of data. It is a good tool to assess the population mean if the sample size is large and the statistical researchers randomly take fragments from the population. 

We can use the following steps to calculate the sample mean of a dataset:-

1. Adding up the sample items - First, we will need to count how many sample items you have within a data set and add up the total amount of items.

2. Dividing sum by the number of samples - Next, we divide the sum from step one by the total number of items in the dataset.

3. The result is the mean - After dividing, the resulting quotient becomes our sample mean, or average.

### Formula for calculating sample mean

<p style = 'text-align: center;'> $ s =  \frac{\sum{x_i}}{n} $ </p>

In [12]:
### Example calculation for a dataset

import seaborn


tips = seaborn.load_dataset('tips')

sample_data = tips['total_bill'].sample(frac = 0.1).tolist()        # Extracting a random sample data from the population data of 'total_bill'

sample_mean = sum(sample_data)/len(sample_data)        # Adding the sample data and then dividing the sum with number of elements in the sample data

print(f'The sample mean of the total_bill dataset is {sample_mean}.')

The sample mean of the total_bill dataset is 18.96458333333333.


In [13]:
#Q12. Relationship between the measures of central tendency for a normal distribution

For a normal distribution, the three measures of central tendency — mean, median, and mode — have a specific relationship:

- Mean: In a normal distribution, the mean is located at the center of the distribution. It is equal to the median, and both are positioned at the peak of the symmetrical bell-shaped curve. This is a key characteristic of the normal distribution, where the mean divides the distribution into two equal halves.

- Median: The median of a normal distribution is the same as the mean. It represents the middle value in the dataset when the values are arranged in ascending or descending order. Since the normal distribution is symmetric, the median falls at the highest point of the curve, which is also where the mean is located.

- Mode: The mode of a normal distribution is also equal to the mean and the median. In a perfectly symmetrical normal distribution, there is a single peak, and that peak represents the most frequently occurring value—the mode. Therefore, the mode coincides with both the mean and median.

In [14]:
#Q13. Difference between covariance and correlation

Covariance and correlation both measure the relationship between two variables, covariance is affected by the scales of the variables and lacks a standardized interpretation of strength. Correlation, on the other hand, is scale-independent, provides a standardized interpretation of strength and direction, and allows for easier comparison across different datasets. Correlation is generally considered a more useful and informative measure when assessing the linear relationship between variables.

In [15]:
#Q14. Outliers affecting measures of central tendency and dispersion along with example

Outliers can significantly impact measures of central tendency and dispersion, influencing their values and potentially distorting the overall representation of the data.

## Measures of Central Tendency

1. Mean: Outliers can have a notable impact on the mean since it is sensitive to extreme values. Even a single outlier that is significantly larger or smaller than the other data points can shift the mean towards that extreme value.

2. Median: The median is less affected by outliers since it represents the middle value in a dataset. Outliers have a minimal effect on the median, as it is only influenced if they fall directly within the central position or near it.

3. Mode: The mode is generally not affected by outliers, as it represents the most frequent value(s) in the dataset. Outliers do not alter the mode unless they become the most frequently occurring value(s).

For example: Considering a dataset representing the salaries of employees in a company: {30,000, 35,000, 40,000, 45,000, 100,000}. In this dataset, the outlier value of 100,000 significantly deviates from the other salary values. When calculating the mean, the presence of the outlier substantially increases the average salary. However, the median remains unaffected, as it represents the middle value (40,000) and is not influenced by the outlier. Similarly, the mode remains the same as the value 100,000 is not repeated and does not become the most frequently occurring value.

## Measures of Dispersion

1. Range: Outliers can significantly impact the range by increasing or decreasing the spread of the data. If an outlier is present, the range may expand to encompass the extreme value(s).

2. Variance and Standard Deviation: Outliers can have a substantial effect on the variance and standard deviation since they involve squared differences from the mean. Outliers tend to increase the variability, resulting in larger variances and standard deviations.

For example: While examining again the previous salary dataset, the outlier value of 100,000 increases the range, expanding it to 70,000 (100,000 - 30,000). Additionally, the outlier affects the variance and standard deviation, as the squared differences between the outlier and other values contribute significantly to the variability of the dataset.