## <u>Central Values</u>
#### Central values are representative values that describe the middle or center of a dataset. These values aim to summarize the dataset by providing a typical or "average" value around which the data points tend to cluster. In other words, central values attempt to capture the central location of the data distribution.
#### Central values are crucial because they help us understand the overall trend in the data without having to look at every individual data point. By knowing a central value, we can have a general sense of what a typical observation might look like within the dataset.

### Examples of Central Values:
#### In a dataset representing ages of a group of people, the central value might represent a typical age in the group.
#### In a dataset of exam scores, the central value might represent the average or most common score.

## <u>Measure of Central Tendency</u>
#### A measure of central tendency is a statistical metric used to quantify the central value in a dataset. It is a formalized method for identifying this central value and gives a single number that represents the "typical" value of the dataset. The three most commonly used measures of central tendency are:

### 1. Mean (Arithmetic Average):
#### This is the sum of all the data values divided by the number of data points.

### 2. Median:
#### This is the middle value when the data points are arranged in ascending or descending order. If the number of data points is odd, the median is the middle value. If the number is even, the median is the average of the two middle values.

### 3. Mode:
#### The mode is the data point that appears most frequently in the dataset. There can be more than one mode if multiple values occur with the same highest frequency.

In [1]:
# Measure of Central Tendency

In [3]:
# Mean (Average)
age = [20, 18, 19, 20, 20, 24, 21, 19, 18, 20]

import numpy as np
avg_age = np.mean(age)
print(avg_age)

19.9


In [5]:
weights = [54, 45, 55, 50, 48, 56, 40]
np.mean(weights)

49.714285714285715

In [21]:
import seaborn as sns

df = sns.load_dataset('tips')
np.mean(df['total_bill'])

19.78594262295082

In [17]:
age_out = [20, 18, 19, 20, 20, 24, 21, 19, 18, 20, 200]

print('Mean without outliers', np.mean(age))
print('Mean with outliers', np.mean(age_out))
print('Median without outliers', np.median(age))
print('Median with outliers', np.median(age_out))

Mean without outliers 19.9
Mean with outliers 36.27272727272727
Median without outliers 20.0
Median with outliers 20.0


In [19]:
from scipy import stats

stats.mode(age)

ModeResult(mode=20, count=4)

## 1. Mean (Arithmetic Average)
### When Is It Useful?
#### Symmetric distributions: The mean is best suited for datasets where the data is symmetrically distributed (e.g., normal distribution), without extreme values or skewness.
#### Continuous data: It works well with continuous numerical data, such as heights, weights, or test scores, where all data points contribute equally to the overall calculation.
#### Large datasets: The mean becomes more reliable in large datasets because individual variations are averaged out over many data points.

### Advantages:
#### Easy to compute and understand: The mean is a simple, intuitive measure and can be easily calculated.
#### Takes into account all data points: It provides a comprehensive summary because every value in the dataset influences the mean.

### Disadvantages:
#### Sensitive to outliers: The mean is heavily influenced by extreme values. For example, if one data point is significantly larger or smaller than the rest, it can distort the mean and make it less representative of the majority of the data.
#### Not useful for skewed(Skewness refers to the asymmetry in the distribution of data. It indicates whether the data points are spread more toward one side of the distribution than the other) data: When a dataset is skewed, the mean might not be a good representation of the "typical" or "central" value because the mean is sensitive to extreme values (outliers) and reflects the overall sum of the data points, rather than the actual central tendency of most data

### When the Mean Is Better:
#### When you have symmetric data without extreme outliers.
#### When you need to calculate the expected value or average performance of a dataset (e.g., average score of students in a class).

## 2. Median (Middle Value)
### When Is It Useful?
#### Skewed distributions: The median is ideal for datasets that are not symmetrically distributed or that contain outliers. It represents the central value more accurately in such cases because it isn’t influenced by extremely high or low values.
#### Ordinal data: It is particularly useful for ordinal data (data that can be ordered) where you want to find the middle point, but you don’t want to consider exact numerical distances between values.
#### Small datasets with outliers: In small datasets where a few outliers could significantly impact the mean, the median is more robust.

### Advantages:
#### Insensitive to outliers: The median is not affected by extremely large or small values, which makes it a good choice when outliers are present.
#### Better for skewed data: For highly skewed datasets, the median provides a better representation of the "typical" value.

### Disadvantages:
#### Does not consider all values: The median only depends on the middle point(s), so it ignores the magnitude or distribution of the remaining data points.
#### Less mathematically useful: The median is not always useful for further mathematical analysis (like calculating variance or standard deviation), as it doesn’t take into account all data points.

### When the Median Is Better:
#### When you have skewed distributions or outliers in your data (e.g., income data, property prices).
#### When working with ordinal data where ranking matters, but distances between values are not meaningful.

## 3. Mode (Most Frequent Value)
#### When Is It Useful?
#### Categorical data: The mode is the only measure of central tendency that is suitable for categorical data (e.g., colors, types of cars, voting preferences), where you want to know the most common category.
#### Discrete data with repetitions: The mode is helpful for data where certain values repeat, and you are interested in identifying the most common value.
#### Bimodal or multimodal distributions: If a dataset has two or more peaks (bimodal or multimodal distributions), the mode can highlight these multiple "centers."

### Advantages:
#### Only choice for categorical data: For non-numerical data, like survey responses (e.g., “What’s your favorite color?”), the mode is the most appropriate measure.
#### Identifies frequency: The mode is useful for identifying the most frequent or popular item in a dataset.
#### Works well with discrete data: When you have discrete data with repeated values, the mode provides a good indication of what value is most common.

### Disadvantages:
#### Not always representative: In some datasets, the mode may not give a meaningful measure of central tendency, especially if there is no clear repetition or if the mode is far from other values.
#### Multiple modes: Some datasets have multiple modes, making it difficult to summarize the dataset with just one value.
#### May not exist: In continuous data, it is possible for there to be no mode or for every value to be equally common.

### When the Mode Is Better:
#### When you are working with categorical data and need to find the most frequent category or label.
#### When you are interested in the most frequent observation in discrete data.

In [6]:
# Summary Table

# Import necessary library
import pandas as pd
pd.set_option('display.max_colwidth', 5000)

# Create data for the table
data = {
    "Measure": ["Mean", "Median", "Mode"],
    "Best When": [
        "Symmetric data without outliers",
        "Skewed data or outliers",
        "Categorical or discrete data with repetitions"
    ],
    "Strengths": [
        "Uses all data points, easy to calculate",
        "Not affected by extreme values",
        "Best for identifying the most common category/value"
    ],
    "Weaknesses": [
        "Sensitive to outliers and skewness",
        "Ignores the exact distribution of data",
        "May not be meaningful for all datasets, can have multiple modes"
    ]
}

# Create a pandas DataFrame
df = pd.DataFrame(data)

# Display the DataFrame as a table
df

Unnamed: 0,Measure,Best When,Strengths,Weaknesses
0,Mean,Symmetric data without outliers,"Uses all data points, easy to calculate",Sensitive to outliers and skewness
1,Median,Skewed data or outliers,Not affected by extreme values,Ignores the exact distribution of data
2,Mode,Categorical or discrete data with repetitions,Best for identifying the most common category/value,"May not be meaningful for all datasets, can have multiple modes"
