# **Measures of Central Tendency**

This notebook will help you understand the main concepts on the statistics part of measuring the central tendency of dataset (Where the data is clustered) in a practical approach.

****
<br>

The following will be explained:


*   Types of Data
*   Mean
*   Median
*   Mode


****


## **Let's get to it!**



In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd

In [6]:
# Create a simple dataset using a dictionary
data = {
    'CustomerID': [101, 102, 103, 104, 105],
    'Age': [28, 35, 42, 29, 55],
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'Purchase_Amount': [150.50, 200.75, 50.00, 320.10, 88.99],
    'Product_Category': ['Electronics', 'Clothing', 'Electronics', 'Home Goods', 'Clothing'],
    'Satisfaction_Rating': ['Good', 'Excellent', 'Poor', 'Excellent', 'Good'] # Ordinal
}

df = pd.DataFrame(data)

print("\nLet's look at the first few rows:\n")
print(df.head())


Let's look at the first few rows:

   CustomerID  Age  Gender  Purchase_Amount Product_Category  \
0         101   28    Male           150.50      Electronics   
1         102   35  Female           200.75         Clothing   
2         103   42    Male            50.00      Electronics   
3         104   29  Female           320.10       Home Goods   
4         105   55    Male            88.99         Clothing   

  Satisfaction_Rating  
0                Good  
1           Excellent  
2                Poor  
3           Excellent  
4                Good  


In [7]:
# Let's look at the data types pandas infers
print("\nLet's look at the data information:\n")
print(df.info())


Let's look at the data information:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   CustomerID           5 non-null      int64  
 1   Age                  5 non-null      int64  
 2   Gender               5 non-null      object 
 3   Purchase_Amount      5 non-null      float64
 4   Product_Category     5 non-null      object 
 5   Satisfaction_Rating  5 non-null      object 
dtypes: float64(1), int64(2), object(3)
memory usage: 372.0+ bytes
None


## **1. Understanding the data types columnwise:**

* `CustomerID` - `int64` (interger) data type often used as an identifier
* `Age` - `int64` (integer) data type; dicrete data (quantitative)
* `Gender` - `object` (string) data type; nominal data (qualitative/categorical data)
* `Purchase_Amount` - `float64` (float) data type; continuous data (quantitative)
* `Product_Category` - `object` (string) data type; nominal data (qualitative/categorical data)
* `Satisfactory_Rating` - `object` (string) data type; ordinal data (qualitative/categorical data)


In [13]:
print("--- Measures of Central Tendency for Numerical Data ---\n")

# 1. --- For Numerical Data (e.g., Age, Purchase_Amount) ---

# Mean
mean_age = df['Age'].mean()
mean_purchase = df['Purchase_Amount'].mean()
print(f"Mean Age: {mean_age:.2f}") # .2f for 2 decimal places
print(f"Mean Purchase Amount: {mean_purchase:.2f}\n")

# Median
median_age = df['Age'].median()
median_purchase = df['Purchase_Amount'].median()
print(f"Median Age: {median_age:.2f}")
print(f"Median Purchase Amount: {median_purchase:.2f}\n")

# Mode (for numerical data - useful if specific values repeat often)
mode_age = df['Age'].mode()
mode_purchase = df['Purchase_Amount'].mode() # May return multiple if tied, or none if all unique
print(f"Mode Age: {mode_age.tolist()}") # .tolist() to display nicely
print(f"Mode Purchase Amount: {mode_purchase.tolist()}") # Note: floating points often unique

--- Measures of Central Tendency for Numerical Data ---

Mean Age: 37.80
Mean Purchase Amount: 162.07

Median Age: 35.00
Median Purchase Amount: 150.50

Mode Age: [28, 29, 35, 42, 55]
Mode Purchase Amount: [50.0, 88.99, 150.5, 200.75, 320.1]


## **Understanding the output**

1. **Mean**

Explains the average value in numerical data and here's what we learn from ours:<br>

* `Age` - The average age of the customers is **37.80**
* `Purchase_Amount` - The average spending amount on products by the customers is **162.07**

<br>

2. **Median**

It shows the middle value in the data when sorted in ascending order and here's what we learn from ours:
* `Age` - The median value for customer's age is **35.00**
* `Purchase_Amount` - The median value for customer's purchase amount is **150.50**

<br>

3. **Mode**

Shows the most frequently appeared value based on the number of times it appear and from our data we find that there are no mode values for the `Age` and `Purchase_Amount` categories hence the list with all the values presented for the categories respectively .

<br>



In [21]:
print("--- Measures of Central Tendency for Categorical Data ---\n")

# --- For Categorical Data (e.g., Gender, Product_Category) ---
# Note: Only Mode is appropriate for nominal categorical data

mode_gender = df['Gender'].mode()
mode_product = df['Product_Category'].mode()
mode_satisfaction = df['Satisfaction_Rating'].mode()

print(f"Mode Gender: {mode_gender.tolist()}")
print(f"Mode Product Category: {mode_product.tolist()}")
print(f"Mode Satisfaction Rating: {mode_satisfaction.tolist()}") # Even though ordinal, mode works

--- Measures of Central Tendency for Categorical Data ---

Mode Gender: ['Male']
Mode Product Category: ['Clothing', 'Electronics']
Mode Satisfaction Rating: ['Excellent', 'Good']


## **Understanding the output**

As we know that mode is the only measure to use for nominal categorical data (*categories with no inherent order*) of which it describes the value that frequently appears in the categories.<br>


Let's understand the values in our output:
* `Gender` category - **Male** customers appear more in our data.
* `Product` category - We have a tie of **Clothing** and **Electronics** products being most purchased by customers in our data.
* `Satisfaction Rating` category - Most customers rate **Excellent** and **Good** to the products.


<br>


## **Testing the effect of outliers to the Central Tendency Measures**

In [19]:
# Let's add an outlier (extreme values) to Age and see how it affects Mean vs Median
df_with_outlier = df.copy()
df_with_outlier.loc[5] = {'CustomerID': 106, 'Age': 200, 'Gender': 'Female', 'Purchase_Amount': 10000, 'Product_Category': 'Electronics', 'Satisfaction_Rating': 'Excellent'} # Add a row with extreme values
print(df_with_outlier.head(6))

print("\n--- Measures with Outlier ---\n")
print(f"Mean Age (with outlier): {df_with_outlier['Age'].mean():.2f}")
print(f"Median Age (with outlier): {df_with_outlier['Age'].median():.2f} \n")
print(f"Mean Purchase Amount (with outlier): {df_with_outlier['Purchase_Amount'].mean():.2f}")
print(f"Median Purchase Amount (with outlier): {df_with_outlier['Purchase_Amount'].median():.2f} \n")

   CustomerID  Age  Gender  Purchase_Amount Product_Category  \
0         101   28    Male           150.50      Electronics   
1         102   35  Female           200.75         Clothing   
2         103   42    Male            50.00      Electronics   
3         104   29  Female           320.10       Home Goods   
4         105   55    Male            88.99         Clothing   
5         106  200  Female         10000.00      Electronics   

  Satisfaction_Rating  
0                Good  
1           Excellent  
2                Poor  
3           Excellent  
4                Good  
5           Excellent  

--- Measures with Outlier ---

Mean Age (with outlier): 64.83
Median Age (with outlier): 38.50 

Mean Purchase Amount (with outlier): 1801.72
Median Purchase Amount (with outlier): 175.62 



## **Understanding the output**

1. **Mean**

Since mean is sensitive to outliers and can potentially skew the results hence as observed with the drastic changes in the values:
* `Age` mean values moving from 37.80 to 64.83 (because of the unusual age [200] of CustomerId - 106 that skews the calculation)
* `Purchase_Amount` mean values moving from 162.07 to 1801.72 (because of the huge purchase amount [10000] of CustomerId - 106 that skews the calculation)

<br>

2. **Median**

Median is usually less affected by outliers hence explaining the small margin differences from before and after the outlier was introduced as seen below:

* `Age` median value moved from 35.00 to 38.50
* `Purchase_Amount` median value moved from 150.50 to 175.62

<br>

****

<br>

## **Conclusion**

**Median** can be used to measure the central tendency of data with outliers while **Mean** can't for better accurate measure without skewness. Hence before working with data it's important to check for outliers in order to pick the right measure to use with one's data.

# **The End!**

Thank you for following till the end and hopefully you've grasped the concepts along the way. For more on the **Math in Data Science Series** you can check out my socials and follow me for more learning insights.

[**LinkedIn**](https://www.linkedin.com/in/richard-muchoki-2408b7205/)<br>
[**GitHub**](https://github.com/Equivocal-Richie)<br>
[**Portfolio**](https://richardmuchoki.vercel.app/)
