## Topic : Measure of Central Tendency

### OUTCOMES

- 1. Introduction of Measure of Tendency

- 2. Mean and when it's appropriate.

- 3. Median and when it's approriate.

- 4. Mode and when it's approriate.

- 5. Comparison and when to use

- 6. Common Mistake.

### 1. Introduction of Measure of Tendency

- Measure of central tendency are statistical values that represent that "center" or typical value of a dataset.

- Why need?
    - It provides a summary of the data by identifying a single value that is most representative of the dataset a whole.

- How to find?
    - mean, median, mode

### 2. Mean and when it's appropriate.

- Definition:
    - Sum of all values divide by the number of values
    - Mean = (Sum of all observations) ÷ (Number of observations)

- Formula
    - simple mean (Xˉ) =  ∑Xi​/N
    - here, Xi => number of values
    -       N  => Total numbers

- Use:
    - mean is sensitive for outlier
    - when dataset has no outlier
    
- Limitations:
    - Mean affected by outliers.
    - Misleading for skewed data.

In [46]:
# Example - use of mean [when data has no outliers]
import numpy as np
import pandas as pd

In [51]:
# Create a larger unordered dataset (20 employee salaries)
data = {
    "Employee": [f"E{i}" for i in range(1, 21)],
    "Salary": [4500, 3200, 3800, 6000, 2800, 5000, 3400, 4100, 3900, 3700,3600, 4200, 3100, 3300, 4800, 2900, 3500, 4000, 4300, 3000]
}

# Create DataFrame
df = pd.DataFrame(data)


In [52]:
# find the mean value 

mean_value = df['Salary'].mean()

print(f"Mean Value: {mean_value:.2f}")

Mean Value: 3855.00


#### [when data has outliers is present ] - mean

In [63]:
# Example - limitaton of mean 

# Simple dataset (e.g., house prices in thousands)
data = [120, 125, 128, 122, 126, 124, 600]  # 600 is an outlier
df1 = pd.DataFrame({'HousePrice': data})

df1

Unnamed: 0,HousePrice
0,120
1,125
2,128
3,122
4,126
5,124
6,600


In [64]:
mean_val = df1['HousePrice'].mean()

print("Mean value: ", mean_val)  # o/p: 192.142

# here - outlier => 600
# outlier affect the mean value
# so, mean value toware the outliers

Mean value:  192.14285714285714


### 3. Median and when it's approriate.

- Definition:
    - Median is the middle value when data is sorted.

- Formult:
    - sort the values (ascending , descending order)

    - odd nums: Median = Middle value

    - even num: Median = (Middle two values)/2


- Use:
    - When Data contain outlier
    - Data is skewed.

- Limitation:
    - Less Accurate for small data set.
    - No use for catagorical values


#### [when outlier is not present] - median

In [None]:
# Example - Median for even data
df

Unnamed: 0,Employee,Salary
0,E1,4500
1,E2,3200
2,E3,3800
3,E4,6000
4,E5,2800
5,E6,5000
6,E7,3400
7,E8,4100
8,E9,3900
9,E10,3700


In [59]:
# sort the dataset
sort_val = df['Salary'].sort_values()

sort_val

4     2800
15    2900
19    3000
12    3100
1     3200
13    3300
6     3400
16    3500
10    3600
9     3700
2     3800
8     3900
17    4000
7     4100
11    4200
18    4300
0     4500
14    4800
5     5000
3     6000
Name: Salary, dtype: int64

In [60]:
# find the median using numpy
np.median(sort_val)

np.float64(3750.0)

In [61]:
# find the median using pandas
df['Salary'].median()

3750.0

#### [when outlier is  present] - median

In [69]:
# Simple dataset [median for odd numbers]
data = [122, 120, 127,125, 128, 122, 126, 124, 600, 700, 1000]  # 600 is an outlier
df1 = pd.DataFrame({'HousePrice': data})

df1

Unnamed: 0,HousePrice
0,122
1,120
2,127
3,125
4,128
5,122
6,126
7,124
8,600
9,700


In [71]:
# using numpy to find median
# 1st sort the value

s_val = df1['HousePrice'].sort_values()

np.median(s_val)


np.float64(126.0)

In [73]:
# using pandas to find median

df1['HousePrice'].median()

126.0

### 4. Mode and when it's approriate.

- Definition:
    - Mode is the most frequent value in dataset.

- Use:
    - Use for Categorical Values.
    
- Limitation:
    - No use on Numerical values.

In [77]:
# Example of Mode 
# Example Dataset: Employees with Department and Gender
data = {
    "Employee": [f"E{i}" for i in range(1, 21)],
    "Department": ["IT", "Finance", "HR", "Marketing", "IT", "Finance", "IT", "HR", "Marketing", "IT",
                   "Finance", "HR", "IT", "Finance", "Marketing", "IT", "HR", "Marketing", "Finance", "IT"],
    "Gender": ["Male", "Female", "Female", "Male", "Male", "Female", "Male", "Female", "Male", "Male",
               "Female", "Female", "Male", "Female", "Male", "Male", "Female", "Male", "Female", "Male"]
}

# Create DataFrame
df_new = pd.DataFrame(data)

df_new.head(7)

Unnamed: 0,Employee,Department,Gender
0,E1,IT,Male
1,E2,Finance,Female
2,E3,HR,Female
3,E4,Marketing,Male
4,E5,IT,Male
5,E6,Finance,Female
6,E7,IT,Male


In [79]:
df_new['Gender']

0       Male
1     Female
2     Female
3       Male
4       Male
5     Female
6       Male
7     Female
8       Male
9       Male
10    Female
11    Female
12      Male
13    Female
14      Male
15      Male
16    Female
17      Male
18    Female
19      Male
Name: Gender, dtype: object

In [83]:
# apply mode for Gender column

mode_val= df_new['Gender'].mode()[0]

print("Mode values: ", mode_val)

Mode values:  Male


### 5. Comparison ans when to use

In [None]:
'''   
------------------------------------------------------------------------------------------------------------------------------------------
       Best Measure                           |      Situation                               |               Reason
------------------------------------------------------------------------------------------------------------------------------------------
                                                - Symametric Data                                                                
 1. Mean                                      |        - eg (height, weight)                 | All Values contribute equally


                                                - Good for normal, balanced data              
------------------------------------------------------------------------------------------------------------------------------------------
                                               - Skewed Data with outliers                                                                
 2. Median                                    |     - eg (income, price)                     |    Resistant to outliers 


                                                - Good for Skewed or outliers-rich data              
------------------------------------------------------------------------------------------------------------------------------------------

                                                - Categorical Data                                                                
 3. Mode                                      |        - eg (gender, color)                  |  works with non-numeric data.


                                                - Best for categorical data              
------------------------------------------------------------------------------------------------------------------------------------------

'''

### 6. Common Mistake.

- Mean can't be use in skewed data.

- Median can't be use for catagorical data

- Mode can't be use for Numerical data